diff --git a/.gitignore b/.gitignore
index 81aad72480..cf134eb478 100644
--- a/.gitignore
+++ b/.gitignore
@@ -18,6 +18,7 @@ __pycache__
 htmlcov
 build/
 build_prims/
+cmake-build*
 cuml.egg-info/
 dist/
 python/cuml/**/*.cpp
@@ -29,6 +30,9 @@ log
 dask-worker-space/
 tmp/
 
+## files pickled in notebook when ran during python docstring generation
+docs/source/*.model
+
 ## eclipse
 .project
 .cproject
diff --git a/BUILD.md b/BUILD.md
index 96cb5ee868..8787c2b253 100644
--- a/BUILD.md
+++ b/BUILD.md
@@ -15,10 +15,17 @@ To install cuML from source, ensure the following dependencies are met:
 9. NCCL (>=2.4)
 10. UCX [optional] (>= 1.7) - enables point-to-point messaging in the cuML standard communicator. This is necessary for many multi-node multi-GPU cuML algorithms to function.
 
-It is recommended to use conda for environment/package management. If doing so, a convenience environment .yml file is located in `conda/environments/cuml_dec_cudax.y.yml` (replace x.y for your CUDA version). This file contains most of the dependencies mentioned above (notable exceptions are `gcc` and `zlib`). To use it, for example to create an environment named `cuml_dev` for CUDA 10.0 and Python 3.7, you can use the follow command:
+It is recommended to use conda for environment/package management. If doing so, a convenience environment .yml file is located in `conda/environments/cuml_dec_cudax.y.yml` (replace x.y for your CUDA version). This file contains most of the dependencies mentioned above (notable exceptions are `gcc` and `zlib`). To use it, for example to create an environment named `cuml_dev` for CUDA 10.2 and Python 3.7, you can use the follow command:
 
+```bash
+conda create -n cuml_dev python=3.7
+conda env update -n cuml_dev --file=conda/environments/cuml_dev_cuda10.2.yml
 ```
-conda env create -n cuml_dev python=3.7 --file=conda/environments/cuml_dev_cuda10.0.yml
+
+These conda environments are based on the general RAPIDS meta packages that install common dependencies for RAPIDS projects. To install different versions of packages contained in those meta packages after creating the environment, it is recommended to remove those meta packages (without removing the actual packages contained in the environment) with the following command (having the environment active):
+
+```bash
+conda remove --force rapids-build-env rapids-notebook-env rapids-doc-env
 ```
 
 ## Installing from Source:
diff --git a/CHANGELOG.md b/CHANGELOG.md
index b5f0f9238a..c2c9f83eeb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,6 +1,145 @@
+# cuML 0.17.0 (Date TBD)
+
+## New Features
+
+## Improvements
+- PR #3070: Speed up dask/test_datasets tests
+- PR #3075: Speed up test_linear_model tests
+- PR #3078: Speed up test_incremental_pca tests
+- PR #2902: `matrix/matrix.cuh` in RAFT namespacing
+- PR #2903: Moving linalg's gemm, gemv, transpose to RAFT namespaces
+- PR #2905: `stats` prims `mean_center`, `sum` to RAFT namespaces
+- PR #2904: Moving `linalg` basic math ops to RAFT namespaces
+- PR #3000: Pin cmake policies to cmake 3.17 version, bump project version to 0.17
+- PR #3083: Improving test_make_blobs testing time
+- PR #2906: Moving `linalg` decomp to RAFT namespaces
+- PR #2996: Removing the max_depth restriction for switching to the batched backend
+- PR #3004: Remove Single Process Multi GPU (SPMG) code
+- PR #3044: Move leftover `linalg` and `stats` to RAFT namespaces
+- PR #3067: Deleting prims moved to RAFT and updating header paths
+- PR #3074: Reducing dask coordinate descent test runtime
+
+## Bug Fixes
+- PR #3072: Fusing metrics and score directories in src_prims
+- PR #3037: Avoid logging deadlock in multi-threaded C code
+- PR #2983: Fix seeding of KISS99 RNG
+- PR #3011: Fix unused initialize_embeddings parameter in Barnes-Hut t-SNE
+- PR #3008: Check number of columns in check_array validator
+- PR #3012: Increasing learning rate for SGD log loss and invscaling pytests
+- PR #3021: Fix a hang in cuML RF experimental backend
+- PR #3039: Update RF and decision tree parameter initializations in benchmark codes
+- PR #3061: Handle C++ exception thrown from FIL predict
+- PR #3073: Update mathjax CDN URL for documentation
+- PR #3062: Bumping xgboost version to match cuml version
+- PR #3086: Reverting FIL Notebook Testing
+
+# cuML 0.16.0 (Date TBD)
+
+## New Features
+- PR #2922: Install RAFT headers with cuML
+- PR #2909: Update allgatherv for compatibility with latest RAFT
+- PR #2677: Ability to export RF trees as JSON
+- PR #2698: Distributed TF-IDF transformer
+- PR #2476: Porter Stemmer
+- PR #2789: Dask LabelEncoder
+- PR #2152: add FIL C++ benchmark
+- PR #2638: Improve cython build with custom `build_ext`
+- PR #2866: Support XGBoost-style multiclass models (gradient boosted decision trees) in FIL C++
+- PR #2874: Issue warning for degraded accuracy with float64 models in Treelite
+- PR #2881: Introduces experimental batched backend for random forest
+- PR #2916: Add SKLearn multi-class GBDT model support in FIL
+
+## Improvements
+- PR #2947: Add more warnings for accuracy degradation with 64-bit models
+- PR #2873: Remove empty marker kernel code for NVTX markers
+- PR #2796: Remove tokens of length 1 by default for text vectorizers
+- PR #2741: Use rapids build packages in conda environments
+- PR #2735: Update seed to random_state in random forest and associated tests
+- PR #2739: Use cusparse_wrappers.h from RAFT
+- PR #2729: Replace `cupy.sparse` with `cupyx.scipy.sparse`
+- PR #2749: Correct docs for python version used in cuml_dev conda environment
+- PR #2747: Adopting raft::handle_t and raft::comms::comms_t in cuML
+- PR #2762: Fix broken links and provide minor edits to docs
+- PR #2723: Support and enable convert_dtype in estimator predict
+- PR #2758: Match sklearn's default n_components behavior for PCA
+- PR #2770: Fix doxygen version during cmake
+- PR #2766: Update default RandomForestRegressor score function to use r2
+- PR #2775: Enablinbg mg gtests w/ raft mpi comms
+- PR #2783: Add pytest that will fail when GPU IDs in Dask cluster are not unique
+- PR #2784: Add SparseCumlArray container for sparse index/data arrays
+- PR #2785: Add in cuML-specific dev conda dependencies
+- PR #2778: Add README for FIL
+- PR #2799: Reenable lightgbm test with lower (1%) proba accuracy
+- PR #2800: Align cuML's spdlog version with RMM's
+- PR #2824: Make data conversions warnings be debug level
+- PR #2835: Rng prims, utils, and dependencies in RAFT
+- PR #2541: Improve Documentation Examples and Source Linking
+- PR #2837: Make the FIL node reorder loop more obvious
+- PR #2849: make num_classes significant in FLOAT_SCALAR case
+- PR #2792: Project flash (new build process) script changes
+- PR #2850: Clean up unused params in paramsPCA
+- PR #2871: Add timing function to utils
+- PR #2863: in FIL, rename leaf_value_t enums to more descriptive
+- PR #2867: improve stability of FIL benchmark measurements
+- PR #2798: Add python tests for FIL multiclass classification of lightgbm models
+- PR #2892: Update ci/local/README.md
+- PR #2910: Adding Support for CuPy 8.x
+- PR #2914: Add tests for XGBoost multi-class models in FIL
+- PR #2622: Simplify tSNE perplexity search
+- PR #2930: Pin libfaiss to <=1.6.3
+- PR #2928: Updating Estimators Derived from Base for Consistency
+- PR #2942: Adding `cuml.experimental` to the Docs
+- PR #3010: Improve gpuCI Scripts
+
+## Bug Fixes
+- PR #2973: Allow data imputation for nan values
+- PR #2982: Adjust kneighbors classifier test threshold to avoid intermittent failure
+- PR #2885: Changing test target for NVTX wrapper test
+- PR #2882: Allow import on machines without GPUs
+- PR #2875: Bug fix to enable colorful NVTX markers
+- PR #2744: Supporting larger number of classes in KNeighborsClassifier
+- PR #2769: Remove outdated doxygen options for 1.8.20
+- PR #2787: Skip lightgbm test for version 3 and above temporarily
+- PR #2805: Retain index in stratified splitting for dataframes
+- PR #2781: Use Python print to correctly redirect spdlogs when sys.stdout is changed
+- PR #2787: Skip lightgbm test for version 3 and above temporarily
+- PR #2813: Fix memory access in generation of non-row-major random blobs
+- PR #2810: Update Rf MNMG threshold to prevent sporadic test failure
+- PR #2808: Relax Doxygen version required in CMake to coincide with integration repo
+- PR #2818: Fix parsing of singlegpu option in build command
+- PR #2827: Force use of whole dataset when sample bootstrapping is disabled
+- PR #2829: Fixing description for labels in docs and removing row number constraint from PCA xform/inverse_xform
+- PR #2832: Updating stress tests that fail with OOM
+- PR #2831: Removing repeated capture and parameter in lambda function
+- PR #2847: Workaround for TSNE lockup, change caching preference.
+- PR #2842: KNN index preprocessors were using incorrect n_samples
+- PR #2848: Fix typo in Python docstring for UMAP
+- PR #2856: Fix LabelEncoder for filtered input
+- PR #2855: Updates for RMM being header only
+- PR #2844: Fix for OPG KNN Classifier & Regressor
+- PR #2880: Fix bugs in Auto-ARIMA when s==None
+- PR #2877: TSNE exception for n_components > 2
+- PR #2879: Update unit test for LabelEncoder on filtered input
+- PR #2932: Marking KBinsDiscretizer pytests as xfail
+- PR #2925: Fixing Owner Bug When Slicing CumlArray Objects
+- PR #2931: Fix notebook error handling in gpuCI
+- PR #2941: Fixing dask tsvd stress test failure
+- PR #2943: Remove unused shuffle_features parameter
+- PR #2940: Correcting labels meta dtype for `cuml.dask.make_classification`
+- PR #2965: Notebooks update
+- PR #2955: Fix for conftest for singlegpu build
+- PR #2968: Remove shuffle_features from RF param names
+- PR #2957: Fix ols test size for stability
+- PR #2972: Upgrade Treelite to 0.93
+- PR #2981: Prevent unguarded import of sklearn in SVC
+- PR #2984: Fix GPU test scripts gcov error
+- PR #2990: Reduce MNMG kneighbors regressor test threshold
+- PR #2997: Changing ARIMA `get/set_params` to `get/set_fit_params`
+
 # cuML 0.15.0 (Date TBD)
 
 ## New Features
+- PR #2581: Added model persistence via joblib in each section of estimator_intro.ipynb
 - PR #2554: Hashing Vectorizer and general vectorizer improvements
 - PR #2240: Making Dask models pickleable
 - PR #2267: CountVectorizer estimator
@@ -12,11 +151,23 @@
 - PR #2394: Adding cosine & correlation distance for KNN
 - PR #2392: PCA can accept sparse inputs, and sparse prim for computing covariance
 - PR #2465: Support pandas 1.0+
+- PR #2550: Single GPU Target Encoder
 - PR #2519: Precision recall curve using cupy
 - PR #2500: Replace UMAP functionality dependency on nvgraph with RAFT Spectral Clustering
+- PR #2502: cuML Implementation of `sklearn.metrics.pairwise_distances`
 - PR #2520: TfidfVectorizer estimator
 - PR #2211: MNMG KNN Classifier & Regressor
 - PR #2461: Add KNN Sparse Output Functionality
+- PR #2615: Incremental PCA
+- PR #2594: Confidence intervals for ARIMA forecasts
+- PR #2607: Add support for probability estimates in SVC
+- PR #2618: SVM class and sample weights
+- PR #2635: Decorator to generate docstrings with autodetection of parameters
+- PR #2270: Multi class MNMG RF
+- PR #2661: CUDA-11 support for single-gpu code
+- PR #2322: Sparse FIL forests with 8-byte nodes
+- PR #2675: Update conda recipes to support CUDA 11
+- PR #2645: Add experimental, sklearn-based preprocessing
 
 ## Improvements
 - PR #2336: Eliminate `rmm.device_array` usage
@@ -46,6 +197,7 @@
 - PR #2403: Support for input and output type consistency in logistic regression predict_proba
 - PR #2473: Add metrics.roc_auc_score to API docs. Additional readability and minor docs bug fixes
 - PR #2468: Add `_n_features_in_` attribute to all single GPU estimators that implement fit
+- PR #2489: Removing explicit FAISS build and adding dependency on libfaiss conda package
 - PR #2480: Moving MNMG glm and solvers to cuml
 - PR #2490: Moving MNMG KMeans to cuml
 - PR #2483: Moving MNMG KNN to cuml
@@ -55,6 +207,7 @@
 - PR #2237: Refactor RF cython code
 - PR #2513: Fixing LGTM Analysis Issues
 - PR #2099: Raise an error when float64 data is used with dask RF
+- PR #2522: Renaming a few arguments in KNeighbors* to be more readable
 - PR #2499: Provide access to `cuml.DBSCAN` core samples
 - PR #2526: Removing PCA TSQR as a solver due to scalability issues
 - PR #2536: Update conda upload versions for new supported CUDA/Python
@@ -69,8 +222,25 @@
 - PR #2591: Generate benchmark datsets using `cuml.datasets`
 - PR #2548: Fix limitation on number of rows usable with tSNE and refactor memory allocation
 - PR #2589: including cuda-11 build fixes into raft
+- PR #2599: Add Stratified train_test_split
 - PR #2487: Set classes_ attribute during classifier fit
 - PR #2605: Reduce memory usage in tSNE
+- PR #2611: Adding building doxygen docs to gpu ci
+- PR #2631: Enabling use of gtest conda package for build
+- PR #2623: Fixing kmeans score() API to be compatible with Scikit-learn
+- PR #2629: Add naive_bayes api docs
+- PR #2643: 'dense' and 'sparse' values of `storage_type` for FIL
+- PR #2691: Generic Base class attribute setter
+- PR #2666: Update MBSGD documentation to mention that the model is experimental
+- PR #2687: Update xgboost version to 1.2.0dev.rapidsai0.15
+- PR #2684: CUDA 11 conda development environment yml and faiss patch
+- PR #2648: Replace CNMeM with `rmm::mr::pool_memory_resource`.
+- PR #2686: Improve SVM tests
+- PR #2692: Changin LBFGS log level
+- PR #2705: Add sum operator and base operator overloader functions to cumlarray
+- PR #2701: Updating README + Adding ref to UMAP paper
+- PR #2721: Update API docs
+- PR #2730: Unpin cumlprims in conda recipes for release
 
 ## Bug Fixes
 - PR #2369: Update RF code to fix set_params memory leak
@@ -94,6 +264,8 @@
 - PR #2497: Changes to accomodate cuDF unsigned categorical changes
 - PR #2209: Fix FIL benchmark for gpuarray-c input
 - PR #2507: Import `treelite.sklearn`
+- PR #2521: Fixing invalid smem calculation in KNeighborsCLassifier
+- PR #2515: Increase tolerance for LogisticRegression test
 - PR #2532: Updating doxygen in new MG headers
 - PR #2521: Fixing invalid smem calculation in KNeighborsCLassifier
 - PR #2515: Increase tolerance for LogisticRegression test
@@ -105,12 +277,41 @@
 - PR #2535: Fix issue with incorrect docker image being used in local build script
 - PR #2542: Fix small memory leak in TSNE
 - PR #2552: Fixed the length argument of updateDevice calls in RF test
+- PR #2565: Fix cell allocation code to avoid loops in quad-tree. Prevent NaNs causing infinite descent
 - PR #2563: Update scipy call for arima gradient test
 - PR #2569: Fix for cuDF update
 - PR #2508: Use keyword parameters in sklearn.datasets.make_* functions
+- PR #2587: Attributes for estimators relying on solvers
 - PR #2586: Fix SVC decision function data type
 - PR #2573: Considering managed memory as device type on checking for KMeans
 - PR #2574: Fixing include path in `tsvd_mg.pyx`
+- PR #2506: Fix usage of CumlArray attributes on `cuml.common.base.Base`
+- PR #2593: Fix inconsistency in train_test_split
+- PR #2609: Fix small doxygen issues
+- PR #2610: Remove cuDF tolist call
+- PR #2613: Removing thresholds from kmeans score tests (SG+MG)
+- PR #2616: Small test code fix for pandas dtype tests
+- PR #2617: Fix floating point precision error in tSNE
+- PR #2625: Update Estimator notebook to resolve errors
+- PR #2634: singlegpu build option fixes
+- PR #2641: [Breaking] Make `max_depth` in RF compatible with scikit-learn
+- PR #2650: Make max_depth behave consistently for max_depth > 14
+- PR #2651: AutoARIMA Python bug fix
+- PR #2654: Fix for vectorizer concatenations
+- PR #2655: Fix C++ RF predict function access of rows/samples array
+- PR #2649: Cleanup sphinx doc warnings for 0.15
+- PR #2668: Order conversion improvements to account for cupy behavior changes
+- PR #2669: Revert PR 2655 Revert "Fixes C++ RF predict function"
+- PR #2683: Fix incorrect "Bad CumlArray Use" error messages on test failures
+- PR #2695: Fix debug build issue due to incorrect host/device method setup
+- PR #2709: Fixing OneHotEncoder Overflow Error
+- PR #2710: Fix SVC doc statement about predic_proba
+- PR #2726: Return correct output type in QN
+- PR #2711: Fix Dask RF failure intermittently
+- PR #2718: Fix temp directory for py.test
+- PR #2719: Set KNeighborsRegressor output dtype according to training target dtype
+- PR #2720: Updates to outdated links
+- PR #2722: Getting cuML covariance test passing w/ Cupy 7.8 & CUDA 11
 
 # cuML 0.14.0 (03 Jun 2020)
 
@@ -132,6 +333,7 @@
 - PR #2256: Add a `make_arima` generator
 - PR #2245: ElasticNet, Lasso and Coordinate Descent MNMG
 - PR #2242: Pandas input support with output as NumPy arrays by default
+- PR #2551: Add cuML RF multiclass prediction using FIL from python
 - PR #1728: Added notebook testing to gpuCI gpu build
 
 ## Improvements
@@ -283,6 +485,8 @@
 - PR #2295: Fix convert_to_dtype copy even with same dtype
 - PR #2305: Fixed race condition in DBScan
 - PR #2354: Fix broken links in README
+- PR #2619: Explicitly skip raft test folder for pytest 6.0.0
+- PR #2788: Set the minimum number of columns that can be sampled to 1 to fix 0 mem allocation error
 
 # cuML 0.13.0 (31 Mar 2020)
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 3ced0b646a..afb28bc23e 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -24,7 +24,7 @@ into three categories:
 
 ### Your first issue
 
-1. Read the project's [README.md](https://github.com/rapidsai/cuml/blob/master/README.md)
+1. Read the project's [README.md](https://github.com/rapidsai/cuml/blob/main/README.md)
     to learn how to setup the development environment.
 2. Find an issue to work on. The best way is to look for the [good first issue](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
     or [help wanted](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels
@@ -62,12 +62,12 @@ implementation of the issue, ask them in the issue instead of the PR.
 
 The cuML repository has two main branches: 
 
-1. `master` branch: it contains the last released version. Only hotfixes are targeted and merged into it.  
+1. `main` branch: it contains the last released version. Only hotfixes are targeted and merged into it.  
 2. `branch-x.y`: it is the development branch which contains the upcoming release. All the new features should be based on this branch and Merge/Pull request should target this branch (with the exception of hotfixes).
     
 ### Additional details
 
-For every new version `x.y` of cuML there is a corresponding branch called `branch-x.y`, from where new feature development starts and PRs will be targeted and merged before its release. The exceptions to this are the 'hotfixes' that target the `master` branch, which target critical issues raised by Github users and are directly merged to `master` branch, and create a new subversion of the project. While trying to patch an issue which requires a 'hotfix', please state the intent in the PR. 
+For every new version `x.y` of cuML there is a corresponding branch called `branch-x.y`, from where new feature development starts and PRs will be targeted and merged before its release. The exceptions to this are the 'hotfixes' that target the `main` branch, which target critical issues raised by Github users and are directly merged to `main` branch, and create a new subversion of the project. While trying to patch an issue which requires a 'hotfix', please state the intent in the PR. 
 
 For all development, your changes should be pushed into a branch (created using the naming instructions below) in your own fork of cuML and then create a pull request when the code is ready. 
 
diff --git a/Dockerfile b/Dockerfile
index 074a7058e3..a9d92e37b3 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,4 +1,4 @@
-# From: https://github.com/rapidsai/cudf/blob/master/Dockerfile
+# From: https://github.com/rapidsai/cudf/blob/main/Dockerfile
 FROM cudf
 
 ENV CONDA_ENV=cudf
diff --git a/README.md b/README.md
index 52e96148d0..a2ec4ba307 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ For large datasets, these GPU-based implementations can complete 10-50x faster
 than their CPU equivalents. For details on performance, see the [cuML Benchmarks
 Notebook](https://github.com/rapidsai/cuml/tree/branch-0.14/notebooks/tools).
 
-As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU:
+As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU, using cuDF:
 ```python
 import cudf
 from cuml.cluster import DBSCAN
@@ -43,10 +43,25 @@ dtype: int32
 cuML also features multi-GPU and multi-node-multi-GPU operation, using [Dask](https://www.dask.org), for a
 growing list of algorithms. The following Python snippet reads input from a CSV file and performs
 a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:
+
+
+Initialize a `LocalCUDACluster` configured with [UCX](https://github.com/rapidsai/ucx-py) for fast transport of CUDA arrays
 ```python
-# Create a Dask CUDA cluster w/ one worker per device
+# Initialize UCX for high-speed transport of CUDA arrays
 from dask_cuda import LocalCUDACluster
-cluster = LocalCUDACluster()
+
+# Create a Dask single-node CUDA cluster w/ one worker per device
+cluster = LocalCUDACluster(protocol="ucx",
+                           enable_tcp_over_ucx=True,
+                           enable_nvlink=True,
+                           enable_infiniband=False)
+```
+
+Load data and perform `k-Nearest Neighbors` search. `cuml.dask` estimators also support `Dask.Array` as input:
+```python
+
+from dask.distributed import Client
+client = Client(cluster)
 
 # Read CSV file in parallel across workers
 import dask_cudf
@@ -54,16 +69,15 @@ df = dask_cudf.read_csv("/path/to/csv")
 
 # Fit a NearestNeighbors model and query it
 from cuml.dask.neighbors import NearestNeighbors
-nn = NearestNeighbors(n_neighbors = 10)
+nn = NearestNeighbors(n_neighbors = 10, client=client)
 nn.fit(df)
 neighbors = nn.kneighbors(df)
 ```
 
-
 For additional examples, browse our complete [API
 documentation](https://docs.rapids.ai/api/cuml/stable/), or check out our
 example [walkthrough
-notebooks](https://github.com/rapidsai/cuml/tree/branch-0.14/notebooks). Finally, you
+notebooks](https://github.com/rapidsai/cuml/tree/branch-0.15/notebooks). Finally, you
 can find complete end-to-end examples in the [notebooks-contrib
 repo](https://github.com/rapidsai/notebooks-contrib).
 
@@ -74,6 +88,7 @@ repo](https://github.com/rapidsai/notebooks-contrib).
 | **Clustering** |  Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | |
 |  | K-Means | Multi-node multi-GPU via Dask |
 | **Dimensionality Reduction** | Principal Components Analysis (PCA) | Multi-node multi-GPU via Dask|
+| | Incremental PCA | Experimental |
 | | Truncated Singular Value Decomposition (tSVD) | Multi-node multi-GPU via Dask |
 | | Uniform Manifold Approximation and Projection (UMAP) | Multi-node multi-GPU Inference via Dask |
 | | Random Projection | |
@@ -82,17 +97,18 @@ repo](https://github.com/rapidsai/notebooks-contrib).
 | | Linear Regression with Lasso or Ridge Regularization | Multi-node multi-GPU via Dask |
 | | ElasticNet Regression | |
 | | Logistic Regression | |
+| | Naive Bayes | Multi-node multi-GPU via Dask |
 | | Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models  | |
 | **Nonlinear Models for Regression or Classification** | Random Forest (RF) Classification | Experimental multi-node multi-GPU via Dask |
 | | Random Forest (RF) Regression | Experimental multi-node multi-GPU via Dask |
 | | Inference for decision tree-based models | Forest Inference Library (FIL) |
-|  | K-Nearest Neighbors (KNN) | Multi-node multi-GPU via Dask, uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
-|  | K-Nearest Neighbors (KNN) Classification | |
-|  | K-Nearest Neighbors (KNN) Regression | |
+|  | K-Nearest Neighbors (KNN) Classification | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
+|  | K-Nearest Neighbors (KNN) Regression | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
 |  | Support Vector Machine Classifier (SVC) | |
 |  | Epsilon-Support Vector Regression (SVR) | |
 | **Time Series** | Holt-Winters Exponential Smoothing | |
 |  | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) |
+| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
 ---
 
 ## Installation
@@ -115,12 +131,14 @@ For additional details on the technologies behind cuML, as well as a broader ove
 
 Please consider citing this when using cuML in a project. You can use the citation BibTeX:
 
-> @article{raschka2020machine,
->   title={Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence},
->   author={Raschka, Sebastian and Patterson, Joshua and Nolet, Corey},
->   journal={arXiv preprint arXiv:2002.04803},
->   year={2020}
-> }
+```
+@article{raschka2020machine,
+  title={Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence},
+  author={Raschka, Sebastian and Patterson, Joshua and Nolet, Corey},
+  journal={arXiv preprint arXiv:2002.04803},
+  year={2020}
+}
+```
 
 ## Contact
 
diff --git a/build.sh b/build.sh
index 74cd7e944d..1366b0cf9c 100755
--- a/build.sh
+++ b/build.sh
@@ -19,7 +19,7 @@ ARGS=$*
 REPODIR=$(cd $(dirname $0); pwd)
 
 VALIDTARGETS="clean libcuml cuml cpp-mgtests prims bench prims-bench cppdocs pydocs"
-VALIDFLAGS="-v -g -n --allgpuarch --singlegpu --nvtx --show_depr_warn -h --help "
+VALIDFLAGS="-v -g -n --allgpuarch --buildfaiss --buildgtest --singlegpu --nvtx --show_depr_warn -h --help "
 VALIDARGS="${VALIDTARGETS} ${VALIDFLAGS}"
 HELP="$0 [<target> ...] [<flag> ...]
  where <target> is:
@@ -27,7 +27,7 @@ HELP="$0 [<target> ...] [<flag> ...]
    libcuml          - build the cuml C++ code only. Also builds the C-wrapper library
                       around the C++ code.
    cuml             - build the cuml Python package
-   cpp-mgtests   - Build libcuml mnmg tests. Builds MPI communicator, adding MPI as dependency.
+   cpp-mgtests      - build libcuml mnmg tests. Builds MPI communicator, adding MPI as dependency.
    prims            - build the ML prims tests
    bench            - build the cuml C++ benchmark
    prims-bench      - build the ml-prims C++ benchmark
@@ -38,6 +38,8 @@ HELP="$0 [<target> ...] [<flag> ...]
    -g               - build for debug
    -n               - no install step
    --allgpuarch     - build for all supported GPU architectures
+   --buildfaiss     - build faiss statically into libcuml
+   --buildgtest     - build googletest library
    --singlegpu      - Build libcuml and cuml without multigpu components
    --nvtx           - Enable nvtx for profiling support
    --show_depr_warn - show cmake deprecation warnings
@@ -45,7 +47,7 @@ HELP="$0 [<target> ...] [<flag> ...]
 
  default action (no args) is to build and install 'libcuml', 'cuml', and 'prims' targets only for the detected GPU arch
 "
-LIBCUML_BUILD_DIR=${REPODIR}/cpp/build
+LIBCUML_BUILD_DIR=${LIBCUML_BUILD_DIR:=${REPODIR}/cpp/build}
 CUML_BUILD_DIR=${REPODIR}/python/build
 PYTHON_DEPS_CLONE=${REPODIR}/python/external_repositories
 BUILD_DIRS="${LIBCUML_BUILD_DIR} ${CUML_BUILD_DIR} ${PYTHON_DEPS_CLONE}"
@@ -62,13 +64,13 @@ CLEAN=0
 BUILD_DISABLE_DEPRECATION_WARNING=ON
 BUILD_CUML_STD_COMMS=ON
 BUILD_CPP_MG_TESTS=OFF
+BUILD_STATIC_FAISS=OFF
 
 # Set defaults for vars that may not have been defined externally
 #  FIXME: if INSTALL_PREFIX is not set, check PREFIX, then check
 #         CONDA_PREFIX, but there is no fallback from there!
 INSTALL_PREFIX=${INSTALL_PREFIX:=${PREFIX:=${CONDA_PREFIX}}}
 PARALLEL_LEVEL=${PARALLEL_LEVEL:=""}
-BUILD_ABI=${BUILD_ABI:=ON}
 
 function hasArg {
     (( ${NUMARGS} != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
@@ -119,6 +121,12 @@ fi
 if hasArg cpp-mgtests; then
     BUILD_CPP_MG_TESTS=ON
 fi
+if hasArg --buildfaiss; then
+    BUILD_STATIC_FAISS=ON
+fi
+if hasArg --buildgtest; then
+    BUILD_GTEST=ON
+fi
 if hasArg --nvtx; then
     NVTX=ON
 fi
@@ -162,7 +170,6 @@ if completeBuild || hasArg libcuml || hasArg prims || hasArg bench || hasArg pri
     cd ${LIBCUML_BUILD_DIR}
 
     cmake -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} \
-          -DCMAKE_CXX11_ABI=${BUILD_ABI} \
           -DBLAS_LIBRARIES=${INSTALL_PREFIX}/lib/libopenblas.so.0 \
           ${GPU_ARCH} \
           -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
@@ -171,6 +178,7 @@ if completeBuild || hasArg libcuml || hasArg prims || hasArg bench || hasArg pri
           -DWITH_UCX=ON \
           -DBUILD_CUML_MPI_COMMS=${BUILD_CPP_MG_TESTS} \
           -DBUILD_CUML_MG_TESTS=${BUILD_CPP_MG_TESTS} \
+          -DBUILD_STATIC_FAISS=${BUILD_STATIC_FAISS} \
           -DNVTX=${NVTX} \
           -DPARALLEL_LEVEL=${PARALLEL_LEVEL} \
           -DNCCL_PATH=${INSTALL_PREFIX} \
@@ -216,10 +224,9 @@ fi
 if completeBuild || hasArg cuml || hasArg pydocs; then
     cd ${REPODIR}/python
     if [[ ${INSTALL_TARGET} != "" ]]; then
-        python setup.py build_ext -j${PARALLEL_LEVEL:-1} --inplace ${SINGLEGPU_PYTHON_FLAG}
-        python setup.py install --single-version-externally-managed --record=record.txt ${SINGLEGPU_PYTHON_FLAG}
+        python setup.py build_ext -j${PARALLEL_LEVEL:-1} ${SINGLEGPU_PYTHON_FLAG} --library-dir=${LIBCUML_BUILD_DIR} install --single-version-externally-managed --record=record.txt
     else
-        python setup.py build_ext -j${PARALLEL_LEVEL:-1} --inplace --library-dir=${LIBCUML_BUILD_DIR} ${SINGLEGPU_PYTHON_FLAG}
+        python setup.py build_ext -j${PARALLEL_LEVEL:-1} --library-dir=${LIBCUML_BUILD_DIR} ${SINGLEGPU_PYTHON_FLAG}
     fi
 
     if hasArg pydocs; then
diff --git a/ci/checks/black_lists.sh b/ci/checks/black_lists.sh
index 8d1a63c47b..2ed13a2135 100755
--- a/ci/checks/black_lists.sh
+++ b/ci/checks/black_lists.sh
@@ -6,7 +6,6 @@
 
 # PR_TARGET_BRANCH is set by the CI enviroment
 
-# Checkout master for comparison
 git checkout --quiet $PR_TARGET_BRANCH
 
 # Switch back to tip of PR branch
diff --git a/ci/checks/changelog.sh b/ci/checks/changelog.sh
index 41cb6d6bd8..946c005f68 100755
--- a/ci/checks/changelog.sh
+++ b/ci/checks/changelog.sh
@@ -4,17 +4,17 @@
 # cuML CHANGELOG Tester #
 #########################
 
-# Checkout master for comparison
-git checkout --quiet master
+# Checkout main for comparison
+git checkout --force --quiet main
 
 # Switch back to tip of PR branch
-git checkout --quiet current-pr-branch
+git checkout --force --quiet current-pr-branch
 
 # Ignore errors during searching
 set +e
 
 # Get list of modified files between matster and PR branch
-CHANGELOG=`git diff --name-only master...current-pr-branch | grep CHANGELOG.md`
+CHANGELOG=`git diff --name-only main...current-pr-branch | grep CHANGELOG.md`
 # Check if CHANGELOG has PR ID
 PRNUM=`cat CHANGELOG.md | grep "$PR_ID"`
 RETVAL=0
diff --git a/ci/cpu/build.sh b/ci/cpu/build.sh
index 6d2cddd80c..f762e5502f 100755
--- a/ci/cpu/build.sh
+++ b/ci/cpu/build.sh
@@ -1,27 +1,24 @@
 #!/bin/bash
 # Copyright (c) 2018, NVIDIA CORPORATION.
-######################################
-# cuML CPU conda build script for CI #
-######################################
+##############################################
+# cuML CPU conda build script for CI         # 
+##############################################
 set -ex
 
-# Logger function for build status output
-function logger() {
-  echo -e "\n>>>> $@\n"
-}
-
 # Set path and build parallel level
-export PATH=/conda/bin:/usr/local/cuda/bin:$PATH
-export PARALLEL_LEVEL=4
-
-# Set versions of packages needed to be grabbed
-export CUDF_VERSION=0.8.*
-export NVSTRINGS_VERSION=0.8.*
-export RMM_VERSION=0.8.*
+export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
+export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
 
 # Set home to the job's workspace
 export HOME=$WORKSPACE
 
+# Determine CUDA release version
+export CUDA_REL=${CUDA_VERSION%.*}
+
+ # Setup 'gpuci_conda_retry' for build retries (results in 2 total attempts)
+export GPUCI_CONDA_RETRY_MAX=1
+export GPUCI_CONDA_RETRY_SLEEP=30
+
 # Switch to project root; also root of repo checkout
 cd $WORKSPACE
 
@@ -34,17 +31,22 @@ fi
 # SETUP - Check environment
 ################################################################################
 
-logger "Get env..."
+gpuci_logger "Check environment variables"
 env
 
-logger "Activate conda env..."
-source activate gdf
+gpuci_logger "Activate conda env"
+. /opt/conda/etc/profile.d/conda.sh
+conda activate rapids
 
-logger "Check versions..."
+gpuci_logger "Check compiler versions"
 python --version
-gcc --version
-g++ --version
-conda list
+$CC --version
+$CXX --version
+
+gpuci_logger "Check conda environment"
+conda info
+conda config --show-sources
+conda list --show-channel-urls
 
 # FIX Added to deal with Anancoda SSL verification issues during conda builds
 conda config --set ssl_verify False
@@ -53,18 +55,32 @@ conda config --set ssl_verify False
 # BUILD - Conda package builds (conda deps: libcuml <- cuml)
 ################################################################################
 
-logger "Build conda pkg for libcuml..."
-source ci/cpu/libcuml/build_libcuml.sh
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+  if [ "$BUILD_LIBCUML" == '1' -o "$BUILD_CUML" == '1' ]; then
+    gpuci_logger "Build conda pkg for libcuml"
+    gpuci_conda_retry build conda/recipes/libcuml
+  fi
+else
+  if [ "$BUILD_LIBCUML" == '1' ]; then
+    gpuci_logger "PROJECT FLASH: Build conda pkg for libcuml"
+    gpuci_conda_retry build conda/recipes/libcuml --dirty --no-remove-work-dir
+  fi
+fi
 
-logger "Build conda pkg for cuml..."
-source ci/cpu/cuml/build_cuml.sh
+if [ "$BUILD_CUML" == '1' ]; then
+  if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    gpuci_logger "Build conda pkg for cuml"
+    gpuci_conda_retry build conda/recipes/cuml --python=${PYTHON}
+  else
+    gpuci_logger "PROJECT FLASH: Build conda pkg for cuml"
+    gpuci_conda_retry build -c ci/artifacts/cuml/cpu/conda-bld/ --dirty --no-remove-work-dir conda/recipes/cuml --python=${PYTHON}
+  fi
+fi
 
 ################################################################################
 # UPLOAD - Conda packages
 ################################################################################
 
-logger "Upload conda pkgs for libcuml..."
-source ci/cpu/libcuml/upload-anaconda.sh
+gpuci_logger "Upload conda pkgs"
+source ci/cpu/upload.sh
 
-logger "Upload conda pkg for cuml..."
-source ci/cpu/cuml/upload-anaconda.sh
diff --git a/ci/cpu/cuml/build_cuml.sh b/ci/cpu/cuml/build_cuml.sh
deleted file mode 100644
index 561b439318..0000000000
--- a/ci/cpu/cuml/build_cuml.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-if [ "$BUILD_CUML" == '1' ]; then
-  echo "Building cuML"
-  CUDA_REL=${CUDA_VERSION%.*}
-  conda build conda/recipes/cuml --python=${PYTHON}
-
-fi
diff --git a/ci/cpu/cuml/upload-anaconda.sh b/ci/cpu/cuml/upload-anaconda.sh
deleted file mode 100755
index 6a79b85919..0000000000
--- a/ci/cpu/cuml/upload-anaconda.sh
+++ /dev/null
@@ -1,33 +0,0 @@
-#!/bin/bash
-#
-# Adopted from https://github.com/tmcdonell/travis-scripts/blob/dfaac280ac2082cd6bcaba3217428347899f2975/update-accelerate-buildbot.sh
-
-set -e
-
-if [ "$BUILD_CUML" == "1" ]; then
-  CUDA_REL=${CUDA_VERSION%.*}
-
-  export UPLOADFILE=`conda build conda/recipes/cuml -c conda-forge -c numba -c conda-forge/label/rc_ucx -c rapidsai -c nvidia -c pytorch -c defaults --python=${PYTHON} --output`
-
-  SOURCE_BRANCH=master
-
-  LABEL_OPTION="--label main"
-  echo "LABEL_OPTION=${LABEL_OPTION}"
-
-  test -e ${UPLOADFILE}
-
-  # Restrict uploads to master branch
-  if [ ${GIT_BRANCH} != ${SOURCE_BRANCH} ]; then
-    echo "Skipping upload"
-    return 0
-  fi
-
-  if [ -z "$MY_UPLOAD_KEY" ]; then
-    echo "No upload key"
-    return 0
-  fi
-
-  echo "Upload"
-  echo ${UPLOADFILE}
-  anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${UPLOADFILE}
-fi
diff --git a/ci/cpu/libcuml/build_libcuml.sh b/ci/cpu/libcuml/build_libcuml.sh
deleted file mode 100755
index c4b88af80e..0000000000
--- a/ci/cpu/libcuml/build_libcuml.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-if [ "$BUILD_LIBCUML" == '1' -o "$BUILD_CUML" == '1' ]; then
-  echo "Building libcuml"
-  CUDA_REL=${CUDA_VERSION%.*}
-
-  conda build conda/recipes/libcuml --python=${PYTHON}
-fi
diff --git a/ci/cpu/libcuml/upload-anaconda.sh b/ci/cpu/libcuml/upload-anaconda.sh
deleted file mode 100644
index 26634fe865..0000000000
--- a/ci/cpu/libcuml/upload-anaconda.sh
+++ /dev/null
@@ -1,33 +0,0 @@
-#!/bin/bash
-#
-# Adopted from https://github.com/tmcdonell/travis-scripts/blob/dfaac280ac2082cd6bcaba3217428347899f2975/update-accelerate-buildbot.sh
-
-set -e
-
-if [ "$BUILD_LIBCUML" == "1" ]; then
-  CUDA_REL=${CUDA_VERSION%.*}
-
-  export UPLOADFILE=`conda build conda/recipes/libcuml -c conda-forge -c numba -c conda-forge/label/rc_ucx -c nvidia -c rapidsai -c pytorch -c defaults --python=${PYTHON} --output`
-
-  SOURCE_BRANCH=master
-
-  LABEL_OPTION="--label main"
-  echo "LABEL_OPTION=${LABEL_OPTION}"
-
-  test -e ${UPLOADFILE}
-
-  # Restrict uploads to master branch
-  if [ ${GIT_BRANCH} != ${SOURCE_BRANCH} ]; then
-    echo "Skipping upload"
-    return 0
-  fi
-
-  if [ -z "$MY_UPLOAD_KEY" ]; then
-    echo "No upload key"
-    return 0
-  fi
-
-  echo "Upload"
-  echo ${UPLOADFILE}
-  anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${UPLOADFILE}
-fi
diff --git a/ci/cpu/prebuild.sh b/ci/cpu/prebuild.sh
index 04096e28ed..a362d2b9ca 100644
--- a/ci/cpu/prebuild.sh
+++ b/ci/cpu/prebuild.sh
@@ -1,9 +1,15 @@
 #!/usr/bin/env bash
 
-export BUILD_CUML=1
+export UPLOAD_CUML=1
 
 if [[ "$PYTHON" == "3.7" ]]; then
-    export BUILD_LIBCUML=1
+    export UPLOAD_LIBCUML=1
 else
-    export BUILD_LIBCUML=0
+    export UPLOAD_LIBCUML=0
 fi
+
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    #If project flash is not activate, always build both
+    export BUILD_LIBCUML=1
+    export BUILD_CUML=1
+fi
\ No newline at end of file
diff --git a/ci/cpu/upload.sh b/ci/cpu/upload.sh
new file mode 100644
index 0000000000..586cabae55
--- /dev/null
+++ b/ci/cpu/upload.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#
+# Adopted from https://github.com/tmcdonell/travis-scripts/blob/dfaac280ac2082cd6bcaba3217428347899f2975/update-accelerate-buildbot.sh
+
+set -e
+
+# Setup 'gpuci_retry' for upload retries (results in 4 total attempts)
+export GPUCI_RETRY_MAX=3
+export GPUCI_RETRY_SLEEP=30
+
+# Set default label options if they are not defined elsewhere
+export LABEL_OPTION=${LABEL_OPTION:-"--label main"}
+
+# Skip uploads unless BUILD_MODE == "branch"
+if [[ ${BUILD_MODE} != "branch" ]]; then
+  echo "Skipping upload"
+  return 0
+fi
+
+# Skip uploads if there is no upload key
+if [[ -z "$MY_UPLOAD_KEY" ]]; then
+  echo "No upload key"
+  return 0
+fi
+
+################################################################################
+# SETUP - Get conda file output locations
+################################################################################
+
+gpuci_logger "Get conda file output locations"
+
+export LIBCUML_FILE=`conda build conda/recipes/libcuml --output`
+export CUML_FILE=`conda build conda/recipes/cuml --python=$PYTHON --output`
+
+################################################################################
+# UPLOAD - Conda packages
+################################################################################
+
+gpuci_logger "Starting conda uploads"
+
+if [[ "$BUILD_LIBCUML" == "1" && "$UPLOAD_LIBCUML" == "1" ]]; then
+  test -e ${LIBCUML_FILE}
+  echo "Upload libcuml"
+  echo ${LIBCUML_FILE}
+  gpuci_retry anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${LIBCUML_FILE}
+fi
+
+if [[ "$BUILD_CUML" == "1" && "$UPLOAD_CUML" == "1" ]]; then
+  test -e ${CUML_FILE}
+  echo "Upload cuml"
+  echo ${CUML_FILE}
+  gpuci_retry anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${CUML_FILE}
+fi
+
diff --git a/ci/docs/build.sh b/ci/docs/build.sh
index 762e580b49..863f9328d4 100644
--- a/ci/docs/build.sh
+++ b/ci/docs/build.sh
@@ -18,32 +18,39 @@ export LIBCUDF_KERNEL_CACHE_PATH="$HOME/.jitify-cache"
 export NIGHTLY_VERSION=$(echo $BRANCH_VERSION | awk -F. '{print $2}')
 export PROJECTS=(cuml libcuml)
 
-logger "Check environment..."
+gpuci_logger "Check environment"
 env
 
-logger "Check GPU usage..."
+gpuci_logger "Check GPU usage"
 nvidia-smi
 
-logger "Activate conda env..."
-source activate rapids
+
+gpuci_logger "Activate conda env"
+. /opt/conda/etc/profile.d/conda.sh
+conda activate rapids
+
 # TODO: Move installs to docs-build-env meta package
 conda install -c anaconda beautifulsoup4 jq
 pip install sphinx-markdown-tables
 
 
-logger "Check versions..."
+gpuci_logger "Check versions"
 python --version
 $CC --version
 $CXX --version
-conda list
+
+gpuci_logger "Show conda info"
+conda info
+conda config --show-sources
+conda list --show-channel-urls
 
 # Build Doxygen docs
-logger "Build Doxygen docs..."
+gpuci_logger "Build Doxygen docs"
 cd $PROJECT_WORKSPACE/cpp/build
 make doc
 	
 # Build Python docs
-logger "Build Sphinx docs..."
+gpuci_logger "Build Sphinx docs"
 cd $PROJECT_WORKSPACE/docs
 make html
 
@@ -54,7 +61,7 @@ for PROJECT in ${PROJECTS[@]}; do
     if [ ! -d "api/$PROJECT/$BRANCH_VERSION" ]; then
         mkdir -p api/$PROJECT/$BRANCH_VERSION
     fi
-    rm -rf $DOCS_WORKSPACE/api/$PROJECT/$BRANCH_VERSION/*	
+    rm -rf $DOCS_WORKSPACE/api/$PROJECT/$BRANCH_VERSION/*
 done
 
 
diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index 46224f011c..9223621ebd 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -1,32 +1,32 @@
 #!/bin/bash
 # Copyright (c) 2018-2020, NVIDIA CORPORATION.
-#########################################
-# cuML GPU build and test script for CI #
-#########################################
+##############################################
+# cuML GPU build and test script for CI      #
+##############################################
+
 set -e
 NUMARGS=$#
 ARGS=$*
 
-# Logger function for build status output
-function logger() {
-  echo -e "\n>>>> $@\n"
-}
-
 # Arg parsing function
 function hasArg {
     (( ${NUMARGS} != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
 }
 
 # Set path and build parallel level
-export PATH=/conda/bin:/usr/local/cuda/bin:$PATH
-export PARALLEL_LEVEL=4
-export CUDA_REL=${CUDA_VERSION%.*}
+export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
+export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
 
 # Set home to the job's workspace
 export HOME=$WORKSPACE
 
-# Parse git describei
+# Determine CUDA release version
+export CUDA_REL=${CUDA_VERSION%.*}
+
+# Switch to project root; also root of repo checkout
 cd $WORKSPACE
+
+# Parse git describe
 export GIT_DESCRIBE_TAG=`git describe --tags`
 export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'`
 
@@ -34,114 +34,208 @@ export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'`
 # SETUP - Check environment
 ################################################################################
 
-logger "Check environment..."
+gpuci_logger "Check environment"
 env
 
-logger "Check GPU usage..."
+gpuci_logger "Check GPU usage"
 nvidia-smi
 
-logger "Activate conda env..."
-source activate gdf
-conda install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvidia \
+gpuci_logger "Activate conda env"
+. /opt/conda/etc/profile.d/conda.sh
+conda activate rapids
+
+gpuci_logger "Install dependencies"
+gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvidia \
       "cudatoolkit=${CUDA_REL}" \
       "cudf=${MINOR_VERSION}" \
       "rmm=${MINOR_VERSION}" \
-      "libcumlprims=0.15.0a200720" \
+      "libcumlprims=${MINOR_VERSION}" \
       "dask-cudf=${MINOR_VERSION}" \
       "dask-cuda=${MINOR_VERSION}" \
       "ucx-py=${MINOR_VERSION}" \
-      "xgboost==1.1.0dev.rapidsai0.15" \
-      "rapids-build-env=$MINOR_VERSION.*" \
-      "rapids-notebook-env=$MINOR_VERSION.*"
+      "xgboost=1.2.0dev.rapidsai${MINOR_VERSION}" \
+      "rapids-build-env=${MINOR_VERSION}.*" \
+      "rapids-notebook-env=${MINOR_VERSION}.*" \
+      "rapids-doc-env=${MINOR_VERSION}.*"
 
 # https://docs.rapids.ai/maintainers/depmgmt/
-# conda remove -f rapids-build-env rapids-notebook-env
-# conda install "your-pkg=1.0.0"
-
+# gpuci_conda_retry remove --force rapids-build-env rapids-notebook-env
+# gpuci_conda_retry install -y "your-pkg=1.0.0"
 
-# Install contextvars on Python 3.6
+gpuci_logger "Install contextvars if needed"
 py_ver=$(python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))")
 if [ "$py_ver" == "3.6" ];then
     conda install contextvars
 fi
 
-# Install the master version of dask, distributed, and dask-ml
-logger "pip install git+https://github.com/dask/distributed.git --upgrade --no-deps"
+gpuci_logger "Install the master version of dask and distributed"
+set -x
 pip install "git+https://github.com/dask/distributed.git" --upgrade --no-deps
-logger "pip install git+https://github.com/dask/dask.git --upgrade --no-deps"
 pip install "git+https://github.com/dask/dask.git" --upgrade --no-deps
+set +x
 
-
-logger "Check versions..."
+gpuci_logger "Check compiler versions"
 python --version
 $CC --version
 $CXX --version
-conda list
 
-################################################################################
-# BUILD - Build libcuml, cuML, and prims from source
-################################################################################
+gpuci_logger "Check conda environment"
+conda info
+conda config --show-sources
+conda list --show-channel-urls
 
-logger "Adding ${CONDA_PREFIX}/lib to LD_LIBRARY_PATH"
+gpuci_logger "Adding ${CONDA_PREFIX}/lib to LD_LIBRARY_PATH"
 
 export LD_LIBRARY_PATH_CACHED=$LD_LIBRARY_PATH
 export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
 
-logger "Build libcuml, cuml, prims and bench targets..."
-$WORKSPACE/build.sh clean libcuml cuml prims bench -v
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    gpuci_logger "Building doxygen C++ docs"
+    $WORKSPACE/build.sh cppdocs -v
 
-logger "Resetting LD_LIBRARY_PATH..."
+    ################################################################################
+    # BUILD - Build libcuml, cuML, and prims from source
+    ################################################################################
 
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_CACHED
-export LD_LIBRARY_PATH_CACHED=""
+    gpuci_logger "Build from source"
+    $WORKSPACE/build.sh clean libcuml cuml prims bench -v
 
-cd $WORKSPACE
+    gpuci_logger "Resetting LD_LIBRARY_PATH"
 
-################################################################################
-# TEST - Run GoogleTest and py.tests for libcuml and cuML
-################################################################################
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_CACHED
+    export LD_LIBRARY_PATH_CACHED=""
 
-if hasArg --skip-tests; then
-    logger "Skipping Tests..."
-    exit 0
-fi
+    cd $WORKSPACE
 
-logger "Check GPU usage..."
-nvidia-smi
+    ################################################################################
+    # TEST - Run GoogleTest and py.tests for libcuml and cuML
+    ################################################################################
+    set +e -Eo pipefail
+    EXITCODE=0
+    trap "EXITCODE=1" ERR
+    
+    if hasArg --skip-tests; then
+        gpuci_logger "Skipping Tests"
+        exit 0
+    fi
 
-logger "GoogleTest for libcuml..."
-cd $WORKSPACE/cpp/build
-GTEST_OUTPUT="xml:${WORKSPACE}/test-results/libcuml_cpp/" ./test/ml
+    gpuci_logger "Check GPU usage"
+    nvidia-smi
 
-logger "Python pytest for cuml..."
-cd $WORKSPACE/python
+    gpuci_logger "GoogleTest for libcuml"
+    set -x
+    cd $WORKSPACE/cpp/build
+    GTEST_OUTPUT="xml:${WORKSPACE}/test-results/libcuml_cpp/" ./test/ml
 
-pytest --cache-clear --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask
+    
+    gpuci_logger "Python pytest for cuml"
+    cd $WORKSPACE/python
 
-timeout 7200 sh -c "pytest cuml/test/dask --cache-clear --junitxml=${WORKSPACE}/junit-cuml-mg.xml -v -s -m 'not memleak' --durations=50 --timeout=300"
+    pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-config=.coveragerc --cov=cuml --cov-report=xml:${WORKSPACE}/python/cuml/cuml-coverage.xml --cov-report term
 
+    timeout 7200 sh -c "pytest cuml/test/dask --cache-clear --basetemp=${WORKSPACE}/cuml-mg-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml-mg.xml -v -s -m 'not memleak' --durations=50 --timeout=300"
 
-################################################################################
-# TEST - Run notebook tests
-################################################################################
 
-${WORKSPACE}/ci/gpu/test-notebooks.sh 2>&1 | tee nbtest.log
-python ${WORKSPACE}/ci/utils/nbtestlog2junitxml.py nbtest.log
+    ################################################################################
+    # TEST - Run notebook tests
+    ################################################################################
 
-################################################################################
-# TEST - Run GoogleTest for ml-prims
-################################################################################
-
-logger "Run ml-prims test..."
-cd $WORKSPACE/cpp/build
-GTEST_OUTPUT="xml:${WORKSPACE}/test-results/prims/" ./test/prims
+    gpuci_logger "Notebook tests"
+    ${WORKSPACE}/ci/gpu/test-notebooks.sh 2>&1 | tee nbtest.log
+    python ${WORKSPACE}/ci/utils/nbtestlog2junitxml.py nbtest.log
 
-################################################################################
-# TEST - Run GoogleTest for ml-prims, but with cuda-memcheck enabled
-################################################################################
+    ################################################################################
+    # TEST - Run GoogleTest for ml-prims
+    ################################################################################
 
-if [ "$BUILD_MODE" = "branch" ] && [ "$BUILD_TYPE" = "gpu" ]; then
-    logger "GoogleTest for ml-prims with cuda-memcheck enabled..."
+    gpuci_logger "Run ml-prims test"
     cd $WORKSPACE/cpp/build
-    python ../scripts/cuda-memcheck.py -tool memcheck -exe ./test/prims
+    GTEST_OUTPUT="xml:${WORKSPACE}/test-results/prims/" ./test/prims
+
+    ################################################################################
+    # TEST - Run GoogleTest for ml-prims, but with cuda-memcheck enabled
+    ################################################################################
+
+    if [ "$BUILD_MODE" = "branch" ] && [ "$BUILD_TYPE" = "gpu" ]; then
+        gpuci_logger "GoogleTest for ml-prims with cuda-memcheck enabled..."
+        cd $WORKSPACE/cpp/build
+        python ../scripts/cuda-memcheck.py -tool memcheck -exe ./test/prims
+    fi
+else
+    #Project Flash
+    export LIBCUML_BUILD_DIR="$WORKSPACE/ci/artifacts/cuml/cpu/conda_work/cpp/build"
+    export LD_LIBRARY_PATH="$LIBCUML_BUILD_DIR:$LD_LIBRARY_PATH"
+    
+    if hasArg --skip-tests; then
+        gpuci_logger "Skipping Tests"
+        exit 0
+    fi
+
+    gpuci_logger "Check GPU usage"
+    nvidia-smi
+
+    gpuci_logger "Update binaries"
+    cd $LIBCUML_BUILD_DIR
+    chrpath -d libcuml.so
+    chrpath -d libcuml++.so
+    patchelf --replace-needed `patchelf --print-needed libcuml++.so | grep faiss` libfaiss.so libcuml++.so
+
+    gpuci_logger "GoogleTest for libcuml"
+    cd $LIBCUML_BUILD_DIR
+    chrpath -d ./test/ml
+    patchelf --replace-needed `patchelf --print-needed ./test/ml | grep faiss` libfaiss.so ./test/ml
+    GTEST_OUTPUT="xml:${WORKSPACE}/test-results/libcuml_cpp/" ./test/ml
+
+    gpuci_logger "Installing libcuml"
+    conda install -c $WORKSPACE/ci/artifacts/cuml/cpu/conda-bld/ libcuml
+        
+    gpuci_logger "Building cuml"
+    "$WORKSPACE/build.sh" -v cuml
+
+    gpuci_logger "Python pytest for cuml"
+    cd $WORKSPACE/python
+
+    pytest --cache-clear --basetemp=${WORKSPACE}/cuml-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml.xml -v -s -m "not memleak" --durations=50 --timeout=300 --ignore=cuml/test/dask --ignore=cuml/raft --cov-config=.coveragerc --cov=cuml --cov-report=xml:${WORKSPACE}/python/cuml/cuml-coverage.xml --cov-report term
+
+    timeout 7200 sh -c "pytest cuml/test/dask --cache-clear --basetemp=${WORKSPACE}/cuml-mg-cuda-tmp --junitxml=${WORKSPACE}/junit-cuml-mg.xml -v -s -m 'not memleak' --durations=50 --timeout=300"
+
+    ################################################################################
+    # TEST - Run notebook tests
+    ################################################################################
+    
+    gpuci_logger "Notebook tests"
+    set +e -Eo pipefail
+    EXITCODE=0
+    trap "EXITCODE=1" ERR
+
+    ${WORKSPACE}/ci/gpu/test-notebooks.sh 2>&1 | tee nbtest.log
+    python ${WORKSPACE}/ci/utils/nbtestlog2junitxml.py nbtest.log
+
+    ################################################################################
+    # TEST - Run GoogleTest for ml-prims
+    ################################################################################
+
+    gpuci_logger "Run ml-prims test"
+    cd $LIBCUML_BUILD_DIR
+    chrpath -d ./test/prims
+    patchelf --replace-needed `patchelf --print-needed ./test/prims | grep faiss` libfaiss.so ./test/prims
+    GTEST_OUTPUT="xml:${WORKSPACE}/test-results/prims/" ./test/prims
+
+    ################################################################################
+    # TEST - Run GoogleTest for ml-prims, but with cuda-memcheck enabled
+    ################################################################################
+
+    if [ "$BUILD_MODE" = "branch" ] && [ "$BUILD_TYPE" = "gpu" ]; then
+        logger "GoogleTest for ml-prims with cuda-memcheck enabled..."
+        cd $WORKSPACE/ci/artifacts/cuml/cpu/conda_work/cpp/build
+        python ../scripts/cuda-memcheck.py -tool memcheck -exe ./test/prims
+    fi
+
+    gpuci_logger "Building doxygen C++ docs"
+    #Need to run in standard directory, not our artifact dir
+    unset LIBCUML_BUILD_DIR
+    $WORKSPACE/build.sh cppdocs -v
+
 fi
+
+return ${EXITCODE}
diff --git a/ci/local/README.md b/ci/local/README.md
index 1e1520bac9..6425d40f0c 100644
--- a/ci/local/README.md
+++ b/ci/local/README.md
@@ -18,19 +18,19 @@ Build and test your local repository using a base gpuCI Docker image
 where:
     -H   Show this help text
     -r   Path to repository (defaults to working directory)
-    -i   Use Docker image (default is gpuci/rapidsai-base:cuda10.0-ubuntu16.04-gcc5-py3.6)
+    -i   Use Docker image (default is gpuci/rapidsai:${NIGHTLY_VERSION}-cuda10.1-devel-ubuntu16.04-py3.7)
     -s   Skip building and testing and start an interactive shell in a container of the Docker image
 ```
 
 Example Usage:
-`bash build.sh -r ~/rapids/cuml -i gpuci/rapidsai-base:cuda10.1-ubuntu16.04-gcc5-py3.6`
+`bash build.sh -r ~/rapids/cuml -i gpuci/rapidsai:0.16-cuda10.2-devel-ubuntu16.04-py3.7`
 
-For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai-base/tags) page.
+For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai/tags) page.
 
 Style Check:
 ```bash
 $ bash ci/local/build.sh -r ~/rapids/cuml -s
-$ source activate gdf    #Activate gpuCI conda environment
+$ source activate rapids    # Activate gpuCI conda environment
 $ cd rapids
 $ flake8 python
 ```
@@ -42,7 +42,7 @@ There are some caveats to be aware of when using this script, especially if you
 
 ### Docker Image Build Repository
 
-The docker image will generate build artifacts in a folder on your machine located in the `root` directory of the repository you passed to the script. For the above example, the directory is named `~/rapids/cuml/build_rapidsai-base_cuda10.1-ubuntu16.04-gcc5-py3.6/`. Feel free to remove this directory after the script is finished.
+The docker image will generate build artifacts in a folder on your machine located in the `root` directory of the repository you passed to the script. For the above example, the directory is named `~/rapids/cuml/build_rapidsai_cuda10.1-ubuntu16.04-py3.7/`. Feel free to remove this directory after the script is finished.
 
 *Note*: The script *will not* override your local build repository. Your local environment stays in tact.
 
diff --git a/ci/mg/build.sh b/ci/mg/build.sh
index 9821d28896..9f742edb52 100644
--- a/ci/mg/build.sh
+++ b/ci/mg/build.sh
@@ -43,7 +43,7 @@ nvidia-smi
 logger "Activate conda env..."
 source activate gdf
 conda install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvidia \
-      "cupy>=7,<8.0.0a0" \
+      "cupy>7.1.0,<9.0.0a0" \
       "cudatoolkit=${CUDA_REL}" \
       "cudf=${MINOR_VERSION}" \
       "rmm=${MINOR_VERSION}" \
diff --git a/ci/release/update-version.sh b/ci/release/update-version.sh
index eac4dd42d6..cb8fca4073 100755
--- a/ci/release/update-version.sh
+++ b/ci/release/update-version.sh
@@ -47,7 +47,7 @@ function sed_runner() {
     sed -i.bak ''"$1"'' $2 && rm -f ${2}.bak
 }
 
-sed_runner 's/'"cuML VERSION .* LANGUAGES"'/'"cuML VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' cpp/CMakeLists.txt
+sed_runner 's/'"CUML VERSION .* LANGUAGES"'/'"cuML VERSION ${NEXT_FULL_TAG} LANGUAGES"'/g' cpp/CMakeLists.txt
 # RTD update
 sed_runner 's/version = .*/version = '"'${NEXT_SHORT_TAG}'"'/g' docs/source/conf.py
 sed_runner 's/release = .*/release = '"'${NEXT_FULL_TAG}'"'/g' docs/source/conf.py
@@ -59,4 +59,7 @@ for FILE in conda/environments/*.yml; do
    sed_runner "s/dask-cudf=${CURRENT_SHORT_TAG}/dask-cudf=${NEXT_SHORT_TAG}/g" ${FILE};
    sed_runner "s/ucx-py=${CURRENT_SHORT_TAG}/ucx-py=${NEXT_SHORT_TAG}/g" ${FILE};
    sed_runner "s/libcumlprims=${CURRENT_SHORT_TAG}/libcumlprims=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/rapids-build-env=${CURRENT_SHORT_TAG}/rapids-build-env=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/rapids-notebook-env=${CURRENT_SHORT_TAG}/rapids-notebook-env=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/rapids-doc-env=${CURRENT_SHORT_TAG}/rapids-doc-env=${NEXT_SHORT_TAG}/g" ${FILE};
 done
diff --git a/codecov.yml b/codecov.yml
new file mode 100644
index 0000000000..c0a3a2fba2
--- /dev/null
+++ b/codecov.yml
@@ -0,0 +1,5 @@
+#Configuration File for CodeCov
+coverage:
+  status:
+    project: off
+    patch: off
diff --git a/conda/environments/cuml_dev_cuda10.0.yml b/conda/environments/cuml_dev_cuda10.0.yml
deleted file mode 100644
index 16fbebdcee..0000000000
--- a/conda/environments/cuml_dev_cuda10.0.yml
+++ /dev/null
@@ -1,43 +0,0 @@
-name: cuml_dev
-channels:
-- rapidsai
-- nvidia
-- rapidsai-nightly
-- conda-forge
-dependencies:
-- cudatoolkit=10.0
-- clang=8.0.1
-- clang-tools=8.0.1
-- cmake=3.14.5
-- numba>=0.46
-- cupy>=7,<8.0.0a0
-- cudf=0.15*
-- rmm=0.15*
-- cython>=0.29,<0.30
-- pytest>=4.6
-- pytest-timeout
-- scikit-learn>=0.21
-- umap-learn>=0.3.9
-- scikit-learn>=0.21
-- dask>=2.12.0
-- distributed>=2.12.0
-- dask-cuda=0.15*
-- dask-cudf=0.15*
-- dask-ml
-- ucx-py=0.15*
-- nccl>=2.5
-- libcumlprims=0.15.0a200720
-- statsmodels
-- treelite=0.92
-- doxygen
-- sphinx
-- sphinx_rtd_theme
-- numpydoc
-- nbsphinx
-- recommonmark
-- ipython
-- pip
-- pip:
-    - sphinx_markdown_tables
-    - git+https://github.com/dask/dask.git
-    - git+https://github.com/dask/distributed.git
diff --git a/conda/environments/cuml_dev_cuda10.1.yml b/conda/environments/cuml_dev_cuda10.1.yml
index 43795aa095..c456bc6b5d 100644
--- a/conda/environments/cuml_dev_cuda10.1.yml
+++ b/conda/environments/cuml_dev_cuda10.1.yml
@@ -6,38 +6,32 @@ channels:
 - conda-forge
 dependencies:
 - cudatoolkit=10.1
-- clang=8.0.1
-- clang-tools=8.0.1
-- cmake=3.14.5
-- numba>=0.46
-- cupy>=7,<8.0.0a0
-- cudf=0.15*
-- rmm=0.15*
-- cython>=0.29,<0.30
-- pytest>=4.6
-- pytest-timeout
-- scikit-learn>=0.21
-- umap-learn>=0.3.9
-- scikit-learn>=0.21
-- dask>=2.12.0
-- distributed>=2.12.0
-- dask-cuda=0.15*
-- dask-cudf=0.15*
+- rapids-build-env=0.17
+- rapids-notebook-env=0.17
+- rapids-doc-env=0.17
+- cudf=0.17.*
+- rmm=0.17.*
+- libcumlprims=0.17.*
+- dask-cudf=0.17.*
+- dask-cuda=0.17.*
+- ucx-py=0.17.*
 - dask-ml
-- ucx-py=0.15*
-- nccl>=2.5
-- libcumlprims=0.15.0a200720
-- statsmodels
-- treelite=0.92
-- doxygen
-- sphinx
-- sphinx_rtd_theme
-- numpydoc
-- nbsphinx
-- recommonmark
-- ipython
+- doxygen>=1.8.20
+- libfaiss>=1.6.3
+- faiss-proc=*=cuda
+- umap-learn
+- scikit-learn=0.23.1
+- treelite=0.93
 - pip
 - pip:
     - sphinx_markdown_tables
     - git+https://github.com/dask/dask.git
     - git+https://github.com/dask/distributed.git
+
+# rapids-build-env, notebook-env and doc-env meta packages are defined in
+# https://docs.rapids.ai/maintainers/depmgmt/
+
+# To install different versions of packages contained in those meta packages,
+# it is recommended to remove those meta packages (without removing the actual
+# packages contained in the environment) first with:
+# conda remove --force rapids-build-env rapids-notebook-env rapids-doc-env
diff --git a/conda/environments/cuml_dev_cuda10.2.yml b/conda/environments/cuml_dev_cuda10.2.yml
index db52085a1e..0078bf4221 100644
--- a/conda/environments/cuml_dev_cuda10.2.yml
+++ b/conda/environments/cuml_dev_cuda10.2.yml
@@ -6,38 +6,32 @@ channels:
 - conda-forge
 dependencies:
 - cudatoolkit=10.2
-- clang=8.0.1
-- clang-tools=8.0.1
-- cmake=3.14.5
-- numba>=0.46
-- cupy>=7,<8.0.0a0
-- cudf=0.15*
-- rmm=0.15*
-- cython>=0.29,<0.30
-- pytest>=4.6
-- pytest-timeout
-- scikit-learn>=0.21
-- umap-learn>=0.3.9
-- scikit-learn>=0.21
-- dask>=2.12.0
-- distributed>=2.12.0
-- dask-cuda=0.15*
-- dask-cudf=0.15*
+- rapids-build-env=0.17
+- rapids-notebook-env=0.17
+- rapids-doc-env=0.17
+- cudf=0.17.*
+- rmm=0.17.*
+- libcumlprims=0.17.*
+- dask-cudf=0.17.*
+- dask-cuda=0.17.*
+- ucx-py=0.17.*
 - dask-ml
-- ucx-py=0.15*
-- nccl>=2.5
-- libcumlprims=0.15.0a200720
-- statsmodels
-- treelite=0.92
-- doxygen
-- sphinx
-- sphinx_rtd_theme
-- numpydoc
-- nbsphinx
-- recommonmark
-- ipython
+- doxygen>=1.8.20
+- libfaiss>=1.6.3
+- faiss-proc=*=cuda
+- umap-learn
+- scikit-learn=0.23.1
+- treelite=0.93
 - pip
 - pip:
     - sphinx_markdown_tables
     - git+https://github.com/dask/dask.git
     - git+https://github.com/dask/distributed.git
+
+# rapids-build-env, notebook-env and doc-env are defined in
+# https://docs.rapids.ai/maintainers/depmgmt/
+
+# To install different versions of packages contained in those meta packages,
+# it is recommended to remove those meta packages (without removing the actual
+# packages contained in the environment) first with:
+# conda remove --force rapids-build-env rapids-notebook-env rapids-doc-env
diff --git a/conda/environments/cuml_dev_cuda11.0.yml b/conda/environments/cuml_dev_cuda11.0.yml
new file mode 100644
index 0000000000..7282bc7493
--- /dev/null
+++ b/conda/environments/cuml_dev_cuda11.0.yml
@@ -0,0 +1,37 @@
+name: cuml_dev
+channels:
+- rapidsai
+- nvidia
+- rapidsai-nightly
+- conda-forge
+dependencies:
+- cudatoolkit=11.0
+- rapids-build-env=0.17
+- rapids-notebook-env=0.17
+- rapids-doc-env=0.17
+- cudf=0.17.*
+- rmm=0.17.*
+- libcumlprims=0.17.*
+- dask-cudf=0.17.*
+- dask-cuda=0.17.*
+- ucx-py=0.17.*
+- dask-ml
+- doxygen>=1.8.20
+- libfaiss>=1.6.3
+- faiss-proc=*=cuda
+- umap-learn
+- scikit-learn=0.23.1
+- treelite=0.93
+- pip
+- pip:
+    - sphinx_markdown_tables
+    - git+https://github.com/dask/dask.git
+    - git+https://github.com/dask/distributed.git
+
+# rapids-build-env, notebook-env and doc-env are defined in
+# https://docs.rapids.ai/maintainers/depmgmt/
+
+# To install different versions of packages contained in those meta packages,
+# it is recommended to remove those meta packages (without removing the actual
+# packages contained in the environment) first with:
+# conda remove --force rapids-build-env rapids-notebook-env rapids-doc-env
diff --git a/conda/recipes/cuml/meta.yaml b/conda/recipes/cuml/meta.yaml
index 14a49504ff..740d238d1d 100644
--- a/conda/recipes/cuml/meta.yaml
+++ b/conda/recipes/cuml/meta.yaml
@@ -28,10 +28,10 @@ requirements:
     - setuptools
     - cython>=0.29,<0.30
     - cmake>=3.14
-    - treelite=0.92
+    - treelite=0.93
     - cudf {{ minor_version }}
     - libcuml={{ version }}
-    - libcumlprims 0.15.0a200720
+    - libcumlprims {{ minor_version }}
     - cudatoolkit {{ cuda_version }}.*
     - ucx-py {{ minor_version }}
   run:
@@ -39,9 +39,9 @@ requirements:
     - cudf {{ minor_version }}
     - dask-cudf {{ minor_version }}
     - libcuml={{ version }}
-    - libcumlprims=0.15.0a200720
-    - treelite=0.92
-    - cupy>=7,<8.0.0a0
+    - libcumlprims {{ minor_version }}
+    - treelite=0.93
+    - cupy>7.1.0,<9.0.0a0
     - nccl>=2.5
     - ucx-py {{ minor_version }}
     - dask>=2.12.0
diff --git a/conda/recipes/libcuml/build.sh b/conda/recipes/libcuml/build.sh
index 04b629c8b9..318fa1b445 100644
--- a/conda/recipes/libcuml/build.sh
+++ b/conda/recipes/libcuml/build.sh
@@ -5,9 +5,8 @@ if [ -n "$MACOSX_DEPLOYMENT_TARGET" ]; then
     export MACOSX_DEPLOYMENT_TARGET=10.11
 fi
 
-# show environment
-printenv
-# Cleanup local git
-git clean -xdf
-
-./build.sh clean libcuml -v --allgpuarch
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    ./build.sh clean libcuml -v --allgpuarch
+else
+    ./build.sh clean libcuml prims -v --allgpuarch
+fi
diff --git a/conda/recipes/libcuml/meta.yaml b/conda/recipes/libcuml/meta.yaml
index e37d8363be..012f911253 100644
--- a/conda/recipes/libcuml/meta.yaml
+++ b/conda/recipes/libcuml/meta.yaml
@@ -23,6 +23,7 @@ build:
     - CUDAHOSTCXX
     - PARALLEL_LEVEL
     - VERSION_SUFFIX
+    - PROJECT_FLASH
 
 requirements:
   build:
@@ -30,20 +31,25 @@ requirements:
     - clang=8.0.1
     - clang-tools=8.0.1
   host:
-    - nccl 2.5.*
+    - nccl >=2.5
     - cudf {{ minor_version }}
     - cudatoolkit {{ cuda_version }}.*
     - ucx-py {{ minor_version }}
-    - libcumlprims=0.15.0a200720
+    - libcumlprims {{ minor_version }}
     - lapack
-    - treelite=0.92
+    - treelite=0.93
+    - faiss-proc=*=cuda
+    - gtest=1.10.0
+    - libfaiss=1.6.3
   run:
-    - libcumlprims=0.15.0a200720
+    - libcumlprims {{ minor_version }}
     - cudf {{ minor_version }}
     - nccl>=2.5
     - ucx-py {{ minor_version }}
     - {{ pin_compatible('cudatoolkit', max_pin='x.x') }}
-    - treelite=0.92
+    - treelite=0.93
+    - faiss-proc=*=cuda
+    - libfaiss=1.6.3
 
 about:
   home: http://rapids.ai/
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 61e90b4985..7cb4d471f1 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -16,9 +16,9 @@
 
 set (CMAKE_FIND_NO_INSTALL_PREFIX TRUE FORCE)
 
-cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
+cmake_minimum_required(VERSION 3.14...3.17 FATAL_ERROR)
 
-project(CUML VERSION 0.15.0 LANGUAGES C CXX CUDA)
+project(CUML VERSION 0.17.0 LANGUAGES C CXX CUDA)
 
 ##############################################################################
 # - build type ---------------------------------------------------------------
@@ -64,7 +64,9 @@ option(BUILD_CUML_STD_COMMS "Build the standard NCCL+UCX Communicator" ON)
 
 option(BUILD_CUML_MPI_COMMS "Build the MPI+NCCL Communicator (used for testing)" OFF)
 
-option(CMAKE_CXX11_ABI "Enable the GLIBCXX11 ABI" ON)
+option(BUILD_STATIC_FAISS "Build the FAISS library for nearest neighbors search on GPU" OFF)
+
+option(BUILD_GTEST "Build the GTEST library for running libcuml++ and prims test executables" OFF)
 
 option(DETECT_CONDA_ENV "Enable detection of conda environment for dependencies" ON)
 
@@ -72,8 +74,6 @@ option(DISABLE_OPENMP "Disable OpenMP" OFF)
 
 option(ENABLE_CUMLPRIMS_MG "Enable algorithms that use libcumlprims_mg" ON)
 
-option(EMPTY_MARKER_KERNEL "Enable empty marker kernel after nvtxRangePop" OFF)
-
 option(KERNEL_INFO "Enable kernel resource usage info" OFF)
 
 option(LINE_INFO "Enable lineinfo in nvcc" OFF)
@@ -169,22 +169,34 @@ endif(BUILD_CUML_MG_TESTS AND NOT SINGLEGPU)
 
 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/cmake/templates)
 
-GENERATE_FIND_MODULE(NAME cumlprims_mg
-                     HEADER_NAME cumlprims.hpp
-                     LIBRARY_NAME cumlprims
-                     LOCATION cumlprims)
+if(ENABLE_CUMLPRIMS_MG)
+  GENERATE_FIND_MODULE(
+    NAME         cumlprims_mg
+    HEADER_NAME  cumlprims.hpp
+    LIBRARY_NAME cumlprims
+    LOCATION     cumlprims)
+endif(ENABLE_CUMLPRIMS_MG)
 
-GENERATE_FIND_MODULE(NAME NCCL
-                     HEADER_NAME nccl.h
-                     LIBRARY_NAME nccl)
+if(BUILD_CUML_STD_COMMS OR BUILD_CUML_MPI_COMMS)
+GENERATE_FIND_MODULE(
+  NAME         NCCL
+  HEADER_NAME  nccl.h
+  LIBRARY_NAME nccl)
+endif(BUILD_CUML_STD_COMMS OR BUILD_CUML_MPI_COMMS)
 
-GENERATE_FIND_MODULE(NAME RMM
-                     HEADER_NAME rmm/device_buffer.hpp
-                     LIBRARY_NAME rmm)
+if(BUILD_CUML_STD_COMMS)
+  GENERATE_FIND_MODULE(
+    NAME         UCX
+    HEADER_NAME  ucp/api/ucp.h
+    LIBRARY_NAME ucp)
+endif(BUILD_CUML_STD_COMMS)
 
-GENERATE_FIND_MODULE(NAME UCX
-                     HEADER_NAME ucp/api/ucp.h
-                     LIBRARY_NAME ucp)
+if(NOT BUILD_STATIC_FAISS)
+  GENERATE_FIND_MODULE(
+    NAME         FAISS
+    HEADER_NAME  faiss/IndexFlat.h
+    LIBRARY_NAME faiss)
+endif(NOT BUILD_STATIC_FAISS)
 
 set(CMAKE_MODULE_PATH ${CMAKE_CURRENT_BINARY_DIR}/cmake)
 
@@ -254,9 +266,7 @@ endif(KERNEL_INFO)
 if(NVTX)
   set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DNVTX_ENABLED")
   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DNVTX_ENABLED")
-  if(EMPTY_MARKER_KERNEL)
-  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DENABLE_EMPTY_MARKER_KERNEL")
-  endif(EMPTY_MARKER_KERNEL)
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}")
 endif(NVTX)
 
 if(CMAKE_BUILD_TYPE MATCHES Debug)
@@ -269,6 +279,14 @@ if("${GPU_ARCHS}" STREQUAL "")
   evaluate_gpu_archs(GPU_ARCHS)
 endif()
 
+
+# CUDA 11 onwards cub ships with CTK
+if((CUDA_VERSION_MAJOR EQUAL 11) OR (CUDA_VERSION_MAJOR GREATER 11))
+  set(CUB_IS_PART_OF_CTK ON)
+else()
+  set(CUB_IS_PART_OF_CTK OFF)
+endif()
+
 if("${GPU_ARCHS}" STREQUAL "ALL")
   set(GPU_ARCHS "60")
   if((CUDA_VERSION_MAJOR EQUAL 9) OR (CUDA_VERSION_MAJOR GREATER 9))
@@ -277,6 +295,9 @@ if("${GPU_ARCHS}" STREQUAL "ALL")
   if((CUDA_VERSION_MAJOR EQUAL 10) OR (CUDA_VERSION_MAJOR GREATER 10))
     set(GPU_ARCHS "${GPU_ARCHS};75")
   endif()
+  if((CUDA_VERSION_MAJOR EQUAL 11) OR (CUDA_VERSION_MAJOR GREATER 11))
+    set(GPU_ARCHS "${GPU_ARCHS};80")
+  endif()
 endif()
 
 message("-- Building for GPU_ARCHS = ${GPU_ARCHS}")
@@ -290,17 +311,6 @@ list(GET GPU_ARCHS -1 ptx)
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode arch=compute_${ptx},code=compute_${ptx}")
 set(FAISS_GPU_ARCHS "${FAISS_GPU_ARCHS} -gencode arch=compute_${ptx},code=compute_${ptx}")
 
-if(CMAKE_COMPILER_IS_GNUCXX)
-  if(NOT CMAKE_CXX11_ABI)
-    message(STATUS "Disabling the GLIBCXX11 ABI")
-    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0")
-    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0")
-    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -D_GLIBCXX_USE_CXX11_ABI=0")
-  elseif(CMAKE_CXX11_ABI)
-    message(STATUS "Enabling the GLIBCXX11 ABI")
-  endif(NOT CMAKE_CXX11_ABI)
-endif(CMAKE_COMPILER_IS_GNUCXX)
-
 set(CMAKE_CUDA_FLAGS
   "${CMAKE_CUDA_FLAGS} -Xcudafe --diag_suppress=unrecognized_gcc_pragma")
 
@@ -317,37 +327,53 @@ set(CUML_INCLUDE_DIRECTORIES
   ${CMAKE_CURRENT_SOURCE_DIR}/src
   ${CMAKE_CURRENT_SOURCE_DIR}/src_prims
   ${CMAKE_CURRENT_SOURCE_DIR}/test/prims
-  ${FAISS_DIR}/src/
   ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
   ${CUTLASS_DIR}/src/cutlass
-  ${CUB_DIR}/src/cub
   ${SPDLOG_DIR}/src/spdlog/include
+  ${FAISS_INCLUDE_DIRS}
   ${RAFT_DIR}/cpp/include
-  ${RMM_INCLUDE_DIRS}
-  )
+  ${RMM_INCLUDE_DIRS})
+
+if(NOT CUB_IS_PART_OF_CTK)
+  list(APPEND CUML_INCLUDE_DIRECTORIES ${CUB_DIR}/src/cub)
+endif(NOT CUB_IS_PART_OF_CTK)
 
 set(CUML_PUBLIC_LINK_LIBRARIES
   ${CUDA_cublas_LIBRARY}
   ${CUDA_curand_LIBRARY}
   ${CUDA_cusolver_LIBRARY}
   ${CUDA_CUDART_LIBRARY}
-  ${CUDA_cusparse_LIBRARY}
-  )
+  ${CUDA_cusparse_LIBRARY})
+
 
 set(CUML_PRIVATE_LINK_LIBRARIES
-  faisslib
+  FAISS::FAISS
   treelite::treelite
   treelite::treelite_runtime
-  RMM::RMM
   )
 
+if(BUILD_CUML_STD_COMMS OR BUILD_CUML_MPI_COMMS)
+	list(APPEND CUML_INCLUDE_DIRECTORIES
+		${NCCL_INCLUDE_DIRS})
+
+	list(APPEND CUML_PRIVATE_LINK_LIBRARIES
+		NCCL::NCCL)
+endif(BUILD_CUML_STD_COMMS OR BUILD_CUML_MPI_COMMS)
+
+if(BUILD_CUML_MPI_COMMS)
+	list(APPEND CUML_INCLUDE_DIRECTORIES
+        ${MPI_CXX_INCLUDE_PATH})
+
+    list(APPEND CUML_PRIVATE_LINK_LIBRARIES
+         ${MPI_CXX_LIBRARIES})
+endif(BUILD_CUML_MPI_COMMS)
+
 if(ENABLE_CUMLPRIMS_MG)
   list(APPEND CUML_INCLUDE_DIRECTORIES
        ${cumlprims_mg_INCLUDE_DIRS})
 
   list(APPEND CUML_PRIVATE_LINK_LIBRARIES
        cumlprims_mg::cumlprims_mg)
-
 endif(ENABLE_CUMLPRIMS_MG)
 
 ##############################################################################
@@ -358,14 +384,12 @@ if(BUILD_CUML_CPP_LIBRARY)
 
   # single GPU components
   add_library(${CUML_CPP_TARGET} SHARED
-    src/arima/batched_kalman.cu
     src/arima/batched_arima.cu
+    src/arima/batched_kalman.cu
     src/common/cumlHandle.cpp
     src/common/cuml_api.cpp
-    src/common/cuML_comms_impl.cpp
     src/common/logger.cpp
     src/common/nvtx.cu
-    src/comms/cuML_comms_test.cpp
     src/datasets/make_arima.cu
     src/datasets/make_blobs.cu
     src/datasets/make_regression.cu
@@ -465,17 +489,6 @@ if(BUILD_CUML_TESTS OR BUILD_CUML_MG_TESTS OR BUILD_PRIMS_TESTS)
   add_subdirectory(test ${PROJECT_BINARY_DIR}/test)
 endif(BUILD_CUML_TESTS OR BUILD_CUML_MG_TESTS OR BUILD_PRIMS_TESTS)
 
-##############################################################################
-# - build comms ------------------------------------------------------------------------------
-
-if(BUILD_CUML_STD_COMMS)
-  add_subdirectory(comms/std)
-endif(BUILD_CUML_STD_COMMS)
-
-if(BUILD_CUML_MPI_COMMS)
-  add_subdirectory(comms/mpi)
-endif(BUILD_CUML_MPI_COMMS)
-
 ##############################################################################
 # - build examples ------------------------------------------------------------------------------
 
@@ -491,6 +504,7 @@ install(TARGETS ${CUML_CPP_TARGET}
         DESTINATION lib)
 
 install(DIRECTORY ${CUML_INCLUDE_DIR}/cuml DESTINATION include)
+install(DIRECTORY ${RAFT_DIR}/cpp/include/ DESTINATION include/cuml)
 
 ##############################################################################
 # - build benchmark executable -----------------------------------------------
diff --git a/cpp/Doxyfile.in b/cpp/Doxyfile.in
index 7a8200db78..d8cd284118 100644
--- a/cpp/Doxyfile.in
+++ b/cpp/Doxyfile.in
@@ -230,12 +230,6 @@ TAB_SIZE               = 4
 
 ALIASES                =
 
-# This tag can be used to specify a number of word-keyword mappings (TCL only).
-# A mapping has the form "name=value". For example adding "class=itcl::class"
-# will allow you to use the command class in the itcl::class meaning.
-
-TCL_SUBST              =
-
 # Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources
 # only. Doxygen will then generate output that is more tailored for C. For
 # instance, some of the names that are used will be different. The list of all
@@ -771,8 +765,7 @@ WARN_LOGFILE           =
 # spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
 # Note: If this tag is empty the current directory is searched.
 
-INPUT                  = @CMAKE_CURRENT_SOURCE_DIR@/comms \
-                         @CMAKE_CURRENT_SOURCE_DIR@/include \
+INPUT                  = @CMAKE_CURRENT_SOURCE_DIR@/include \
                          @CMAKE_CURRENT_SOURCE_DIR@/src \
                          @CMAKE_CURRENT_SOURCE_DIR@/src_prims
 
@@ -873,7 +866,11 @@ EXAMPLE_RECURSIVE      = NO
 # that contain images that are to be included in the documentation (see the
 # \image command).
 
-IMAGE_PATH             = @CMAKE_CURRENT_SOURCE_DIR@/doxygen/images
+# IMAGE_PATH             = @CMAKE_CURRENT_SOURCE_DIR@/doxygen/images
+
+# temporarily using cmake_current_source_dir for image path since we don't have images,
+# comment the above whenever images are needed in the doxygen/images folder
+IMAGE_PATH             = @CMAKE_CURRENT_SOURCE_DIR@/
 
 # The INPUT_FILTER tag can be used to specify a program that doxygen should
 # invoke to filter for each input file. Doxygen will invoke the filter program
@@ -1017,25 +1014,6 @@ USE_HTAGS              = NO
 
 VERBATIM_HEADERS       = YES
 
-# If the CLANG_ASSISTED_PARSING tag is set to YES then doxygen will use the
-# clang parser (see: http://clang.llvm.org/) for more accurate parsing at the
-# cost of reduced performance. This can be particularly helpful with template
-# rich C++ code for which doxygen's built-in parser lacks the necessary type
-# information.
-# Note: The availability of this option depends on whether or not doxygen was
-# generated with the -Duse-libclang=ON option for CMake.
-# The default value is: NO.
-
-CLANG_ASSISTED_PARSING = NO
-
-# If clang assisted parsing is enabled you can provide the compiler with command
-# line options that you would normally use when invoking the compiler. Note that
-# the include paths will already be set by doxygen for the files and directories
-# specified with INPUT and INCLUDE_PATH.
-# This tag requires that the tag CLANG_ASSISTED_PARSING is set to YES.
-
-CLANG_OPTIONS          =
-
 #---------------------------------------------------------------------------
 # Configuration options related to the alphabetical class index
 #---------------------------------------------------------------------------
@@ -1500,10 +1478,10 @@ MATHJAX_FORMAT         = HTML-CSS
 # Content Delivery Network so you can quickly see the result without installing
 # MathJax. However, it is strongly recommended to install a local copy of
 # MathJax from http://www.mathjax.org before deployment.
-# The default value is: http://cdn.mathjax.org/mathjax/latest.
+# The default value is: https://cdn.mathjax.org/mathjax/latest.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
-MATHJAX_RELPATH        = http://cdn.mathjax.org/mathjax/latest
+MATHJAX_RELPATH        = https://cdn.mathjax.org/mathjax/latest
 
 # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax
 # extension names that should be enabled during MathJax rendering. For example
@@ -2113,11 +2091,6 @@ EXTERNAL_GROUPS        = YES
 
 EXTERNAL_PAGES         = YES
 
-# The PERL_PATH should be the absolute path and name of the perl script
-# interpreter (i.e. the result of 'which perl').
-# The default file (with absolute path) is: /usr/bin/perl.
-
-PERL_PATH              = /usr/bin/perl
 
 #---------------------------------------------------------------------------
 # Configuration options related to the dot tool
@@ -2132,14 +2105,6 @@ PERL_PATH              = /usr/bin/perl
 
 CLASS_DIAGRAMS         = YES
 
-# You can define message sequence charts within doxygen comments using the \msc
-# command. Doxygen will then run the mscgen tool (see:
-# http://www.mcternan.me.uk/mscgen/)) to produce the chart and insert it in the
-# documentation. The MSCGEN_PATH tag allows you to specify the directory where
-# the mscgen tool resides. If left empty the tool is assumed to be found in the
-# default search path.
-
-MSCGEN_PATH            =
 
 # You can include diagrams made with dia in doxygen documentation. Doxygen will
 # then run dia to produce the diagram and insert it in the documentation. The
diff --git a/cpp/README.md b/cpp/README.md
index d258e084c6..f4d9076710 100644
--- a/cpp/README.md
+++ b/cpp/README.md
@@ -33,6 +33,7 @@ Current cmake offers the following configuration options:
 | Flag | Possible Values | Default Value | Behavior |
 | --- | --- | --- | --- |
 | BUILD_CUML_CPP_LIBRARY | [ON, OFF]  | ON  | Enable/disable building libcuml++ shared library. Setting this variable to `OFF` sets the variables BUILD_CUML_TESTS, BUILD_CUML_MG_TESTS and BUILD_CUML_EXAMPLES to `OFF` |
+| BUILD_GTEST | [ON, OFF]  | ON  |  Enable/disable building Googletest for test executables. The library search path will be used to find an existing version. |
 | BUILD_CUML_TESTS | [ON, OFF]  | ON  |  Enable/disable building cuML algorithm test executable `ml_test`.  |
 | BUILD_CUML_MG_TESTS | [ON, OFF]  | ON  |  Enable/disable building cuML algorithm test executable `ml_mg_test`. Requires MPI to be installed. When enabled, BUILD_CUML_MPI_COMMS will be automatically set to ON. |
 | BUILD_PRIMS_TESTS | [ON, OFF]  | ON  | Enable/disable building cuML algorithm test executable `prims_test`.  |
@@ -42,7 +43,7 @@ Current cmake offers the following configuration options:
 | SINGLEGPU | [ON, OFF] | OFF | Disable all mnmg components. Disables building of all multi-GPU algorithms and all comms library components. Removes libcumlprims, UCX-py and NCCL dependencies. Overrides values of  BUILD_CUML_MG_TESTS, BUILD_CUML_STD_COMMS, WITH_UCX and BUILD_CUML_MPI_COMMS. |
 | BUILD_CUML_EXAMPLES | [ON, OFF]  | ON  | Enable/disable building cuML C++ API usage examples.  |
 | BUILD_CUML_BENCH | [ON, OFF] | ON | Enable/disable building oc cuML C++ benchark.  |
-| CMAKE_CXX11_ABI | [ON, OFF]  | ON  | Enable/disable the GLIBCXX11 ABI  |
+| BUILD_STATIC_FAISS | [ON, OFF] | OFF | Enable/disable building and static linking of FAISS into cuML. When this option is disabled, build will search for an installed version of FAISS. |
 | DISABLE_OPENMP | [ON, OFF]  | OFF  | Set to `ON` to disable OpenMP  |
 | GPU_ARCHS |  List of GPU architectures, semicolon-separated | Empty  | List of GPU architectures that all artifacts are compiled for. Passing ALL means compiling for all currently supported GPU architectures: 60;70;75. If you don't pass this flag, then the build system will try to look for the GPU card installed on the system and compile only for that.  |
 
@@ -50,7 +51,9 @@ Current cmake offers the following configuration options:
 
 | Flag | Possible Values | Default Value | Behavior |
 | --- | --- | --- | --- |
-| BLAS_LIBRARIES | path/to/blas_lib | "" | Optional variable allowing to manually specify location of BLAS library. |
+| BLAS_LIBRARIES | path/to/blas_lib | "" | Optional variable allowing to manually specify location of BLAS library. This is only used when BUILD_STATIC_FAISS=ON |
+| FAISS_ROOT | path/to/faiss | "" | Optional variable allowing to manually specify the location of FAISS. |
+| GTEST_ROOT | path/to/gtest | "" | Optional variable allowing to manually specify the location of Googletest. |
 | NCCL_PATH| path/to/nccl | "" | Optional variable allowing to manually specify location of NCCL library. |
 | CUMLPRIMS_MG_PATH | path/to/libcumlprims | "" | Optional variable allowing to manually specify location of libcumlprims library. |
 
@@ -83,3 +86,36 @@ Current external submodules are:
 2. [CUB](https://github.com/NVlabs/cub)
 3. [Faiss](https://github.com/facebookresearch/faiss)
 4. [Google Test](https://github.com/google/googletest)
+
+## Using cuML libraries
+
+After building cuML, you can use its functionality in other C/C++ applications
+by linking against the generated libraries. The following trivial example shows
+how to make external use of cuML's logger:
+
+```cpp
+// main.cpp
+#include <cuml/common/logger.hpp>
+
+int main(int argc, char *argv[]) {
+  CUML_LOG_WARN("This is a warning from the cuML logger!");
+  return 0;
+}
+```
+
+To compile this example, we must point the compiler to where cuML was
+installed. Assuming you did not provide a custom `$CMAKE_INSTALL_PREFIX`, this
+will default to the `$CONDA_PREFIX` environment variable.
+
+```bash
+$ export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib"
+$ nvcc \
+       main.cpp \
+       -o cuml_logger_example \
+       "-L${CONDA_PREFIX}/lib" \
+       "-I${CONDA_PREFIX}/include" \
+       "-I${CONDA_PREFIX}/include/cuml/raft" \
+       -lcuml++
+$ ./cuml_logger_example
+[W] [13:26:43.503068] This is a warning from the cuML logger!
+```
diff --git a/cpp/bench/CMakeLists.txt b/cpp/bench/CMakeLists.txt
index 63cecba1ac..48de6fe1bd 100644
--- a/cpp/bench/CMakeLists.txt
+++ b/cpp/bench/CMakeLists.txt
@@ -27,10 +27,13 @@ if(BUILD_CUML_BENCH)
     sg/kmeans.cu
     sg/main.cpp
     sg/rf_classifier.cu
-    sg/rf_regressor.cu
+    # FIXME: RF Regressor is having an issue where the tests now seem to take
+    # forever to finish, as opposed to the classifier counterparts!
+    # sg/rf_regressor.cu
     sg/svc.cu
     sg/svr.cu
     sg/umap.cu
+    sg/fil.cu
     )
 
   target_link_libraries(sg_benchmark
@@ -62,9 +65,11 @@ if(BUILD_CUML_PRIMS_BENCH)
     prims/permute.cu
     prims/reduce.cu
     prims/rng.cu
-    ../src/common/logger.cpp  # because prims is header only!
-    )
+    ../src/common/logger.cpp)  # because prims is header only!
 
+  if(NOT CUB_IS_PART_OF_CTK)
+    add_dependencies(prims_benchmark cub)
+  endif(NOT CUB_IS_PART_OF_CTK)
   add_dependencies(prims_benchmark spdlog)
 
   target_link_libraries(prims_benchmark ${CUDA_cublas_LIBRARY} benchmarklib)
diff --git a/cpp/bench/common/ml_benchmark.hpp b/cpp/bench/common/ml_benchmark.hpp
index 678c32da52..a205256cda 100644
--- a/cpp/bench/common/ml_benchmark.hpp
+++ b/cpp/bench/common/ml_benchmark.hpp
@@ -17,8 +17,8 @@
 #pragma once
 
 #include <benchmark/benchmark.h>
-#include <common/cudart_utils.h>
 #include <cuda_runtime.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/common/logger.hpp>
 #include <cuml/common/utils.hpp>
diff --git a/cpp/bench/prims/add.cu b/cpp/bench/prims/add.cu
index 92eadcc289..c89c7413d4 100644
--- a/cpp/bench/prims/add.cu
+++ b/cpp/bench/prims/add.cu
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-#include <linalg/add.cuh>
+#include <raft/linalg/add.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -28,8 +28,8 @@ struct AddParams {
 template <typename T>
 struct AddBench : public Fixture {
   AddBench(const std::string& name, const AddParams& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -45,7 +45,7 @@ struct AddBench : public Fixture {
 
   void runBenchmark(::benchmark::State& state) override {
     loopOnState(state, [this]() {
-      MLCommon::LinAlg::add(ptr0, ptr0, ptr1, params.len, stream);
+      raft::linalg::add(ptr0, ptr0, ptr1, params.len, stream);
     });
   }
 
diff --git a/cpp/bench/prims/distance_common.cuh b/cpp/bench/prims/distance_common.cuh
index 895a6be86e..112d17d18f 100644
--- a/cpp/bench/prims/distance_common.cuh
+++ b/cpp/bench/prims/distance_common.cuh
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <distance/distance.cuh>
 #include "../common/ml_benchmark.hpp"
 
@@ -26,11 +26,11 @@ struct Params {
   int m, n, k;
 };  // struct Params
 
-template <typename T, MLCommon::Distance::DistanceType DType>
+template <typename T, ML::Distance::DistanceType DType>
 struct Distance : public Fixture {
   Distance(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
diff --git a/cpp/bench/prims/distance_cosine.cu b/cpp/bench/prims/distance_cosine.cu
index 6f9291535e..5f937fdd43 100644
--- a/cpp/bench/prims/distance_cosine.cu
+++ b/cpp/bench/prims/distance_cosine.cu
@@ -20,7 +20,8 @@ namespace MLCommon {
 namespace Bench {
 namespace Distance {
 
-DIST_BENCH_REGISTER(DistanceCosine, MLCommon::Distance::EucExpandedCosine);
+DIST_BENCH_REGISTER(DistanceCosine,
+                    ML::Distance::DistanceType::EucExpandedCosine);
 
 }  // namespace Distance
 }  // namespace Bench
diff --git a/cpp/bench/prims/distance_exp_l2.cu b/cpp/bench/prims/distance_exp_l2.cu
index 31ca18f5f9..9940e6ba28 100644
--- a/cpp/bench/prims/distance_exp_l2.cu
+++ b/cpp/bench/prims/distance_exp_l2.cu
@@ -20,8 +20,9 @@ namespace MLCommon {
 namespace Bench {
 namespace Distance {
 
-DIST_BENCH_REGISTER(DistanceL2Sq, MLCommon::Distance::EucExpandedL2);
-DIST_BENCH_REGISTER(DistanceL2Sqrt, MLCommon::Distance::EucExpandedL2Sqrt);
+DIST_BENCH_REGISTER(DistanceL2Sq, ML::Distance::DistanceType::EucExpandedL2);
+DIST_BENCH_REGISTER(DistanceL2Sqrt,
+                    ML::Distance::DistanceType::EucExpandedL2Sqrt);
 
 }  // namespace Distance
 }  // namespace Bench
diff --git a/cpp/bench/prims/distance_l1.cu b/cpp/bench/prims/distance_l1.cu
index 6abb0cb8aa..1e97e9b891 100644
--- a/cpp/bench/prims/distance_l1.cu
+++ b/cpp/bench/prims/distance_l1.cu
@@ -20,7 +20,7 @@ namespace MLCommon {
 namespace Bench {
 namespace Distance {
 
-DIST_BENCH_REGISTER(DistanceL1, MLCommon::Distance::EucUnexpandedL1);
+DIST_BENCH_REGISTER(DistanceL1, ML::Distance::DistanceType::EucUnexpandedL1);
 
 }  // namespace Distance
 }  // namespace Bench
diff --git a/cpp/bench/prims/distance_unexp_l2.cu b/cpp/bench/prims/distance_unexp_l2.cu
index 5bbf3d81f3..82e65c69ea 100644
--- a/cpp/bench/prims/distance_unexp_l2.cu
+++ b/cpp/bench/prims/distance_unexp_l2.cu
@@ -20,9 +20,10 @@ namespace MLCommon {
 namespace Bench {
 namespace Distance {
 
-DIST_BENCH_REGISTER(DistanceUnexpL2Sq, MLCommon::Distance::EucUnexpandedL2);
+DIST_BENCH_REGISTER(DistanceUnexpL2Sq,
+                    ML::Distance::DistanceType::EucUnexpandedL2);
 DIST_BENCH_REGISTER(DistanceUnexpL2Sqrt,
-                    MLCommon::Distance::EucUnexpandedL2Sqrt);
+                    ML::Distance::DistanceType::EucUnexpandedL2Sqrt);
 
 }  // namespace Distance
 }  // namespace Bench
diff --git a/cpp/bench/prims/fused_l2_nn.cu b/cpp/bench/prims/fused_l2_nn.cu
index fcb2608c85..60a2a6cfaa 100644
--- a/cpp/bench/prims/fused_l2_nn.cu
+++ b/cpp/bench/prims/fused_l2_nn.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <distance/fused_l2_nn.cuh>
 #include <limits>
-#include <linalg/norm.cuh>
-#include <random/rng.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/random/rng.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -32,8 +32,8 @@ struct FLNParams {
 template <typename T>
 struct FusedL2NN : public Fixture {
   FusedL2NN(const std::string& name, const FLNParams& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -44,14 +44,14 @@ struct FusedL2NN : public Fixture {
     alloc(yn, params.n);
     alloc(out, params.m);
     alloc(workspace, params.m);
-    MLCommon::Random::Rng r(123456ULL);
+    raft::random::Rng r(123456ULL);
     r.uniform(x, params.m * params.k, T(-1.0), T(1.0), stream);
     r.uniform(y, params.n * params.k, T(-1.0), T(1.0), stream);
-    MLCommon::LinAlg::rowNorm(xn, x, params.k, params.m,
-                              MLCommon::LinAlg::L2Norm, true, stream);
-    MLCommon::LinAlg::rowNorm(yn, y, params.k, params.n,
-                              MLCommon::LinAlg::L2Norm, true, stream);
-    auto blks = ceildiv(params.m, 256);
+    raft::linalg::rowNorm(xn, x, params.k, params.m, raft::linalg::L2Norm, true,
+                          stream);
+    raft::linalg::rowNorm(yn, y, params.k, params.n, raft::linalg::L2Norm, true,
+                          stream);
+    auto blks = raft::ceildiv(params.m, 256);
     MLCommon::Distance::initKernel<T, cub::KeyValuePair<int, T>, int>
       <<<blks, 256, 0, stream>>>(out, params.m, std::numeric_limits<T>::max(),
                                  op);
diff --git a/cpp/bench/prims/gram_matrix.cu b/cpp/bench/prims/gram_matrix.cu
index 5961a5ee38..f8cbf664c6 100644
--- a/cpp/bench/prims/gram_matrix.cu
+++ b/cpp/bench/prims/gram_matrix.cu
@@ -18,7 +18,7 @@
 #include <matrix/grammatrix.cuh>
 #include <matrix/kernelfactory.cuh>
 #include <memory>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include <sstream>
 #include <string>
 #include <vector>
@@ -40,8 +40,8 @@ struct GramTestParams {
 template <typename T>
 struct GramMatrix : public Fixture {
   GramMatrix(const std::string& name, const GramTestParams& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {
     std::vector<std::string> kernel_names{"linear", "poly", "rbf", "tanh"};
     std::ostringstream oss;
@@ -61,7 +61,7 @@ struct GramMatrix : public Fixture {
     alloc(A, params.m * params.k);
     alloc(B, params.k * params.n);
     alloc(C, params.m * params.n);
-    MLCommon::Random::Rng r(123456ULL);
+    raft::random::Rng r(123456ULL);
     r.uniform(A, params.m * params.k, T(-1.0), T(1.0), stream);
     r.uniform(B, params.k * params.n, T(-1.0), T(1.0), stream);
   }
diff --git a/cpp/bench/prims/make_blobs.cu b/cpp/bench/prims/make_blobs.cu
index ca7eff42f2..0e9f65a5f3 100644
--- a/cpp/bench/prims/make_blobs.cu
+++ b/cpp/bench/prims/make_blobs.cu
@@ -29,8 +29,8 @@ struct Params {
 template <typename T>
 struct MakeBlobs : public Fixture {
   MakeBlobs(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
diff --git a/cpp/bench/prims/map_then_reduce.cu b/cpp/bench/prims/map_then_reduce.cu
index d5c757d003..2bd8bf2501 100644
--- a/cpp/bench/prims/map_then_reduce.cu
+++ b/cpp/bench/prims/map_then_reduce.cu
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-#include <linalg/map_then_reduce.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -33,8 +33,8 @@ struct Identity {
 template <typename T>
 struct MapThenReduce : public Fixture {
   MapThenReduce(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -50,8 +50,8 @@ struct MapThenReduce : public Fixture {
 
   void runBenchmark(::benchmark::State& state) override {
     loopOnState(state, [this]() {
-      MLCommon::LinAlg::mapThenSumReduce(out, params.len, Identity<T>(), stream,
-                                         in);
+      raft::linalg::mapThenSumReduce(out, params.len, Identity<T>(), stream,
+                                     in);
     });
   }
 
diff --git a/cpp/bench/prims/matrix_vector_op.cu b/cpp/bench/prims/matrix_vector_op.cu
index 62b1ebaa76..4dd7a3ea75 100644
--- a/cpp/bench/prims/matrix_vector_op.cu
+++ b/cpp/bench/prims/matrix_vector_op.cu
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-#include <linalg/matrix_vector_op.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -29,8 +29,8 @@ struct Params {
 template <typename T>
 struct MatVecOp : public Fixture {
   MatVecOp(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -50,9 +50,9 @@ struct MatVecOp : public Fixture {
 
   void runBenchmark(::benchmark::State& state) override {
     loopOnState(state, [this]() {
-      MLCommon::LinAlg::matrixVectorOp(out, in, vec, params.cols, params.rows,
-                                       params.rowMajor, params.bcastAlongRows,
-                                       Sum<T>(), stream);
+      raft::linalg::matrixVectorOp(out, in, vec, params.cols, params.rows,
+                                   params.rowMajor, params.bcastAlongRows,
+                                   raft::Sum<T>(), stream);
     });
   }
 
diff --git a/cpp/bench/prims/permute.cu b/cpp/bench/prims/permute.cu
index 1b54d6e8cb..8d3b8f1157 100644
--- a/cpp/bench/prims/permute.cu
+++ b/cpp/bench/prims/permute.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
 #include <random/permute.cuh>
-#include <random/rng.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -31,8 +31,8 @@ struct Params {
 template <typename T>
 struct Permute : public Fixture {
   Permute(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -44,7 +44,7 @@ struct Permute : public Fixture {
     } else {
       perms = nullptr;
     }
-    MLCommon::Random::Rng r(123456ULL);
+    raft::random::Rng r(123456ULL);
     if (params.needShuffle) {
       alloc(out, matLen);
       alloc(in, matLen);
@@ -67,7 +67,7 @@ struct Permute : public Fixture {
   }
 
   void runBenchmark(::benchmark::State& state) override {
-    MLCommon::Random::Rng r(123456ULL);
+    raft::random::Rng r(123456ULL);
     loopOnState(state, [this, &r]() {
       MLCommon::Random::permute(perms, out, in, params.cols, params.rows,
                                 params.rowMajor, stream);
diff --git a/cpp/bench/prims/reduce.cu b/cpp/bench/prims/reduce.cu
index cfcb193ffb..0ed557ab71 100644
--- a/cpp/bench/prims/reduce.cu
+++ b/cpp/bench/prims/reduce.cu
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-#include <linalg/reduce.cuh>
+#include <raft/linalg/reduce.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -29,8 +29,8 @@ struct Params {
 template <typename T>
 struct Reduce : public Fixture {
   Reduce(const std::string& name, const Params& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -46,8 +46,8 @@ struct Reduce : public Fixture {
 
   void runBenchmark(::benchmark::State& state) override {
     loopOnState(state, [this]() {
-      MLCommon::LinAlg::reduce(dots, data, params.cols, params.rows, T(0.f),
-                               true, params.alongRows, stream);
+      raft::linalg::reduce(dots, data, params.cols, params.rows, T(0.f), true,
+                           params.alongRows, stream);
     });
   }
 
diff --git a/cpp/bench/prims/rng.cu b/cpp/bench/prims/rng.cu
index 934794bd80..af1281eb0e 100644
--- a/cpp/bench/prims/rng.cu
+++ b/cpp/bench/prims/rng.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
 #include "../common/ml_benchmark.hpp"
 
 namespace MLCommon {
@@ -38,15 +38,15 @@ template <typename T>
 struct Params {
   int len;
   RandomType type;
-  MLCommon::Random::GeneratorType gtype;
+  raft::random::GeneratorType gtype;
   T start, end;
 };  // struct Params
 
 template <typename T>
 struct RngBench : public Fixture {
   RngBench(const std::string& name, const Params<T>& p)
-    : Fixture(name,
-              std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)),
+    : Fixture(name, std::shared_ptr<deviceAllocator>(
+                      new raft::mr::device::default_allocator)),
       params(p) {}
 
  protected:
@@ -59,7 +59,7 @@ struct RngBench : public Fixture {
   }
 
   void runBenchmark(::benchmark::State& state) override {
-    MLCommon::Random::Rng r(123456ULL, params.gtype);
+    raft::random::Rng r(123456ULL, params.gtype);
     loopOnState(state, [this, &r]() {
       switch (params.type) {
         case RNG_Normal:
@@ -100,7 +100,7 @@ struct RngBench : public Fixture {
 
 template <typename T>
 static std::vector<Params<T>> getInputs() {
-  using namespace MLCommon::Random;
+  using namespace raft::random;
   return {
     {1024 * 1024, RNG_Uniform, GenPhilox, T(-1.0), T(1.0)},
     {32 * 1024 * 1024, RNG_Uniform, GenPhilox, T(-1.0), T(1.0)},
diff --git a/cpp/bench/sg/arima_loglikelihood.cu b/cpp/bench/sg/arima_loglikelihood.cu
index f3befac912..d13e859e49 100644
--- a/cpp/bench/sg/arima_loglikelihood.cu
+++ b/cpp/bench/sg/arima_loglikelihood.cu
@@ -22,8 +22,9 @@
 
 #include <cuml/tsa/arima_common.h>
 #include <cuml/tsa/batched_arima.hpp>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 
+#include <raft/cudart_utils.h>
 #include "benchmark.cuh"
 
 namespace ML {
@@ -46,13 +47,12 @@ class ArimaLoglikelihood : public TsFixtureRandom<DataT> {
     using MLCommon::Bench::CudaEventTimer;
 
     auto& handle = *this->handle;
-    auto stream = handle.getStream();
+    auto stream = handle.get_stream();
     auto counting = thrust::make_counting_iterator(0);
 
     // Generate random parameters
     int N = order.complexity();
-    MLCommon::Random::Rng gpu_gen(this->params.seed,
-                                  MLCommon::Random::GenPhilox);
+    raft::random::Rng gpu_gen(this->params.seed, raft::random::GenPhilox);
     gpu_gen.uniform(param, N * this->params.batch_size, -1.0, 1.0, stream);
     // Set sigma2 parameters to 1.0
     DataT* x = param;  // copy the object attribute for thrust
@@ -75,8 +75,8 @@ class ArimaLoglikelihood : public TsFixtureRandom<DataT> {
     Fixture::allocateBuffers(state);
 
     auto& handle = *this->handle;
-    auto stream = handle.getStream();
-    auto allocator = handle.getDeviceAllocator();
+    auto stream = handle.get_stream();
+    auto allocator = handle.get_device_allocator();
 
     // Buffer for the model parameters
     param = (DataT*)allocator->allocate(
@@ -86,28 +86,24 @@ class ArimaLoglikelihood : public TsFixtureRandom<DataT> {
     loglike = (DataT*)allocator->allocate(
       this->params.batch_size * sizeof(DataT), stream);
     residual = (DataT*)allocator->allocate(
-      this->params.batch_size * (this->params.n_obs - order.lost_in_diff()) *
-        sizeof(DataT),
-      stream);
+      this->params.batch_size * this->params.n_obs * sizeof(DataT), stream);
   }
 
   void deallocateBuffers(const ::benchmark::State& state) {
     Fixture::deallocateBuffers(state);
 
     auto& handle = *this->handle;
-    auto stream = handle.getStream();
-    auto allocator = handle.getDeviceAllocator();
+    auto stream = handle.get_stream();
+    auto allocator = handle.get_device_allocator();
 
     allocator->deallocate(
       param, order.complexity() * this->params.batch_size * sizeof(DataT),
       stream);
     allocator->deallocate(loglike, this->params.batch_size * sizeof(DataT),
                           stream);
-    allocator->deallocate(residual,
-                          this->params.batch_size *
-                            (this->params.n_obs - order.lost_in_diff()) *
-                            sizeof(DataT),
-                          stream);
+    allocator->deallocate(
+      residual, this->params.batch_size * this->params.n_obs * sizeof(DataT),
+      stream);
   }
 
  protected:
diff --git a/cpp/bench/sg/benchmark.cuh b/cpp/bench/sg/benchmark.cuh
index 2e448a0a51..2669b79019 100644
--- a/cpp/bench/sg/benchmark.cuh
+++ b/cpp/bench/sg/benchmark.cuh
@@ -17,8 +17,8 @@
 #pragma once
 
 #include <benchmark/benchmark.h>
-#include <common/cudart_utils.h>
 #include <cuda_runtime.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/logger.hpp>
 #include <cuml/cuml.hpp>
 #include "../common/ml_benchmark.hpp"
@@ -32,15 +32,16 @@ namespace Bench {
 class Fixture : public MLCommon::Bench::Fixture {
  public:
   Fixture(const std::string& name)
-    : MLCommon::Bench::Fixture(
-        name, std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator)) {}
+    : MLCommon::Bench::Fixture(name,
+                               std::shared_ptr<deviceAllocator>(
+                                 new raft::mr::device::default_allocator)) {}
   Fixture() = delete;
 
   void SetUp(const ::benchmark::State& state) override {
-    handle.reset(new cumlHandle(NumStreams));
-    d_alloc = handle->getDeviceAllocator();
+    handle.reset(new raft::handle_t(NumStreams));
+    d_alloc = handle->get_device_allocator();
     MLCommon::Bench::Fixture::SetUp(state);
-    handle->setStream(stream);
+    handle->set_stream(stream);
   }
 
   void TearDown(const ::benchmark::State& state) override {
@@ -82,7 +83,7 @@ class Fixture : public MLCommon::Bench::Fixture {
     generateMetrics(state);
   }
 
-  std::unique_ptr<cumlHandle> handle;
+  std::unique_ptr<raft::handle_t> handle;
 
   ///@todo: ideally, this should be determined at runtime based on the inputs
   ///       passed to the fixture. That will require a whole lot of plumbing of
diff --git a/cpp/bench/sg/dataset.cuh b/cpp/bench/sg/dataset.cuh
index 1cb72bef53..ce9d243a85 100644
--- a/cpp/bench/sg/dataset.cuh
+++ b/cpp/bench/sg/dataset.cuh
@@ -16,15 +16,15 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/cuml.hpp>
 #include <cuml/datasets/make_blobs.hpp>
 #include <fstream>
 #include <iostream>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include <random/make_regression.cuh>
 #include <sstream>
 #include <string>
@@ -81,17 +81,17 @@ struct Dataset {
   L* y;
 
   /** allocate space needed for the dataset */
-  void allocate(const cumlHandle& handle, const DatasetParams& p) {
-    auto allocator = handle.getDeviceAllocator();
-    auto stream = handle.getStream();
+  void allocate(const raft::handle_t& handle, const DatasetParams& p) {
+    auto allocator = handle.get_device_allocator();
+    auto stream = handle.get_stream();
     X = (D*)allocator->allocate(p.nrows * p.ncols * sizeof(D), stream);
     y = (L*)allocator->allocate(p.nrows * sizeof(L), stream);
   }
 
   /** free-up the buffers */
-  void deallocate(const cumlHandle& handle, const DatasetParams& p) {
-    auto allocator = handle.getDeviceAllocator();
-    auto stream = handle.getStream();
+  void deallocate(const raft::handle_t& handle, const DatasetParams& p) {
+    auto allocator = handle.get_device_allocator();
+    auto stream = handle.get_stream();
     allocator->deallocate(X, p.nrows * p.ncols * sizeof(D), stream);
     allocator->deallocate(y, p.nrows * sizeof(L), stream);
   }
@@ -103,12 +103,12 @@ struct Dataset {
    * Generate random blobs data. Args are the same as in make_blobs.
    * Assumes that the user has already called `allocate`
    */
-  void blobs(const cumlHandle& handle, const DatasetParams& p,
+  void blobs(const raft::handle_t& handle, const DatasetParams& p,
              const BlobsParams& b) {
-    const auto& handle_impl = handle.getImpl();
-    auto stream = handle_impl.getStream();
-    auto cublas_handle = handle_impl.getCublasHandle();
-    auto allocator = handle_impl.getDeviceAllocator();
+    const auto& handle_impl = handle;
+    auto stream = handle_impl.get_stream();
+    auto cublas_handle = handle_impl.get_cublas_handle();
+    auto allocator = handle_impl.get_device_allocator();
 
     // Make blobs will generate labels of type IdxT which has to be an integer
     // type. We cast it to a different output type if needed.
@@ -124,7 +124,7 @@ struct Dataset {
                              b.shuffle, D(b.center_box_min),
                              D(b.center_box_max), b.seed);
     if (!std::is_same<L, IdxT>::value) {
-      MLCommon::LinAlg::unaryOp(
+      raft::linalg::unaryOp(
         y, tmpY, p.nrows, [] __device__(IdxT z) { return (L)z; }, stream);
       allocator->deallocate(tmpY, p.nrows * sizeof(IdxT), stream);
     }
@@ -134,15 +134,15 @@ struct Dataset {
    * Generate random regression data. Args are the same as in make_regression.
    * Assumes that the user has already called `allocate`
    */
-  void regression(const cumlHandle& handle, const DatasetParams& p,
+  void regression(const raft::handle_t& handle, const DatasetParams& p,
                   const RegressionParams& r) {
     ASSERT(!isClassification(),
            "make_regression: is only for regression problems!");
-    const auto& handle_impl = handle.getImpl();
-    auto stream = handle_impl.getStream();
-    auto cublas_handle = handle_impl.getCublasHandle();
-    auto cusolver_handle = handle_impl.getcusolverDnHandle();
-    auto allocator = handle_impl.getDeviceAllocator();
+    const auto& handle_impl = handle;
+    auto stream = handle_impl.get_stream();
+    auto cublas_handle = handle_impl.get_cublas_handle();
+    auto cusolver_handle = handle_impl.get_cusolver_dn_handle();
+    auto allocator = handle_impl.get_device_allocator();
 
     D* tmpX = X;
 
@@ -150,12 +150,11 @@ struct Dataset {
       tmpX = (D*)allocator->allocate(p.nrows * p.ncols * sizeof(D), stream);
     }
     MLCommon::Random::make_regression(
-      tmpX, y, p.nrows, p.ncols, r.n_informative, cublas_handle,
-      cusolver_handle, allocator, stream, (D*)nullptr, 1, D(r.bias),
-      r.effective_rank, D(r.tail_strength), D(r.noise), r.shuffle, r.seed);
+      handle, tmpX, y, p.nrows, p.ncols, r.n_informative, stream, (D*)nullptr,
+      1, D(r.bias), r.effective_rank, D(r.tail_strength), D(r.noise), r.shuffle,
+      r.seed);
     if (!p.rowMajor) {
-      MLCommon::LinAlg::transpose(tmpX, X, p.nrows, p.ncols, cublas_handle,
-                                  stream);
+      raft::linalg::transpose(handle, tmpX, X, p.nrows, p.ncols, stream);
       allocator->deallocate(tmpX, p.nrows * p.ncols * sizeof(D), stream);
     }
   }
@@ -173,7 +172,7 @@ struct Dataset {
    *              std::vector<L>& y, int lineNum, const DatasetParams& p);`
    */
   template <typename Lambda>
-  void read_csv(const cumlHandle& handle, const std::string& csvfile,
+  void read_csv(const raft::handle_t& handle, const std::string& csvfile,
                 const DatasetParams& p, Lambda readOp) {
     if (isClassification() && p.nclasses <= 0) {
       ASSERT(false,
@@ -192,9 +191,9 @@ struct Dataset {
       counter++;
     }
     myfile.close();
-    auto stream = handle.getStream();
-    MLCommon::copy(X, &(_X[0]), p.nrows * p.ncols, stream);
-    MLCommon::copy(y, &(_y[0]), p.nrows, stream);
+    auto stream = handle.get_stream();
+    raft::copy(X, &(_X[0]), p.nrows * p.ncols, stream);
+    raft::copy(y, &(_y[0]), p.nrows, stream);
   }
 
  private:
diff --git a/cpp/bench/sg/dataset_ts.cuh b/cpp/bench/sg/dataset_ts.cuh
index 686e45e697..b43029d22b 100644
--- a/cpp/bench/sg/dataset_ts.cuh
+++ b/cpp/bench/sg/dataset_ts.cuh
@@ -17,10 +17,11 @@
 #pragma once
 
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/cuml.hpp>
+#include <raft/cuda_utils.cuh>
 
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
 
 namespace ML {
 namespace Bench {
@@ -42,25 +43,25 @@ struct TimeSeriesDataset {
   DataT* X;
 
   /** allocate space needed for the dataset */
-  void allocate(const cumlHandle& handle, const TimeSeriesParams& p) {
-    auto allocator = handle.getDeviceAllocator();
-    auto stream = handle.getStream();
+  void allocate(const raft::handle_t& handle, const TimeSeriesParams& p) {
+    auto allocator = handle.get_device_allocator();
+    auto stream = handle.get_stream();
     X = (DataT*)allocator->allocate(p.batch_size * p.n_obs * sizeof(DataT),
                                     stream);
   }
 
   /** free-up the buffers */
-  void deallocate(const cumlHandle& handle, const TimeSeriesParams& p) {
-    auto allocator = handle.getDeviceAllocator();
-    auto stream = handle.getStream();
+  void deallocate(const raft::handle_t& handle, const TimeSeriesParams& p) {
+    auto allocator = handle.get_device_allocator();
+    auto stream = handle.get_stream();
     allocator->deallocate(X, p.batch_size * p.n_obs * sizeof(DataT), stream);
   }
 
   /** generate random time series (normal distribution) */
-  void random(const cumlHandle& handle, const TimeSeriesParams& p, DataT mu = 0,
-              DataT sigma = 1) {
-    MLCommon::Random::Rng gpu_gen(p.seed, MLCommon::Random::GenPhilox);
-    gpu_gen.normal(X, p.batch_size * p.n_obs, mu, sigma, handle.getStream());
+  void random(const raft::handle_t& handle, const TimeSeriesParams& p,
+              DataT mu = 0, DataT sigma = 1) {
+    raft::random::Rng gpu_gen(p.seed, raft::random::GenPhilox);
+    gpu_gen.normal(X, p.batch_size * p.n_obs, mu, sigma, handle.get_stream());
   }
 };
 
diff --git a/cpp/bench/sg/fil.cu b/cpp/bench/sg/fil.cu
new file mode 100644
index 0000000000..094e735c1a
--- /dev/null
+++ b/cpp/bench/sg/fil.cu
@@ -0,0 +1,194 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cuml/fil/fil.h>
+#include <cuml/tree/algo_helper.h>
+#include <decisiontree/decisiontree_impl.h>
+#include <treelite/c_api.h>
+#include <treelite/tree.h>
+#include <cuml/common/logger.hpp>
+#include <cuml/cuml.hpp>
+#include <cuml/ensemble/randomforest.hpp>
+#include <utility>
+#include "benchmark.cuh"
+
+namespace ML {
+namespace Bench {
+namespace fil {
+
+struct Params {
+  DatasetParams data;
+  RegressionParams blobs;
+  ModelHandle model;
+  ML::fil::storage_type_t storage;
+  ML::fil::algo_t algo;
+  RF_params rf;
+  int predict_repetitions;
+};
+
+class FIL : public RegressionFixture<float> {
+  typedef RegressionFixture<float> Base;
+
+ public:
+  FIL(const std::string& name, const Params& p)
+  /*
+        fitting to linear combinations in "y" normally yields trees that check
+        values of all significant columns, as well as their linear
+        combinations in "X". During inference, the exact threshold
+        values do not affect speed. The distribution of column popularity does
+        not affect speed barring lots of uninformative columns in succession.
+        Hence, this method represents real datasets well enough for both
+        classification and regression.
+      */
+  : RegressionFixture<float>(name, p.data, p.blobs),
+    model(p.model),
+    p_rest(p) {}
+
+  static void regression_to_classification(float* y, int nrows, int nclasses,
+                                           cudaStream_t stream) {
+    raft::linalg::unaryOp(
+      y, y, nrows,
+      [=] __device__(float a) {
+        return float(lroundf(fabsf(a) * 1000. * nclasses) % nclasses);
+      },
+      stream);
+  }
+
+ protected:
+  void runBenchmark(::benchmark::State& state) override {
+    if (!params.rowMajor) {
+      state.SkipWithError("FIL only supports row-major inputs");
+    }
+    if (params.nclasses > 1) {
+      // convert regression ranges into [0..nclasses-1]
+      regression_to_classification(data.y, params.nrows, params.nclasses,
+                                   stream);
+    }
+    // create model
+    ML::RandomForestRegressorF rf_model;
+    auto* mPtr = &rf_model;
+    mPtr->trees = nullptr;
+    size_t train_nrows = std::min(params.nrows, 1000);
+    fit(*handle, mPtr, data.X, train_nrows, params.ncols, data.y, p_rest.rf);
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+
+    ML::build_treelite_forest(&model, &rf_model, params.ncols,
+                              params.nclasses > 1 ? 2 : 1);
+    ML::fil::treelite_params_t tl_params = {
+      .algo = p_rest.algo,
+      .output_class = params.nclasses > 1,  // cuML RF forest
+      .threshold = 1.f / params.nclasses,   //Fixture::DatasetParams
+      .storage_type = p_rest.storage};
+    ML::fil::from_treelite(*handle, &forest, model, &tl_params);
+
+    // only time prediction
+    this->loopOnState(state, [this]() {
+      // Dataset<D, L> allocates y assuming one output value per input row,
+      // so not supporting predict_proba yet
+      for (int i = 0; i < p_rest.predict_repetitions; i++) {
+        ML::fil::predict(*this->handle, this->forest, this->data.y,
+                         this->data.X, this->params.nrows, false);
+      }
+    });
+  }
+
+  void allocateBuffers(const ::benchmark::State& state) override {
+    Base::allocateBuffers(state);
+  }
+
+  void deallocateBuffers(const ::benchmark::State& state) override {
+    ML::fil::free(*handle, forest);
+    Base::deallocateBuffers(state);
+  }
+
+ private:
+  ML::fil::forest_t forest;
+  ModelHandle model;
+  Params p_rest;
+};
+
+struct FilBenchParams {
+  int nrows;
+  int ncols;
+  int nclasses;
+  int max_depth;
+  int ntrees;
+  ML::fil::storage_type_t storage;
+  ML::fil::algo_t algo;
+};
+
+std::vector<Params> getInputs() {
+  std::vector<Params> out;
+  Params p;
+  p.data.rowMajor = true;
+  p.blobs = {
+    .n_informative = -1,   // Just a placeholder value, anyway changed below
+    .effective_rank = -1,  // Just a placeholder value, anyway changed below
+    .bias = 0.f,
+    .tail_strength = 0.1,
+    .noise = 0.01,
+    .shuffle = false,
+    .seed = 12345ULL};
+
+  set_rf_params(p.rf,  // Output RF parameters
+                1,  // n_trees, just a placeholder value, anyway changed below
+                true,  // bootstrap
+                1.f,   // rows_sample
+                1234,  // seed
+                8);    // n_streams
+
+  set_tree_params(p.rf.tree_params,    // Output tree parameters
+                  10,                  // max_depth, just a placeholder value,
+                                       //   anyway changed below
+                  (1 << 20),           // max_leaves
+                  1,                   // max_features
+                  32,                  // n_bins
+                  1,                   // split_algo
+                  3,                   // min_rows_per_node
+                  0.0f,                // min_impurity_decrease
+                  true,                // bootstrap_features
+                  ML::CRITERION::MSE,  // split_criterion
+                  false,               // quantile_per_tree
+                  false,               // use_experimental_backend
+                  128);                // max_batch_size
+
+  using ML::fil::algo_t;
+  using ML::fil::storage_type_t;
+  std::vector<FilBenchParams> var_params = {
+    {(int)1e6, 20, 1, 5, 1000, storage_type_t::DENSE, algo_t::BATCH_TREE_REORG},
+    {(int)1e6, 20, 2, 5, 1000, storage_type_t::DENSE,
+     algo_t::BATCH_TREE_REORG}};
+  for (auto& i : var_params) {
+    p.data.nrows = i.nrows;
+    p.data.ncols = i.ncols;
+    p.blobs.n_informative = i.ncols / 3;
+    p.blobs.effective_rank = i.ncols / 3;
+    p.data.nclasses = i.nclasses;
+    p.rf.tree_params.max_depth = i.max_depth;
+    p.rf.n_trees = i.ntrees;
+    p.storage = i.storage;
+    p.algo = i.algo;
+    p.predict_repetitions = 10;
+    out.push_back(p);
+  }
+  return out;
+}
+
+ML_BENCH_REGISTER(Params, FIL, "", getInputs());
+
+}  // end namespace fil
+}  // end namespace Bench
+}  // end namespace ML
diff --git a/cpp/bench/sg/rf_classifier.cu b/cpp/bench/sg/rf_classifier.cu
index 54859d5f15..ec1f95cba0 100644
--- a/cpp/bench/sg/rf_classifier.cu
+++ b/cpp/bench/sg/rf_classifier.cu
@@ -77,22 +77,34 @@ std::vector<Params> getInputs() {
   std::vector<Params> out;
   Params p;
   p.data.rowMajor = false;
-  p.blobs.cluster_std = 10.0;
-  p.blobs.shuffle = false;
-  p.blobs.center_box_min = -10.0;
-  p.blobs.center_box_max = 10.0;
-  p.blobs.seed = 12345ULL;
-  p.rf.bootstrap = true;
-  p.rf.rows_sample = 1.f;
-  p.rf.tree_params.max_leaves = 1 << 20;
-  p.rf.tree_params.min_rows_per_node = 3;
-  p.rf.tree_params.n_bins = 32;
-  p.rf.tree_params.bootstrap_features = true;
-  p.rf.tree_params.quantile_per_tree = false;
-  p.rf.tree_params.split_algo = 1;
-  p.rf.tree_params.split_criterion = (ML::CRITERION)0;
-  p.rf.n_trees = 500;
-  p.rf.n_streams = 8;
+  p.blobs = {10.0,         // cluster_std
+             false,        // shuffle
+             -10.0,        // center_box_min
+             10.0,         // center_box_max
+             2152953ULL};  //seed
+
+  set_rf_params(p.rf,  // Output RF parameters
+                500,   // n_trees
+                true,  // bootstrap
+                1.f,   // rows_sample
+                1234,  // seed
+                8);    // n_streams
+
+  set_tree_params(p.rf.tree_params,  // Output tree parameters
+                  10,                // max_depth, this is anyway changed below
+                  (1 << 20),         // max_leaves
+                  0.3,               // max_features, just a placeholder value,
+                                     //   anyway changed below
+                  32,                // n_bins
+                  1,                 // split_algo
+                  3,                 // min_rows_per_node
+                  0.0f,              // min_impurity_decrease
+                  true,              // bootstrap_features
+                  ML::CRITERION::GINI,  // split_criterion
+                  false,                // quantile_per_tree
+                  false,                // use_experimental_backend
+                  128);                 // max_batch_size
+
   std::vector<Triplets> rowcols = {
     {160000, 64, 2},
     {640000, 64, 8},
@@ -105,7 +117,7 @@ std::vector<Params> getInputs() {
     p.data.ncols = rc.ncols;
     p.data.nclasses = rc.nclasses;
     p.rf.tree_params.max_features = 1.f / std::sqrt(float(rc.ncols));
-    for (auto max_depth : std::vector<int>({8, 10})) {
+    for (auto max_depth : std::vector<int>({7, 9})) {
       p.rf.tree_params.max_depth = max_depth;
       out.push_back(p);
     }
diff --git a/cpp/bench/sg/rf_regressor.cu b/cpp/bench/sg/rf_regressor.cu
index 8235ce382c..1ed292a089 100644
--- a/cpp/bench/sg/rf_regressor.cu
+++ b/cpp/bench/sg/rf_regressor.cu
@@ -77,23 +77,37 @@ std::vector<RegParams> getInputs() {
   struct std::vector<RegParams> out;
   RegParams p;
   p.data.rowMajor = false;
-  p.regression.shuffle = true;  // better to shuffle when n_informative < ncols
-  p.regression.seed = 12345ULL;
-  p.regression.effective_rank = -1;  // dataset generation will be faster
-  p.regression.bias = 4.5;
-  p.regression.tail_strength = 0.5;  // unused when effective_rank = -1
-  p.regression.noise = 1.;
-  p.rf.bootstrap = true;
-  p.rf.rows_sample = 1.f;
-  p.rf.tree_params.max_leaves = 1 << 20;
-  p.rf.tree_params.min_rows_per_node = 3;
-  p.rf.tree_params.n_bins = 32;
-  p.rf.tree_params.bootstrap_features = true;
-  p.rf.tree_params.quantile_per_tree = false;
-  p.rf.tree_params.split_algo = 1;
-  p.rf.tree_params.split_criterion = ML::CRITERION::MSE;
-  p.rf.n_trees = 500;
-  p.rf.n_streams = 8;
+  p.regression = {
+    .shuffle = true,       // Better to shuffle when n_informative < ncols
+    .effective_rank = -1,  // dataset generation will be faster
+    .bias = 4.5,
+    .tail_strength = 0.5,  // unused when effective_rank = -1
+    .noise = 1.0,
+    .seed = 12345ULL};
+
+  set_rf_params(p.rf,  // Output RF parameters
+                500,   // n_trees
+                true,  // bootstrap
+                1.f,   // rows_sample
+                1234,  // seed
+                8);    // n_streams
+
+  set_tree_params(p.rf.tree_params,  // Output tree parameters
+                  10,                // max_depth, just a place holder value,
+                                     //   anyway changed below
+                  (1 << 20),         // max_leaves
+                  0.3,               // max_features, just a place holder value,
+                                     //   anyway changed below
+                  32,                // n_bins
+                  1,                 // split_algo
+                  3,                 // min_rows_per_node
+                  0.0f,              // min_impurity_decrease
+                  true,              // bootstrap_features
+                  ML::CRITERION::MSE,  // split_criterion
+                  false,               // quantile_per_tree
+                  false,               // use_experimental_backend
+                  128);                // max_batch_size
+
   std::vector<DimInfo> dim_info = {{500000, 500, 400}};
   for (auto& di : dim_info) {
     // Let's run Bosch only for float type
@@ -102,7 +116,7 @@ std::vector<RegParams> getInputs() {
     p.data.ncols = di.ncols;
     p.regression.n_informative = di.n_informative;
     p.rf.tree_params.max_features = 1.f;
-    for (auto max_depth : std::vector<int>({8, 12, 16})) {
+    for (auto max_depth : std::vector<int>({7, 11, 15})) {
       p.rf.tree_params.max_depth = max_depth;
       out.push_back(p);
     }
diff --git a/cpp/bench/sg/umap.cu b/cpp/bench/sg/umap.cu
index e4395b9268..d7ddb31552 100644
--- a/cpp/bench/sg/umap.cu
+++ b/cpp/bench/sg/umap.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <cuda_utils.cuh>
 #include <cuml/cuml.hpp>
 #include <cuml/manifold/umap.hpp>
+#include <raft/cuda_utils.cuh>
 #include <utility>
 #include "benchmark.cuh"
 
@@ -40,7 +40,7 @@ __global__ void castKernel(OutT* out, const InT* in, IdxT len) {
 template <typename OutT, typename InT, typename IdxT = int>
 void cast(OutT* out, const InT* in, IdxT len, cudaStream_t stream) {
   static const int TPB = 256;
-  auto nblks = MLCommon::ceildiv<IdxT>(len, TPB);
+  auto nblks = raft::ceildiv<IdxT>(len, TPB);
   castKernel<OutT, InT, IdxT><<<nblks, TPB, 0, stream>>>(out, in, len);
   CUDA_CHECK(cudaGetLastError());
 }
diff --git a/cpp/cmake/Dependencies.cmake b/cpp/cmake/Dependencies.cmake
index 6fcbf742c6..ffa014641b 100644
--- a/cpp/cmake/Dependencies.cmake
+++ b/cpp/cmake/Dependencies.cmake
@@ -39,7 +39,7 @@ else(DEFINED ENV{RAFT_PATH})
 
   ExternalProject_Add(raft
     GIT_REPOSITORY    https://github.com/rapidsai/raft.git
-    GIT_TAG           b6ef2a825bfcd47aa46d634a46049da791b43fa0
+    GIT_TAG           9b3afe67895fbea397fb2c72375157aadfc132d8
     PREFIX            ${RAFT_DIR}
     CONFIGURE_COMMAND ""
     BUILD_COMMAND     ""
@@ -53,7 +53,7 @@ endif(DEFINED ENV{RAFT_PATH})
 ##############################################################################
 # - cumlprims (binary dependency) --------------------------------------------
 
-if(NOT DISABLE_CUMLPRIMS_MG)
+if(ENABLE_CUMLPRIMS_MG)
 
     if(DEFINED ENV{CUMLPRIMS_MG_PATH})
       set(CUMLPRIMS_MG_PATH ENV{CUMLPRIMS_MG_PATH}})
@@ -74,31 +74,47 @@ if(NOT DISABLE_CUMLPRIMS_MG)
       endif(EXISTS "${CUMLPRIMS_MG_PATH}/lib/libcumlprims.so")
     endif(NOT CUMLPRIMS_MG_PATH)
 
-endif(NOT DISABLE_CUMLPRIMS_MG)
+endif(ENABLE_CUMLPRIMS_MG)
 
 
 ##############################################################################
 # - RMM ----------------------------------------------------------------------
 
-# find package module uses RMM_INSTALL_DIR for Hints, checking RMM_ROOT env variable
-# to match other RAPIDS repos.
-set(RMM_INSTALL_DIR ENV{RMM_ROOT})
+find_path(RMM_INCLUDE_DIRS "rmm"
+    HINTS
+    "$ENV{RMM_ROOT}/include"
+    "$ENV{CONDA_PREFIX}/include/rmm"
+    "$ENV{CONDA_PREFIX}/include")
 
-find_package(RMM
-             REQUIRED)
+message(STATUS "RMM: RMM_INCLUDE_DIRS set to ${RMM_INCLUDE_DIRS}")
 
+##############################################################################
+# - NCCL ---------------------------------------------------------------------
+
+if(BUILD_CUML_MPI_COMMS OR BUILD_CUML_STD_COMMS)
+  find_package(NCCL REQUIRED)
+endif(BUILD_CUML_MPI_COMMS OR BUILD_CUML_STD_COMMS)
+
+##############################################################################
+# - MPI ---------------------------------------------------------------------
+
+if(BUILD_CUML_MPI_COMMS)
+  find_package(MPI REQUIRED)
+endif(BUILD_CUML_MPI_COMMS)
 
 ##############################################################################
 # - cub - (header only) ------------------------------------------------------
 
-set(CUB_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub CACHE STRING "Path to cub repo")
-ExternalProject_Add(cub
-  GIT_REPOSITORY    https://github.com/thrust/cub.git
-  GIT_TAG           1.8.0
-  PREFIX            ${CUB_DIR}
-  CONFIGURE_COMMAND ""
-  BUILD_COMMAND     ""
-  INSTALL_COMMAND   "")
+if(NOT CUB_IS_PART_OF_CTK)
+  set(CUB_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub CACHE STRING "Path to cub repo")
+  ExternalProject_Add(cub
+    GIT_REPOSITORY    https://github.com/thrust/cub.git
+    GIT_TAG           1.8.0
+    PREFIX            ${CUB_DIR}
+    CONFIGURE_COMMAND ""
+    BUILD_COMMAND     ""
+    INSTALL_COMMAND   "")
+endif(NOT CUB_IS_PART_OF_CTK)
 
 ##############################################################################
 # - cutlass - (header only) --------------------------------------------------
@@ -120,7 +136,7 @@ set(SPDLOG_DIR ${CMAKE_CURRENT_BINARY_DIR}/spdlog CACHE STRING
   "Path to spdlog install directory")
 ExternalProject_Add(spdlog
   GIT_REPOSITORY    https://github.com/gabime/spdlog.git
-  GIT_TAG           v1.x
+  GIT_TAG           v1.7.0
   PREFIX            ${SPDLOG_DIR}
   CONFIGURE_COMMAND ""
   BUILD_COMMAND     ""
@@ -129,69 +145,85 @@ ExternalProject_Add(spdlog
 ##############################################################################
 # - faiss --------------------------------------------------------------------
 
-set(FAISS_DIR ${CMAKE_CURRENT_BINARY_DIR}/faiss CACHE STRING
-  "Path to FAISS source directory")
-ExternalProject_Add(faiss
-  GIT_REPOSITORY    https://github.com/facebookresearch/faiss.git
-  GIT_TAG           v1.6.2
-  CONFIGURE_COMMAND LIBS=-pthread
-                    CPPFLAGS=-w
-                    LDFLAGS=-L${CMAKE_INSTALL_PREFIX}/lib
-                            ${CMAKE_CURRENT_BINARY_DIR}/faiss/src/faiss/configure
-                            --prefix=${CMAKE_CURRENT_BINARY_DIR}/faiss
-                            --with-blas=${BLAS_LIBRARIES}
-                            --with-cuda=${CUDA_TOOLKIT_ROOT_DIR}
-                            --with-cuda-arch=${FAISS_GPU_ARCHS}
-                            -v
-  PREFIX            ${FAISS_DIR}
-  BUILD_COMMAND     make -j${PARALLEL_LEVEL} VERBOSE=1
-  BUILD_BYPRODUCTS  ${FAISS_DIR}/lib/libfaiss.a
-  INSTALL_COMMAND   make -s install > /dev/null
-  UPDATE_COMMAND    ""
-  BUILD_IN_SOURCE   1)
-
-ExternalProject_Get_Property(faiss install_dir)
-
-add_library(faisslib STATIC IMPORTED)
-
-set_property(TARGET faisslib PROPERTY
-  IMPORTED_LOCATION ${FAISS_DIR}/lib/libfaiss.a)
+if(BUILD_STATIC_FAISS)
+  set(FAISS_DIR ${CMAKE_CURRENT_BINARY_DIR}/faiss CACHE STRING
+    "Path to FAISS source directory")
+  ExternalProject_Add(faiss
+    GIT_REPOSITORY    https://github.com/facebookresearch/faiss.git
+    GIT_TAG           a5b850dec6f1cd6c88ab467bfd5e87b0cac2e41d
+    CONFIGURE_COMMAND LIBS=-pthread
+                      CPPFLAGS=-w
+                      LDFLAGS=-L${CMAKE_INSTALL_PREFIX}/lib
+                              ${CMAKE_CURRENT_BINARY_DIR}/faiss/src/faiss/configure
+	                      --prefix=${CMAKE_CURRENT_BINARY_DIR}/faiss
+	                      --with-blas=${BLAS_LIBRARIES}
+	                      --with-cuda=${CUDA_TOOLKIT_ROOT_DIR}
+	                      --with-cuda-arch=${FAISS_GPU_ARCHS}
+	                      -v
+    PREFIX            ${FAISS_DIR}
+    BUILD_COMMAND     make -j${PARALLEL_LEVEL} VERBOSE=1
+    BUILD_BYPRODUCTS  ${FAISS_DIR}/lib/libfaiss.a
+    BUILD_ALWAYS      1
+    INSTALL_COMMAND   make -s install > /dev/null
+    UPDATE_COMMAND    ""
+    BUILD_IN_SOURCE   1
+    PATCH_COMMAND     patch -p1 -N < ${CMAKE_CURRENT_SOURCE_DIR}/cmake/faiss_cuda11.patch || true)
+
+  ExternalProject_Get_Property(faiss install_dir)
+  add_library(FAISS::FAISS STATIC IMPORTED)
+  set_property(TARGET FAISS::FAISS PROPERTY
+    IMPORTED_LOCATION ${FAISS_DIR}/lib/libfaiss.a)
+  # to account for the FAISS file reorg that happened recently after the current
+  # pinned commit, just change the following line to
+  # set(FAISS_INCLUDE_DIRS "${FAISS_DIR}/src/faiss")
+  set(FAISS_INCLUDE_DIRS "${FAISS_DIR}/src")
+else()
+  set(FAISS_INSTALL_DIR ENV{FAISS_ROOT})
+  find_package(FAISS REQUIRED)
+endif(BUILD_STATIC_FAISS)
 
 ##############################################################################
 # - treelite build -----------------------------------------------------------
 
-find_package(Treelite 0.92 REQUIRED)
+find_package(Treelite 0.93 REQUIRED)
 
 ##############################################################################
-# - googletest ---------------------------------------------------------------
-
-set(GTEST_DIR ${CMAKE_CURRENT_BINARY_DIR}/googletest CACHE STRING
-  "Path to googletest repo")
-set(GTEST_BINARY_DIR ${PROJECT_BINARY_DIR}/googletest)
-set(GTEST_INSTALL_DIR ${GTEST_BINARY_DIR}/install)
-set(GTEST_LIB ${GTEST_INSTALL_DIR}/lib/libgtest_main.a)
-include(ExternalProject)
-ExternalProject_Add(googletest
-  GIT_REPOSITORY    https://github.com/google/googletest.git
-  GIT_TAG           6ce9b98f541b8bcd84c5c5b3483f29a933c4aefb
-  PREFIX            ${GTEST_DIR}
-  CMAKE_ARGS        -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
-                    -DBUILD_SHARED_LIBS=OFF
-                    -DCMAKE_INSTALL_LIBDIR=lib
-  BUILD_BYPRODUCTS  ${GTEST_DIR}/lib/libgtest.a
-                    ${GTEST_DIR}/lib/libgtest_main.a
-  UPDATE_COMMAND    "")
-
-add_library(gtestlib STATIC IMPORTED)
-add_library(gtest_mainlib STATIC IMPORTED)
-
-set_property(TARGET gtestlib PROPERTY
-  IMPORTED_LOCATION ${GTEST_DIR}/lib/libgtest.a)
-set_property(TARGET gtest_mainlib PROPERTY
-  IMPORTED_LOCATION ${GTEST_DIR}/lib/libgtest_main.a)
-
-add_dependencies(gtestlib googletest)
-add_dependencies(gtest_mainlib googletest)
+# - googletest build -----------------------------------------------------------
+
+if(BUILD_GTEST)
+	set(GTEST_DIR ${CMAKE_CURRENT_BINARY_DIR}/googletest CACHE STRING
+	  "Path to googletest repo")
+	set(GTEST_BINARY_DIR ${PROJECT_BINARY_DIR}/googletest)
+	set(GTEST_INSTALL_DIR ${GTEST_BINARY_DIR}/install)
+	set(GTEST_LIB ${GTEST_INSTALL_DIR}/lib/libgtest_main.a)
+	include(ExternalProject)
+	ExternalProject_Add(googletest
+	  GIT_REPOSITORY    https://github.com/google/googletest.git
+	  GIT_TAG           6ce9b98f541b8bcd84c5c5b3483f29a933c4aefb
+	  PREFIX            ${GTEST_DIR}
+	  CMAKE_ARGS        -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
+	                    -DBUILD_SHARED_LIBS=OFF
+	                    -DCMAKE_INSTALL_LIBDIR=lib
+	  BUILD_BYPRODUCTS  ${GTEST_DIR}/lib/libgtest.a
+	                    ${GTEST_DIR}/lib/libgtest_main.a
+	  UPDATE_COMMAND    "")
+
+	add_library(GTest::GTest STATIC IMPORTED)
+	add_library(GTest::Main STATIC IMPORTED)
+
+	set_property(TARGET GTest::GTest PROPERTY
+	  IMPORTED_LOCATION ${GTEST_DIR}/lib/libgtest.a)
+	set_property(TARGET GTest::Main PROPERTY
+	  IMPORTED_LOCATION ${GTEST_DIR}/lib/libgtest_main.a)
+
+	set(GTEST_INCLUDE_DIRS "${GTEST_DIR}")
+
+	add_dependencies(GTest::GTest googletest)
+	add_dependencies(GTest::Main googletest)
+
+else()
+	find_package(GTest REQUIRED)
+endif(BUILD_GTEST)
 
 ##############################################################################
 # - googlebench ---------------------------------------------------------------
@@ -225,10 +257,14 @@ set_property(TARGET benchmarklib PROPERTY
 
 # TODO: Change to using build.sh and make targets instead of this
 
-add_dependencies(cub raft)
-add_dependencies(cutlass cub)
+if(CUB_IS_PART_OF_CTK)
+  add_dependencies(cutlass raft)
+else()
+  add_dependencies(cub raft)
+  add_dependencies(cutlass cub)
+endif(CUB_IS_PART_OF_CTK)
 add_dependencies(spdlog cutlass)
-add_dependencies(googletest spdlog)
-add_dependencies(benchmark googletest)
-add_dependencies(faiss benchmark)
-add_dependencies(faisslib faiss)
+add_dependencies(GTest::GTest spdlog)
+add_dependencies(benchmark GTest::GTest)
+add_dependencies(FAISS::FAISS benchmark)
+add_dependencies(FAISS::FAISS faiss)
diff --git a/cpp/cmake/doxygen.cmake b/cpp/cmake/doxygen.cmake
index b27cb39290..07b2d1488a 100644
--- a/cpp/cmake/doxygen.cmake
+++ b/cpp/cmake/doxygen.cmake
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,7 +13,7 @@
 # limitations under the License.
 #
 
-find_package(Doxygen 1.8.11)
+find_package(Doxygen 1.8.12 REQUIRED)
 
 function(add_doxygen_target)
   if(Doxygen_FOUND)
diff --git a/cpp/cmake/faiss_cuda11.patch b/cpp/cmake/faiss_cuda11.patch
new file mode 100644
index 0000000000..496ca0e7b2
--- /dev/null
+++ b/cpp/cmake/faiss_cuda11.patch
@@ -0,0 +1,40 @@
+diff --git a/configure b/configure
+index ed40dae..f88ed0a 100755
+--- a/configure
++++ b/configure
+@@ -2970,7 +2970,7 @@ ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ex
+ ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
+
+
+-  ax_cxx_compile_alternatives="11 0x"    ax_cxx_compile_cxx11_required=true
++  ax_cxx_compile_alternatives="14 11 0x"    ax_cxx_compile_cxx11_required=true
+   ac_ext=cpp
+ ac_cpp='$CXXCPP $CPPFLAGS'
+ ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+diff --git a/gpu/utils/DeviceDefs.cuh b/gpu/utils/DeviceDefs.cuh
+index 89d3dda..bc0f9b5 100644
+--- a/gpu/utils/DeviceDefs.cuh
++++ b/gpu/utils/DeviceDefs.cuh
+@@ -13,7 +13,7 @@
+ namespace faiss { namespace gpu {
+
+ #ifdef __CUDA_ARCH__
+-#if __CUDA_ARCH__ <= 750
++#if __CUDA_ARCH__ <= 800
+ constexpr int kWarpSize = 32;
+ #else
+ #error Unknown __CUDA_ARCH__; please define parameters for compute capability
+diff --git a/gpu/utils/MatrixMult-inl.cuh b/gpu/utils/MatrixMult-inl.cuh
+index ede225e..4f7eb44 100644
+--- a/gpu/utils/MatrixMult-inl.cuh
++++ b/gpu/utils/MatrixMult-inl.cuh
+@@ -51,6 +51,9 @@ rawGemm(cublasHandle_t handle,
+   auto cBT = GetCudaType<BT>::Type;
+
+   // Always accumulate in f32
++# if __CUDACC_VER_MAJOR__ >= 11
++  cublasSetMathMode(handle, CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
++# endif
+   return cublasSgemmEx(handle, transa, transb, m, n, k,
+                        &fAlpha, A, cAT, lda,
+                        B, cBT, ldb,
diff --git a/cpp/comms/mpi/CMakeLists.txt b/cpp/comms/mpi/CMakeLists.txt
deleted file mode 100644
index 5f6713b709..0000000000
--- a/cpp/comms/mpi/CMakeLists.txt
+++ /dev/null
@@ -1,53 +0,0 @@
-#
-# Copyright (c) 2019-2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
-project(cuML-comms-MPI LANGUAGES CXX CUDA)
-
-find_package(MPI REQUIRED)
-
-if(NOT NCCL_PATH)
-  find_package(NCCL REQUIRED)
-else()
-  set(NCCL_INCLUDE_DIRS ${NCCL_PATH}/include)
-  set(NCCL_LIBRARIES ${NCCL_PATH}/lib/libnccl.so)
-  set(NCCL_FOUND ON)
-endif(NOT NCCL_PATH)
-
-set(CMAKE_CXX_STANDARD 14)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-
-include_directories(include
-  ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
-  ${MPI_CXX_INCLUDE_PATH}
-  ../../include
-  ../../src
-  ../../src_prims
-)
-
-set(MPI_COMMS_LINK_LIBRARIES ${CUML_CPP_TARGET} ${MPI_C_LIBRARIES})
-
-if (NCCL_FOUND)
-    add_definitions(-DHAVE_NCCL)
-    include_directories( ${NCCL_INCLUDE_DIRS} )
-    list(APPEND MPI_COMMS_LINK_LIBRARIES ${NCCL_LIBRARIES})
-endif()
-
-add_library(cumlcommsmpi SHARED src/cuML_comms_mpi_impl.cpp)
-target_link_libraries(cumlcommsmpi ${MPI_COMMS_LINK_LIBRARIES})
-target_compile_options(cumlcommsmpi PUBLIC ${MPI_C_COMPILE_FLAGS})
-
-install(TARGETS cumlcommsmpi DESTINATION lib)
diff --git a/cpp/comms/mpi/include/cuML_comms.hpp b/cpp/comms/mpi/include/cuML_comms.hpp
deleted file mode 100644
index 1df2c3f27d..0000000000
--- a/cpp/comms/mpi/include/cuML_comms.hpp
+++ /dev/null
@@ -1,26 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <mpi.h>
-#include <cuml/cuml.hpp>
-
-namespace ML {
-
-void initialize_mpi_comms(cumlHandle& handle, MPI_Comm comm);
-
-}  // end namespace ML
diff --git a/cpp/comms/mpi/src/cuML_comms_mpi_impl.cpp b/cpp/comms/mpi/src/cuML_comms_mpi_impl.cpp
deleted file mode 100644
index ad2b5ab98a..0000000000
--- a/cpp/comms/mpi/src/cuML_comms_mpi_impl.cpp
+++ /dev/null
@@ -1,413 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "cuML_comms_mpi_impl.hpp"
-
-#include <cstdio>
-#include <memory>
-
-#include <common/cumlHandle.hpp>
-#include <cuML_comms.hpp>
-#include <cuml/common/logger.hpp>
-
-#include <common/cudart_utils.h>
-
-#define MPI_CHECK(call)                                                     \
-  do {                                                                      \
-    int status = call;                                                      \
-    if (MPI_SUCCESS != status) {                                            \
-      int mpi_error_string_lenght = 0;                                      \
-      char mpi_error_string[MPI_MAX_ERROR_STRING];                          \
-      MPI_Error_string(status, mpi_error_string, &mpi_error_string_lenght); \
-      ASSERT(MPI_SUCCESS == status, "ERROR: MPI call='%s'. Reason:%s\n",    \
-             #call, mpi_error_string);                                      \
-    }                                                                       \
-  } while (0)
-
-#define MPI_CHECK_NO_THROW(call)                                            \
-  do {                                                                      \
-    int status = call;                                                      \
-    if (MPI_SUCCESS != status) {                                            \
-      int mpi_error_string_lenght = 0;                                      \
-      char mpi_error_string[MPI_MAX_ERROR_STRING];                          \
-      MPI_Error_string(status, mpi_error_string, &mpi_error_string_lenght); \
-      CUML_LOG_ERROR("MPI call='%s' at file=%s line=%d failed with %s ",    \
-                     #call, __FILE__, __LINE__, mpi_error_string);          \
-    }                                                                       \
-  } while (0)
-
-#define NCCL_CHECK(call)                                                       \
-  do {                                                                         \
-    ncclResult_t status = call;                                                \
-    ASSERT(ncclSuccess == status, "ERROR: NCCL call='%s'. Reason:%s\n", #call, \
-           ncclGetErrorString(status));                                        \
-  } while (0)
-
-#define NCCL_CHECK_NO_THROW(call)                                 \
-  do {                                                            \
-    ncclResult_t status = call;                                   \
-    if (status != ncclSuccess) {                                  \
-      CUML_LOG_ERROR("NCCL call='%s' failed. Reason:%s\n", #call, \
-                     ncclGetErrorString(status));                 \
-    }                                                             \
-  } while (0)
-
-namespace ML {
-
-namespace {
-size_t getDatatypeSize(const cumlMPICommunicator_impl::datatype_t datatype) {
-  switch (datatype) {
-    case MLCommon::cumlCommunicator::CHAR:
-      return sizeof(char);
-    case MLCommon::cumlCommunicator::UINT8:
-      return sizeof(unsigned char);
-    case MLCommon::cumlCommunicator::INT:
-      return sizeof(int);
-    case MLCommon::cumlCommunicator::UINT:
-      return sizeof(unsigned int);
-    case MLCommon::cumlCommunicator::INT64:
-      return sizeof(long long int);
-    case MLCommon::cumlCommunicator::UINT64:
-      return sizeof(unsigned long long int);
-    case MLCommon::cumlCommunicator::FLOAT:
-      return sizeof(float);
-    case MLCommon::cumlCommunicator::DOUBLE:
-      return sizeof(double);
-    default:
-      // Execution should never reach here. This takes care of compiler warning.
-      return 0;
-  }
-}
-
-MPI_Datatype getMPIDatatype(
-  const cumlMPICommunicator_impl::datatype_t datatype) {
-  switch (datatype) {
-    case MLCommon::cumlCommunicator::CHAR:
-      return MPI_CHAR;
-    case MLCommon::cumlCommunicator::UINT8:
-      return MPI_UNSIGNED_CHAR;
-    case MLCommon::cumlCommunicator::INT:
-      return MPI_INT;
-    case MLCommon::cumlCommunicator::UINT:
-      return MPI_UNSIGNED;
-    case MLCommon::cumlCommunicator::INT64:
-      return MPI_LONG_LONG;
-    case MLCommon::cumlCommunicator::UINT64:
-      return MPI_UNSIGNED_LONG_LONG;
-    case MLCommon::cumlCommunicator::FLOAT:
-      return MPI_FLOAT;
-    case MLCommon::cumlCommunicator::DOUBLE:
-      return MPI_DOUBLE;
-    default:
-      // Execution should never reach here. This takes care of compiler warning.
-      return MPI_DOUBLE;
-  }
-}
-
-MPI_Op getMPIOp(const cumlMPICommunicator_impl::op_t op) {
-  switch (op) {
-    case MLCommon::cumlCommunicator::SUM:
-      return MPI_SUM;
-    case MLCommon::cumlCommunicator::PROD:
-      return MPI_PROD;
-    case MLCommon::cumlCommunicator::MIN:
-      return MPI_MIN;
-    case MLCommon::cumlCommunicator::MAX:
-      return MPI_MAX;
-    default:
-      // Execution should never reach here. This takes care of compiler warning.
-      return MPI_MAX;
-  }
-}
-
-#ifdef HAVE_NCCL
-ncclDataType_t getNCCLDatatype(
-  const cumlMPICommunicator_impl::datatype_t datatype) {
-  switch (datatype) {
-    case MLCommon::cumlCommunicator::CHAR:
-      return ncclChar;
-    case MLCommon::cumlCommunicator::UINT8:
-      return ncclUint8;
-    case MLCommon::cumlCommunicator::INT:
-      return ncclInt;
-    case MLCommon::cumlCommunicator::UINT:
-      return ncclUint32;
-    case MLCommon::cumlCommunicator::INT64:
-      return ncclInt64;
-    case MLCommon::cumlCommunicator::UINT64:
-      return ncclUint64;
-    case MLCommon::cumlCommunicator::FLOAT:
-      return ncclFloat;
-    case MLCommon::cumlCommunicator::DOUBLE:
-      return ncclDouble;
-    default:
-      // Execution should never reach here. This takes care of compiler warning.
-      return ncclDouble;
-  }
-}
-
-ncclRedOp_t getNCCLOp(const cumlMPICommunicator_impl::op_t op) {
-  switch (op) {
-    case MLCommon::cumlCommunicator::SUM:
-      return ncclSum;
-    case MLCommon::cumlCommunicator::PROD:
-      return ncclProd;
-    case MLCommon::cumlCommunicator::MIN:
-      return ncclMin;
-    case MLCommon::cumlCommunicator::MAX:
-      return ncclMax;
-    default:
-      // Execution should never reach here. This takes care of compiler warning.
-      return ncclMax;
-  }
-}
-#endif
-}  // namespace
-
-void initialize_mpi_comms(cumlHandle& handle, MPI_Comm comm) {
-  auto communicator = std::make_shared<MLCommon::cumlCommunicator>(
-    std::unique_ptr<MLCommon::cumlCommunicator_iface>(
-      new cumlMPICommunicator_impl(comm)));
-  handle.getImpl().setCommunicator(communicator);
-}
-
-cumlMPICommunicator_impl::cumlMPICommunicator_impl(MPI_Comm comm,
-                                                   const bool owns_mpi_comm)
-  : _owns_mpi_comm(owns_mpi_comm),
-    _mpi_comm(comm),
-    _size(0),
-    _rank(1),
-    _next_request_id(0) {
-  int mpi_is_initialized = 0;
-  MPI_CHECK(MPI_Initialized(&mpi_is_initialized));
-  ASSERT(mpi_is_initialized, "ERROR: MPI is not initialized!");
-  MPI_CHECK(MPI_Comm_size(_mpi_comm, &_size));
-  MPI_CHECK(MPI_Comm_rank(_mpi_comm, &_rank));
-#ifdef HAVE_NCCL
-  //get NCCL unique ID at rank 0 and broadcast it to all others
-  ncclUniqueId id;
-  if (0 == _rank) NCCL_CHECK(ncclGetUniqueId(&id));
-  MPI_CHECK(MPI_Bcast((void*)&id, sizeof(id), MPI_BYTE, 0, _mpi_comm));
-
-  //initializing NCCL
-  NCCL_CHECK(ncclCommInitRank(&_nccl_comm, _size, id, _rank));
-#endif
-}
-
-cumlMPICommunicator_impl::~cumlMPICommunicator_impl() {
-#ifdef HAVE_NCCL
-  //finalizing NCCL
-  NCCL_CHECK_NO_THROW(ncclCommDestroy(_nccl_comm));
-#endif
-  if (_owns_mpi_comm) {
-    MPI_CHECK_NO_THROW(MPI_Comm_free(&_mpi_comm));
-  }
-}
-
-int cumlMPICommunicator_impl::getSize() const { return _size; }
-
-int cumlMPICommunicator_impl::getRank() const { return _rank; }
-
-std::unique_ptr<MLCommon::cumlCommunicator_iface>
-cumlMPICommunicator_impl::commSplit(int color, int key) const {
-  MPI_Comm new_comm;
-  MPI_CHECK(MPI_Comm_split(_mpi_comm, color, key, &new_comm));
-  return std::unique_ptr<MLCommon::cumlCommunicator_iface>(
-    new cumlMPICommunicator_impl(new_comm, true));
-}
-
-void cumlMPICommunicator_impl::barrier() const {
-  MPI_CHECK(MPI_Barrier(_mpi_comm));
-}
-
-void cumlMPICommunicator_impl::isend(const void* buf, int size, int dest,
-                                     int tag, request_t* request) const {
-  MPI_Request mpi_req;
-  request_t req_id;
-  if (_free_requests.empty()) {
-    req_id = _next_request_id++;
-  } else {
-    auto it = _free_requests.begin();
-    req_id = *it;
-    _free_requests.erase(it);
-  }
-  MPI_CHECK(MPI_Isend(buf, size, MPI_BYTE, dest, tag, _mpi_comm, &mpi_req));
-  _requests_in_flight.insert(std::make_pair(req_id, mpi_req));
-  *request = req_id;
-}
-
-void cumlMPICommunicator_impl::irecv(void* buf, int size, int source, int tag,
-                                     request_t* request) const {
-  if (source == CUML_ANY_SOURCE) source = MPI_ANY_SOURCE;
-
-  MPI_Request mpi_req;
-  request_t req_id;
-  if (_free_requests.empty()) {
-    req_id = _next_request_id++;
-  } else {
-    auto it = _free_requests.begin();
-    req_id = *it;
-    _free_requests.erase(it);
-  }
-
-  MPI_CHECK(MPI_Irecv(buf, size, MPI_BYTE, source, tag, _mpi_comm, &mpi_req));
-  _requests_in_flight.insert(std::make_pair(req_id, mpi_req));
-  *request = req_id;
-}
-
-void cumlMPICommunicator_impl::waitall(int count,
-                                       request_t array_of_requests[]) const {
-  std::vector<MPI_Request> requests;
-  requests.reserve(count);
-  for (int i = 0; i < count; ++i) {
-    auto req_it = _requests_in_flight.find(array_of_requests[i]);
-    ASSERT(_requests_in_flight.end() != req_it,
-           "ERROR: waitall on invalid request: %d", array_of_requests[i]);
-    requests.push_back(req_it->second);
-    _free_requests.insert(req_it->first);
-    _requests_in_flight.erase(req_it);
-  }
-  MPI_CHECK(MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE));
-}
-
-void cumlMPICommunicator_impl::allreduce(const void* sendbuff, void* recvbuff,
-                                         int count, datatype_t datatype,
-                                         op_t op, cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  NCCL_CHECK(ncclAllReduce(sendbuff, recvbuff, count, getNCCLDatatype(datatype),
-                           getNCCLOp(op), _nccl_comm, stream));
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  MPI_CHECK(MPI_Allreduce(sendbuff, recvbuff, count, getMPIDatatype(datatype),
-                          getMPIOp(op), _mpi_comm));
-#endif
-}
-
-void cumlMPICommunicator_impl::bcast(void* buff, int count, datatype_t datatype,
-                                     int root, cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  NCCL_CHECK(ncclBroadcast(buff, buff, count, getNCCLDatatype(datatype), root,
-                           _nccl_comm, stream));
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  MPI_CHECK(MPI_Bcast(buff, count, getMPIDatatype(datatype), root, _mpi_comm));
-#endif
-}
-
-void cumlMPICommunicator_impl::reduce(const void* sendbuff, void* recvbuff,
-                                      int count, datatype_t datatype, op_t op,
-                                      int root, cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  NCCL_CHECK(ncclReduce(sendbuff, recvbuff, count, getNCCLDatatype(datatype),
-                        getNCCLOp(op), root, _nccl_comm, stream));
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  MPI_CHECK(MPI_Reduce(sendbuff, recvbuff, count, getMPIDatatype(datatype),
-                       getMPIOp(op), root, _mpi_comm));
-#endif
-}
-
-void cumlMPICommunicator_impl::allgather(const void* sendbuff, void* recvbuff,
-                                         int sendcount, datatype_t datatype,
-                                         cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  NCCL_CHECK(ncclAllGather(sendbuff, recvbuff, sendcount,
-                           getNCCLDatatype(datatype), _nccl_comm, stream));
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  MPI_CHECK(MPI_Allgather(sendbuff, sendcount, getMPIDatatype(datatype),
-                          recvbuff, sendcount, getMPIDatatype(datatype),
-                          _mpi_comm));
-#endif
-}
-
-void cumlMPICommunicator_impl::allgatherv(const void* sendbuf, void* recvbuf,
-                                          const int recvcounts[],
-                                          const int displs[],
-                                          datatype_t datatype,
-                                          cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  //From: "An Empirical Evaluation of Allgatherv on Multi-GPU Systems" - https://arxiv.org/pdf/1812.05964.pdf
-  //Listing 1 on page 4.
-  for (int root = 0; root < _size; ++root) {
-    NCCL_CHECK(ncclBroadcast(
-      sendbuf,
-      static_cast<char*>(recvbuf) + displs[root] * getDatatypeSize(datatype),
-      recvcounts[root], getNCCLDatatype(datatype), root, _nccl_comm, stream));
-  }
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  MPI_CHECK(MPI_Allgatherv(sendbuf, recvcounts[_rank], getMPIDatatype(datatype),
-                           recvbuf, recvcounts, displs,
-                           getMPIDatatype(datatype), _mpi_comm));
-#endif
-}
-
-void cumlMPICommunicator_impl::reducescatter(const void* sendbuff,
-                                             void* recvbuff, int recvcount,
-                                             datatype_t datatype, op_t op,
-                                             cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  NCCL_CHECK(ncclReduceScatter(sendbuff, recvbuff, recvcount,
-                               getNCCLDatatype(datatype), getNCCLOp(op),
-                               _nccl_comm, stream));
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  std::vector<int> recvcounts(_size, recvcount);
-  MPI_CHECK(MPI_Reduce_scatter(sendbuff, recvbuff, recvcounts.data(),
-                               getMPIDatatype(datatype), getMPIOp(op),
-                               _mpi_comm));
-#endif
-}
-
-MLCommon::cumlCommunicator::status_t cumlMPICommunicator_impl::syncStream(
-  cudaStream_t stream) const {
-#ifdef HAVE_NCCL
-  cudaError_t cudaErr;
-  ncclResult_t ncclErr, ncclAsyncErr;
-  while (1) {
-    cudaErr = cudaStreamQuery(stream);
-    if (cudaErr == cudaSuccess) return status_t::commStatusSuccess;
-
-    if (cudaErr != cudaErrorNotReady) {
-      // An error occurred querying the status of the stream
-      return status_t::commStatusError;
-    }
-
-    ncclErr = ncclCommGetAsyncError(_nccl_comm, &ncclAsyncErr);
-    if (ncclErr != ncclSuccess) {
-      // An error occurred retrieving the asynchronous error
-      return status_t::commStatusError;
-    }
-
-    if (ncclAsyncErr != ncclSuccess) {
-      // An asynchronous error happened. Stop the operation and destroy
-      // the communicator
-      ncclErr = ncclCommAbort(_nccl_comm);
-      if (ncclErr != ncclSuccess)
-        // Caller may abort with an exception or try to re-create a new communicator.
-        return status_t::commStatusAbort;
-    }
-
-    // Let other threads (including NCCL threads) use the CPU.
-    pthread_yield();
-  }
-#else
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  return status_t::commStatusSuccess;
-#endif
-}
-}  // end namespace ML
diff --git a/cpp/comms/mpi/src/cuML_comms_mpi_impl.hpp b/cpp/comms/mpi/src/cuML_comms_mpi_impl.hpp
deleted file mode 100644
index 165d2df8c6..0000000000
--- a/cpp/comms/mpi/src/cuML_comms_mpi_impl.hpp
+++ /dev/null
@@ -1,94 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <unordered_map>
-#include <unordered_set>
-#include <utility>
-
-#include <mpi.h>
-
-#ifdef HAVE_NCCL
-#include <nccl.h>
-#endif
-
-#include <common/cuml_comms_iface.hpp>
-
-namespace ML {
-
-class cumlMPICommunicator_impl : public MLCommon::cumlCommunicator_iface {
- public:
-  cumlMPICommunicator_impl() = delete;
-
-  cumlMPICommunicator_impl(MPI_Comm comm, const bool owns_mpi_comm = false);
-
-  virtual ~cumlMPICommunicator_impl();
-
-  virtual int getSize() const;
-  virtual int getRank() const;
-
-  virtual std::unique_ptr<MLCommon::cumlCommunicator_iface> commSplit(
-    int color, int key) const;
-
-  virtual void barrier() const;
-
-  virtual void isend(const void* buf, int size, int dest, int tag,
-                     request_t* request) const;
-
-  virtual void irecv(void* buf, int size, int source, int tag,
-                     request_t* request) const;
-
-  virtual void waitall(int count, request_t array_of_requests[]) const;
-
-  virtual void allreduce(const void* sendbuff, void* recvbuff, int count,
-                         datatype_t datatype, op_t op,
-                         cudaStream_t stream) const;
-
-  virtual void bcast(void* buff, int count, datatype_t datatype, int root,
-                     cudaStream_t stream) const;
-
-  virtual void reduce(const void* sendbuff, void* recvbuff, int count,
-                      datatype_t datatype, op_t op, int root,
-                      cudaStream_t stream) const;
-
-  virtual void allgather(const void* sendbuff, void* recvbuff, int sendcount,
-                         datatype_t datatype, cudaStream_t stream) const;
-
-  virtual void allgatherv(const void* sendbuf, void* recvbuf,
-                          const int recvcounts[], const int displs[],
-                          datatype_t datatype, cudaStream_t stream) const;
-
-  virtual void reducescatter(const void* sendbuff, void* recvbuff,
-                             int recvcount, datatype_t datatype, op_t op,
-                             cudaStream_t stream) const;
-
-  virtual status_t syncStream(cudaStream_t stream) const;
-
- private:
-  bool _owns_mpi_comm;
-  MPI_Comm _mpi_comm;
-#ifdef HAVE_NCCL
-  ncclComm_t _nccl_comm;
-#endif
-  int _size;
-  int _rank;
-  mutable request_t _next_request_id;
-  mutable std::unordered_map<request_t, MPI_Request> _requests_in_flight;
-  mutable std::unordered_set<request_t> _free_requests;
-};
-
-}  // end namespace ML
diff --git a/cpp/comms/std/CMakeLists.txt b/cpp/comms/std/CMakeLists.txt
deleted file mode 100644
index 16891178c7..0000000000
--- a/cpp/comms/std/CMakeLists.txt
+++ /dev/null
@@ -1,60 +0,0 @@
-#
-# Copyright (c) 2019-2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
-project(cuML-comms LANGUAGES CXX CUDA)
-
-set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CUML_DIR}/cmake")
-
-option(WITH_UCX "Uses UCX for P2P comms" ON)
-
-if(NOT NCCL_PATH)
-    find_package(NCCL REQUIRED)
-else()
-    message("-- Manually set NCCL PATH to ${NCCL_PATH}")
-    set(NCCL_INCLUDE_DIRS ${NCCL_PATH}/include)
-    set(NCCL_LIBRARIES ${NCCL_PATH}/lib/libnccl.so)
-endif(NOT NCCL_PATH)
-
-set(CMAKE_CXX_STANDARD 14)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-
-include_directories(include
-  ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
-  ../../include
-  ../../src
-  ../../src_prims
-)
-
-set(COMMS_LINK_LIBRARIES ${CUML_CPP_TARGET})
-
-# Note this option will be removed once UCX conda package is released
-if(WITH_UCX)
-    # dlopen is used to dynamically load the needed ucp symbols at runtime. 
-    # Only the UCX include directories are needed for compiling
-  	find_package(UCX)
-	include_directories(${UCX_INCLUDE_DIRS})
-	add_compile_definitions(WITH_UCX=1)
-endif(WITH_UCX)
-
-add_definitions(-DHAVE_NCCL)
-include_directories( ${NCCL_INCLUDE_DIRS} )
-list(APPEND COMMS_LINK_LIBRARIES ${NCCL_LIBRARIES})
-
-add_library(cumlcomms SHARED src/cuML_std_comms_impl.cpp)
-target_link_libraries(cumlcomms ${COMMS_LINK_LIBRARIES})
-
-install(TARGETS cumlcomms DESTINATION lib)
diff --git a/cpp/comms/std/include/cuML_comms.hpp b/cpp/comms/std/include/cuML_comms.hpp
deleted file mode 100644
index 61b9623f18..0000000000
--- a/cpp/comms/std/include/cuML_comms.hpp
+++ /dev/null
@@ -1,56 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <nccl.h>
-
-#ifdef WITH_UCX
-#include <ucp/api/ucp.h>
-#endif
-
-#include <cuml/cuml.hpp>
-
-namespace ML {
-
-#ifdef WITH_UCX
-/**
- * @brief Given initialized comms handles for NCCL and UCP, this function builds a
- * cumlCommunicator object and injects it into the given cumlHandle instance.
- * @param handle the cuml handle to inject a new communicator instance into
- * @param comm initialized nccl communicator
- * @param ucp_worker the ucp_worker for the current initialized ucp context
- * @param eps an array of endpoints to the other ucp workers in the cluster
- * @param size the size of the cluster (number of elements in eps)
- * @param rank rank of the current worker
- */
-void inject_comms(cumlHandle &handle, ncclComm_t comm, ucp_worker_h ucp_worker,
-                  ucp_ep_h *eps, int size, int rank);
-#endif
-
-/**
- * @brief Given an initialized comms handle for NCCL, this function builds a
- * cumlCommunicator object and injects it into the given cumlHandle instance.
- * The underlying cumlCommunicator will only have support for collective
- * communications functions.
- * @param handle the cuml handle to inject a new communicator instance into
- * @param comm initialized nccl communicator
- * @param size the size of the cluster
- * @param rank rank of the current worker
- */
-void inject_comms(cumlHandle &handle, ncclComm_t comm, int size, int rank);
-
-}  // end namespace ML
diff --git a/cpp/comms/std/src/cuML_comms_py.hpp b/cpp/comms/std/src/cuML_comms_py.hpp
deleted file mode 100644
index 5267c7cb15..0000000000
--- a/cpp/comms/std/src/cuML_comms_py.hpp
+++ /dev/null
@@ -1,73 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <nccl.h>
-
-#include <cuml/cuml.hpp>
-
-namespace ML {
-
-bool ucx_enabled();
-
-/**
- * @brief This function wraps the inject comms functions in
- * cpp/comms/std/include/cuML_comms.hpp to decouple the Python
- * layer from the optional UCX dependency in the C++ build. This
- * allows the Cython to compile without having to propagate the `WITH_UCX`
- * directive to that layer.
- * @param handle the cuml handle to inject a new communicator instance into
- * @param comm initialized nccl communicator
- * @param ucp_worker: ucp_worker_h instance for the current initialized ucp context
- * @param eps an array of ucp_ep_h endpoints to the other ucp workers in the cluster
- * @param size the size of the cluster (number of elements in eps)
- * @param rank rank of the current worker
- */
-void inject_comms_py(cumlHandle *handle, ncclComm_t comm, void *ucp_worker,
-                     void *eps, int size, int rank);
-
-/**
-   * @brief This function follows the design of the wrapper function in
-   * cpp/comms/std/include/cuML_comms.hpp to decouple the Python layer
-   * injection functions from the C++ layer functions.
-   * @param handle the cuml handle to inject a new communicator instance into
-   * @param comm initialized nccl communicator
-   * @param size the size of the cluster (number of elements in eps)
-   * @param rank rank of the current worker
-   */
-void inject_comms_py_coll(cumlHandle *handle, ncclComm_t comm, int size,
-                          int rank);
-
-/**
- * @brief Stores the given character array on the given ncclUniqueId struct.
- * @param id the ncclUniqueId struct instance to store the given character array
- * @param uniqueId the unique id char array to store on the ncclUniqueId
- * @param size id size
- */
-void ncclUniqueIdFromChar(ncclUniqueId *id, char *uniqueId, int size);
-
-/**
- * @brief Returns a NCCL unique ID as a character array. PyTorch
- * uses this same approach, so that it can be more easily
- * converted to a native Python string by Cython and further
- * serialized to be sent across process & node boundaries.
- *
- * @param uid nccl unique id for establishing a new clique.
- * @param size uid size
- */
-void get_unique_id(char *uid, int size);
-}  // namespace ML
diff --git a/cpp/comms/std/src/cuML_std_comms_impl.cpp b/cpp/comms/std/src/cuML_std_comms_impl.cpp
deleted file mode 100644
index dab51f3129..0000000000
--- a/cpp/comms/std/src/cuML_std_comms_impl.cpp
+++ /dev/null
@@ -1,498 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "cuML_std_comms_impl.hpp"
-
-#include <nccl.h>
-
-#ifdef WITH_UCX
-constexpr bool UCX_ENABLED = true;
-#else
-constexpr bool UCX_ENABLED = false;
-#endif
-
-#ifdef WITH_UCX
-#include <ucp/api/ucp.h>
-#include <ucp/api/ucp_def.h>
-#include "ucp_helper.h"
-#endif
-
-#include <stdlib.h>
-#include <time.h>
-#include <algorithm>
-#include <chrono>
-#include <common/cumlHandle.hpp>
-#include <cstdio>
-#include <cuML_comms.hpp>
-#include <exception>
-#include <memory>
-
-#include <cuml/common/logger.hpp>
-
-#include <thread>
-
-#include <cuda_runtime.h>
-
-#include <common/cudart_utils.h>
-
-#define NCCL_CHECK(call)                                                       \
-  do {                                                                         \
-    ncclResult_t status = call;                                                \
-    ASSERT(ncclSuccess == status, "ERROR: NCCL call='%s'. Reason:%s\n", #call, \
-           ncclGetErrorString(status));                                        \
-  } while (0)
-
-#define NCCL_CHECK_NO_THROW(call)                                 \
-  do {                                                            \
-    ncclResult_t status = call;                                   \
-    if (status != ncclSuccess) {                                  \
-      CUML_LOG_ERROR("NCCL call='%s' failed. Reason:%s\n", #call, \
-                     ncclGetErrorString(status));                 \
-    }                                                             \
-  } while (0)
-
-namespace ML {
-
-namespace {
-
-size_t getDatatypeSize(const cumlStdCommunicator_impl::datatype_t datatype) {
-  switch (datatype) {
-    case MLCommon::cumlCommunicator::CHAR:
-      return sizeof(char);
-    case MLCommon::cumlCommunicator::UINT8:
-      return sizeof(uint8_t);
-    case MLCommon::cumlCommunicator::INT:
-      return sizeof(int);
-    case MLCommon::cumlCommunicator::UINT:
-      return sizeof(unsigned int);
-    case MLCommon::cumlCommunicator::INT64:
-      return sizeof(int64_t);
-    case MLCommon::cumlCommunicator::UINT64:
-      return sizeof(uint64_t);
-    case MLCommon::cumlCommunicator::FLOAT:
-      return sizeof(float);
-    case MLCommon::cumlCommunicator::DOUBLE:
-      return sizeof(double);
-  }
-}
-
-ncclDataType_t getNCCLDatatype(
-  const cumlStdCommunicator_impl::datatype_t datatype) {
-  switch (datatype) {
-    case MLCommon::cumlCommunicator::CHAR:
-      return ncclChar;
-    case MLCommon::cumlCommunicator::UINT8:
-      return ncclUint8;
-    case MLCommon::cumlCommunicator::INT:
-      return ncclInt;
-    case MLCommon::cumlCommunicator::UINT:
-      return ncclUint32;
-    case MLCommon::cumlCommunicator::INT64:
-      return ncclInt64;
-    case MLCommon::cumlCommunicator::UINT64:
-      return ncclUint64;
-    case MLCommon::cumlCommunicator::FLOAT:
-      return ncclFloat;
-    case MLCommon::cumlCommunicator::DOUBLE:
-      return ncclDouble;
-  }
-}
-
-ncclRedOp_t getNCCLOp(const cumlStdCommunicator_impl::op_t op) {
-  switch (op) {
-    case MLCommon::cumlCommunicator::SUM:
-      return ncclSum;
-    case MLCommon::cumlCommunicator::PROD:
-      return ncclProd;
-    case MLCommon::cumlCommunicator::MIN:
-      return ncclMin;
-    case MLCommon::cumlCommunicator::MAX:
-      return ncclMax;
-  }
-}
-}  // namespace
-
-bool ucx_enabled() { return UCX_ENABLED; }
-
-/**
- * @brief Underlying comms, like NCCL and UCX, should be initialized and ready for use,
- * and maintained, outside of the cuML Comms lifecycle. This allows us to decouple the
- * ownership of the actual comms from cuml so that they can also be used outside of cuml.
- *
- * For instance, nccl-py can be used to bootstrap a ncclComm_t before it is
- * used to construct a cuml comms instance. UCX endpoints can be bootstrapped
- * in Python using ucx-py, before being used to construct a cuML comms instance.
- */
-#ifdef WITH_UCX
-void inject_comms(cumlHandle &handle, ncclComm_t comm, ucp_worker_h ucp_worker,
-                  std::shared_ptr<ucp_ep_h *> eps, int size, int rank) {
-  auto communicator = std::make_shared<MLCommon::cumlCommunicator>(
-    std::unique_ptr<MLCommon::cumlCommunicator_iface>(
-      new cumlStdCommunicator_impl(comm, ucp_worker, eps, size, rank)));
-  handle.getImpl().setCommunicator(communicator);
-}
-#endif
-
-void inject_comms(cumlHandle &handle, ncclComm_t comm, int size, int rank) {
-  auto communicator = std::make_shared<MLCommon::cumlCommunicator>(
-    std::unique_ptr<MLCommon::cumlCommunicator_iface>(
-      new cumlStdCommunicator_impl(comm, size, rank)));
-  handle.getImpl().setCommunicator(communicator);
-}
-
-void inject_comms_py_coll(cumlHandle *handle, ncclComm_t comm, int size,
-                          int rank) {
-  inject_comms(*handle, comm, size, rank);
-}
-
-void inject_comms_py(ML::cumlHandle *handle, ncclComm_t comm, void *ucp_worker,
-                     void *eps, int size, int rank) {
-#ifdef WITH_UCX
-  std::shared_ptr<ucp_ep_h *> eps_sp =
-    std::make_shared<ucp_ep_h *>(new ucp_ep_h[size]);
-
-  size_t *size_t_ep_arr = (size_t *)eps;
-
-  for (int i = 0; i < size; i++) {
-    size_t ptr = size_t_ep_arr[i];
-    ucp_ep_h *ucp_ep_v = (ucp_ep_h *)*eps_sp;
-
-    if (ptr != 0) {
-      ucp_ep_h eps_ptr = (ucp_ep_h)size_t_ep_arr[i];
-      ucp_ep_v[i] = eps_ptr;
-    } else {
-      ucp_ep_v[i] = nullptr;
-    }
-  }
-
-  inject_comms(*handle, comm, (ucp_worker_h)ucp_worker, eps_sp, size, rank);
-#else
-  inject_comms(*handle, comm, size, rank);
-#endif
-}
-
-void ncclUniqueIdFromChar(ncclUniqueId *id, char *uniqueId, int size) {
-  memcpy(id->internal, uniqueId, size);
-}
-
-void get_unique_id(char *uid, int size) {
-  ncclUniqueId id;
-  ncclGetUniqueId(&id);
-
-  memcpy(uid, id.internal, size);
-}
-
-#ifdef WITH_UCX
-cumlStdCommunicator_impl::cumlStdCommunicator_impl(
-  ncclComm_t comm, ucp_worker_h ucp_worker, std::shared_ptr<ucp_ep_h *> eps,
-  int size, int rank)
-  : _nccl_comm(comm),
-    _ucp_worker(ucp_worker),
-    _ucp_eps(eps),
-    _size(size),
-    _rank(rank),
-    _next_request_id(0) {
-  initialize();
-  p2p_enabled = true;
-}
-#endif
-
-cumlStdCommunicator_impl::cumlStdCommunicator_impl(ncclComm_t comm, int size,
-                                                   int rank)
-  : _nccl_comm(comm), _size(size), _rank(rank) {
-  initialize();
-}
-
-void cumlStdCommunicator_impl::initialize() {
-  CUDA_CHECK(cudaStreamCreate(&_stream));
-
-  CUDA_CHECK(cudaMalloc(&_sendbuff, sizeof(int)));
-  CUDA_CHECK(cudaMalloc(&_recvbuff, sizeof(int)));
-}
-
-cumlStdCommunicator_impl::~cumlStdCommunicator_impl() {
-  CUDA_CHECK_NO_THROW(cudaStreamDestroy(_stream));
-
-  CUDA_CHECK_NO_THROW(cudaFree(_sendbuff));
-  CUDA_CHECK_NO_THROW(cudaFree(_recvbuff));
-}
-
-int cumlStdCommunicator_impl::getSize() const { return _size; }
-
-int cumlStdCommunicator_impl::getRank() const { return _rank; }
-
-std::unique_ptr<MLCommon::cumlCommunicator_iface>
-cumlStdCommunicator_impl::commSplit(int color, int key) const {
-  // Not supported by NCCL
-  ASSERT(false,
-         "ERROR: commSplit called but not yet supported in this comms "
-         "implementation.");
-}
-
-void cumlStdCommunicator_impl::barrier() const {
-  CUDA_CHECK(cudaMemsetAsync(_sendbuff, 1, sizeof(int), _stream));
-  CUDA_CHECK(cudaMemsetAsync(_recvbuff, 1, sizeof(int), _stream));
-
-  allreduce(_sendbuff, _recvbuff, 1, MLCommon::cumlCommunicator::INT,
-            MLCommon::cumlCommunicator::SUM, _stream);
-
-  ASSERT(syncStream(_stream) == status_t::commStatusSuccess,
-         "ERROR: syncStream failed. This can be caused by a failed rank.");
-}
-
-void cumlStdCommunicator_impl::get_request_id(request_t *req) const {
-#ifdef WITH_UCX
-
-  request_t req_id;
-
-  if (this->_free_requests.empty())
-    req_id = this->_next_request_id++;
-  else {
-    auto it = this->_free_requests.begin();
-    req_id = *it;
-    this->_free_requests.erase(it);
-  }
-  *req = req_id;
-#endif
-}
-
-void cumlStdCommunicator_impl::isend(const void *buf, int size, int dest,
-                                     int tag, request_t *request) const {
-  ASSERT(UCX_ENABLED, "cuML Comms not built with UCX support");
-  ASSERT(p2p_enabled,
-         "cuML Comms instance was not initialized for point-to-point");
-
-#ifdef WITH_UCX
-  ASSERT(_ucp_worker != nullptr,
-         "ERROR: UCX comms not initialized on communicator.");
-
-  get_request_id(request);
-  ucp_ep_h ep_ptr = (*_ucp_eps)[dest];
-
-  ucp_request *ucp_req = (ucp_request *)malloc(sizeof(ucp_request));
-
-  this->_ucp_handler.ucp_isend(ucp_req, ep_ptr, buf, size, tag,
-                               default_tag_mask, getRank());
-
-  CUML_LOG_DEBUG(
-    "%d: Created send request [id=%llu], ptr=%llu, to=%llu, ep=%llu", getRank(),
-    (unsigned long long)*request, (unsigned long long)ucp_req->req,
-    (unsigned long long)dest, (unsigned long long)ep_ptr);
-
-  _requests_in_flight.insert(std::make_pair(*request, ucp_req));
-#endif
-}
-
-void cumlStdCommunicator_impl::irecv(void *buf, int size, int source, int tag,
-                                     request_t *request) const {
-  ASSERT(UCX_ENABLED, "cuML Comms not built with UCX support");
-  ASSERT(p2p_enabled,
-         "cuML Comms instance was not initialized for point-to-point");
-
-#ifdef WITH_UCX
-  ASSERT(_ucp_worker != nullptr,
-         "ERROR: UCX comms not initialized on communicator.");
-
-  get_request_id(request);
-
-  ucp_ep_h ep_ptr = (*_ucp_eps)[source];
-
-  ucp_tag_t tag_mask = default_tag_mask;
-
-  if (source == CUML_ANY_SOURCE) {
-    tag_mask = any_rank_tag_mask;
-  }
-
-  ucp_request *ucp_req = (ucp_request *)malloc(sizeof(ucp_request));
-  _ucp_handler.ucp_irecv(ucp_req, _ucp_worker, ep_ptr, buf, size, tag, tag_mask,
-                         source);
-
-  CUML_LOG_DEBUG(
-    "%d: Created receive request [id=%llu], ptr=%llu, from=%llu, ep=%llu",
-    getRank(), (unsigned long long)*request, (unsigned long long)ucp_req->req,
-    (unsigned long long)source, (unsigned long long)ep_ptr);
-
-  _requests_in_flight.insert(std::make_pair(*request, ucp_req));
-#endif
-}
-
-void cumlStdCommunicator_impl::waitall(int count,
-                                       request_t array_of_requests[]) const {
-  ASSERT(UCX_ENABLED, "cuML Comms not built with UCX support");
-  ASSERT(p2p_enabled,
-         "cuML Comms instance was not initialized for point-to-point");
-
-#ifdef WITH_UCX
-  ASSERT(_ucp_worker != nullptr,
-         "ERROR: UCX comms not initialized on communicator.");
-
-  std::vector<ucp_request *> requests;
-  requests.reserve(count);
-
-  time_t start = time(NULL);
-
-  for (int i = 0; i < count; ++i) {
-    auto req_it = _requests_in_flight.find(array_of_requests[i]);
-    ASSERT(_requests_in_flight.end() != req_it,
-           "ERROR: waitall on invalid request: %d", array_of_requests[i]);
-    requests.push_back(req_it->second);
-    _free_requests.insert(req_it->first);
-    _requests_in_flight.erase(req_it);
-  }
-
-  while (requests.size() > 0) {
-    time_t now = time(NULL);
-
-    // Timeout if we have not gotten progress or completed any requests
-    // in 10 or more seconds.
-    ASSERT(now - start < 10, "Timed out waiting for requests.");
-
-    for (std::vector<ucp_request *>::iterator it = requests.begin();
-         it != requests.end();) {
-      bool restart = false;  // resets the timeout when any progress was made
-
-      // Causes UCP to progress through the send/recv message queue
-      while (_ucp_handler.ucp_progress(_ucp_worker) != 0) {
-        restart = true;
-      }
-
-      auto req = *it;
-
-      // If the message needs release, we know it will be sent/received
-      // asynchronously, so we will need to track and verify its state
-      if (req->needs_release) {
-        ASSERT(UCS_PTR_IS_PTR(req->req),
-               "UCX Request Error. Request is not valid UCX pointer");
-        ASSERT(!UCS_PTR_IS_ERR(req->req), "UCX Request Error: %d\n",
-               UCS_PTR_STATUS(req->req));
-        ASSERT(req->req->completed == 1 || req->req->completed == 0,
-               "request->completed not a valid value: %d\n",
-               req->req->completed);
-      }
-
-      // If a message was sent synchronously (eg. completed before
-      // `isend`/`irecv` completed) or an asynchronous message
-      // is complete, we can go ahead and clean it up.
-      if (!req->needs_release || req->req->completed == 1) {
-        restart = true;
-        CUML_LOG_DEBUG(
-          "%d: request completed. [ptr=%llu, num_left=%lu,"
-          " other_rank=%d, is_send=%d, completed_immediately=%d]",
-          getRank(), (unsigned long long)req->req, requests.size() - 1,
-          req->other_rank, req->is_send_request, !req->needs_release);
-
-        // perform cleanup
-        _ucp_handler.free_ucp_request(req);
-
-        // remove from pending requests
-        it = requests.erase(it);
-      } else {
-        ++it;
-      }
-      // if any progress was made, reset the timeout start time
-      if (restart) {
-        start = time(NULL);
-      }
-    }
-  }
-
-#endif
-}
-
-void cumlStdCommunicator_impl::allreduce(const void *sendbuff, void *recvbuff,
-                                         int count, datatype_t datatype,
-                                         op_t op, cudaStream_t stream) const {
-  NCCL_CHECK(ncclAllReduce(sendbuff, recvbuff, count, getNCCLDatatype(datatype),
-                           getNCCLOp(op), _nccl_comm, stream));
-}
-
-void cumlStdCommunicator_impl::bcast(void *buff, int count, datatype_t datatype,
-                                     int root, cudaStream_t stream) const {
-  NCCL_CHECK(ncclBroadcast(buff, buff, count, getNCCLDatatype(datatype), root,
-                           _nccl_comm, stream));
-}
-
-void cumlStdCommunicator_impl::reduce(const void *sendbuff, void *recvbuff,
-                                      int count, datatype_t datatype, op_t op,
-                                      int root, cudaStream_t stream) const {
-  NCCL_CHECK(ncclReduce(sendbuff, recvbuff, count, getNCCLDatatype(datatype),
-                        getNCCLOp(op), root, _nccl_comm, stream));
-}
-
-void cumlStdCommunicator_impl::allgather(const void *sendbuff, void *recvbuff,
-                                         int sendcount, datatype_t datatype,
-                                         cudaStream_t stream) const {
-  NCCL_CHECK(ncclAllGather(sendbuff, recvbuff, sendcount,
-                           getNCCLDatatype(datatype), _nccl_comm, stream));
-}
-
-void cumlStdCommunicator_impl::allgatherv(const void *sendbuf, void *recvbuf,
-                                          const int recvcounts[],
-                                          const int displs[],
-                                          datatype_t datatype,
-                                          cudaStream_t stream) const {
-  //From: "An Empirical Evaluation of Allgatherv on Multi-GPU Systems" - https://arxiv.org/pdf/1812.05964.pdf
-  //Listing 1 on page 4.
-  for (int root = 0; root < _size; ++root)
-    NCCL_CHECK(ncclBroadcast(
-      sendbuf,
-      static_cast<char *>(recvbuf) + displs[root] * getDatatypeSize(datatype),
-      recvcounts[root], getNCCLDatatype(datatype), root, _nccl_comm, stream));
-}
-
-void cumlStdCommunicator_impl::reducescatter(const void *sendbuff,
-                                             void *recvbuff, int recvcount,
-                                             datatype_t datatype, op_t op,
-                                             cudaStream_t stream) const {
-  NCCL_CHECK(ncclReduceScatter(sendbuff, recvbuff, recvcount,
-                               getNCCLDatatype(datatype), getNCCLOp(op),
-                               _nccl_comm, stream));
-}
-
-MLCommon::cumlCommunicator::status_t cumlStdCommunicator_impl::syncStream(
-  cudaStream_t stream) const {
-  cudaError_t cudaErr;
-  ncclResult_t ncclErr, ncclAsyncErr;
-  while (1) {
-    cudaErr = cudaStreamQuery(stream);
-    if (cudaErr == cudaSuccess) return status_t::commStatusSuccess;
-
-    if (cudaErr != cudaErrorNotReady) {
-      // An error occurred querying the status of the stream
-      return status_t::commStatusError;
-    }
-
-    ncclErr = ncclCommGetAsyncError(_nccl_comm, &ncclAsyncErr);
-    if (ncclErr != ncclSuccess) {
-      // An error occurred retrieving the asynchronous error
-      return status_t::commStatusError;
-    }
-
-    if (ncclAsyncErr != ncclSuccess) {
-      // An asynchronous error happened. Stop the operation and destroy
-      // the communicator
-      ncclErr = ncclCommAbort(_nccl_comm);
-      if (ncclErr != ncclSuccess)
-        // Caller may abort with an exception or try to re-create a new communicator.
-        return status_t::commStatusAbort;
-    }
-
-    // Let other threads (including NCCL threads) use the CPU.
-    pthread_yield();
-  }
-}
-
-}  // end namespace ML
diff --git a/cpp/comms/std/src/cuML_std_comms_impl.hpp b/cpp/comms/std/src/cuML_std_comms_impl.hpp
deleted file mode 100644
index 9237fc83f9..0000000000
--- a/cpp/comms/std/src/cuML_std_comms_impl.hpp
+++ /dev/null
@@ -1,140 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <unordered_map>
-#include <unordered_set>
-#include <utility>
-
-#include <nccl.h>
-
-#include <common/cuml_comms_iface.hpp>
-
-#ifdef WITH_UCX
-#include <ucp/api/ucp.h>
-#include <ucp/api/ucp_def.h>
-#include "ucp_helper.h"
-#endif
-
-namespace ML {
-
-/**
- * @brief A cumlCommunicator implementation capable of running collective communications
- * with NCCL and point-to-point-communications with UCX. Note that the latter is optional.
- *
- * Underlying comms, like NCCL and UCX, should be initialized and ready for use,
- * and maintained, outside of the cuML Comms lifecycle. This allows us to decouple the
- * ownership of the actual comms from cuml so that they can also be used outside of cuml.
- *
- * For instance, nccl-py can be used to bootstrap a ncclComm_t before it is
- * used to construct a cuml comms instance. UCX endpoints can be bootstrapped
- * in Python using ucx-py, before being used to construct a cuML comms instance.
- */
-class cumlStdCommunicator_impl : public MLCommon::cumlCommunicator_iface {
- public:
-  cumlStdCommunicator_impl() = delete;
-
-#ifdef WITH_UCX
-
-  /**
-   * @brief Constructor for collective + point-to-point operation.
-   * @param comm initialized nccl comm
-   * @param ucp_worker initialized ucp_worker instance
-   * @param eps shared pointer to array of ucp endpoints
-   * @param size size of the cluster
-   * @param rank rank of the current worker
-   */
-  cumlStdCommunicator_impl(ncclComm_t comm, ucp_worker_h ucp_worker,
-                           std::shared_ptr<ucp_ep_h*> eps, int size, int rank);
-#endif
-
-  /**
-   * @brief constructor for collective-only operation
-   * @param comm initilized nccl communicator
-   * @param size size of the cluster
-   * @param rank rank of the current worker
-   */
-  cumlStdCommunicator_impl(ncclComm_t comm, int size, int rank);
-
-  virtual ~cumlStdCommunicator_impl();
-
-  virtual int getSize() const;
-
-  virtual int getRank() const;
-
-  virtual std::unique_ptr<MLCommon::cumlCommunicator_iface> commSplit(
-    int color, int key) const;
-
-  virtual void barrier() const;
-
-  virtual void isend(const void* buf, int size, int dest, int tag,
-                     request_t* request) const;
-
-  virtual void irecv(void* buf, int size, int source, int tag,
-                     request_t* request) const;
-
-  virtual void waitall(int count, request_t array_of_requests[]) const;
-
-  virtual void allreduce(const void* sendbuff, void* recvbuff, int count,
-                         datatype_t datatype, op_t op,
-                         cudaStream_t stream) const;
-
-  virtual void bcast(void* buff, int count, datatype_t datatype, int root,
-                     cudaStream_t stream) const;
-
-  virtual void reduce(const void* sendbuff, void* recvbuff, int count,
-                      datatype_t datatype, op_t op, int root,
-                      cudaStream_t stream) const;
-
-  virtual void allgather(const void* sendbuff, void* recvbuff, int sendcount,
-                         datatype_t datatype, cudaStream_t stream) const;
-
-  virtual void allgatherv(const void* sendbuf, void* recvbuf,
-                          const int recvcounts[], const int displs[],
-                          datatype_t datatype, cudaStream_t stream) const;
-
-  virtual void reducescatter(const void* sendbuff, void* recvbuff,
-                             int recvcount, datatype_t datatype, op_t op,
-                             cudaStream_t stream) const;
-
-  virtual status_t syncStream(cudaStream_t stream) const;
-
- private:
-  ncclComm_t _nccl_comm;
-  cudaStream_t _stream;
-
-  int *_sendbuff, *_recvbuff;
-
-  int _size;
-  int _rank;
-
-  void initialize();
-  void get_request_id(request_t* req) const;
-  bool p2p_enabled = false;
-
-#ifdef WITH_UCX
-  comms_ucp_handler _ucp_handler;
-  ucp_worker_h _ucp_worker;
-  std::shared_ptr<ucp_ep_h*> _ucp_eps;
-  mutable request_t _next_request_id;
-  mutable std::unordered_map<request_t, struct ucp_request*>
-    _requests_in_flight;
-  mutable std::unordered_set<request_t> _free_requests;
-#endif
-};
-
-}  // end namespace ML
diff --git a/cpp/comms/std/src/ucp_helper.h b/cpp/comms/std/src/ucp_helper.h
deleted file mode 100644
index fbb8b3e110..0000000000
--- a/cpp/comms/std/src/ucp_helper.h
+++ /dev/null
@@ -1,240 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <dlfcn.h>
-#include <stdio.h>
-#include <ucp/api/ucp.h>
-#include <ucp/api/ucp_def.h>
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-
-#pragma once
-
-typedef void (*dlsym_print_info)(ucp_ep_h, FILE *);
-typedef void (*dlsym_rec_free)(void *);
-typedef int (*dlsym_worker_progress)(ucp_worker_h);
-
-typedef ucs_status_ptr_t (*dlsym_send)(ucp_ep_h, const void *, size_t,
-                                       ucp_datatype_t, ucp_tag_t,
-                                       ucp_send_callback_t);
-typedef ucs_status_ptr_t (*dlsym_recv)(ucp_worker_h, void *, size_t count,
-                                       ucp_datatype_t datatype, ucp_tag_t,
-                                       ucp_tag_t, ucp_tag_recv_callback_t);
-
-/**
- * Standard UCX request object that will be passed
- * around asynchronously. This object is really
- * opaque and the comms layer only cares that it
- * has been completed. Because cuml comms do not
- * initialize the ucx application context, it doesn't
- * own this object and thus it's important not to
- * modify this struct.
- */
-struct ucx_context {
-  int completed;
-};
-
-/**
- * Wraps the `ucx_context` request and adds a few
- * other fields for trace logging and cleanup.
- */
-class ucp_request {
- public:
-  struct ucx_context *req;
-  bool needs_release = true;
-  int other_rank = -1;
-  bool is_send_request = false;
-};
-
-// by default, match the whole tag
-static const ucp_tag_t default_tag_mask = -1;
-
-// Only match the passed in tag, not the rank. This
-// enables simulated multi-cast.
-static const ucp_tag_t any_rank_tag_mask = 0xFFFF0000;
-
-// Per the MPI API, receiving from a rank of -1 denotes receiving
-// from any rank that used the expected tag.
-static const int UCP_ANY_RANK = -1;
-
-/**
- * @brief Asynchronous send callback sets request to completed
- */
-static void send_callback(void *request, ucs_status_t status) {
-  struct ucx_context *context = (struct ucx_context *)request;
-  context->completed = 1;
-}
-
-/**
- * @brief Asynchronous recv callback sets request to completed
- */
-static void recv_callback(void *request, ucs_status_t status,
-                          ucp_tag_recv_info_t *info) {
-  struct ucx_context *context = (struct ucx_context *)request;
-  context->completed = 1;
-}
-
-/**
- * Helper class for managing `dlopen` state and
- * interacting with ucp.
- */
-class comms_ucp_handler {
- public:
-  comms_ucp_handler() {
-    load_ucp_handle();
-    load_send_func();
-    load_recv_func();
-    load_free_req_func();
-    load_print_info_func();
-    load_worker_progress_func();
-  }
-
-  ~comms_ucp_handler() { dlclose(ucp_handle); }
-
- private:
-  void *ucp_handle;
-
-  dlsym_print_info print_info_func;
-  dlsym_rec_free req_free_func;
-  dlsym_worker_progress worker_progress_func;
-  dlsym_send send_func;
-  dlsym_recv recv_func;
-
-  void load_ucp_handle() {
-    ucp_handle = dlopen("libucp.so", RTLD_LAZY | RTLD_NOLOAD | RTLD_NODELETE);
-    if (!ucp_handle) {
-      ucp_handle = dlopen("libucp.so", RTLD_LAZY | RTLD_NODELETE);
-      ASSERT(ucp_handle, "Cannot open UCX library: %s\n", dlerror());
-    }
-    // Reset any potential error
-    dlerror();
-  }
-
-  void assert_dlerror() {
-    char *error = dlerror();
-    ASSERT(error == NULL, "Error loading function symbol: %s\n", error);
-  }
-
-  void load_send_func() {
-    send_func = (dlsym_send)dlsym(ucp_handle, "ucp_tag_send_nb");
-    assert_dlerror();
-  }
-
-  void load_free_req_func() {
-    req_free_func = (dlsym_rec_free)dlsym(ucp_handle, "ucp_request_free");
-    assert_dlerror();
-  }
-
-  void load_print_info_func() {
-    print_info_func = (dlsym_print_info)dlsym(ucp_handle, "ucp_ep_print_info");
-    assert_dlerror();
-  }
-
-  void load_worker_progress_func() {
-    worker_progress_func =
-      (dlsym_worker_progress)dlsym(ucp_handle, "ucp_worker_progress");
-    assert_dlerror();
-  }
-
-  void load_recv_func() {
-    recv_func = (dlsym_recv)dlsym(ucp_handle, "ucp_tag_recv_nb");
-    assert_dlerror();
-  }
-
-  ucp_tag_t build_message_tag(int rank, int tag) const {
-    // keeping the rank in the lower bits enables debugging.
-    return ((uint32_t)tag << 31) | (uint32_t)rank;
-  }
-
- public:
-  int ucp_progress(ucp_worker_h worker) const {
-    return (*(worker_progress_func))(worker);
-  }
-
-  /**
-   * @brief Frees any memory underlying the given ucp request object
-   */
-  void free_ucp_request(ucp_request *request) const {
-    if (request->needs_release) {
-      request->req->completed = 0;
-      (*(req_free_func))(request->req);
-    }
-    free(request);
-  }
-
-  /**
-   * @brief Asynchronously send data to the given endpoint using the given tag
-   */
-  void ucp_isend(ucp_request *req, ucp_ep_h ep_ptr, const void *buf, int size,
-                 int tag, ucp_tag_t tag_mask, int rank) const {
-    ucp_tag_t ucp_tag = build_message_tag(rank, tag);
-
-    CUML_LOG_DEBUG("Sending tag: %ld", ucp_tag);
-
-    ucs_status_ptr_t send_result = (*(send_func))(
-      ep_ptr, buf, size, ucp_dt_make_contig(1), ucp_tag, send_callback);
-    struct ucx_context *ucp_req = (struct ucx_context *)send_result;
-    if (UCS_PTR_IS_ERR(send_result)) {
-      ASSERT(!UCS_PTR_IS_ERR(send_result),
-             "unable to send UCX data message (%d)\n",
-             UCS_PTR_STATUS(send_result));
-      /**
-     * If the request didn't fail, but it's not OK, it is in flight.
-     * Expect the handler to be invoked
-     */
-    } else if (UCS_PTR_STATUS(send_result) != UCS_OK) {
-      /**
-      * If the request is OK, it's already been completed and we don't need to wait on it.
-      * The request will be a nullptr, however, so we need to create a new request
-      * and set it to completed to make the "waitall()" function work properly.
-      */
-      req->needs_release = true;
-    } else {
-      req->needs_release = false;
-    }
-
-    req->other_rank = rank;
-    req->is_send_request = true;
-    req->req = ucp_req;
-  }
-
-  /**
-   * @brief Asynchronously receive data from given endpoint with the given tag.
-   */
-  void ucp_irecv(ucp_request *req, ucp_worker_h worker, ucp_ep_h ep_ptr,
-                 void *buf, int size, int tag, ucp_tag_t tag_mask,
-                 int sender_rank) const {
-    ucp_tag_t ucp_tag = build_message_tag(sender_rank, tag);
-
-    CUML_LOG_DEBUG("%d: Receiving tag: %ld", ucp_tag);
-
-    ucs_status_ptr_t recv_result =
-      (*(recv_func))(worker, buf, size, ucp_dt_make_contig(1), ucp_tag,
-                     tag_mask, recv_callback);
-
-    struct ucx_context *ucp_req = (struct ucx_context *)recv_result;
-
-    req->req = ucp_req;
-    req->needs_release = true;
-    req->is_send_request = false;
-    req->other_rank = sender_rank;
-
-    ASSERT(!UCS_PTR_IS_ERR(recv_result),
-           "unable to receive UCX data message (%d)\n",
-           UCS_PTR_STATUS(recv_result));
-  }
-};
diff --git a/cpp/examples/dbscan/dbscan_example.cpp b/cpp/examples/dbscan/dbscan_example.cpp
index db13720701..273d1fa71e 100644
--- a/cpp/examples/dbscan/dbscan_example.cpp
+++ b/cpp/examples/dbscan/dbscan_example.cpp
@@ -23,13 +23,7 @@
 #include <sstream>
 #include <vector>
 
-#ifdef HAVE_CUB
-#include <cuml/common/cubAllocatorAdapter.hpp>
-#endif  //HAVE_CUB
-
-#ifdef HAVE_RMM
-#include <cuml/common/rmmAllocatorAdapter.hpp>
-#endif  //HAVE_RMM
+#include <raft/mr/device/allocator.hpp>
 
 #include <cuml/cluster/dbscan.hpp>
 #include <cuml/cuml.hpp>
@@ -140,29 +134,12 @@ int main(int argc, char* argv[]) {
     }
   }
 
-  ML::cumlHandle cumlHandle;
+  raft::handle_t handle;
 
-#ifdef HAVE_RMM
-  rmmOptions_t rmmOptions;
-  rmmOptions.allocation_mode = PoolAllocation;
-  rmmOptions.initial_pool_size = 0;
-  rmmOptions.enable_logging = false;
-  rmmError_t rmmStatus = rmmInitialize(&rmmOptions);
-  if (RMM_SUCCESS != rmmStatus) {
-    std::cerr << "WARN: Could not initialize RMM: "
-              << rmmGetErrorString(rmmStatus) << std::endl;
-  }
-#endif  //HAVE_RMM
-#ifdef HAVE_RMM
-  std::shared_ptr<ML::deviceAllocator> allocator(new ML::rmmAllocatorAdapter());
-#elif defined(HAVE_CUB)
   std::shared_ptr<ML::deviceAllocator> allocator(
-    new ML::cachingDeviceAllocator());
-#else
-  std::shared_ptr<ML::deviceAllocator> allocator(
-    new ML::defaultDeviceAllocator());
-#endif  // HAVE_RMM
-  cumlHandle.setDeviceAllocator(allocator);
+    new raft::mr::device::default_allocator());
+
+  handle.set_device_allocator(allocator);
 
   std::vector<float> h_inputData;
 
@@ -204,7 +181,7 @@ int main(int argc, char* argv[]) {
 
   cudaStream_t stream;
   CUDA_RT_CALL(cudaStreamCreate(&stream));
-  cumlHandle.setStream(stream);
+  handle.set_stream(stream);
 
   std::vector<int> h_labels(nRows);
   int* d_labels = nullptr;
@@ -223,7 +200,7 @@ int main(int argc, char* argv[]) {
             << "eps - " << eps << std::endl
             << "max_bytes_per_batch - " << max_bytes_per_batch << std::endl;
 
-  ML::dbscanFit(cumlHandle, d_inputData, nRows, nCols, eps, minPts, d_labels,
+  ML::dbscanFit(handle, d_inputData, nRows, nCols, eps, minPts, d_labels,
                 nullptr, max_bytes_per_batch, false);
   CUDA_RT_CALL(cudaMemcpyAsync(h_labels.data(), d_labels, nRows * sizeof(int),
                                cudaMemcpyDeviceToHost, stream));
diff --git a/cpp/examples/kmeans/kmeans_example.cpp b/cpp/examples/kmeans/kmeans_example.cpp
index 20bba55298..aeb03b2c67 100644
--- a/cpp/examples/kmeans/kmeans_example.cpp
+++ b/cpp/examples/kmeans/kmeans_example.cpp
@@ -23,13 +23,7 @@
 
 #include <cuda_runtime.h>
 
-#ifdef HAVE_CUB
-#include <cuml/common/cubAllocatorAdapter.hpp>
-#endif  //HAVE_CUB
-
-#ifdef HAVE_RMM
-#include <cuml/common/rmmAllocatorAdapter.hpp>
-#endif  // HAVE_RMM
+#include <raft/mr/device/allocator.hpp>
 
 #include <cuml/cluster/kmeans.hpp>
 #include <cuml/cuml.hpp>
@@ -92,17 +86,6 @@ int main(int argc, char *argv[]) {
                 << "(" << cudaGetErrorString(cudaStatus) << ")" << std::endl;
       return 1;
     }
-#ifdef HAVE_RMM
-    rmmOptions_t rmmOptions;
-    rmmOptions.allocation_mode = PoolAllocation;
-    rmmOptions.initial_pool_size = 0;
-    rmmOptions.enable_logging = false;
-    rmmError_t rmmStatus = rmmInitialize(&rmmOptions);
-    if (RMM_SUCCESS != rmmStatus) {
-      std::cerr << "WARN: Could not initialize RMM: "
-                << rmmGetErrorString(rmmStatus) << std::endl;
-    }
-#endif  // HAVE_RMM
   }
 
   std::vector<double> h_srcdata;
@@ -143,22 +126,16 @@ int main(int argc, char *argv[]) {
     std::cout << "Run KMeans with k=" << params.n_clusters
               << ", max_iterations=" << params.max_iter << std::endl;
 
-    ML::cumlHandle cumlHandle;
-#ifdef HAVE_RMM
-    std::shared_ptr<ML::deviceAllocator> allocator(
-      new ML::rmmAllocatorAdapter());
-#elif defined(HAVE_CUB)
-    std::shared_ptr<ML::deviceAllocator> allocator(
-      new ML::cachingDeviceAllocator());
-#else
+    raft::handle_t handle;
+
     std::shared_ptr<ML::deviceAllocator> allocator(
-      new ML::defaultDeviceAllocator());
-#endif  // HAVE_RMM
-    cumlHandle.setDeviceAllocator(allocator);
+      new raft::mr::device::default_allocator());
+
+    handle.set_device_allocator(allocator);
 
     cudaStream_t stream;
     CUDA_RT_CALL(cudaStreamCreate(&stream));
-    cumlHandle.setStream(stream);
+    handle.set_stream(stream);
 
     // srcdata size n_samples * n_features
     double *d_srcdata = nullptr;
@@ -178,9 +155,8 @@ int main(int argc, char *argv[]) {
 
     double inertia = 0;
     int n_iter = 0;
-    ML::kmeans::fit_predict(cumlHandle, params, d_srcdata, n_samples,
-                            n_features, 0, d_pred_centroids, d_pred_labels,
-                            inertia, n_iter);
+    ML::kmeans::fit_predict(handle, params, d_srcdata, n_samples, n_features, 0,
+                            d_pred_centroids, d_pred_labels, inertia, n_iter);
 
     std::vector<int> h_pred_labels(n_samples);
     CUDA_RT_CALL(cudaMemcpyAsync(h_pred_labels.data(), d_pred_labels,
diff --git a/cpp/include/cuml/cluster/dbscan.hpp b/cpp/include/cuml/cluster/dbscan.hpp
index ecd717c0c8..e1a1dbe350 100644
--- a/cpp/include/cuml/cluster/dbscan.hpp
+++ b/cpp/include/cuml/cluster/dbscan.hpp
@@ -21,8 +21,6 @@
 
 namespace ML {
 
-/** @} */
-
 /**
  * @defgroup DbscanCpp C++ implementation of Dbscan algo
  * @brief Fits a DBSCAN model on an input feature matrix and outputs the labels
@@ -45,20 +43,20 @@ namespace ML {
  * @{
  */
 
-void dbscanFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
-               float eps, int min_pts, int *labels,
+void dbscanFit(const raft::handle_t &handle, float *input, int n_rows,
+               int n_cols, float eps, int min_pts, int *labels,
                int *core_sample_indices = nullptr,
                size_t max_bytes_per_batch = 0, int verbosity = CUML_LEVEL_INFO);
-void dbscanFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
-               double eps, int min_pts, int *labels,
+void dbscanFit(const raft::handle_t &handle, double *input, int n_rows,
+               int n_cols, double eps, int min_pts, int *labels,
                int *core_sample_indices = nullptr,
                size_t max_bytes_per_batch = 0, int verbosity = CUML_LEVEL_INFO);
 
-void dbscanFit(const cumlHandle &handle, float *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, float *input, int64_t n_rows,
                int64_t n_cols, float eps, int min_pts, int64_t *labels,
                int64_t *core_sample_indices = nullptr,
                size_t max_bytes_per_batch = 0, int verbosity = CUML_LEVEL_INFO);
-void dbscanFit(const cumlHandle &handle, double *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, double *input, int64_t n_rows,
                int64_t n_cols, double eps, int min_pts, int64_t *labels,
                int64_t *core_sample_indices = nullptr,
                size_t max_bytes_per_batch = 0, int verbosity = CUML_LEVEL_INFO);
diff --git a/cpp/include/cuml/cluster/kmeans.hpp b/cpp/include/cuml/cluster/kmeans.hpp
index 882b67e6fa..7ac7c5e4ae 100644
--- a/cpp/include/cuml/cluster/kmeans.hpp
+++ b/cpp/include/cuml/cluster/kmeans.hpp
@@ -53,7 +53,7 @@ struct KMeansParams {
   int seed = 0;
 
   // Metric to use for distance computation. Any metric from
-  // MLCommon::Distance::DistanceType can be used
+  // ML::Distance::DistanceType can be used
   int metric = 0;
 
   // Number of instance k-means algorithm will be run with different seeds.
@@ -96,12 +96,12 @@ struct KMeansParams {
  closest cluster center.
  * @param[out]    n_iter        Number of iterations run.
  */
-void fit_predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit_predict(const raft::handle_t &handle, const KMeansParams &params,
                  const float *X, int n_samples, int n_features,
                  const float *sample_weight, float *centroids, int *labels,
                  float &inertia, int &n_iter);
 
-void fit_predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit_predict(const raft::handle_t &handle, const KMeansParams &params,
                  const double *X, int n_samples, int n_features,
                  const double *sample_weight, double *centroids, int *labels,
                  double &inertia, int &n_iter);
@@ -128,12 +128,12 @@ void fit_predict(const ML::cumlHandle &handle, const KMeansParams &params,
  * @param[out]    n_iter        Number of iterations run.
  */
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const float *X, int n_samples, int n_features,
          const float *sample_weight, float *centroids, float &inertia,
          int &n_iter);
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const double *X, int n_samples, int n_features,
          const double *sample_weight, double *centroids, double &inertia,
          int &n_iter);
@@ -158,12 +158,12 @@ void fit(const ML::cumlHandle &handle, const KMeansParams &params,
  * closest cluster center.
  */
 
-void predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void predict(const raft::handle_t &handle, const KMeansParams &params,
              const float *centroids, const float *X, int n_samples,
              int n_features, const float *sample_weight, int *labels,
              float &inertia);
 
-void predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void predict(const raft::handle_t &handle, const KMeansParams &params,
              const double *centroids, const double *X, int n_samples,
              int n_features, const double *sample_weight, int *labels,
              double &inertia);
@@ -184,14 +184,14 @@ void predict(const ML::cumlHandle &handle, const KMeansParams &params,
  * sample in 'X' (it should be same as the dimension for each cluster centers in
  * 'centroids').
  * @param[in]     metric        Metric to use for distance computation. Any
- * metric from MLCommon::Distance::DistanceType can be used
+ * metric from ML::Distance::DistanceType can be used
  * @param[out]    X_new         X transformed in the new space..
  */
-void transform(const ML::cumlHandle &handle, const KMeansParams &params,
+void transform(const raft::handle_t &handle, const KMeansParams &params,
                const float *centroids, const float *X, int n_samples,
                int n_features, int metric, float *X_new);
 
-void transform(const ML::cumlHandle &handle, const KMeansParams &params,
+void transform(const raft::handle_t &handle, const KMeansParams &params,
                const double *centroids, const double *X, int n_samples,
                int n_features, int metric, double *X_new);
 
diff --git a/cpp/include/cuml/cluster/kmeans_mg.hpp b/cpp/include/cuml/cluster/kmeans_mg.hpp
index b10f5fe3f0..cba1fd3c72 100644
--- a/cpp/include/cuml/cluster/kmeans_mg.hpp
+++ b/cpp/include/cuml/cluster/kmeans_mg.hpp
@@ -43,11 +43,11 @@ namespace opg {
  * @param[out]    n_iter        Number of iterations run.
  */
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const float *X, int n_samples, int n_features, float *centroids,
          float &inertia, int &n_iter);
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const double *X, int n_samples, int n_features, double *centroids,
          double &inertia, int &n_iter);
 
diff --git a/cpp/include/cuml/cluster/spectral.hpp b/cpp/include/cuml/cluster/spectral.hpp
index 6a51e1773d..d984f217fd 100644
--- a/cpp/include/cuml/cluster/spectral.hpp
+++ b/cpp/include/cuml/cluster/spectral.hpp
@@ -35,8 +35,8 @@ namespace Spectral {
    * @param n_components the number of components to project the X into
    * @param out output array for embedding (size n*n_comonents)
    */
-void fit_embedding(const cumlHandle &handle, int *rows, int *cols, float *vals,
-                   int nnz, int n, int n_components, float *out);
+void fit_embedding(const raft::handle_t &handle, int *rows, int *cols,
+                   float *vals, int nnz, int n, int n_components, float *out);
 
 }  // namespace Spectral
 }  // namespace ML
diff --git a/cpp/include/cuml/common/callbackSink.hpp b/cpp/include/cuml/common/callbackSink.hpp
new file mode 100644
index 0000000000..abd4c33a7e
--- /dev/null
+++ b/cpp/include/cuml/common/callbackSink.hpp
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <iostream>
+#include <mutex>
+
+#define SPDLOG_HEADER_ONLY
+#include <spdlog/common.h>
+#include <spdlog/details/log_msg.h>
+#include <spdlog/sinks/base_sink.h>
+
+namespace spdlog {
+namespace sinks {
+
+typedef void (*LogCallback)(int lvl, const char* msg);
+
+template <class Mutex>
+class CallbackSink : public base_sink<Mutex> {
+ public:
+  explicit CallbackSink(std::string tag = "spdlog",
+                        LogCallback callback = nullptr,
+                        void (*flush)() = nullptr)
+    : _callback{callback}, _flush{flush} {};
+
+  void set_callback(LogCallback callback) { _callback = callback; }
+  void set_flush(void (*flush)()) { _flush = flush; }
+
+ protected:
+  void sink_it_(const details::log_msg& msg) override {
+    spdlog::memory_buf_t formatted;
+    base_sink<Mutex>::formatter_->format(msg, formatted);
+    std::string msg_string = fmt::to_string(formatted);
+
+    if (_callback) {
+      _callback(static_cast<int>(msg.level), msg_string.c_str());
+    } else {
+      std::cout << msg_string;
+    }
+  }
+
+  void flush_() override {
+    if (_flush) {
+      _flush();
+    } else {
+      std::cout << std::flush;
+    }
+  }
+
+  LogCallback _callback;
+  void (*_flush)();
+};
+
+using callback_sink_mt = CallbackSink<std::mutex>;
+using callback_sink_st = CallbackSink<details::null_mutex>;
+
+}  // end namespace sinks
+}  // end namespace spdlog
diff --git a/cpp/include/cuml/common/cuml_allocator.hpp b/cpp/include/cuml/common/cuml_allocator.hpp
index fd68f1d367..215c1ad3f2 100644
--- a/cpp/include/cuml/common/cuml_allocator.hpp
+++ b/cpp/include/cuml/common/cuml_allocator.hpp
@@ -19,148 +19,12 @@
 #include <cuda_runtime.h>
 #include <cuml/common/utils.hpp>
 
-namespace MLCommon {
-
-/**
- * @brief Interface for a asynchronous device allocator.
- *
- * A implementation of this interface can make the following assumptions
- * - It does not need to be but it can allow asynchronous allocate and deallocate.
- * - Allocations may be always on the device that was specified on construction.
- */
-class deviceAllocator {
- public:
-  /**
-   * @brief Asynchronously allocates device memory.
-   * 
-   * An implementation of this need to return a allocation of n bytes properly align bytes
-   * on the configured device. The allocation can optionally be asynchronous in the sense
-   * that it is only save to use after all work submitted to the passed in stream prior to 
-   * the call to allocate has completed. If the allocation is used before, e.g. in another 
-   * stream the behaviour may be undefined.
-   * @todo: Add alignment requirments.
-   * 
-   * @param[in] n         number of bytes to allocate
-   * @param[in] stream    stream to issue the possible asynchronous allocation in
-   */
-  virtual void* allocate(std::size_t n, cudaStream_t stream) = 0;
-
-  /**
-   * @brief Asynchronously deallocates device memory
-   * 
-   * An implementation of this need to ensure that the allocation that the passed in pointer
-   * points to remains usable until all work sheduled in stream prior to the call to 
-   * deallocate has completed.
-   *
-   * @param[inout] p      pointer to the buffer to deallocte
-   * @param[in] n         size of the buffer to deallocte in bytes
-   * @param[in] stream    stream in which the allocation might be still in use
-   */
-  virtual void deallocate(void* p, std::size_t n, cudaStream_t stream) = 0;
-
-  virtual ~deviceAllocator() {}
-};
-
-/**
- * @brief Interface for a asynchronous host allocations.
- *
- * A implementation of this interface can make the following assumptions
- * - It does not need to be but it can allow asynchronous allocate and deallocate.
- * - Allocations don't need to be zero copy accessible form a device.
- */
-class hostAllocator {
- public:
-  /**
-   * @brief Asynchronously allocates host memory.
-   * 
-   * An implementation of this need to return a allocation of n bytes properly align bytes
-   * on the host. The allocation can optionally be asynchronous in the sense
-   * that it is only save to use after all work submitted to the passed in stream prior to 
-   * the call to allocate has completed. If the allocation is used before, e.g. in another 
-   * stream the behaviour may be undefined.
-   * @todo: Add alignment requirments.
-   * 
-   * @param[in] n         number of bytes to allocate
-   * @param[in] stream    stream to issue the possible asynchronous allocation in
-   */
-  virtual void* allocate(std::size_t n, cudaStream_t stream) = 0;
-
-  /**
-   * @brief Asynchronously deallocates host memory
-   * 
-   * An implementation of this need to ensure that the allocation that the passed in pointer
-   * points to remains usable until all work sheduled in stream prior to the call to 
-   * deallocate has completed.
-   *
-   * @param[inout] p      pointer to the buffer to deallocte
-   * @param[in] n         size of the buffer to deallocte in bytes
-   * @param[in] stream    stream in which the allocation might be still in use
-   */
-  virtual void deallocate(void* p, std::size_t n, cudaStream_t stream) = 0;
-
-  virtual ~hostAllocator() {}
-};
-
-/** Default cudaMalloc/cudaFree based device allocator */
-class defaultDeviceAllocator : public deviceAllocator {
- public:
-  /**
-   * @brief asynchronosly allocate n bytes that can be used after all work in
-   *        stream sheduled prior to this call has completetd.
-   *
-   * @param[in] n         size of the allocation in bytes
-   * @param[in] stream    the stream to use for the asynchronous allocations
-   */
-  virtual void* allocate(std::size_t n, cudaStream_t stream) {
-    void* ptr = 0;
-    CUDA_CHECK(cudaMalloc(&ptr, n));
-    return ptr;
-  }
+#include <raft/mr/device/allocator.hpp>
+#include <raft/mr/host/allocator.hpp>
 
-  /**
-   * @brief asynchronosly free an allocation of n bytes that can be reused after
-   *        all work in stream scheduled prior to this call has completed.
-   *
-   * @param[in] p         pointer to n bytes of memory to be deallocated
-   * @param[in] n         size of the allocation to release in bytes
-   * @param[in] stream    the stream to use for the asynchronous free
-   */
-  virtual void deallocate(void* p, std::size_t n, cudaStream_t stream) {
-    CUDA_CHECK_NO_THROW(cudaFree(p));
-  }
-
-  virtual ~defaultDeviceAllocator() {}
-};
-
-/** Default cudaMallocHost/cudaFreeHost based host allocator */
-class defaultHostAllocator : public hostAllocator {
- public:
-  /**
-   * @brief allocate n bytes that can be used after all work in
-   *        stream sheduled prior to this call has completetd.
-   *
-   * @param[in] n         size of the allocation in bytes
-   * @param[in] stream    the stream to use for the asynchronous allocations
-   */
-  virtual void* allocate(std::size_t n, cudaStream_t stream) {
-    void* ptr = 0;
-    CUDA_CHECK(cudaMallocHost(&ptr, n));
-    return ptr;
-  }
-
-  /**
-   * @brief free an allocation of n bytes that can be reused after
-   *        all work in stream scheduled prior to this call has completed.
-   *
-   * @param[in] p         pointer to n bytes of memory to be deallocated
-   * @param[in] n         size of the allocation to release in bytes
-   * @param[in] stream    the stream to use for the asynchronous free
-   */
-  virtual void deallocate(void* p, std::size_t n, cudaStream_t stream) {
-    CUDA_CHECK_NO_THROW(cudaFreeHost(p));
-  }
+namespace MLCommon {
 
-  virtual ~defaultHostAllocator() {}
-};
+using deviceAllocator = raft::mr::device::allocator;
+using hostAllocator = raft::mr::host::allocator;
 
 };  // end namespace MLCommon
diff --git a/cpp/include/cuml/common/logger.hpp b/cpp/include/cuml/common/logger.hpp
index 0e9c2c285c..ac6e81ec81 100644
--- a/cpp/include/cuml/common/logger.hpp
+++ b/cpp/include/cuml/common/logger.hpp
@@ -17,12 +17,18 @@
 
 #include <stdarg.h>
 #include <memory>
+#include <mutex>
 #include <sstream>
 #include <string>
 
 namespace spdlog {
 class logger;
-};
+namespace sinks {
+template <class Mutex>
+class CallbackSink;
+using callback_sink_mt = CallbackSink<std::mutex>;
+};  // namespace sinks
+};  // namespace spdlog
 
 namespace ML {
 
@@ -104,6 +110,20 @@ class Logger {
    */
   void setPattern(const std::string& pattern);
 
+  /**
+   * @brief Register a callback function to be run in place of usual log call
+   *
+   * @param[in] callback the function to be run on all logged messages
+   */
+  void setCallback(void (*callback)(int lvl, const char* msg));
+
+  /**
+   * @brief Register a flush function compatible with the registered callback
+   *
+   * @param[in] flush the function to use when flushing logs
+   */
+  void setFlush(void (*flush)());
+
   /**
    * @brief Tells whether messages will be logged for the given log level
    *
@@ -133,10 +153,16 @@ class Logger {
    */
   void log(int level, const char* fmt, ...);
 
+  /**
+   * @brief Flush logs by calling flush on underlying logger
+   */
+  void flush();
+
  private:
   Logger();
   ~Logger() {}
 
+  std::shared_ptr<spdlog::sinks::callback_sink_mt> sink;
   std::shared_ptr<spdlog::logger> logger;
   std::string currPattern;
   static const std::string DefaultPattern;
diff --git a/cpp/include/cuml/common/rmmAllocatorAdapter.hpp b/cpp/include/cuml/common/rmmAllocatorAdapter.hpp
deleted file mode 100644
index dcbac6ec43..0000000000
--- a/cpp/include/cuml/common/rmmAllocatorAdapter.hpp
+++ /dev/null
@@ -1,65 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-#include <cuml/cuml.hpp>
-#include <rmm/mr/device/default_memory_resource.hpp>
-
-namespace ML {
-
-/**
- * @brief Implemententation of ML::deviceAllocator using the
- *        RAPIDS Memory Manager (RMM) for allocations.
- *
- * rmmAllocatorAdapter does not initialize RMM. If RMM is not initialized on
- * construction of rmmAllocatorAdapter allocations fall back to cudaMalloc.
- */
-class rmmAllocatorAdapter : public ML::deviceAllocator {
- public:
-  rmmAllocatorAdapter() {}
-
-  /**
-   * @brief asynchronosly allocate n bytes that can be used after all work in
-   *        stream sheduled prior to this call has completetd.
-   *
-   * @param[in] n         size of the allocation in bytes
-   * @param[in] stream    the stream to use for the asynchronous allocations
-   */
-  virtual void* allocate(std::size_t n, cudaStream_t stream) {
-    void* ptr = 0;
-    ptr = rmm::mr::get_default_resource()->allocate(n, stream);
-    return ptr;
-  }
-
-  /**
-   * @brief asynchronosly free an allocation of n bytes that can be reused after
-   *        all work in stream scheduled prior to this call has completed.
-   *
-   * @param[in] p         pointer to n bytes of memory to be deallocated
-   * @param[in] n         size of the allocation to release in bytes
-   * @param[in] stream    the stream to use for the asynchronous free
-   */
-  virtual void deallocate(void* p, std::size_t n, cudaStream_t stream) {
-    rmm::mr::get_default_resource()->deallocate(p, n, stream);
-  }
-
-  virtual ~rmmAllocatorAdapter() {}
-};
-
-}  // end namespace ML
diff --git a/cpp/include/cuml/common/rmmPoolAllocatorAdapter.hpp b/cpp/include/cuml/common/rmmPoolAllocatorAdapter.hpp
deleted file mode 100644
index 7282c67f9a..0000000000
--- a/cpp/include/cuml/common/rmmPoolAllocatorAdapter.hpp
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-#include <cuml/cuml.hpp>
-#include <rmm/mr/device/cnmem_memory_resource.hpp>
-#include "rmmAllocatorAdapter.hpp"
-
-namespace ML {
-
-/**
- * @brief Implemententation of ML::deviceAllocator using the RMM pool
- *
- * @todo rmmPoolAllocatorAdapter currently only uses the default ctor of the
- *       underlying pool allocator (ie cnmem).
- */
-class rmmPoolAllocatorAdapter : public rmmAllocatorAdapter {
- public:
-  rmmPoolAllocatorAdapter() : cnmem_mr_() {
-    prev_mr_ = rmm::mr::set_default_resource(&cnmem_mr_);
-  }
-
-  ~rmmPoolAllocatorAdapter() {
-    // restore the previous memory resource when this object goes out-of-scope
-    rmm::mr::set_default_resource(prev_mr_);
-  }
-
- private:
-  rmm::mr::cnmem_memory_resource cnmem_mr_;
-  rmm::mr::device_memory_resource* prev_mr_;
-};
-
-}  // end namespace ML
diff --git a/cpp/include/cuml/common/utils.hpp b/cpp/include/cuml/common/utils.hpp
index 7962c6bb3d..3fcd535825 100644
--- a/cpp/include/cuml/common/utils.hpp
+++ b/cpp/include/cuml/common/utils.hpp
@@ -18,101 +18,10 @@
 
 #include <cuda_runtime.h>
 #include <execinfo.h>
+#include <raft/cudart_utils.h>
 #include <cstdio>
+#include <raft/error.hpp>
 #include <sstream>
 #include <stdexcept>
 #include <string>
 #include "logger.hpp"
-
-namespace MLCommon {
-/** base exception class for the cuML or ml-prims project */
-class Exception : public std::exception {
- public:
-  /** default ctor */
-  Exception() throw() : std::exception(), msg() {}
-
-  /** copy ctor */
-  Exception(const Exception& src) throw() : std::exception(), msg(src.what()) {
-    collectCallStack();
-  }
-
-  /** ctor from an input message */
-  Exception(const std::string& _msg) throw() : std::exception(), msg(_msg) {
-    collectCallStack();
-  }
-
-  /** dtor */
-  virtual ~Exception() throw() {}
-
-  /** get the message associated with this exception */
-  virtual const char* what() const throw() { return msg.c_str(); }
-
- private:
-  /** message associated with this exception */
-  std::string msg;
-
-  /** append call stack info to this exception's message for ease of debug */
-  // Courtesy: https://www.gnu.org/software/libc/manual/html_node/Backtraces.html
-  void collectCallStack() throw() {
-#ifdef __GNUC__
-    const int MaxStackDepth = 64;
-    void* stack[MaxStackDepth];
-    auto depth = backtrace(stack, MaxStackDepth);
-    std::ostringstream oss;
-    oss << std::endl << "Obtained " << depth << " stack frames" << std::endl;
-    char** strings = backtrace_symbols(stack, depth);
-    if (strings == nullptr) {
-      oss << "But no stack trace could be found!" << std::endl;
-      msg += oss.str();
-      return;
-    }
-    ///@todo: support for demangling of C++ symbol names
-    for (int i = 0; i < depth; ++i) {
-      oss << "#" << i << " in " << strings[i] << std::endl;
-    }
-    free(strings);
-    msg += oss.str();
-#endif  // __GNUC__
-  }
-};
-
-/** macro to throw a runtime error */
-#define THROW(fmt, ...)                                                        \
-  do {                                                                         \
-    std::string msg;                                                           \
-    char errMsg[2048];                                                         \
-    std::snprintf(errMsg, sizeof(errMsg),                                      \
-                  "Exception occured! file=%s line=%d: ", __FILE__, __LINE__); \
-    msg += errMsg;                                                             \
-    std::snprintf(errMsg, sizeof(errMsg), fmt, ##__VA_ARGS__);                 \
-    msg += errMsg;                                                             \
-    throw MLCommon::Exception(msg);                                            \
-  } while (0)
-
-/** macro to check for a conditional and assert on failure */
-#define ASSERT(check, fmt, ...)              \
-  do {                                       \
-    if (!(check)) THROW(fmt, ##__VA_ARGS__); \
-  } while (0)
-
-/** check for cuda runtime API errors and assert accordingly */
-#define CUDA_CHECK(call)                                               \
-  do {                                                                 \
-    cudaError_t status = call;                                         \
-    ASSERT(status == cudaSuccess, "FAIL: call='%s'. Reason:%s", #call, \
-           cudaGetErrorString(status));                                \
-  } while (0)
-
-/**
- * @brief check for cuda runtime API errors but log error instead of raising
- *        exception.
- */
-#define CUDA_CHECK_NO_THROW(call)                                            \
-  do {                                                                       \
-    cudaError_t status = call;                                               \
-    if (status != cudaSuccess) {                                             \
-      CUML_LOG_ERROR("CUDA call='%s' at file=%s line=%d failed with %s ",    \
-                     #call, __FILE__, __LINE__, cudaGetErrorString(status)); \
-    }                                                                        \
-  } while (0)
-};  // namespace MLCommon
diff --git a/cpp/include/cuml/cuml.hpp b/cpp/include/cuml/cuml.hpp
index cc7a5cbca2..27e2c06f54 100644
--- a/cpp/include/cuml/cuml.hpp
+++ b/cpp/include/cuml/cuml.hpp
@@ -19,105 +19,10 @@
 #include <cuda_runtime.h>
 #include <cuml/common/cuml_allocator.hpp>
 #include <memory>
+#include <raft/handle.hpp>
 #include <vector>
 
 namespace ML {
-
-class cumlHandle_impl;
-
 using MLCommon::deviceAllocator;
 using MLCommon::hostAllocator;
-
-using MLCommon::defaultDeviceAllocator;
-using MLCommon::defaultHostAllocator;
-
-/**
- * @brief Handle to manage resources needed by cuML algorithms.
- */
-class cumlHandle {
- public:
-  /**
-     * @brief construct a cumlHandle with default paramters.
-     * @param n_streams number of internal streams to be setup
-     *
-     * The default paramters are 
-     *   - stream: default or NULL stream
-     *   - DeviceAllocator: cudaMalloc
-     *   - HostAllocator: cudaMallocHost
-     * @{
-     */
-  cumlHandle(int n_streams);
-  cumlHandle();
-  /** @} */
-  /**
-     * @brief releases all resources internally manged by cumlHandle.
-     */
-  ~cumlHandle();
-  /**
-     * @brief sets the stream to which all cuML work issued via this handle should be ordered.
-     *
-     * @param[in] stream    the stream to which cuML work should be ordered.
-     */
-  void setStream(cudaStream_t stream);
-  /**
-     * @brief gets the stream to which all cuML work issued via this handle should be ordered.
-     *
-     * @returns the stream to which cuML work should be ordered.
-     */
-  cudaStream_t getStream() const;
-  /** Get the cached device properties of the device this handle is for */
-  const cudaDeviceProp& getDeviceProperties() const;
-  /**
-     * @brief sets the allocator to use for all device allocations done in cuML.
-     * 
-     * @param[in] allocator     the deviceAllocator to use for device allocations.
-     */
-  void setDeviceAllocator(std::shared_ptr<deviceAllocator> allocator);
-  /**
-     * @brief gets the allocator to use for all device allocations done in cuML.
-     * 
-     * @returns the deviceAllocator to use for device allocations.
-     */
-  std::shared_ptr<deviceAllocator> getDeviceAllocator() const;
-  /**
-     * @brief sets the allocator to use for substantial host allocations done in cuML.
-     * 
-     * @param[in] allocator     the hostAllocator to use for host allocations.
-     */
-  void setHostAllocator(std::shared_ptr<hostAllocator> allocator);
-  /**
-     * @brief gets the allocator to use for substantial host allocations done in cuML.
-     * 
-     * @returns the hostAllocator to use for host allocations.
-     */
-  std::shared_ptr<hostAllocator> getHostAllocator() const;
-  /**
-  * @brief API to query Num of work streams set during handle creation.
-  * @returns num of streams in the handle.
-  */
-  int getNumInternalStreams();
-
-  /**
-   * @brief API to get the internal streams as a vector.
-   * @return vector of internal streams in the handle
-   */
-  std::vector<cudaStream_t> getInternalStreams() const;
-
-  /**
-     * @brief for internal use only.
-     */
-  const cumlHandle_impl& getImpl() const;
-  /**
-     * @brief for internal use only.
-     */
-  cumlHandle_impl& getImpl();
-
-  /** for internal use only */
-  static int getDefaultNumInternalStreams();
-
- private:
-  static constexpr int _default_num_internal_streams = 0;
-  std::unique_ptr<cumlHandle_impl> _impl;
-};
-
-}  // end namespace ML
+}  // namespace ML
diff --git a/cpp/include/cuml/datasets/make_arima.hpp b/cpp/include/cuml/datasets/make_arima.hpp
index f8da8c24b1..ab1ed41a24 100644
--- a/cpp/include/cuml/datasets/make_arima.hpp
+++ b/cpp/include/cuml/datasets/make_arima.hpp
@@ -37,11 +37,12 @@ namespace Datasets {
  * @param[in]  seed            Seed for the random number generator
  * @{
  */
-void make_arima(const cumlHandle& handle, float* out, int batch_size, int n_obs,
-                ARIMAOrder order, float scale = 1.0f, float noise_scale = 0.2f,
-                float intercept_scale = 1.0f, uint64_t seed = 0ULL);
+void make_arima(const raft::handle_t& handle, float* out, int batch_size,
+                int n_obs, ARIMAOrder order, float scale = 1.0f,
+                float noise_scale = 0.2f, float intercept_scale = 1.0f,
+                uint64_t seed = 0ULL);
 
-void make_arima(const cumlHandle& handle, double* out, int batch_size,
+void make_arima(const raft::handle_t& handle, double* out, int batch_size,
                 int n_obs, ARIMAOrder order, double scale = 1.0,
                 double noise_scale = 0.2, double intercept_scale = 1.0,
                 uint64_t seed = 0ULL);
diff --git a/cpp/include/cuml/datasets/make_blobs.hpp b/cpp/include/cuml/datasets/make_blobs.hpp
index 2f9c6171b6..baacae16fa 100644
--- a/cpp/include/cuml/datasets/make_blobs.hpp
+++ b/cpp/include/cuml/datasets/make_blobs.hpp
@@ -53,29 +53,29 @@ namespace Datasets {
  * @param[in]  seed               seed for the RNG
  * @{
  */
-void make_blobs(const cumlHandle& handle, float* out, int64_t* labels,
+void make_blobs(const raft::handle_t& handle, float* out, int64_t* labels,
                 int64_t n_rows, int64_t n_cols, int64_t n_clusters,
                 bool row_major = true, const float* centers = nullptr,
                 const float* cluster_std = nullptr,
                 const float cluster_std_scalar = 1.f, bool shuffle = true,
                 float center_box_min = -10.f, float center_box_max = 10.f,
                 uint64_t seed = 0ULL);
-void make_blobs(const cumlHandle& handle, double* out, int64_t* labels,
+void make_blobs(const raft::handle_t& handle, double* out, int64_t* labels,
                 int64_t n_rows, int64_t n_cols, int64_t n_clusters,
                 bool row_major = true, const double* centers = nullptr,
                 const double* cluster_std = nullptr,
                 const double cluster_std_scalar = 1.0, bool shuffle = true,
                 double center_box_min = -10.0, double center_box_max = 10.0,
                 uint64_t seed = 0ULL);
-void make_blobs(const cumlHandle& handle, float* out, int* labels, int n_rows,
-                int n_cols, int n_clusters, bool row_major = true,
+void make_blobs(const raft::handle_t& handle, float* out, int* labels,
+                int n_rows, int n_cols, int n_clusters, bool row_major = true,
                 const float* centers = nullptr,
                 const float* cluster_std = nullptr,
                 const float cluster_std_scalar = 1.f, bool shuffle = true,
                 float center_box_min = -10.f, float center_box_max = 10.0,
                 uint64_t seed = 0ULL);
-void make_blobs(const cumlHandle& handle, double* out, int* labels, int n_rows,
-                int n_cols, int n_clusters, bool row_major = true,
+void make_blobs(const raft::handle_t& handle, double* out, int* labels,
+                int n_rows, int n_cols, int n_clusters, bool row_major = true,
                 const double* centers = nullptr,
                 const double* cluster_std = nullptr,
                 const double cluster_std_scalar = 1.0, bool shuffle = true,
diff --git a/cpp/include/cuml/datasets/make_regression.hpp b/cpp/include/cuml/datasets/make_regression.hpp
index f163cfac21..c6aa8c5f8f 100644
--- a/cpp/include/cuml/datasets/make_regression.hpp
+++ b/cpp/include/cuml/datasets/make_regression.hpp
@@ -51,28 +51,28 @@ namespace Datasets {
  * @param[in]   shuffle         Shuffle the samples and the features
  * @param[in]   seed            Seed for the random number generator
  */
-void make_regression(const cumlHandle& handle, float* out, float* values,
+void make_regression(const raft::handle_t& handle, float* out, float* values,
                      int64_t n_rows, int64_t n_cols, int64_t n_informative,
                      float* coef = nullptr, int64_t n_targets = 1LL,
                      float bias = 0.0f, int64_t effective_rank = -1LL,
                      float tail_strength = 0.5f, float noise = 0.0f,
                      bool shuffle = true, uint64_t seed = 0ULL);
 
-void make_regression(const cumlHandle& handle, double* out, double* values,
+void make_regression(const raft::handle_t& handle, double* out, double* values,
                      int64_t n_rows, int64_t n_cols, int64_t n_informative,
                      double* coef = nullptr, int64_t n_targets = 1LL,
                      double bias = 0.0, int64_t effective_rank = -1LL,
                      double tail_strength = 0.5, double noise = 0.0,
                      bool shuffle = true, uint64_t seed = 0ULL);
 
-void make_regression(const cumlHandle& handle, float* out, float* values,
+void make_regression(const raft::handle_t& handle, float* out, float* values,
                      int n_rows, int n_cols, int n_informative,
                      float* coef = nullptr, int n_targets = 1LL,
                      float bias = 0.0f, int effective_rank = -1LL,
                      float tail_strength = 0.5f, float noise = 0.0f,
                      bool shuffle = true, uint64_t seed = 0ULL);
 
-void make_regression(const cumlHandle& handle, double* out, double* values,
+void make_regression(const raft::handle_t& handle, double* out, double* values,
                      int n_rows, int n_cols, int n_informative,
                      double* coef = nullptr, int n_targets = 1LL,
                      double bias = 0.0, int effective_rank = -1LL,
diff --git a/cpp/include/cuml/decomposition/params.hpp b/cpp/include/cuml/decomposition/params.hpp
index dabc904156..014d52735d 100644
--- a/cpp/include/cuml/decomposition/params.hpp
+++ b/cpp/include/cuml/decomposition/params.hpp
@@ -19,19 +19,12 @@
 namespace ML {
 
 /**
- * @defgroup pcaSolver: enumeration for pca solvers.
- * @param AUTO: Fastest solver will be used based on input shape and n_components.
- * @param FULL: All the eigenvectors and singular values (or eigenvalues) will be generated.
- * @param ARPACK: tsvd using power method. Lanczos will be included in the future.
- * @param RANDOMIZED: randomized svd
  * @param COV_EIG_DQ: covariance of input will be used along with eigen decomposition using divide and conquer method for symmetric matrices
  * @param COV_EIG_JACOBI: covariance of input will be used along with eigen decomposition using jacobi method for symmetric matrices
- * @{
  */
 enum class solver : int {
   COV_EIG_DQ,
   COV_EIG_JACOBI,
-  RANDOMIZED,
 };
 
 class params {
@@ -48,7 +41,6 @@ class paramsSolver : public params {
   //math_t tol = 0.0;
   float tol = 0.0;
   int n_iterations = 15;
-  int random_state;
   int verbose = 0;
 };
 
@@ -56,9 +48,7 @@ template <typename enum_solver = solver>
 class paramsTSVDTemplate : public paramsSolver {
  public:
   int n_components = 1;
-  int max_sweeps = 15;
   enum_solver algorithm = enum_solver::COV_EIG_DQ;
-  bool trans_input = false;
 };
 
 /**
@@ -68,19 +58,16 @@ class paramsTSVDTemplate : public paramsSolver {
  *              use fit_transform(X) instead.
  * @param whiten: When True (False by default) the components_ vectors are multiplied by the square root of n_samples and
  *                then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
- * @param svd_solver: the solver to be used in PCA.
+ * @param algorithm: the solver to be used in PCA.
  * @param tol: Tolerance for singular values computed by svd_solver == ‘arpack’ or svd_solver == ‘COV_EIG_JACOBI’
- * @param iterated_power: Number of iterations for the power method computed by svd_solver == ‘randomized’ or
- *                        jacobi method by svd_solver == 'COV_EIG_JACOBI'.
- * @param random_state: RandomState instance or None, optional (default None)
+ * @param n_iterations: Number of iterations for the power method computed by jacobi method (svd_solver == 'COV_EIG_JACOBI').
  * @param verbose: 0: no error message printing, 1: print error messages
- * @param max_sweeps: number of sweeps jacobi method uses. The more the better accuracy.
  */
 
 template <typename enum_solver = solver>
 class paramsPCATemplate : public paramsTSVDTemplate<enum_solver> {
  public:
-  bool copy = true;
+  bool copy = true;  // TODO unused, see #2830 and #2833
   bool whiten = false;
 };
 
diff --git a/cpp/include/cuml/decomposition/pca.hpp b/cpp/include/cuml/decomposition/pca.hpp
index 2bb577fd75..8710b03a2f 100644
--- a/cpp/include/cuml/decomposition/pca.hpp
+++ b/cpp/include/cuml/decomposition/pca.hpp
@@ -21,32 +21,32 @@
 
 namespace ML {
 
-void pcaFit(cumlHandle &handle, float *input, float *components,
+void pcaFit(raft::handle_t &handle, float *input, float *components,
             float *explained_var, float *explained_var_ratio,
             float *singular_vals, float *mu, float *noise_vars,
             const paramsPCA &prms);
-void pcaFit(cumlHandle &handle, double *input, double *components,
+void pcaFit(raft::handle_t &handle, double *input, double *components,
             double *explained_var, double *explained_var_ratio,
             double *singular_vals, double *mu, double *noise_vars,
             const paramsPCA &prms);
-void pcaFitTransform(cumlHandle &handle, float *input, float *trans_input,
+void pcaFitTransform(raft::handle_t &handle, float *input, float *trans_input,
                      float *components, float *explained_var,
                      float *explained_var_ratio, float *singular_vals,
                      float *mu, float *noise_vars, const paramsPCA &prms);
-void pcaFitTransform(cumlHandle &handle, double *input, double *trans_input,
+void pcaFitTransform(raft::handle_t &handle, double *input, double *trans_input,
                      double *components, double *explained_var,
                      double *explained_var_ratio, double *singular_vals,
                      double *mu, double *noise_vars, const paramsPCA &prms);
-void pcaInverseTransform(cumlHandle &handle, float *trans_input,
+void pcaInverseTransform(raft::handle_t &handle, float *trans_input,
                          float *components, float *singular_vals, float *mu,
                          float *input, const paramsPCA &prms);
-void pcaInverseTransform(cumlHandle &handle, double *trans_input,
+void pcaInverseTransform(raft::handle_t &handle, double *trans_input,
                          double *components, double *singular_vals, double *mu,
                          double *input, const paramsPCA &prms);
-void pcaTransform(cumlHandle &handle, float *input, float *components,
+void pcaTransform(raft::handle_t &handle, float *input, float *components,
                   float *trans_input, float *singular_vals, float *mu,
                   const paramsPCA &prms);
-void pcaTransform(cumlHandle &handle, double *input, double *components,
+void pcaTransform(raft::handle_t &handle, double *input, double *components,
                   double *trans_input, double *singular_vals, double *mu,
                   const paramsPCA &prms);
 
diff --git a/cpp/include/cuml/decomposition/pca_mg.hpp b/cpp/include/cuml/decomposition/pca_mg.hpp
index 5b3f83a18b..302aaf4fd1 100644
--- a/cpp/include/cuml/decomposition/pca_mg.hpp
+++ b/cpp/include/cuml/decomposition/pca_mg.hpp
@@ -47,13 +47,13 @@ namespace opg {
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<float> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc, float *components,
          float *explained_var, float *explained_var_ratio, float *singular_vals,
          float *mu, float *noise_vars, paramsPCAMG prms, bool verbose = false);
 
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<double> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc, double *components,
          double *explained_var, double *explained_var_ratio,
@@ -76,7 +76,7 @@ void fit(cumlHandle &handle,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
                    MLCommon::Matrix::floatData_t **input,
                    MLCommon::Matrix::floatData_t **trans_input,
@@ -84,7 +84,7 @@ void fit_transform(cumlHandle &handle,
                    float *explained_var_ratio, float *singular_vals, float *mu,
                    float *noise_vars, paramsPCAMG prms, bool verbose);
 
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
                    MLCommon::Matrix::doubleData_t **input,
                    MLCommon::Matrix::doubleData_t **trans_input,
@@ -106,14 +106,16 @@ void fit_transform(cumlHandle &handle,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-               size_t n_parts, MLCommon::Matrix::Data<float> **input,
-               float *components, MLCommon::Matrix::Data<float> **trans_input,
+void transform(raft::handle_t &handle,
+               MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+               MLCommon::Matrix::Data<float> **input, float *components,
+               MLCommon::Matrix::Data<float> **trans_input,
                float *singular_vals, float *mu, paramsPCAMG prms, bool verbose);
 
-void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-               size_t n_parts, MLCommon::Matrix::Data<double> **input,
-               double *components, MLCommon::Matrix::Data<double> **trans_input,
+void transform(raft::handle_t &handle,
+               MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+               MLCommon::Matrix::Data<double> **input, double *components,
+               MLCommon::Matrix::Data<double> **trans_input,
                double *singular_vals, double *mu, paramsPCAMG prms,
                bool verbose);
 
@@ -130,7 +132,7 @@ void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void inverse_transform(cumlHandle &handle,
+void inverse_transform(raft::handle_t &handle,
                        MLCommon::Matrix::RankSizePair **rank_sizes,
                        size_t n_parts,
                        MLCommon::Matrix::Data<float> **trans_input,
@@ -139,7 +141,7 @@ void inverse_transform(cumlHandle &handle,
                        bool verbose);
 
 void inverse_transform(
-  cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
+  raft::handle_t &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
   size_t n_parts, MLCommon::Matrix::Data<double> **trans_input,
   double *components, MLCommon::Matrix::Data<double> **input,
   double *singular_vals, double *mu, paramsPCAMG prms, bool verbose);
diff --git a/cpp/include/cuml/decomposition/sign_flip_mg.hpp b/cpp/include/cuml/decomposition/sign_flip_mg.hpp
index 8930d03d79..2563a740cb 100644
--- a/cpp/include/cuml/decomposition/sign_flip_mg.hpp
+++ b/cpp/include/cuml/decomposition/sign_flip_mg.hpp
@@ -35,12 +35,12 @@ namespace opg {
  * @param[in] n_stream: number of streams
  * @{
  */
-void sign_flip(cumlHandle &handle,
+void sign_flip(raft::handle_t &handle,
                std::vector<MLCommon::Matrix::Data<float> *> &input_data,
                MLCommon::Matrix::PartDescriptor &input_desc, float *components,
                int n_components, cudaStream_t *streams, int n_stream);
 
-void sign_flip(cumlHandle &handle,
+void sign_flip(raft::handle_t &handle,
                std::vector<MLCommon::Matrix::Data<double> *> &input_data,
                MLCommon::Matrix::PartDescriptor &input_desc, double *components,
                int n_components, cudaStream_t *streams, int n_stream);
diff --git a/cpp/include/cuml/decomposition/tsvd.hpp b/cpp/include/cuml/decomposition/tsvd.hpp
index 8296f0dba0..66e76bac1d 100644
--- a/cpp/include/cuml/decomposition/tsvd.hpp
+++ b/cpp/include/cuml/decomposition/tsvd.hpp
@@ -21,27 +21,27 @@
 
 namespace ML {
 
-void tsvdFit(cumlHandle &handle, float *input, float *components,
+void tsvdFit(raft::handle_t &handle, float *input, float *components,
              float *singular_vals, const paramsTSVD &prms);
-void tsvdFit(cumlHandle &handle, double *input, double *components,
+void tsvdFit(raft::handle_t &handle, double *input, double *components,
              double *singular_vals, const paramsTSVD &prms);
-void tsvdInverseTransform(cumlHandle &handle, float *trans_input,
+void tsvdInverseTransform(raft::handle_t &handle, float *trans_input,
                           float *components, float *input,
                           const paramsTSVD &prms);
-void tsvdInverseTransform(cumlHandle &handle, double *trans_input,
+void tsvdInverseTransform(raft::handle_t &handle, double *trans_input,
                           double *components, double *input,
                           const paramsTSVD &prms);
-void tsvdTransform(cumlHandle &handle, float *input, float *components,
+void tsvdTransform(raft::handle_t &handle, float *input, float *components,
                    float *trans_input, const paramsTSVD &prms);
-void tsvdTransform(cumlHandle &handle, double *input, double *components,
+void tsvdTransform(raft::handle_t &handle, double *input, double *components,
                    double *trans_input, const paramsTSVD &prms);
-void tsvdFitTransform(cumlHandle &handle, float *input, float *trans_input,
+void tsvdFitTransform(raft::handle_t &handle, float *input, float *trans_input,
                       float *components, float *explained_var,
                       float *explained_var_ratio, float *singular_vals,
                       const paramsTSVD &prms);
-void tsvdFitTransform(cumlHandle &handle, double *input, double *trans_input,
-                      double *components, double *explained_var,
-                      double *explained_var_ratio, double *singular_vals,
-                      const paramsTSVD &prms);
+void tsvdFitTransform(raft::handle_t &handle, double *input,
+                      double *trans_input, double *components,
+                      double *explained_var, double *explained_var_ratio,
+                      double *singular_vals, const paramsTSVD &prms);
 
 }  // namespace ML
diff --git a/cpp/include/cuml/decomposition/tsvd_mg.hpp b/cpp/include/cuml/decomposition/tsvd_mg.hpp
index 16573dba39..5c1b4d01b6 100644
--- a/cpp/include/cuml/decomposition/tsvd_mg.hpp
+++ b/cpp/include/cuml/decomposition/tsvd_mg.hpp
@@ -37,12 +37,12 @@ namespace opg {
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void fit(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
+void fit(raft::handle_t &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
          size_t n_parts, MLCommon::Matrix::floatData_t **input,
          float *components, float *singular_vals, paramsTSVD prms,
          bool verbose = false);
 
-void fit(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
+void fit(raft::handle_t &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
          size_t n_parts, MLCommon::Matrix::doubleData_t **input,
          double *components, double *singular_vals, paramsTSVD prms,
          bool verbose = false);
@@ -61,7 +61,7 @@ void fit(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    std::vector<MLCommon::Matrix::Data<float> *> &input_data,
                    MLCommon::Matrix::PartDescriptor &input_desc,
                    std::vector<MLCommon::Matrix::Data<float> *> &trans_data,
@@ -70,7 +70,7 @@ void fit_transform(cumlHandle &handle,
                    float *explained_var_ratio, float *singular_vals,
                    paramsTSVD prms, bool verbose);
 
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    std::vector<MLCommon::Matrix::Data<double> *> &input_data,
                    MLCommon::Matrix::PartDescriptor &input_desc,
                    std::vector<MLCommon::Matrix::Data<double> *> &trans_data,
@@ -90,15 +90,17 @@ void fit_transform(cumlHandle &handle,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-               size_t n_parts, MLCommon::Matrix::Data<float> **input,
-               float *components, MLCommon::Matrix::Data<float> **trans_input,
-               paramsTSVD prms, bool verbose);
+void transform(raft::handle_t &handle,
+               MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+               MLCommon::Matrix::Data<float> **input, float *components,
+               MLCommon::Matrix::Data<float> **trans_input, paramsTSVD prms,
+               bool verbose);
 
-void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-               size_t n_parts, MLCommon::Matrix::Data<double> **input,
-               double *components, MLCommon::Matrix::Data<double> **trans_input,
-               paramsTSVD prms, bool verbose);
+void transform(raft::handle_t &handle,
+               MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+               MLCommon::Matrix::Data<double> **input, double *components,
+               MLCommon::Matrix::Data<double> **trans_input, paramsTSVD prms,
+               bool verbose);
 
 /**
  * @brief performs MNMG inverse transform operation for the output.
@@ -111,14 +113,14 @@ void transform(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
  * @param[in] prms: data structure that includes all the parameters from input size to algorithm
  * @param[in] verbose
  */
-void inverse_transform(cumlHandle &handle,
+void inverse_transform(raft::handle_t &handle,
                        MLCommon::Matrix::RankSizePair **rank_sizes,
                        size_t n_parts,
                        MLCommon::Matrix::Data<float> **trans_input,
                        float *components, MLCommon::Matrix::Data<float> **input,
                        paramsTSVD prms, bool verbose);
 
-void inverse_transform(cumlHandle &handle,
+void inverse_transform(raft::handle_t &handle,
                        MLCommon::Matrix::RankSizePair **rank_sizes,
                        size_t n_parts,
                        MLCommon::Matrix::Data<double> **trans_input,
diff --git a/cpp/include/cuml/distance/distance_type.h b/cpp/include/cuml/distance/distance_type.h
new file mode 100644
index 0000000000..da881c35b9
--- /dev/null
+++ b/cpp/include/cuml/distance/distance_type.h
@@ -0,0 +1,23 @@
+#pragma once
+
+namespace ML {
+namespace Distance {
+
+/** enum to tell how to compute euclidean distance */
+enum DistanceType : unsigned short {
+  /** evaluate as dist_ij = sum(x_ik^2) + sum(y_ij)^2 - 2*sum(x_ik * y_jk) */
+  EucExpandedL2 = 0,
+  /** same as above, but inside the epilogue, perform square root operation */
+  EucExpandedL2Sqrt = 1,
+  /** cosine distance */
+  EucExpandedCosine = 2,
+  /** L1 distance */
+  EucUnexpandedL1 = 3,
+  /** evaluate as dist_ij += (x_ik - y-jk)^2 */
+  EucUnexpandedL2 = 4,
+  /** same as above, but inside the epilogue, perform square root operation */
+  EucUnexpandedL2Sqrt = 5,
+};
+
+};  // end namespace Distance
+};  // end namespace ML
diff --git a/cpp/include/cuml/ensemble/randomforest.hpp b/cpp/include/cuml/ensemble/randomforest.hpp
index 545c5decad..cda6e88125 100644
--- a/cpp/include/cuml/ensemble/randomforest.hpp
+++ b/cpp/include/cuml/ensemble/randomforest.hpp
@@ -57,9 +57,14 @@ struct RF_params {
    */
   int n_trees;
   /**
-   * Control bootstrapping. If set, each tree in the forest is built on a
-   * bootstrapped sample with replacement.
-   * If false, sampling without replacement is done.
+   * Control bootstrapping.
+   * If bootstrapping is set to true, bootstrapped samples are used for building
+   * each tree. Bootstrapped sampling is done by randomly drawing
+   * round(rows_sample * n_samples) number of samples with replacement. More on
+   * bootstrapping:
+   *     https://en.wikipedia.org/wiki/Bootstrap_aggregating
+   * If boostrapping is set to false, whole dataset is used to build each
+   * tree.
    */
   bool bootstrap;
   /**
@@ -129,6 +134,9 @@ void print_rf_summary(const RandomForestMetaData<T, L>* forest);
 template <class T, class L>
 void print_rf_detailed(const RandomForestMetaData<T, L>* forest);
 
+template <class T, class L>
+std::string dump_rf_as_json(const RandomForestMetaData<T, L>* forest);
+
 template <class T, class L>
 void build_treelite_forest(ModelHandle* model,
                            const RandomForestMetaData<T, L>* forest,
@@ -143,37 +151,37 @@ void compare_concat_forest_to_subforests(
 typedef RandomForestMetaData<float, int> RandomForestClassifierF;
 typedef RandomForestMetaData<double, int> RandomForestClassifierD;
 
-void fit(const cumlHandle& user_handle, RandomForestClassifierF*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestClassifierF*& forest,
          float* input, int n_rows, int n_cols, int* labels, int n_unique_labels,
          RF_params rf_params, int verbosity = CUML_LEVEL_INFO);
-void fit(const cumlHandle& user_handle, RandomForestClassifierD*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestClassifierD*& forest,
          double* input, int n_rows, int n_cols, int* labels,
          int n_unique_labels, RF_params rf_params,
          int verbosity = CUML_LEVEL_INFO);
 
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestClassifierF* forest, const float* input,
              int n_rows, int n_cols, int* predictions,
              int verbosity = CUML_LEVEL_INFO);
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestClassifierD* forest, const double* input,
              int n_rows, int n_cols, int* predictions,
              int verbosity = CUML_LEVEL_INFO);
 
-void predictGetAll(const cumlHandle& user_handle,
+void predictGetAll(const raft::handle_t& user_handle,
                    const RandomForestClassifierF* forest, const float* input,
                    int n_rows, int n_cols, int* predictions,
                    int verbosity = CUML_LEVEL_INFO);
-void predictGetAll(const cumlHandle& user_handle,
+void predictGetAll(const raft::handle_t& user_handle,
                    const RandomForestClassifierD* forest, const double* input,
                    int n_rows, int n_cols, int* predictions,
                    int verbosity = CUML_LEVEL_INFO);
 
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestClassifierF* forest, const int* ref_labels,
                  int n_rows, const int* predictions,
                  int verbosity = CUML_LEVEL_INFO);
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestClassifierD* forest, const int* ref_labels,
                  int n_rows, const int* predictions,
                  int verbosity = CUML_LEVEL_INFO);
@@ -183,34 +191,35 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
                            float min_impurity_decrease, bool bootstrap_features,
                            bool bootstrap, int n_trees, float rows_sample,
                            int seed, CRITERION split_criterion,
-                           bool quantile_per_tree, int cfg_n_streams);
+                           bool quantile_per_tree, int cfg_n_streams,
+                           bool use_experimental_backend, int max_batch_size);
 
 // ----------------------------- Regression ----------------------------------- //
 
 typedef RandomForestMetaData<float, float> RandomForestRegressorF;
 typedef RandomForestMetaData<double, double> RandomForestRegressorD;
 
-void fit(const cumlHandle& user_handle, RandomForestRegressorF*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestRegressorF*& forest,
          float* input, int n_rows, int n_cols, float* labels,
          RF_params rf_params, int verbosity = CUML_LEVEL_INFO);
-void fit(const cumlHandle& user_handle, RandomForestRegressorD*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestRegressorD*& forest,
          double* input, int n_rows, int n_cols, double* labels,
          RF_params rf_params, int verbosity = CUML_LEVEL_INFO);
 
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestRegressorF* forest, const float* input,
              int n_rows, int n_cols, float* predictions,
              int verbosity = CUML_LEVEL_INFO);
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestRegressorD* forest, const double* input,
              int n_rows, int n_cols, double* predictions,
              int verbosity = CUML_LEVEL_INFO);
 
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestRegressorF* forest, const float* ref_labels,
                  int n_rows, const float* predictions,
                  int verbosity = CUML_LEVEL_INFO);
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestRegressorD* forest, const double* ref_labels,
                  int n_rows, const double* predictions,
                  int verbosity = CUML_LEVEL_INFO);
diff --git a/cpp/include/cuml/fil/fil.h b/cpp/include/cuml/fil/fil.h
index 7642375255..b2f8b924be 100644
--- a/cpp/include/cuml/fil/fil.h
+++ b/cpp/include/cuml/fil/fil.h
@@ -79,8 +79,14 @@ enum storage_type_t {
   AUTO,
   /** import the forest as dense */
   DENSE,
-  /** import the forest as sparse */
-  SPARSE
+  /** import the forest as sparse (currently always with 16-byte nodes) */
+  SPARSE,
+  /** (experimental) import the forest as sparse with 8-byte nodes; can fail if
+      8-byte nodes are not enough to store the forest, e.g. there are too many
+      nodes in a tree or too many features; note that the number of bits used to
+      store the child or feature index can change in the future; this can affect
+      whether a particular forest can be imported as SPARSE8 */
+  SPARSE8,
 };
 
 /** val_t is the payload within a FIL leaf */
@@ -98,57 +104,93 @@ struct dense_node_t {
   int bits;
 };
 
-/** sparse_node_extra_data is what's missing from a dense node to store
+/** sparse_node16_extra_data is what's missing from a dense node to store
     a sparse node, that is, extra indexing information due to compressing
     a sparse tree. */
-struct sparse_node_extra_data {
+struct sparse_node16_extra_data {
   int left_idx;
   int dummy;  // make alignment explicit and reserve for future use
 };
 
-/** sparse_node_t is a node in a sparsely-stored forest */
-struct sparse_node_t : dense_node_t, sparse_node_extra_data {
-  sparse_node_t() = default;
-  sparse_node_t(dense_node_t dn, sparse_node_extra_data ed)
-    : dense_node_t(dn), sparse_node_extra_data(ed) {}
+/** sparse_node16_t is a 16-byte node in a sparsely-stored forest */
+struct sparse_node16_t : dense_node_t, sparse_node16_extra_data {
+  sparse_node16_t() = default;
+  sparse_node16_t(dense_node_t dn, sparse_node16_extra_data ed)
+    : dense_node_t(dn), sparse_node16_extra_data(ed) {}
 };
 
-/** leaf_value_t describes what the leaves in a FIL forest store (predict) */
-enum leaf_value_t {
-  /** storing a class probability or regression summand */
-  FLOAT_SCALAR = 0,
-  /** storing a class label */
-  INT_CLASS_LABEL = 1
+/** sparse_node8_t is a node of reduced size (8 bytes)
+    in a sparsely-stored forest */
+struct sparse_node8_t : dense_node_t {
+  sparse_node8_t() = default;
+  sparse_node8_t(dense_node_t dn) : dense_node_t(dn) {}
+};
+
+/** leaf_algo_t describes what the leaves in a FIL forest store (predict)
+    and how FIL aggregates them into class margins/regression result/best class
+**/
+enum leaf_algo_t {
+  /** storing a class probability or regression summand. We add all margins
+      together and determine regression result or use threshold to determine
+      one of the two classes. **/
+  FLOAT_UNARY_BINARY = 0,
+  /** storing a class label. Trees vote on the resulting class.
+      Probabilities are just normalized votes. */
+  CATEGORICAL_LEAF = 1,
+  /** 1-vs-rest, or tree-per-class, where trees are assigned round-robin to
+      consecutive categories and predict a floating-point margin. Used in
+      Gradient Boosted Decision Trees. We sum margins for each group separately
+      **/
+  GROVE_PER_CLASS = 2,
+  /** 1-vs-rest, or tree-per-class, where trees are assigned round-robin to
+      consecutive categories and predict a floating-point margin. Used in
+      Gradient Boosted Decision Trees. We sum margins for each group separately
+      This is a more specific version of GROVE_PER_CLASS.
+      _FEW_CLASSES means fewer (or as many) classes than threads. **/
+  GROVE_PER_CLASS_FEW_CLASSES = 3,
+  /** 1-vs-rest, or tree-per-class, where trees are assigned round-robin to
+      consecutive categories and predict a floating-point margin. Used in
+      Gradient Boosted Decision Trees. We sum margins for each group separately
+      This is a more specific version of GROVE_PER_CLASS.
+      _MANY_CLASSES means more classes than threads. **/
+  GROVE_PER_CLASS_MANY_CLASSES = 4,
   // to be extended
 };
 
-template <leaf_value_t leaf_payload_type>
+template <leaf_algo_t leaf_algo>
 struct leaf_output_t {};
 template <>
-struct leaf_output_t<leaf_value_t::FLOAT_SCALAR> {
+struct leaf_output_t<leaf_algo_t::FLOAT_UNARY_BINARY> {
   typedef float T;
 };
 template <>
-struct leaf_output_t<leaf_value_t::INT_CLASS_LABEL> {
+struct leaf_output_t<leaf_algo_t::CATEGORICAL_LEAF> {
   typedef int T;
 };
+template <>
+struct leaf_output_t<leaf_algo_t::GROVE_PER_CLASS_FEW_CLASSES> {
+  typedef float T;
+};
+template <>
+struct leaf_output_t<leaf_algo_t::GROVE_PER_CLASS_MANY_CLASSES> {
+  typedef float T;
+};
 
-/** dense_node_init initializes node from paramters */
-void dense_node_init(dense_node_t* n, val_t output, float thresh, int fid,
-                     bool def_left, bool is_leaf);
-
-/** dense_node_decode extracts individual members from node */
-void dense_node_decode(const dense_node_t* node, val_t* output, float* thresh,
-                       int* fid, bool* def_left, bool* is_leaf);
-
-/** sparse_node_init initializes node from parameters */
-void sparse_node_init(sparse_node_t* node, val_t output, float thresh, int fid,
-                      bool def_left, bool is_leaf, int left_index);
+/** node_init initializes node from paramters */
+void node_init(dense_node_t* n, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf);
+void node_init(sparse_node16_t* node, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf, int left_index);
+void node_init(sparse_node8_t* node, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf, int left_index);
 
-/** sparse_node_decode extracts individual members from node */
-void sparse_node_decode(const sparse_node_t* node, val_t* output, float* thresh,
-                        int* fid, bool* def_left, bool* is_leaf,
-                        int* left_index);
+/** node_decode extracts individual members from node */
+void node_decode(const dense_node_t* node, val_t* output, float* thresh,
+                 int* fid, bool* def_left, bool* is_leaf);
+void node_decode(const sparse_node16_t* node, val_t* output, float* thresh,
+                 int* fid, bool* def_left, bool* is_leaf, int* left_index);
+void node_decode(const sparse_node8_t* node, val_t* output, float* thresh,
+                 int* fid, bool* def_left, bool* is_leaf, int* left_index);
 
 struct forest;
 
@@ -165,20 +207,20 @@ struct forest_params_t {
   int num_trees;
   // num_cols is the number of columns in the data
   int num_cols;
-  // leaf_payload_type determines what the leaves store (predict)
-  leaf_value_t leaf_payload_type;
+  // leaf_algo determines what the leaves store (predict)
+  leaf_algo_t leaf_algo;
   // algo is the inference algorithm;
   // sparse forests do not distinguish between NAIVE and TREE_REORG
   algo_t algo;
   // output is the desired output type
   output_t output;
-  // threshold is used to for classification if leaf_payload_type == FLOAT_SCALAR && (output & OUTPUT_CLASS) != 0 && !predict_proba,
+  // threshold is used to for classification if leaf_algo == FLOAT_UNARY_BINARY && (output & OUTPUT_CLASS) != 0 && !predict_proba,
   // and is ignored otherwise
   float threshold;
   // global_bias is added to the sum of tree predictions
   // (after averaging, if it is used, but before any further transformations)
   float global_bias;
-  // only used for INT_CLASS_LABEL inference. since we're storing the
+  // only used for CATEGORICAL_LEAF inference. since we're storing the
   // labels in leaves instead of the whole vector, this keeps track
   // of the number of classes
   int num_classes;
@@ -207,18 +249,30 @@ struct treelite_params_t {
       (2**(params->depth + 1) - 1) * params->ntrees
  *  @param params pointer to parameters used to initialize the forest
  */
-void init_dense(const cumlHandle& h, forest_t* pf, const dense_node_t* nodes,
-                const forest_params_t* params);
+void init_dense(const raft::handle_t& h, forest_t* pf,
+                const dense_node_t* nodes, const forest_params_t* params);
+
+/** init_sparse uses params, trees and nodes to initialize the sparse forest
+ *  with 16-byte nodes stored in pf
+ *  @param h cuML handle used by this function
+ *  @param pf pointer to where to store the newly created forest
+ *  @param trees indices of tree roots in the nodes arrray, of length params->ntrees
+ *  @param nodes nodes for the forest, of length params->num_nodes
+ *  @param params pointer to parameters used to initialize the forest
+ */
+void init_sparse(const raft::handle_t& h, forest_t* pf, const int* trees,
+                 const sparse_node16_t* nodes, const forest_params_t* params);
 
-/** init_sparse uses params, trees and nodes to initialize the sparse forest stored in pf
+/** init_sparse uses params, trees and nodes to initialize the sparse forest
+ *  with 8-byte nodes stored in pf
  *  @param h cuML handle used by this function
  *  @param pf pointer to where to store the newly created forest
  *  @param trees indices of tree roots in the nodes arrray, of length params->ntrees
  *  @param nodes nodes for the forest, of length params->num_nodes
  *  @param params pointer to parameters used to initialize the forest
  */
-void init_sparse(const cumlHandle& h, forest_t* pf, const int* trees,
-                 const sparse_node_t* nodes, const forest_params_t* params);
+void init_sparse(const raft::handle_t& h, forest_t* pf, const int* trees,
+                 const sparse_node8_t* nodes, const forest_params_t* params);
 
 /** from_treelite uses a treelite model to initialize the forest
  * @param handle cuML handle used by this function
@@ -226,14 +280,14 @@ void init_sparse(const cumlHandle& h, forest_t* pf, const int* trees,
  * @param model treelite model used to initialize the forest
  * @param tl_params additional parameters for the forest
  */
-void from_treelite(const cumlHandle& handle, forest_t* pforest,
+void from_treelite(const raft::handle_t& handle, forest_t* pforest,
                    ModelHandle model, const treelite_params_t* tl_params);
 
 /** free deletes forest and all resources held by it; after this, forest is no longer usable
  *  @param h cuML handle used by this function
  *  @param f the forest to free; not usable after the call to this function
  */
-void free(const cumlHandle& h, forest_t f);
+void free(const raft::handle_t& h, forest_t f);
 
 /** predict predicts on data (with n rows) using forest and writes results into preds;
  *  the number of columns is stored in forest, and both preds and data point to GPU memory
@@ -247,8 +301,8 @@ void free(const cumlHandle& h, forest_t f);
  *  @param predict_proba for classifier models, this forces to output both class probabilities
  *      instead of binary class prediction. format matches scikit-learn API
  */
-void predict(const cumlHandle& h, forest_t f, float* preds, const float* data,
-             size_t num_rows, bool predict_proba = false);
+void predict(const raft::handle_t& h, forest_t f, float* preds,
+             const float* data, size_t num_rows, bool predict_proba = false);
 
 }  // namespace fil
 }  // namespace ML
diff --git a/cpp/include/cuml/linear_model/glm.hpp b/cpp/include/cuml/linear_model/glm.hpp
index 90f08e40f8..6eff9f4e28 100644
--- a/cpp/include/cuml/linear_model/glm.hpp
+++ b/cpp/include/cuml/linear_model/glm.hpp
@@ -33,10 +33,10 @@ namespace GLM {
  * @param algo          specifies which solver to use (0: SVD, 1: Eigendecomposition, 2: QR-decomposition)
  * @{
  */
-void olsFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
+void olsFit(const raft::handle_t &handle, float *input, int n_rows, int n_cols,
             float *labels, float *coef, float *intercept, bool fit_intercept,
             bool normalize, int algo = 0);
-void olsFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
+void olsFit(const raft::handle_t &handle, double *input, int n_rows, int n_cols,
             double *labels, double *coef, double *intercept, bool fit_intercept,
             bool normalize, int algo = 0);
 /** @} */
@@ -56,14 +56,14 @@ void olsFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
  * @param algo          specifies which solver to use (0: SVD, 1: Eigendecomposition)
  * @{
  */
-void ridgeFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
-              float *labels, float *alpha, int n_alpha, float *coef,
+void ridgeFit(const raft::handle_t &handle, float *input, int n_rows,
+              int n_cols, float *labels, float *alpha, int n_alpha, float *coef,
               float *intercept, bool fit_intercept, bool normalize,
               int algo = 0);
-void ridgeFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
-              double *labels, double *alpha, int n_alpha, double *coef,
-              double *intercept, bool fit_intercept, bool normalize,
-              int algo = 0);
+void ridgeFit(const raft::handle_t &handle, double *input, int n_rows,
+              int n_cols, double *labels, double *alpha, int n_alpha,
+              double *coef, double *intercept, bool fit_intercept,
+              bool normalize, int algo = 0);
 /** @} */
 
 /**
@@ -76,21 +76,21 @@ void ridgeFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
  * @param preds         device pointer to store predictions of size n_rows
  * @{
  */
-void olsPredict(const cumlHandle &handle, const float *input, int n_rows,
+void olsPredict(const raft::handle_t &handle, const float *input, int n_rows,
                 int n_cols, const float *coef, float intercept, float *preds);
-void olsPredict(const cumlHandle &handle, const double *input, int n_rows,
+void olsPredict(const raft::handle_t &handle, const double *input, int n_rows,
                 int n_cols, const double *coef, double intercept,
                 double *preds);
-void ridgePredict(const cumlHandle &handle, const float *input, int n_rows,
+void ridgePredict(const raft::handle_t &handle, const float *input, int n_rows,
                   int n_cols, const float *coef, float intercept, float *preds);
-void ridgePredict(const cumlHandle &handle, const double *input, int n_rows,
+void ridgePredict(const raft::handle_t &handle, const double *input, int n_rows,
                   int n_cols, const double *coef, double intercept,
                   double *preds);
 /** @} */
 
 /**
  * @defgroup qnFit to fit a GLM using quasi newton methods.
- * @param cuml_handle           reference to cumlHandle object
+ * @param cuml_handle           reference to raft::handle_t object
  * @param X                     device pointer to feature matrix of dimension
  * NxD (row- or column major: see X_col_major param)
  * @param y                     device pointer to label vector of length N (for
@@ -125,13 +125,13 @@ void ridgePredict(const cumlHandle &handle, const double *input, int n_rows,
  * normal/squared, 2: multinomial/softmax)
  * @{
  */
-void qnFit(const cumlHandle &cuml_handle, float *X, float *y, int N, int D,
+void qnFit(const raft::handle_t &cuml_handle, float *X, float *y, int N, int D,
            int C, bool fit_intercept, float l1, float l2, int max_iter,
            float grad_tol, int linesearch_max_iter, int lbfgs_memory,
            int verbosity, float *w0, float *f, int *num_iters, bool X_col_major,
            int loss_type);
-void qnFit(const cumlHandle &cuml_handle, double *X, double *y, int N, int D,
-           int C, bool fit_intercept, double l1, double l2, int max_iter,
+void qnFit(const raft::handle_t &cuml_handle, double *X, double *y, int N,
+           int D, int C, bool fit_intercept, double l1, double l2, int max_iter,
            double grad_tol, int linesearch_max_iter, int lbfgs_memory,
            int verbosity, double *w0, double *f, int *num_iters,
            bool X_col_major, int loss_type);
@@ -139,7 +139,7 @@ void qnFit(const cumlHandle &cuml_handle, double *X, double *y, int N, int D,
 
 /**
  * @defgroup qnDecisionFunction to obtain the confidence scores of samples
- * @param cuml_handle           reference to cumlHandle object
+ * @param cuml_handle           reference to raft::handle_t object
  * @param X                     device pointer to feature matrix of dimension NxD (row- or column major: see X_col_major param)
  * @param N                     number of examples
  * @param D                     number of features
@@ -151,17 +151,17 @@ void qnFit(const cumlHandle &cuml_handle, double *X, double *y, int N, int D,
  * @param scores                device pointer to confidence scores of length N (for binary logistic: [0,1], for multinomial:  [0,...,C-1])
  * @{
  */
-void qnDecisionFunction(const cumlHandle &cuml_handle, float *X, int N, int D,
-                        int C, bool fit_intercept, float *params,
+void qnDecisionFunction(const raft::handle_t &cuml_handle, float *X, int N,
+                        int D, int C, bool fit_intercept, float *params,
                         bool X_col_major, int loss_type, float *scores);
-void qnDecisionFunction(const cumlHandle &cuml_handle, double *X, int N, int D,
-                        int C, bool fit_intercept, double *params,
+void qnDecisionFunction(const raft::handle_t &cuml_handle, double *X, int N,
+                        int D, int C, bool fit_intercept, double *params,
                         bool X_col_major, int loss_type, double *scores);
 /** @} */
 
 /**
  * @defgroup qnPredict to fit a GLM using quasi newton methods.
- * @param cuml_handle           reference to cumlHandle object
+ * @param cuml_handle           reference to raft::handle_t object
  * @param X                     device pointer to feature matrix of dimension NxD (row- or column major: see X_col_major param)
  * @param N                     number of examples
  * @param D                     number of features
@@ -173,11 +173,11 @@ void qnDecisionFunction(const cumlHandle &cuml_handle, double *X, int N, int D,
  * @param preds                 device pointer to predictions of length N (for binary logistic: [0,1], for multinomial:  [0,...,C-1])
  * @{
  */
-void qnPredict(const cumlHandle &cuml_handle, float *X, int N, int D, int C,
+void qnPredict(const raft::handle_t &cuml_handle, float *X, int N, int D, int C,
                bool fit_intercept, float *params, bool X_col_major,
                int loss_type, float *preds);
-void qnPredict(const cumlHandle &cuml_handle, double *X, int N, int D, int C,
-               bool fit_intercept, double *params, bool X_col_major,
+void qnPredict(const raft::handle_t &cuml_handle, double *X, int N, int D,
+               int C, bool fit_intercept, double *params, bool X_col_major,
                int loss_type, double *preds);
 /** @} */
 
diff --git a/cpp/include/cuml/linear_model/ols_mg.hpp b/cpp/include/cuml/linear_model/ols_mg.hpp
index 5308acdca7..37dea89df8 100644
--- a/cpp/include/cuml/linear_model/ols_mg.hpp
+++ b/cpp/include/cuml/linear_model/ols_mg.hpp
@@ -39,14 +39,14 @@ namespace opg {
  * @param[in] algo: which algorithm is used for OLS. 0 is for SVD, 1 is for eig.
  * @param[in] verbose
  */
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<float> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<float> *> &labels, float *coef,
          float *intercept, bool fit_intercept, bool normalize, int algo,
          bool verbose);
 
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<double> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<double> *> &labels, double *coef,
@@ -66,14 +66,16 @@ void fit(cumlHandle &handle,
  * @param[out] preds: predictions
  * @param[in] verbose
  */
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<float> **input,
-             size_t n_rows, size_t n_cols, float *coef, float intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<float> **input, size_t n_rows,
+             size_t n_cols, float *coef, float intercept,
              MLCommon::Matrix::Data<float> **preds, bool verbose);
 
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<double> **input,
-             size_t n_rows, size_t n_cols, double *coef, double intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<double> **input, size_t n_rows,
+             size_t n_cols, double *coef, double intercept,
              MLCommon::Matrix::Data<double> **preds, bool verbose);
 
 };  // end namespace opg
diff --git a/cpp/include/cuml/linear_model/preprocess_mg.hpp b/cpp/include/cuml/linear_model/preprocess_mg.hpp
index 46e79b5b48..a204648b14 100644
--- a/cpp/include/cuml/linear_model/preprocess_mg.hpp
+++ b/cpp/include/cuml/linear_model/preprocess_mg.hpp
@@ -17,16 +17,16 @@
 #pragma once
 
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <cuml/common/cuml_allocator.hpp>
 #include <opg/matrix/data.hpp>
 #include <opg/matrix/part_descriptor.hpp>
+#include <raft/comms/comms.hpp>
 
 namespace ML {
 namespace GLM {
 namespace opg {
 
-void preProcessData(cumlHandle &handle,
+void preProcessData(raft::handle_t &handle,
                     std::vector<MLCommon::Matrix::Data<float> *> &input_data,
                     MLCommon::Matrix::PartDescriptor &input_desc,
                     std::vector<MLCommon::Matrix::Data<float> *> &labels,
@@ -34,7 +34,7 @@ void preProcessData(cumlHandle &handle,
                     bool fit_intercept, bool normalize, cudaStream_t *streams,
                     int n_streams, bool verbose);
 
-void preProcessData(cumlHandle &handle,
+void preProcessData(raft::handle_t &handle,
                     std::vector<MLCommon::Matrix::Data<double> *> &input_data,
                     MLCommon::Matrix::PartDescriptor &input_desc,
                     std::vector<MLCommon::Matrix::Data<double> *> &labels,
@@ -42,7 +42,7 @@ void preProcessData(cumlHandle &handle,
                     bool fit_intercept, bool normalize, cudaStream_t *streams,
                     int n_streams, bool verbose);
 
-void postProcessData(cumlHandle &handle,
+void postProcessData(raft::handle_t &handle,
                      std::vector<MLCommon::Matrix::Data<float> *> &input_data,
                      MLCommon::Matrix::PartDescriptor &input_desc,
                      std::vector<MLCommon::Matrix::Data<float> *> &labels,
@@ -51,7 +51,7 @@ void postProcessData(cumlHandle &handle,
                      bool normalize, cudaStream_t *streams, int n_streams,
                      bool verbose);
 
-void postProcessData(cumlHandle &handle,
+void postProcessData(raft::handle_t &handle,
                      std::vector<MLCommon::Matrix::Data<double> *> &input_data,
                      MLCommon::Matrix::PartDescriptor &input_desc,
                      std::vector<MLCommon::Matrix::Data<double> *> &labels,
diff --git a/cpp/include/cuml/linear_model/ridge_mg.hpp b/cpp/include/cuml/linear_model/ridge_mg.hpp
index bc8ca50e09..b5cb23a47e 100644
--- a/cpp/include/cuml/linear_model/ridge_mg.hpp
+++ b/cpp/include/cuml/linear_model/ridge_mg.hpp
@@ -41,14 +41,14 @@ namespace opg {
  * @param[in] algo: the algorithm to use for fitting
  * @param[in] verbose
  */
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<float> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<float> *> &labels, float *alpha,
          int n_alpha, float *coef, float *intercept, bool fit_intercept,
          bool normalize, int algo, bool verbose);
 
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<double> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<double> *> &labels, double *alpha,
@@ -68,14 +68,16 @@ void fit(cumlHandle &handle,
  * @param[out] preds: predictions
  * @param[in] verbose
  */
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<float> **input,
-             size_t n_rows, size_t n_cols, float *coef, float intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<float> **input, size_t n_rows,
+             size_t n_cols, float *coef, float intercept,
              MLCommon::Matrix::Data<float> **preds, bool verbose);
 
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<double> **input,
-             size_t n_rows, size_t n_cols, double *coef, double intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<double> **input, size_t n_rows,
+             size_t n_cols, double *coef, double intercept,
              MLCommon::Matrix::Data<double> **preds, bool verbose);
 
 };  // end namespace opg
diff --git a/cpp/include/cuml/manifold/tsne.h b/cpp/include/cuml/manifold/tsne.h
index 14ba176b07..e94d1dd4d7 100644
--- a/cpp/include/cuml/manifold/tsne.h
+++ b/cpp/include/cuml/manifold/tsne.h
@@ -59,7 +59,7 @@ namespace ML {
  *                                 or >= 0 for reproducible outputs.
  * @param[in] verbosity            verbosity level for logging messages during
  *                                 execution
- * @param[in] intialize_embeddings Whether to overwrite the current Y vector
+ * @param[in] initialize_embeddings Whether to overwrite the current Y vector
  *                                 with random noise.
  * @param[in] barnes_hut           Whether to use the fast Barnes Hut or use the
  *                                 slower exact version.
@@ -71,10 +71,11 @@ namespace ML {
  * approach is available in their article t-SNE-CUDA: GPU-Accelerated t-SNE and
  * its Applications to Modern Data (https://arxiv.org/abs/1807.11824).
  */
-void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
-              const int p, const int dim = 2, int n_neighbors = 1023,
-              const float theta = 0.5f, const float epssq = 0.0025,
-              float perplexity = 50.0f, const int perplexity_max_iter = 100,
+void TSNE_fit(const raft::handle_t &handle, const float *X, float *Y,
+              const int n, const int p, const int dim = 2,
+              int n_neighbors = 1023, const float theta = 0.5f,
+              const float epssq = 0.0025, float perplexity = 50.0f,
+              const int perplexity_max_iter = 100,
               const float perplexity_tol = 1e-5,
               const float early_exaggeration = 12.0f,
               const int exaggeration_iter = 250, const float min_gain = 0.01f,
@@ -84,6 +85,6 @@ void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
               const float pre_momentum = 0.5, const float post_momentum = 0.8,
               const long long random_state = -1,
               int verbosity = CUML_LEVEL_INFO,
-              const bool intialize_embeddings = true, bool barnes_hut = true);
+              const bool initialize_embeddings = true, bool barnes_hut = true);
 
 }  // namespace ML
diff --git a/cpp/include/cuml/manifold/umap.hpp b/cpp/include/cuml/manifold/umap.hpp
index af6f9d9966..d90464c9f9 100644
--- a/cpp/include/cuml/manifold/umap.hpp
+++ b/cpp/include/cuml/manifold/umap.hpp
@@ -22,20 +22,20 @@
 
 namespace ML {
 
-void transform(const cumlHandle &handle, float *X, int n, int d,
+void transform(const raft::handle_t &handle, float *X, int n, int d,
                int64_t *knn_indices, float *knn_dists, float *orig_X,
                int orig_n, float *embedding, int embedding_n,
                UMAPParams *params, float *transformed);
 
-void find_ab(const cumlHandle &handle, UMAPParams *params);
+void find_ab(const raft::handle_t &handle, UMAPParams *params);
 
-void fit(const cumlHandle &handle,
+void fit(const raft::handle_t &handle,
          float *X,  // input matrix
          float *y,  // labels
          int n, int d, int64_t *knn_indices, float *knn_dists,
          UMAPParams *params, float *embeddings);
 
-void fit(const cumlHandle &handle,
+void fit(const raft::handle_t &handle,
          float *X,  // input matrix
          int n,     // rows
          int d,     // cols
@@ -45,11 +45,11 @@ void fit(const cumlHandle &handle,
 class UMAP_API {
   float *orig_X;
   int orig_n;
-  cumlHandle *handle;
+  raft::handle_t *handle;
   UMAPParams *params;
 
  public:
-  UMAP_API(const cumlHandle &handle, UMAPParams *params);
+  UMAP_API(const raft::handle_t &handle, UMAPParams *params);
   ~UMAP_API();
 
   /**
diff --git a/cpp/include/cuml/metrics/metrics.hpp b/cpp/include/cuml/metrics/metrics.hpp
index e54e72f5bf..4b7fdd1070 100644
--- a/cpp/include/cuml/metrics/metrics.hpp
+++ b/cpp/include/cuml/metrics/metrics.hpp
@@ -16,6 +16,7 @@
 
 #pragma once
 
+#include <cuml/distance/distance_type.h>
 #include <cuml/cuml.hpp>
 
 namespace ML {
@@ -32,13 +33,13 @@ namespace Metrics {
 * in a linear regression model. The larger the R-squared value, the
 * more variability is explained by the linear regression model.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of ground-truth response variables
 * @param y_hat: Array of predicted response variables
 * @param n: Number of elements in y and y_hat
 * @return: The R-squared value.
 */
-float r2_score_py(const cumlHandle &handle, float *y, float *y_hat, int n);
+float r2_score_py(const raft::handle_t &handle, float *y, float *y_hat, int n);
 
 /**
 * Calculates the "Coefficient of Determination" (R-Squared) score
@@ -50,27 +51,28 @@ float r2_score_py(const cumlHandle &handle, float *y, float *y_hat, int n);
 * in a linear regression model. The larger the R-squared value, the
 * more variability is explained by the linear regression model.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of ground-truth response variables
 * @param y_hat: Array of predicted response variables
 * @param n: Number of elements in y and y_hat
 * @return: The R-squared value.
 */
-double r2_score_py(const cumlHandle &handle, double *y, double *y_hat, int n);
+double r2_score_py(const raft::handle_t &handle, double *y, double *y_hat,
+                   int n);
 
 /**
 * Calculates the "rand index"
 *
 * This metric is a measure of similarity between two data clusterings.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of response variables of the first clustering classifications
 * @param y_hat: Array of response variables of the second clustering classifications
 * @param n: Number of elements in y and y_hat
 * @return: The rand index value
 */
 
-double randIndex(const cumlHandle &handle, double *y, double *y_hat, int n);
+double randIndex(const raft::handle_t &handle, double *y, double *y_hat, int n);
 
 /**
 * Calculates the "Silhouette Score"
@@ -81,7 +83,7 @@ double randIndex(const cumlHandle &handle, double *y, double *y_hat, int n);
 * and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient
 * is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of data samples with dimensions (nRows x nCols)
 * @param nRows: number of data samples
 * @param nCols: number of features
@@ -90,7 +92,7 @@ double randIndex(const cumlHandle &handle, double *y, double *y_hat, int n);
 * @param metric: the numerical value that maps to the type of distance metric to be used in the calculations
 * @param silScores: Array that is optionally taken in as input if required to be populated with the silhouette score for every sample (1 x nRows), else nullptr is passed
 */
-double silhouetteScore(const cumlHandle &handle, double *y, int nRows,
+double silhouetteScore(const raft::handle_t &handle, double *y, int nRows,
                        int nCols, int *labels, int nLabels, double *silScores,
                        int metric);
 /**
@@ -98,16 +100,16 @@ double silhouetteScore(const cumlHandle &handle, double *y, int nRows,
 *
 * This metric is the corrected-for-chance version of the rand index
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of response variables of the first clustering classifications
 * @param y_hat: Array of response variables of the second clustering classifications
 * @param n: Number of elements in y and y_hat
 * @return: The adjusted rand index value
 * @{
 */
-double adjustedRandIndex(const cumlHandle &handle, const int64_t *y,
+double adjustedRandIndex(const raft::handle_t &handle, const int64_t *y,
                          const int64_t *y_hat, const int64_t n);
-double adjustedRandIndex(const cumlHandle &handle, const int *y,
+double adjustedRandIndex(const raft::handle_t &handle, const int *y,
                          const int *y_hat, const int n);
 /** @} */
 
@@ -118,13 +120,13 @@ double adjustedRandIndex(const cumlHandle &handle, const int *y,
 * approximates the probability distribution P
 * It is often also used as a 'distance metric' between two probablity ditributions (not symmetric)
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of probabilities corresponding to distribution P
 * @param y_hat: Array of probabilities corresponding to distribution Q
 * @param n: Number of elements in y and y_hat
 * @return: The KL Divergence value
 */
-double klDivergence(const cumlHandle &handle, const double *y,
+double klDivergence(const raft::handle_t &handle, const double *y,
                     const double *y_hat, int n);
 
 /**
@@ -134,28 +136,28 @@ double klDivergence(const cumlHandle &handle, const double *y,
 * approximates the probability distribution P
 * It is often also used as a 'distance metric' between two probablity ditributions (not symmetric)
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of probabilities corresponding to distribution P
 * @param y_hat: Array of probabilities corresponding to distribution Q
 * @param n: Number of elements in y and y_hat
 * @return: The KL Divergence value
 */
-float klDivergence(const cumlHandle &handle, const float *y, const float *y_hat,
-                   int n);
+float klDivergence(const raft::handle_t &handle, const float *y,
+                   const float *y_hat, int n);
 
 /**
 * Calculates the "entropy" of a labelling
 *
 * This metric is a measure of the purity/polarity of the clustering
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of response variables of the clustering
 * @param n: Number of elements in y
 * @param lower_class_range: the lowest value in the range of classes
 * @param upper_class_range: the highest value in the range of classes
 * @return: The entropy value of the clustering
 */
-double entropy(const cumlHandle &handle, const int *y, const int n,
+double entropy(const raft::handle_t &handle, const int *y, const int n,
                const int lower_class_range, const int upper_class_range);
 
 /**
@@ -164,7 +166,7 @@ double entropy(const cumlHandle &handle, const int *y, const int n,
 * Mutual Information is a measure of the similarity between two labels of
 * the same data.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: Array of response variables of the first clustering classifications
 * @param y_hat: Array of response variables of the second clustering classifications
 * @param n: Number of elements in y and y_hat
@@ -172,8 +174,9 @@ double entropy(const cumlHandle &handle, const int *y, const int n,
 * @param upper_class_range: the highest value in the range of classes
 * @return: The mutual information score
 */
-double mutualInfoScore(const cumlHandle &handle, const int *y, const int *y_hat,
-                       const int n, const int lower_class_range,
+double mutualInfoScore(const raft::handle_t &handle, const int *y,
+                       const int *y_hat, const int n,
+                       const int lower_class_range,
                        const int upper_class_range);
 
 /**
@@ -182,7 +185,7 @@ double mutualInfoScore(const cumlHandle &handle, const int *y, const int *y_hat,
 * A clustering result satisfies homogeneity if all of its clusters
 * contain only data points which are members of a single class.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: truth labels
 * @param y_hat: predicted labels
 * @param n: Number of elements in y and y_hat
@@ -190,7 +193,7 @@ double mutualInfoScore(const cumlHandle &handle, const int *y, const int *y_hat,
 * @param upper_class_range: the highest value in the range of classes
 * @return: The homogeneity score
 */
-double homogeneityScore(const cumlHandle &handle, const int *y,
+double homogeneityScore(const raft::handle_t &handle, const int *y,
                         const int *y_hat, const int n,
                         const int lower_class_range,
                         const int upper_class_range);
@@ -201,7 +204,7 @@ double homogeneityScore(const cumlHandle &handle, const int *y,
 * A clustering result satisfies completeness if all the data points
 * that are members of a given class are elements of the same cluster.
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: truth labels
 * @param y_hat: predicted labels
 * @param n: Number of elements in y and y_hat
@@ -209,7 +212,7 @@ double homogeneityScore(const cumlHandle &handle, const int *y,
 * @param upper_class_range: the highest value in the range of classes
 * @return: The completeness score
 */
-double completenessScore(const cumlHandle &handle, const int *y,
+double completenessScore(const raft::handle_t &handle, const int *y,
                          const int *y_hat, const int n,
                          const int lower_class_range,
                          const int upper_class_range);
@@ -220,7 +223,7 @@ double completenessScore(const cumlHandle &handle, const int *y,
 * v-measure is the harmonic mean between the homogeneity
 * and completeness scores of 2 cluster classifications
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param y: truth labels
 * @param y_hat: predicted labels
 * @param n: Number of elements in y and y_hat
@@ -228,7 +231,7 @@ double completenessScore(const cumlHandle &handle, const int *y,
 * @param upper_class_range: the highest value in the range of classes
 * @return: The v-measure
 */
-double vMeasure(const cumlHandle &handle, const int *y, const int *y_hat,
+double vMeasure(const raft::handle_t &handle, const int *y, const int *y_hat,
                 const int n, const int lower_class_range,
                 const int upper_class_range);
 
@@ -238,13 +241,55 @@ double vMeasure(const cumlHandle &handle, const int *y, const int *y_hat,
 * The accuracy metric is used to calculate the accuracy of the predict labels
 * predict labels
 *
-* @param handle: cumlHandle
+* @param handle: raft::handle_t
 * @param predictions: predicted labels
 * @param ref_predictions: truth labels
 * @param n: Number of elements in y and y_hat
 * @return: The accuracy
 */
-float accuracy_score_py(const cumlHandle &handle, const int *predictions,
+float accuracy_score_py(const raft::handle_t &handle, const int *predictions,
                         const int *ref_predictions, int n);
+
+/**
+ * @brief Calculates the ij pairwise distances between two input arrays of
+ *        double type
+ * 
+ * @param handle raft::handle_t
+ * @param x pointer to the input data samples array (mRows x kCols)
+ * @param y pointer to the second input data samples array. Can use the same
+ *          pointer as x (nRows x kCols)
+ * @param dist output pointer where the results will be stored (mRows x nCols)
+ * @param m number of rows in x
+ * @param n number of rows in y
+ * @param k number of cols in x and y (must be the same)
+ * @param metric the distance metric to use for the calculation
+ * @param isRowMajor specifies whether the x and y data pointers are row (C
+ *                   type array) or col (F type array) major
+ */
+void pairwiseDistance(const raft::handle_t &handle, const double *x,
+                      const double *y, double *dist, int m, int n, int k,
+                      ML::Distance::DistanceType metric,
+                      bool isRowMajor = true);
+
+/**
+ * @brief Calculates the ij pairwise distances between two input arrays of float type
+ * 
+ * @param handle raft::handle_t
+ * @param x pointer to the input data samples array (mRows x kCols)
+ * @param y pointer to the second input data samples array. Can use the same
+ *          pointer as x (nRows x kCols)
+ * @param dist output pointer where the results will be stored (mRows x nCols)
+ * @param m number of rows in x
+ * @param n number of rows in y
+ * @param k number of cols in x and y (must be the same)
+ * @param metric the distance metric to use for the calculation
+ * @param isRowMajor specifies whether the x and y data pointers are row (C
+ *                   type array) or col (F type array) major
+ */
+void pairwiseDistance(const raft::handle_t &handle, const float *x,
+                      const float *y, float *dist, int m, int n, int k,
+                      ML::Distance::DistanceType metric,
+                      bool isRowMajor = true);
+
 }  // namespace Metrics
 }  // namespace ML
diff --git a/cpp/include/cuml/neighbors/knn.hpp b/cpp/include/cuml/neighbors/knn.hpp
index 198fdfd091..ff52791b8b 100644
--- a/cpp/include/cuml/neighbors/knn.hpp
+++ b/cpp/include/cuml/neighbors/knn.hpp
@@ -41,25 +41,25 @@ enum MetricType {
    * a series of input arrays and combine the results into a single
    * output array for indexes and distances.
    *
-   * @param handle the cuml handle to use
-   * @param input vector of pointers to the input arrays
-   * @param sizes vector of sizes of input arrays
-   * @param D the dimensionality of the arrays
-   * @param search_items array of items to search of dimensionality D
-   * @param n number of rows in search_items
-   * @param res_I the resulting index array of size n * k
-   * @param res_D the resulting distance array of size n * k
-   * @param k the number of nearest neighbors to return
-   * @param rowMajorIndex are the index arrays in row-major order?
-   * @param rowMajorQuery are the query arrays in row-major order?
-   * @param metric distance metric to use. Euclidean (L2) is used by
+   * @param[in] handle the cuml handle to use
+   * @param[in] input vector of pointers to the input arrays
+   * @param[in] sizes vector of sizes of input arrays
+   * @param[in] D the dimensionality of the arrays
+   * @param[in] search_items array of items to search of dimensionality D
+   * @param[in] n number of rows in search_items
+   * @param[out] res_I the resulting index array of size n * k
+   * @param[out] res_D the resulting distance array of size n * k
+   * @param[in] k the number of nearest neighbors to return
+   * @param[in] rowMajorIndex are the index arrays in row-major order?
+   * @param[in] rowMajorQuery are the query arrays in row-major order?
+   * @param[in] metric distance metric to use. Euclidean (L2) is used by
    * 			   default
- * @param metric_arg the value of `p` for Minkowski (l-p) distances. This
+ * @param[in] metric_arg the value of `p` for Minkowski (l-p) distances. This
  * 					 is ignored if the metric_type is not Minkowski.
- * @param expanded should lp-based distances be returned in their expanded
+ * @param[in] expanded should lp-based distances be returned in their expanded
  * 					 form (e.g., without raising to the 1/p power).
    */
-void brute_force_knn(cumlHandle &handle, std::vector<float *> &input,
+void brute_force_knn(raft::handle_t &handle, std::vector<float *> &input,
                      std::vector<int> &sizes, int D, float *search_items, int n,
                      int64_t *res_I, float *res_D, int k,
                      bool rowMajorIndex = false, bool rowMajorQuery = false,
@@ -72,17 +72,17 @@ void brute_force_knn(cumlHandle &handle, std::vector<float *> &input,
  * by classifying on multiple label arrays. Note that each label is
  * classified independently, as is done in scikit-learn.
  *
- * @param handle the cuml handle to use
- * @param out output array on device (size n_samples * size of y vector)
- * @param knn_indices index array on device resulting from knn query (size n_samples * k)
- * @param y vector of label arrays on device vector size is number of (size n_samples)
- * @param n_index_rows number of vertices in index (eg. size of each y array)
- * @param n_samples number of samples in knn_indices
- * @param k number of nearest neighbors in knn_indices
+ * @param[in] handle the cuml handle to use
+ * @param[out] out output array on device (size n_samples * size of y vector)
+ * @param[in] knn_indices index array on device resulting from knn query (size n_samples * k)
+ * @param[in] y vector of label arrays on device vector size is number of (size n_samples)
+ * @param[in] n_index_rows number of vertices in index (eg. size of each y array)
+ * @param[in] n_query_rows number of samples in knn_indices
+ * @param[in] k number of nearest neighbors in knn_indices
  */
-void knn_classify(cumlHandle &handle, int *out, int64_t *knn_indices,
-                  std::vector<int *> &y, size_t n_index_rows, size_t n_samples,
-                  int k);
+void knn_classify(raft::handle_t &handle, int *out, int64_t *knn_indices,
+                  std::vector<int *> &y, size_t n_index_rows,
+                  size_t n_query_rows, int k);
 
 /**
  * @brief Flat C++ API function to perform a knn regression using
@@ -90,33 +90,33 @@ void knn_classify(cumlHandle &handle, int *out, int64_t *knn_indices,
  * regression by clasifying on multiple label arrays. Note that
  * each label is classified independently, as is done in scikit-learn.
  *
- * @param handle the cuml handle to use
- * @param out output array on device (size n_samples)
- * @param knn_indices array on device of knn indices (size n_samples * k)
- * @param y array of labels on device (size n_samples)
- * @param n_query_rows number of vertices in index (eg. size of each y array)
- * @param n_samples number of samples in knn_indices and out
- * @param k number of nearest neighbors in knn_indices
+ * @param[in] handle the cuml handle to use
+ * @param[out] out output array on device (size n_samples)
+ * @param[in] knn_indices array on device of knn indices (size n_samples * k)
+ * @param[in] y array of labels on device (size n_samples)
+ * @param[in] n_index_rows number of vertices in index (eg. size of each y array)
+ * @param[in] n_query_rows number of samples in knn_indices and out
+ * @param[in] k number of nearest neighbors in knn_indices
  */
-void knn_regress(cumlHandle &handle, float *out, int64_t *knn_indices,
-                 std::vector<float *> &y, size_t n_query_rows, size_t n_samples,
-                 int k);
+void knn_regress(raft::handle_t &handle, float *out, int64_t *knn_indices,
+                 std::vector<float *> &y, size_t n_index_rows,
+                 size_t n_query_rows, int k);
 
 /**
  * @brief Flat C++ API function to compute knn class probabilities
  * using a vector of device arrays containing discrete class labels.
  * Note that the output is a vector, which is
  *
- * @param handle the cuml handle to use
- * @param out vector of output arrays on device. vector size = n_outputs.
+ * @param[in] handle the cuml handle to use
+ * @param[out] out vector of output arrays on device. vector size = n_outputs.
  * Each array should have size(n_samples, n_classes)
- * @param knn_indices array on device of knn indices (size n_samples * k)
- * @param y array of labels on device (size n_samples)
- * @param n_index_rows number of labels
- * @param n_samples number of samples in knn_indices and out
- * @param k number of nearest neighbors in knn_indices
+ * @param[in] knn_indices array on device of knn indices (size n_samples * k)
+ * @param[in] y array of labels on device (size n_samples)
+ * @param[in] n_index_rows number of labels in y
+ * @param[in] n_query_rows number of rows in knn_indices and out
+ * @param[in] k number of nearest neighbors in knn_indices
  */
-void knn_class_proba(cumlHandle &handle, std::vector<float *> &out,
+void knn_class_proba(raft::handle_t &handle, std::vector<float *> &out,
                      int64_t *knn_indices, std::vector<int *> &y,
-                     size_t n_index_rows, size_t n_samples, int k);
+                     size_t n_index_rows, size_t n_query_rows, int k);
 };  // namespace ML
diff --git a/cpp/include/cuml/neighbors/knn_api.h b/cpp/include/cuml/neighbors/knn_api.h
index 0b49fd4e12..a1ec1b20f7 100644
--- a/cpp/include/cuml/neighbors/knn_api.h
+++ b/cpp/include/cuml/neighbors/knn_api.h
@@ -27,24 +27,24 @@ extern "C" {
  * a series of input arrays and combine the results into a single
  * output array for indexes and distances.
  *
- * @param handle the cuml handle to use
- * @param input an array of pointers to the input arrays
- * @param size an array of sizes of input arrays
- * @param n_params array size of input and sizes
- * @param D the dimensionality of the arrays
- * @param search_items array of items to search of dimensionality D
- * @param n number of rows in search_items
- * @param res_I the resulting index array of size n * k
- * @param res_D the resulting distance array of size n * k
- * @param k the number of nearest neighbors to return
- * @param rowMajorIndex is the index array in row major layout?
- * @param rowMajorQuery is the query array in row major layout?
- * @param metric_type the type of distance metric to use. This corresponds
+ * @param[in] handle the cuml handle to use
+ * @param[in] input an array of pointers to the input arrays
+ * @param[in] size an array of sizes of input arrays
+ * @param[in] n_params array size of input and sizes
+ * @param[in] D the dimensionality of the arrays
+ * @param[in] search_items array of items to search of dimensionality D
+ * @param[in] n number of rows in search_items
+ * @param[out] res_I the resulting index array of size n * k
+ * @param[out] res_D the resulting distance array of size n * k
+ * @param[in] k the number of nearest neighbors to return
+ * @param[in] rowMajorIndex is the index array in row major layout?
+ * @param[in] rowMajorQuery is the query array in row major layout?
+ * @param[in] metric_type the type of distance metric to use. This corresponds
  * 					  to the value in the ML::MetricType enum. Default is
  * 					  Euclidean (L2).
- * @param metric_arg the value of `p` for Minkowski (l-p) distances. This
+ * @param[in] metric_arg the value of `p` for Minkowski (l-p) distances. This
  * 					 is ignored if the metric_type is not Minkowski.
- * @param expanded should lp-based distances be returned in their expanded
+ * @param[in] expanded should lp-based distances be returned in their expanded
  * 					 form (e.g., without raising to the 1/p power).
  */
 cumlError_t knn_search(const cumlHandle_t handle, float **input, int *size,
diff --git a/cpp/include/cuml/neighbors/knn_mg.hpp b/cpp/include/cuml/neighbors/knn_mg.hpp
index 33fda79046..ddc6ad2108 100644
--- a/cpp/include/cuml/neighbors/knn_mg.hpp
+++ b/cpp/include/cuml/neighbors/knn_mg.hpp
@@ -31,7 +31,7 @@ namespace opg {
 
 /**
  * @brief Performs a multi-node multi-GPU brute force nearest neighbors.
- * @param handle: the cumlHandle to use for managing resources
+ * @param handle: the raft::handle_t to use for managing resources
  * @param[out] out_I: vector of output index partitions. size should match the
  *        number of local input partitions.
  * @param[out] out_D: vector of output distance partitions. size should match
@@ -49,7 +49,7 @@ namespace opg {
  * @param[in] verbose: print extra logging info
  *
  */
-void brute_force_knn(ML::cumlHandle &handle,
+void brute_force_knn(raft::handle_t &handle,
                      std::vector<Matrix::Data<int64_t> *> &out_I,
                      std::vector<Matrix::floatData_t *> &out_D,
                      std::vector<Matrix::floatData_t *> &idx_data,
@@ -62,31 +62,33 @@ void brute_force_knn(ML::cumlHandle &handle,
 
 /**
  * Performs a multi-node multi-GPU KNN classify.
- * @param handle the cumlHandle to use for managing resources
- * @param out vector of output labels partitions. size should match the
+ * @param[in] handle the raft::handle_t to use for managing resources
+ * @param[out] out vector of output labels partitions. size should match the
  *        number of local input partitions.
- * @param out_I vector of output index partitions. size should match the
+ * @param[out] out_I vector of output index partitions. size should match the
  *        number of local input partitions.
- * @param out_D vector of output distance partitions. size should match
+ * @param[out] out_D vector of output distance partitions. size should match
  *        the number of local input partitions.
- * @param probas (optional) pointer to a vector containing arrays of probabilities
- * @param idx_data vector of local indices to query
- * @param idx_desc describes how the index partitions are distributed
+ * @param[in] probas (optional) pointer to a vector containing arrays of probabilities
+ * @param[in] idx_data vector of local indices to query
+ * @param[in] idx_desc describes how the index partitions are distributed
  *        across the ranks.
- * @param query_data vector of local query partitions
- * @param query_desc describes how the query partitions are distributed
+ * @param[in] query_data vector of local query partitions
+ * @param[in] query_desc describes how the query partitions are distributed
  *        across the cluster.
- * @param y vector of vector of label arrays. for multilabel classification, each
+ * @param[in] y vector of vector of label arrays. for multilabel classification, each
  *          element in the vector is a different "output" array of labels corresponding
  *          to the i'th output. size should match the number of local input partitions.
- * @param uniq_labels vector of the sorted unique labels for each array in y
- * @param n_unique vector of sizes for each array in uniq_labels
- * @param probas_only return probas instead of performing complete knn_classify
- * @param k the number of neighbors to query
- * @param batch_size the max number of rows to broadcast at a time
- * @param verbose print extra logging info
+ * @param[in] uniq_labels vector of the sorted unique labels for each array in y
+ * @param[in] n_unique vector of sizes for each array in uniq_labels
+ * @param[in] rowMajorIndex boolean indicating whether the index is row major. 
+ * @param[in] rowMajorQuery boolean indicating whether the query is row major. 
+ * @param[in] probas_only return probas instead of performing complete knn_classify
+ * @param[in] k the number of neighbors to query
+ * @param[in] batch_size the max number of rows to broadcast at a time
+ * @param[in] verbose print extra logging info
  */
-void knn_classify(ML::cumlHandle &handle, std::vector<Matrix::Data<int> *> *out,
+void knn_classify(raft::handle_t &handle, std::vector<Matrix::Data<int> *> *out,
                   std::vector<Matrix::Data<int64_t> *> *out_I,
                   std::vector<Matrix::floatData_t *> *out_D,
                   std::vector<std::vector<float *>> *probas,
@@ -102,28 +104,30 @@ void knn_classify(ML::cumlHandle &handle, std::vector<Matrix::Data<int> *> *out,
 
 /**
  * Performs a multi-node multi-GPU KNN regress.
- * @param handle the cumlHandle to use for managing resources
- * @param out vector of output partitions. size should match the
+ * @param[in] handle the raft::handle_t to use for managing resources
+ * @param[out] out vector of output partitions. size should match the
  *        number of local input partitions.
- * @param out_I vector of output index partitions. size should match the
+ * @param[out] out_I vector of output index partitions. size should match the
  *        number of local input partitions.
- * @param out_D vector of output distance partitions. size should match
+ * @param[out] out_D vector of output distance partitions. size should match
  *        the number of local input partitions.
- * @param idx_data vector of local indices to query
- * @param idx_desc describes how the index partitions are distributed
+ * @param[in] idx_data vector of local indices to query
+ * @param[in] idx_desc describes how the index partitions are distributed
  *        across the ranks.
- * @param query_data vector of local query partitions
- * @param query_desc describes how the query partitions are distributed
+ * @param[in] query_data vector of local query partitions
+ * @param[in] query_desc describes how the query partitions are distributed
  *        across the cluster.
- * @param y vector of vector of output arrays. for multi-output regression, each
+ * @param[in] y vector of vector of output arrays. for multi-output regression, each
  *          element in the vector is a different "output" array corresponding
  *          to the i'th output. size should match the number of local input partitions.
- * @param k the number of neighbors to query
- * @param n_outputs number of outputs
- * @param batch_size the max number of rows to broadcast at a time
- * @param verbose print extra logging info
+ * @param[in] rowMajorIndex boolean indicating whether the index is row major.
+ * @param[in] rowMajorQuery boolean indicating whether the query is row major.
+ * @param[in] k the number of neighbors to query
+ * @param[in] n_outputs number of outputs
+ * @param[in] batch_size the max number of rows to broadcast at a time
+ * @param[in] verbose print extra logging info
  */
-void knn_regress(ML::cumlHandle &handle,
+void knn_regress(raft::handle_t &handle,
                  std::vector<Matrix::Data<float> *> *out,
                  std::vector<Matrix::Data<int64_t> *> *out_I,
                  std::vector<Matrix::floatData_t *> *out_D,
diff --git a/cpp/include/cuml/random_projection/rproj_c.h b/cpp/include/cuml/random_projection/rproj_c.h
index 1ea3ff4dd0..ddeb682918 100644
--- a/cpp/include/cuml/random_projection/rproj_c.h
+++ b/cpp/include/cuml/random_projection/rproj_c.h
@@ -82,11 +82,11 @@ struct rand_mat {
 };
 
 template <typename math_t>
-void RPROJfit(const cumlHandle &handle, rand_mat<math_t> *random_matrix,
+void RPROJfit(const raft::handle_t &handle, rand_mat<math_t> *random_matrix,
               paramsRPROJ *params);
 
 template <typename math_t>
-void RPROJtransform(const cumlHandle &handle, math_t *input,
+void RPROJtransform(const raft::handle_t &handle, math_t *input,
                     rand_mat<math_t> *random_matrix, math_t *output,
                     paramsRPROJ *params);
 
diff --git a/cpp/include/cuml/solvers/cd_mg.hpp b/cpp/include/cuml/solvers/cd_mg.hpp
index 2a7fa26974..8ed64c924d 100644
--- a/cpp/include/cuml/solvers/cd_mg.hpp
+++ b/cpp/include/cuml/solvers/cd_mg.hpp
@@ -43,14 +43,14 @@ namespace opg {
  * @param[in] tol: tolerance for early stopping during fitting
  * @param[in] verbose
  */
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<float> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<float> *> &labels, float *coef,
          float *intercept, bool fit_intercept, bool normalize, int epochs,
          float alpha, float l1_ratio, bool shuffle, float tol, bool verbose);
 
-void fit(cumlHandle &handle,
+void fit(raft::handle_t &handle,
          std::vector<MLCommon::Matrix::Data<double> *> &input_data,
          MLCommon::Matrix::PartDescriptor &input_desc,
          std::vector<MLCommon::Matrix::Data<double> *> &labels, double *coef,
@@ -70,14 +70,16 @@ void fit(cumlHandle &handle,
  * @param[out] preds: predictions
  * @param[in] verbose
  */
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<float> **input,
-             size_t n_rows, size_t n_cols, float *coef, float intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<float> **input, size_t n_rows,
+             size_t n_cols, float *coef, float intercept,
              MLCommon::Matrix::Data<float> **preds, bool verbose);
 
-void predict(cumlHandle &handle, MLCommon::Matrix::RankSizePair **rank_sizes,
-             size_t n_parts, MLCommon::Matrix::Data<double> **input,
-             size_t n_rows, size_t n_cols, double *coef, double intercept,
+void predict(raft::handle_t &handle,
+             MLCommon::Matrix::RankSizePair **rank_sizes, size_t n_parts,
+             MLCommon::Matrix::Data<double> **input, size_t n_rows,
+             size_t n_cols, double *coef, double intercept,
              MLCommon::Matrix::Data<double> **preds, bool verbose);
 
 };  // end namespace opg
diff --git a/cpp/include/cuml/solvers/solver.hpp b/cpp/include/cuml/solvers/solver.hpp
index df4ea5de8c..2db32a4a03 100644
--- a/cpp/include/cuml/solvers/solver.hpp
+++ b/cpp/include/cuml/solvers/solver.hpp
@@ -21,47 +21,51 @@
 namespace ML {
 namespace Solver {
 
-void sgdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
+void sgdFit(raft::handle_t &handle, float *input, int n_rows, int n_cols,
             float *labels, float *coef, float *intercept, bool fit_intercept,
             int batch_size, int epochs, int lr_type, float eta0, float power_t,
             int loss, int penalty, float alpha, float l1_ratio, bool shuffle,
             float tol, int n_iter_no_change);
 
-void sgdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
+void sgdFit(raft::handle_t &handle, double *input, int n_rows, int n_cols,
             double *labels, double *coef, double *intercept, bool fit_intercept,
             int batch_size, int epochs, int lr_type, double eta0,
             double power_t, int loss, int penalty, double alpha,
             double l1_ratio, bool shuffle, double tol, int n_iter_no_change);
 
-void sgdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
-                const float *coef, float intercept, float *preds, int loss);
+void sgdPredict(raft::handle_t &handle, const float *input, int n_rows,
+                int n_cols, const float *coef, float intercept, float *preds,
+                int loss);
 
-void sgdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
-                const double *coef, double intercept, double *preds, int loss);
+void sgdPredict(raft::handle_t &handle, const double *input, int n_rows,
+                int n_cols, const double *coef, double intercept, double *preds,
+                int loss);
 
-void sgdPredictBinaryClass(cumlHandle &handle, const float *input, int n_rows,
-                           int n_cols, const float *coef, float intercept,
-                           float *preds, int loss);
+void sgdPredictBinaryClass(raft::handle_t &handle, const float *input,
+                           int n_rows, int n_cols, const float *coef,
+                           float intercept, float *preds, int loss);
 
-void sgdPredictBinaryClass(cumlHandle &handle, const double *input, int n_rows,
-                           int n_cols, const double *coef, double intercept,
-                           double *preds, int loss);
+void sgdPredictBinaryClass(raft::handle_t &handle, const double *input,
+                           int n_rows, int n_cols, const double *coef,
+                           double intercept, double *preds, int loss);
 
-void cdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
+void cdFit(raft::handle_t &handle, float *input, int n_rows, int n_cols,
            float *labels, float *coef, float *intercept, bool fit_intercept,
            bool normalize, int epochs, int loss, float alpha, float l1_ratio,
            bool shuffle, float tol);
 
-void cdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
+void cdFit(raft::handle_t &handle, double *input, int n_rows, int n_cols,
            double *labels, double *coef, double *intercept, bool fit_intercept,
            bool normalize, int epochs, int loss, double alpha, double l1_ratio,
            bool shuffle, double tol);
 
-void cdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
-               const float *coef, float intercept, float *preds, int loss);
+void cdPredict(raft::handle_t &handle, const float *input, int n_rows,
+               int n_cols, const float *coef, float intercept, float *preds,
+               int loss);
 
-void cdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
-               const double *coef, double intercept, double *preds, int loss);
+void cdPredict(raft::handle_t &handle, const double *input, int n_rows,
+               int n_cols, const double *coef, double intercept, double *preds,
+               int loss);
 
 };  // namespace Solver
 };  // end namespace ML
diff --git a/cpp/include/cuml/svm/svc.hpp b/cpp/include/cuml/svm/svc.hpp
index 02b291a865..7b559356f9 100644
--- a/cpp/include/cuml/svm/svc.hpp
+++ b/cpp/include/cuml/svm/svc.hpp
@@ -44,12 +44,13 @@ namespace SVM {
  * @param [in] param parameters for training
  * @param [in] kernel_params parameters for the kernel function
  * @param [out] model parameters of the trained model
+ * @param [in] sample_weight optional sample weights, size [n_rows]
  */
 template <typename math_t>
-void svcFit(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
+void svcFit(const raft::handle_t &handle, math_t *input, int n_rows, int n_cols,
             math_t *labels, const svmParameter &param,
             MLCommon::Matrix::KernelParams &kernel_params,
-            svmModel<math_t> &model);
+            svmModel<math_t> &model, const math_t *sample_weight = nullptr);
 
 /**
  * @brief Predict classes or decision function value for samples in input.
@@ -81,8 +82,8 @@ void svcFit(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
  *     return the decision function value (false)
  */
 template <typename math_t>
-void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
-                MLCommon::Matrix::KernelParams &kernel_params,
+void svcPredict(const raft::handle_t &handle, math_t *input, int n_rows,
+                int n_cols, MLCommon::Matrix::KernelParams &kernel_params,
                 const svmModel<math_t> &model, math_t *preds,
                 math_t buffer_size, bool predict_class = true);
 
@@ -93,7 +94,7 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
  * @param [inout] m SVM model parameters
  */
 template <typename math_t>
-void svmFreeBuffers(const cumlHandle &handle, svmModel<math_t> &m);
+void svmFreeBuffers(const raft::handle_t &handle, svmModel<math_t> &m);
 
 /**
  * @brief C-Support Vector Classification
@@ -133,7 +134,7 @@ class SVC {
    * @param nochange_steps number of steps with no change wrt convergence
    * @param verbosity verbosity level for logging messages during execution
    */
-  SVC(cumlHandle &handle, math_t C = 1, math_t tol = 1.0e-3,
+  SVC(raft::handle_t &handle, math_t C = 1, math_t tol = 1.0e-3,
       MLCommon::Matrix::KernelParams kernel_params =
         MLCommon::Matrix::KernelParams{MLCommon::Matrix::LINEAR, 3, 1, 0},
       math_t cache_size = 200, int max_iter = -1, int nochange_steps = 1000,
@@ -151,8 +152,10 @@ class SVC {
    * @param n_rows number of rows
    * @param n_cols number of columns
    * @param labels device pointer for the labels. Size n_rows.
+   * @param [in] sample_weight optional sample weights, size [n_rows]
    */
-  void fit(math_t *input, int n_rows, int n_cols, math_t *labels);
+  void fit(math_t *input, int n_rows, int n_cols, math_t *labels,
+           const math_t *sample_weight = nullptr);
 
   /**
    * @brief Predict classes for samples in input.
@@ -177,7 +180,7 @@ class SVC {
   void decisionFunction(math_t *input, int n_rows, int n_cols, math_t *preds);
 
  private:
-  const cumlHandle &handle;
+  const raft::handle_t &handle;
 };
 
 };  // end namespace SVM
diff --git a/cpp/include/cuml/svm/svr.hpp b/cpp/include/cuml/svm/svr.hpp
index 679f59ae4f..8ca308acac 100644
--- a/cpp/include/cuml/svm/svr.hpp
+++ b/cpp/include/cuml/svm/svr.hpp
@@ -43,12 +43,13 @@ namespace SVM {
  * @param [in] param parameters for training
  * @param [in] kernel_params parameters for the kernel function
  * @param [out] model parameters of the trained model
+ * @param [in] sample_weight optional sample weights, size [n_rows]
  */
 template <typename math_t>
-void svrFit(const cumlHandle &handle, math_t *X, int n_rows, int n_cols,
+void svrFit(const raft::handle_t &handle, math_t *X, int n_rows, int n_cols,
             math_t *y, const svmParameter &param,
             MLCommon::Matrix::KernelParams &kernel_params,
-            svmModel<math_t> &model);
+            svmModel<math_t> &model, const math_t *sample_weight = nullptr);
 
 // For prediction we use svcPredict
 
diff --git a/cpp/include/cuml/tree/decisiontree.hpp b/cpp/include/cuml/tree/decisiontree.hpp
index 2236dbba0c..a5b7a91c82 100644
--- a/cpp/include/cuml/tree/decisiontree.hpp
+++ b/cpp/include/cuml/tree/decisiontree.hpp
@@ -15,7 +15,8 @@
  */
 
 #pragma once
-#include <common/cumlHandle.hpp>
+#include <cuml/cuml.hpp>
+#include <vector>
 #include "algo_helper.h"
 #include "flatnode.h"
 
@@ -61,14 +62,24 @@ struct DecisionTreeParams {
    * Node split criterion. GINI and Entropy for classification, MSE or MAE for regression.
    */
   CRITERION split_criterion;
-  /**
-   * Weahther to fully reshuffle the features for subsampling at each tree node. Default is one shuffle per depth with random start point in the shuffled feature list per node
-   */
-  bool shuffle_features;
   /**
    * Minimum impurity decrease required for spliting a node. If the impurity decrease is below this value, node is leafed out. Default is 0.0
    */
   float min_impurity_decrease = 0.0f;
+
+  /**
+   * Maximum number of nodes that can be processed in a given batch. This is 
+   * used only for batched-level algo
+   */
+  int max_batch_size;
+  /**
+  * If set to true and following conditions are also met, experimental decision
+  *  tree training implementation would be used:
+  *     split_algo = 1 (GLOBAL_QUANTILE)
+  *     max_features = 1.0 (Feature sub-sampling disabled)
+  *     quantile_per_tree = false (No per tree quantile computation)
+  */
+  bool use_experimental_backend;
 };
 
 /**
@@ -86,7 +97,12 @@ struct DecisionTreeParams {
  * @param[in] cfg_split_criterion: split criterion; default CRITERION_END,
  *            i.e., GINI for classification or MSE for regression
  * @param[in] cfg_quantile_per_tree: compute quantile per tree; default false
- * @param[in] cfg_shuffle_features: whether to shuffle features or not
+ * @param[in] cfg_use_experimental_backend: If set to true, experimental batched
+ *            backend is used (provided other conditions are met). Default is 
+              false.
+ * @param[in] cfg_max_batch_size: Maximum number of nodes that can be processed
+              in a batch. This is used only for batched-level algo. Default 
+              value 128.
  */
 void set_tree_params(DecisionTreeParams &params, int cfg_max_depth = -1,
                      int cfg_max_leaves = -1, float cfg_max_features = 1.0f,
@@ -96,7 +112,8 @@ void set_tree_params(DecisionTreeParams &params, int cfg_max_depth = -1,
                      bool cfg_bootstrap_features = false,
                      CRITERION cfg_split_criterion = CRITERION_END,
                      bool cfg_quantile_per_tree = false,
-                     bool cfg_shuffle_features = false);
+                     bool cfg_use_experimental_backend = false,
+                     int cfg_max_batch_size = 128);
 
 /**
  * @brief Check validity of all decision tree hyper-parameters.
@@ -138,6 +155,9 @@ void print_tree_summary(const TreeMetaDataNode<T, L> *tree);
 template <class T, class L>
 void print_tree(const TreeMetaDataNode<T, L> *tree);
 
+template <class T, class L>
+std::string dump_tree_as_json(const TreeMetaDataNode<T, L> *tree);
+
 // ----------------------------- Classification ----------------------------------- //
 
 typedef TreeMetaDataNode<float, int> TreeClassifierF;
@@ -146,7 +166,7 @@ typedef TreeMetaDataNode<double, int> TreeClassifierD;
 /**
  * @defgroup DecisionTreeClassifierFit Fit functions
  * @brief Build (i.e., fit, train) Decision Tree classifier for input data.
- * @param[in] handle: cumlHandle
+ * @param[in] handle: raft::handle_t
  * @param[in, out] tree: CPU pointer to TreeMetaDataNode. User allocated.
  * @param[in] data: train data (nrows samples, ncols features) in column major format,
  *    excluding labels. Device pointer.
@@ -166,13 +186,13 @@ typedef TreeMetaDataNode<double, int> TreeClassifierD;
  * @param[in] tree_params: Decision Tree training hyper parameter struct.
  * @{
  */
-void decisionTreeClassifierFit(const ML::cumlHandle &handle,
+void decisionTreeClassifierFit(const raft::handle_t &handle,
                                TreeClassifierF *&tree, float *data,
                                const int ncols, const int nrows, int *labels,
                                unsigned int *rowids, const int n_sampled_rows,
                                int unique_labels,
                                DecisionTree::DecisionTreeParams tree_params);
-void decisionTreeClassifierFit(const ML::cumlHandle &handle,
+void decisionTreeClassifierFit(const raft::handle_t &handle,
                                TreeClassifierD *&tree, double *data,
                                const int ncols, const int nrows, int *labels,
                                unsigned int *rowids, const int n_sampled_rows,
@@ -184,7 +204,7 @@ void decisionTreeClassifierFit(const ML::cumlHandle &handle,
  * @defgroup DecisionTreeClassifierPredict Predict functions
  * @brief Predict target feature for input data; n-ary classification for
  *   single feature supported. Inference of trees is CPU only for now.
- * @param[in] handle: cumlHandle (currently unused; API placeholder)
+ * @param[in] handle: raft::handle_t (currently unused; API placeholder)
  * @param[in] tree: CPU pointer to TreeMetaDataNode.
  * @param[in] rows: test data (n_rows samples, n_cols features) in row major format.
  *    Current impl. expects a CPU pointer. TODO future API change.
@@ -198,12 +218,12 @@ void decisionTreeClassifierFit(const ML::cumlHandle &handle,
  *                       the caller itself might have set.
  * @{
  */
-void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
+void decisionTreeClassifierPredict(const raft::handle_t &handle,
                                    const TreeClassifierF *tree,
                                    const float *rows, const int n_rows,
                                    const int n_cols, int *predictions,
                                    int verbosity = -1);
-void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
+void decisionTreeClassifierPredict(const raft::handle_t &handle,
                                    const TreeClassifierD *tree,
                                    const double *rows, const int n_rows,
                                    const int n_cols, int *predictions,
@@ -218,7 +238,7 @@ typedef TreeMetaDataNode<double, double> TreeRegressorD;
 /**
  * @defgroup DecisionTreeRegressorFit Fit functions
  * @brief Build (i.e., fit, train) Decision Tree regressor for input data.
- * @param[in] handle: cumlHandle
+ * @param[in] handle: raft::handle_t
  * @param[in, out] tree: CPU pointer to TreeMetaDataNode. User allocated.
  * @param[in] data: train data (nrows samples, ncols features) in column major format,
  *   excluding labels. Device pointer.
@@ -234,12 +254,12 @@ typedef TreeMetaDataNode<double, double> TreeRegressorD;
  * @param[in] tree_params: Decision Tree training hyper parameter struct.
  * @{
  */
-void decisionTreeRegressorFit(const ML::cumlHandle &handle,
+void decisionTreeRegressorFit(const raft::handle_t &handle,
                               TreeRegressorF *&tree, float *data,
                               const int ncols, const int nrows, float *labels,
                               unsigned int *rowids, const int n_sampled_rows,
                               DecisionTree::DecisionTreeParams tree_params);
-void decisionTreeRegressorFit(const ML::cumlHandle &handle,
+void decisionTreeRegressorFit(const raft::handle_t &handle,
                               TreeRegressorD *&tree, double *data,
                               const int ncols, const int nrows, double *labels,
                               unsigned int *rowids, const int n_sampled_rows,
@@ -250,7 +270,7 @@ void decisionTreeRegressorFit(const ML::cumlHandle &handle,
  * @defgroup DecisionTreeRegressorPredict Predict functions
  * @brief Predict target feature for input data; regression for single feature supported.
  *   Inference of trees is CPU only for now.
- * @param[in] handle: cumlHandle (currently unused; API placeholder)
+ * @param[in] handle: raft::handle_t (currently unused; API placeholder)
  * @param[in] tree: CPU pointer to TreeMetaDataNode.
  * @param[in] rows: test data (n_rows samples, n_cols features) in row major format.
  *   Current impl. expects a CPU pointer. TODO future API change.
@@ -264,11 +284,11 @@ void decisionTreeRegressorFit(const ML::cumlHandle &handle,
  *                       the caller itself might have set.
  * @{
  */
-void decisionTreeRegressorPredict(const ML::cumlHandle &handle,
+void decisionTreeRegressorPredict(const raft::handle_t &handle,
                                   const TreeRegressorF *tree, const float *rows,
                                   const int n_rows, const int n_cols,
                                   float *predictions, int verbosity = -1);
-void decisionTreeRegressorPredict(const ML::cumlHandle &handle,
+void decisionTreeRegressorPredict(const raft::handle_t &handle,
                                   const TreeRegressorD *tree,
                                   const double *rows, const int n_rows,
                                   const int n_cols, double *predictions,
diff --git a/cpp/include/cuml/tree/flatnode.h b/cpp/include/cuml/tree/flatnode.h
index 6604c1b411..74eba5e235 100644
--- a/cpp/include/cuml/tree/flatnode.h
+++ b/cpp/include/cuml/tree/flatnode.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -13,16 +13,22 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
+
 #pragma once
-/* sparse node same tree node in Decsion Tree.
-* This however used an index instead of pointer to left child
-* Right child index is left_child_id + 1
-*/
-template <class T, class L>
+
+/**
+ * A node in Decision Tree.
+ * This however uses an index instead of pointer to left child. Right child
+ * index is assumed to be `left_child_id + 1`
+ * @tparam T data type
+ * @tparam L label type
+ * @tparam IdxT type used for indexing operations
+ */
+template <typename DataT, typename LabelT, typename IdxT = int>
 struct SparseTreeNode {
-  L prediction;
-  int colid = -1;
-  T quesval;
-  T best_metric_val;
-  int left_child_id = -1;
+  LabelT prediction;
+  IdxT colid = IdxT(-1);
+  DataT quesval;
+  DataT best_metric_val;
+  IdxT left_child_id = IdxT(-1);
 };
diff --git a/cpp/include/cuml/tsa/arima_common.h b/cpp/include/cuml/tsa/arima_common.h
index afcd61c397..e2220727e8 100644
--- a/cpp/include/cuml/tsa/arima_common.h
+++ b/cpp/include/cuml/tsa/arima_common.h
@@ -20,6 +20,7 @@
 
 #include <algorithm>
 
+#include <raft/cudart_utils.h>
 #include <thrust/execution_policy.h>
 #include <thrust/for_each.h>
 #include <thrust/iterator/counting_iterator.h>
@@ -39,13 +40,13 @@ struct ARIMAOrder {
   int s;  // Seasonal period
   int k;  // Fit intercept?
 
-  inline int r() const { return std::max(p + s * P, q + s * Q + 1); }
-  inline int complexity() const { return p + P + q + Q + k + 1; }
-  inline int lost_in_diff() const { return d + s * D; }
+  inline int n_diff() const { return d + s * D; }
   inline int n_phi() const { return p + s * P; }
   inline int n_theta() const { return q + s * Q; }
-
-  inline bool need_prep() const { return static_cast<bool>(d + D); }
+  inline int r() const { return std::max(n_phi(), n_theta() + 1); }
+  inline int rd() const { return n_diff() + r(); }
+  inline int complexity() const { return p + P + q + Q + k + 1; }
+  inline bool need_diff() const { return static_cast<bool>(d + D); }
 };
 
 /**
diff --git a/cpp/include/cuml/tsa/auto_arima.h b/cpp/include/cuml/tsa/auto_arima.h
index de3ef1c218..cd9a2fa357 100644
--- a/cpp/include/cuml/tsa/auto_arima.h
+++ b/cpp/include/cuml/tsa/auto_arima.h
@@ -30,7 +30,7 @@ namespace ML {
  * @param[in]  batch_size Batch size
  * @return The number of 'true' series in the mask
  */
-int divide_by_mask_build_index(const cumlHandle& handle, const bool* d_mask,
+int divide_by_mask_build_index(const raft::handle_t& handle, const bool* d_mask,
                                int* d_index, int batch_size);
 
 /**
@@ -46,15 +46,15 @@ int divide_by_mask_build_index(const cumlHandle& handle, const bool* d_mask,
  * @param[in]  batch_size Batch size
  * @param[in]  n_obs      Number of data points per series
  */
-void divide_by_mask_execute(const cumlHandle& handle, const float* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const float* d_in,
                             const bool* d_mask, const int* d_index,
                             float* d_out0, float* d_out1, int batch_size,
                             int n_obs);
-void divide_by_mask_execute(const cumlHandle& handle, const double* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const double* d_in,
                             const bool* d_mask, const int* d_index,
                             double* d_out0, double* d_out1, int batch_size,
                             int n_obs);
-void divide_by_mask_execute(const cumlHandle& handle, const int* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const int* d_in,
                             const bool* d_mask, const int* d_index, int* d_out0,
                             int* d_out1, int batch_size, int n_obs);
 
@@ -72,12 +72,14 @@ void divide_by_mask_execute(const cumlHandle& handle, const int* d_in,
  * @param[in]  batch_size Batch size
  * @param[in]  n_sub      Number of sub-batches
  */
-void divide_by_min_build_index(const cumlHandle& handle, const float* d_matrix,
-                               int* d_batch, int* d_index, int* h_size,
-                               int batch_size, int n_sub);
-void divide_by_min_build_index(const cumlHandle& handle, const double* d_matrix,
-                               int* d_batch, int* d_index, int* h_size,
-                               int batch_size, int n_sub);
+void divide_by_min_build_index(const raft::handle_t& handle,
+                               const float* d_matrix, int* d_batch,
+                               int* d_index, int* h_size, int batch_size,
+                               int n_sub);
+void divide_by_min_build_index(const raft::handle_t& handle,
+                               const double* d_matrix, int* d_batch,
+                               int* d_index, int* h_size, int batch_size,
+                               int n_sub);
 
 /**
  * Batch division by minimum value step 2: create all the sub-batches
@@ -92,15 +94,15 @@ void divide_by_min_build_index(const cumlHandle& handle, const double* d_matrix,
  * @param[in]  n_sub      Number of sub-batches
  * @param[in]  n_obs      Number of data points per series
  */
-void divide_by_min_execute(const cumlHandle& handle, const float* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const float* d_in,
                            const int* d_batch, const int* d_index,
                            float** hd_out, int batch_size, int n_sub,
                            int n_obs);
-void divide_by_min_execute(const cumlHandle& handle, const double* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const double* d_in,
                            const int* d_batch, const int* d_index,
                            double** hd_out, int batch_size, int n_sub,
                            int n_obs);
-void divide_by_min_execute(const cumlHandle& handle, const int* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const int* d_in,
                            const int* d_batch, const int* d_index, int** hd_out,
                            int batch_size, int n_sub, int n_obs);
 
@@ -119,7 +121,7 @@ void divide_by_min_execute(const cumlHandle& handle, const int* d_in,
  * @param[in]  batch_size    Batch size
  * @param[in]  n_sub         Number of sub-batches
  */
-void build_division_map(const cumlHandle& handle, const int* const* hd_id,
+void build_division_map(const raft::handle_t& handle, const int* const* hd_id,
                         const int* h_size, int* d_id_to_pos, int* d_id_to_model,
                         int batch_size, int n_sub);
 
@@ -140,10 +142,10 @@ void build_division_map(const cumlHandle& handle, const int* const* hd_id,
  * @param[in]  n_sub       Number of sub-batches
  * @param[in]  n_obs       Number of observations (or forecasts) per series
  */
-void merge_series(const cumlHandle& handle, const float* const* hd_in,
+void merge_series(const raft::handle_t& handle, const float* const* hd_in,
                   const int* d_id_to_pos, const int* d_id_to_sub, float* d_out,
                   int batch_size, int n_sub, int n_obs);
-void merge_series(const cumlHandle& handle, const double* const* hd_in,
+void merge_series(const raft::handle_t& handle, const double* const* hd_in,
                   const int* d_id_to_pos, const int* d_id_to_sub, double* d_out,
                   int batch_size, int n_sub, int n_obs);
 
diff --git a/cpp/include/cuml/tsa/batched_arima.hpp b/cpp/include/cuml/tsa/batched_arima.hpp
index 5cdd442236..569515b45b 100644
--- a/cpp/include/cuml/tsa/batched_arima.hpp
+++ b/cpp/include/cuml/tsa/batched_arima.hpp
@@ -33,7 +33,7 @@ enum LoglikeMethod { CSS, MLE };
  * @param[in]  n_obs      Number of observations
  * @param[in]  order      ARIMA order
  */
-void batched_diff(cumlHandle& handle, double* d_y_diff, const double* d_y,
+void batched_diff(raft::handle_t& handle, double* d_y_diff, const double* d_y,
                   int batch_size, int n_obs, const ARIMAOrder& order);
 
 /**
@@ -59,13 +59,18 @@ void batched_diff(cumlHandle& handle, double* d_y_diff, const double* d_y,
  *                          number of observations
  * @param[in]  fc_steps     Number of steps to forecast
  * @param[in]  d_fc         Array to store the forecast
+ * @param[in]  level        Confidence level for prediction intervals. 0 to
+ *                          skip the computation. Else 0 < level < 1
+ * @param[out] d_lower      Lower limit of the prediction interval
+ * @param[out] d_upper      Upper limit of the prediction interval
  */
-void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
+void batched_loglike(raft::handle_t& handle, const double* d_y, int batch_size,
                      int n_obs, const ARIMAOrder& order, const double* d_params,
                      double* loglike, double* d_vs, bool trans = true,
                      bool host_loglike = true, LoglikeMethod method = MLE,
-                     int truncate = 0, int fc_steps = 0,
-                     double* d_fc = nullptr);
+                     int truncate = 0, int fc_steps = 0, double* d_fc = nullptr,
+                     double level = 0, double* d_lower = nullptr,
+                     double* d_upper = nullptr);
 
 /**
  * Compute the loglikelihood of the given parameter on the given time series
@@ -92,13 +97,18 @@ void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
  *                          number of observations
  * @param[in]  fc_steps     Number of steps to forecast
  * @param[in]  d_fc         Array to store the forecast
+ * @param[in]  level        Confidence level for prediction intervals. 0 to
+ *                          skip the computation. Else 0 < level < 1
+ * @param[out] d_lower      Lower limit of the prediction interval
+ * @param[out] d_upper      Upper limit of the prediction interval
  */
-void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
+void batched_loglike(raft::handle_t& handle, const double* d_y, int batch_size,
                      int n_obs, const ARIMAOrder& order,
                      const ARIMAParams<double>& params, double* loglike,
                      double* d_vs, bool trans = true, bool host_loglike = true,
                      LoglikeMethod method = MLE, int truncate = 0,
-                     int fc_steps = 0, double* d_fc = nullptr);
+                     int fc_steps = 0, double* d_fc = nullptr, double level = 0,
+                     double* d_lower = nullptr, double* d_upper = nullptr);
 
 /**
  * Compute the gradient of the log-likelihood
@@ -117,10 +127,11 @@ void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
  * @param[in]  truncate     For CSS, start the sum-of-squares after a given
  *                          number of observations
  */
-void batched_loglike_grad(cumlHandle& handle, const double* d_y, int batch_size,
-                          int n_obs, const ARIMAOrder& order, const double* d_x,
-                          double* d_grad, double h, bool trans = true,
-                          LoglikeMethod method = MLE, int truncate = 0);
+void batched_loglike_grad(raft::handle_t& handle, const double* d_y,
+                          int batch_size, int n_obs, const ARIMAOrder& order,
+                          const double* d_x, double* d_grad, double h,
+                          bool trans = true, LoglikeMethod method = MLE,
+                          int truncate = 0);
 
 /**
  * Batched in-sample and out-of-sample prediction of a time-series given all
@@ -136,12 +147,18 @@ void batched_loglike_grad(cumlHandle& handle, const double* d_y, int batch_size,
  * @param[in]  end         Index to end the prediction (excluded)
  * @param[in]  order       ARIMA hyper-parameters
  * @param[in]  params      ARIMA parameters (device)
- * @param[out] d_vs        Residual output (device)
  * @param[out] d_y_p       Prediction output (device)
+ * @param[in]  pre_diff    Whether to use pre-differencing
+ * @param[in]  level       Confidence level for prediction intervals. 0 to
+ *                         skip the computation. Else 0 < level < 1
+ * @param[out] d_lower     Lower limit of the prediction interval
+ * @param[out] d_upper     Upper limit of the prediction interval
  */
-void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
-             int start, int end, const ARIMAOrder& order,
-             const ARIMAParams<double>& params, double* d_vs, double* d_y_p);
+void predict(raft::handle_t& handle, const double* d_y, int batch_size,
+             int n_obs, int start, int end, const ARIMAOrder& order,
+             const ARIMAParams<double>& params, double* d_y_p,
+             bool pre_diff = true, double level = 0, double* d_lower = nullptr,
+             double* d_upper = nullptr);
 
 /**
  * Compute an information criterion (AIC, AICc, BIC)
@@ -159,7 +176,7 @@ void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
  * @param[in]  ic_type     Type of information criterion wanted.
  *                         0: AIC, 1: AICc, 2: BIC
  */
-void information_criterion(cumlHandle& handle, const double* d_y,
+void information_criterion(raft::handle_t& handle, const double* d_y,
                            int batch_size, int n_obs, const ARIMAOrder& order,
                            const ARIMAParams<double>& params, double* ic,
                            int ic_type);
@@ -176,7 +193,7 @@ void information_criterion(cumlHandle& handle, const double* d_y,
  *                         (all series must be identical)
  * @param[in]  order       ARIMA hyper-parameters
  */
-void estimate_x0(cumlHandle& handle, ARIMAParams<double>& params,
+void estimate_x0(raft::handle_t& handle, ARIMAParams<double>& params,
                  const double* d_y, int batch_size, int n_obs,
                  const ARIMAOrder& order);
 
diff --git a/cpp/include/cuml/tsa/batched_kalman.hpp b/cpp/include/cuml/tsa/batched_kalman.hpp
index f5935b3752..1cec632aa3 100644
--- a/cpp/include/cuml/tsa/batched_kalman.hpp
+++ b/cpp/include/cuml/tsa/batched_kalman.hpp
@@ -38,12 +38,18 @@ namespace ML {
  *                           shape=(nobs-d-s*D, batch_size) (device)
  * @param[in]  fc_steps      Number of steps to forecast
  * @param[in]  d_fc          Array to store the forecast
+ * @param[in]  level         Confidence level for prediction intervals. 0 to
+ *                           skip the computation. Else 0 < level < 1
+ * @param[out] d_lower       Lower limit of the prediction interval
+ * @param[out] d_upper       Upper limit of the prediction interval
  */
-void batched_kalman_filter(cumlHandle& handle, const double* d_ys_b, int nobs,
-                           const ARIMAParams<double>& params,
+void batched_kalman_filter(raft::handle_t& handle, const double* d_ys_b,
+                           int nobs, const ARIMAParams<double>& params,
                            const ARIMAOrder& order, int batch_size,
                            double* d_loglike, double* d_vs, int fc_steps = 0,
-                           double* d_fc = nullptr);
+                           double* d_fc = nullptr, double level = 0,
+                           double* d_lower = nullptr,
+                           double* d_upper = nullptr);
 
 /**
  * Convenience function for batched "jones transform" used in ARIMA to ensure
@@ -59,7 +65,7 @@ void batched_kalman_filter(cumlHandle& handle, const double* d_ys_b, int nobs,
  *                        (expects pre-allocated array of size
  *                         (p+q)*batch_size) (host)
  */
-void batched_jones_transform(cumlHandle& handle, const ARIMAOrder& order,
+void batched_jones_transform(raft::handle_t& handle, const ARIMAOrder& order,
                              int batch_size, bool isInv, const double* h_params,
                              double* h_Tparams);
 }  // namespace ML
diff --git a/cpp/include/cuml/tsa/holtwinters.h b/cpp/include/cuml/tsa/holtwinters.h
index 2f6a5d08b8..bd3c4f0f28 100644
--- a/cpp/include/cuml/tsa/holtwinters.h
+++ b/cpp/include/cuml/tsa/holtwinters.h
@@ -75,11 +75,11 @@ void buffer_size(int n, int batch_size, int frequency,
              * @param[out] error_d
              *             device pointer to array which will hold training SSE error
              */
-void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
+void fit(const raft::handle_t &handle, int n, int batch_size, int frequency,
          int start_periods, ML::SeasonalType seasonal, float epsilon,
          float *data, float *level_d, float *trend_d, float *season_d,
          float *error_d);
-void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
+void fit(const raft::handle_t &handle, int n, int batch_size, int frequency,
          int start_periods, ML::SeasonalType seasonal, double epsilon,
          double *data, double *level_d, double *trend_d, double *season_d,
          double *error_d);
@@ -107,10 +107,10 @@ void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
              * @param[out] forecast_d
              *             device pointer to array which will hold the forecast points
              */
-void forecast(const ML::cumlHandle &handle, int n, int batch_size,
+void forecast(const raft::handle_t &handle, int n, int batch_size,
               int frequency, int h, ML::SeasonalType seasonal, float *level_d,
               float *trend_d, float *season_d, float *forecast_d);
-void forecast(const ML::cumlHandle &handle, int n, int batch_size,
+void forecast(const raft::handle_t &handle, int n, int batch_size,
               int frequency, int h, ML::SeasonalType seasonal, double *level_d,
               double *trend_d, double *season_d, double *forecast_d);
 
diff --git a/cpp/include/cuml/tsa/stationarity.h b/cpp/include/cuml/tsa/stationarity.h
index 0b1240fbec..98dea34a6e 100644
--- a/cpp/include/cuml/tsa/stationarity.h
+++ b/cpp/include/cuml/tsa/stationarity.h
@@ -36,10 +36,10 @@ namespace Stationarity {
  * @param[in]   pval_threshold  P-value threshold above which a series is
  *                              considered stationary
  */
-void kpss_test(const cumlHandle& handle, const float* d_y, bool* results,
+void kpss_test(const raft::handle_t& handle, const float* d_y, bool* results,
                int batch_size, int n_obs, int d, int D, int s,
                float pval_threshold);
-void kpss_test(const cumlHandle& handle, const double* d_y, bool* results,
+void kpss_test(const raft::handle_t& handle, const double* d_y, bool* results,
                int batch_size, int n_obs, int d, int D, int s,
                double pval_threshold);
 
diff --git a/cpp/src/arima/batched_arima.cu b/cpp/src/arima/batched_arima.cu
index 994d894151..3b7570fdf9 100644
--- a/cpp/src/arima/batched_arima.cu
+++ b/cpp/src/arima/batched_arima.cu
@@ -28,57 +28,71 @@
 #include <cuml/tsa/batched_arima.hpp>
 #include <cuml/tsa/batched_kalman.hpp>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <common/nvtx.hpp>
-#include <cuda_utils.cuh>
 #include <linalg/batched/matrix.cuh>
-#include <linalg/matrix_vector_op.cuh>
 #include <metrics/batched/information_criterion.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
 #include <timeSeries/arima_helpers.cuh>
 
 namespace ML {
 
-void batched_diff(cumlHandle& handle, double* d_y_diff, const double* d_y,
+void batched_diff(raft::handle_t& handle, double* d_y_diff, const double* d_y,
                   int batch_size, int n_obs, const ARIMAOrder& order) {
-  const auto stream = handle.getStream();
+  const auto stream = handle.get_stream();
   MLCommon::TimeSeries::prepare_data(d_y_diff, d_y, batch_size, n_obs, order.d,
                                      order.D, order.s, stream);
 }
 
-void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
-             int start, int end, const ARIMAOrder& order,
-             const ARIMAParams<double>& params, double* d_vs, double* d_y_p) {
+void predict(raft::handle_t& handle, const double* d_y, int batch_size,
+             int n_obs, int start, int end, const ARIMAOrder& order,
+             const ARIMAParams<double>& params, double* d_y_p, bool pre_diff,
+             double level, double* d_lower, double* d_upper) {
   ML::PUSH_RANGE(__func__);
-  auto allocator = handle.getDeviceAllocator();
-  const auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  const auto stream = handle.get_stream();
+
+  bool diff = order.need_diff() && pre_diff && level == 0;
 
   // Prepare data
-  int diff_obs = order.lost_in_diff();
-  int ld_yprep = n_obs - diff_obs;
-  double* d_y_prep = (double*)allocator->allocate(
-    ld_yprep * batch_size * sizeof(double), stream);
-  MLCommon::TimeSeries::prepare_data(d_y_prep, d_y, batch_size, n_obs, order.d,
-                                     order.D, order.s, stream);
+  int n_obs_kf;
+  const double* d_y_kf;
+  MLCommon::device_buffer<double> diff_buffer(allocator, stream);
+  ARIMAOrder order_after_prep = order;
+  if (diff) {
+    n_obs_kf = n_obs - order.n_diff();
+    diff_buffer.resize(n_obs_kf * batch_size, stream);
+    MLCommon::TimeSeries::prepare_data(diff_buffer.data(), d_y, batch_size,
+                                       n_obs, order.d, order.D, order.s,
+                                       stream);
+    d_y_kf = diff_buffer.data();
+    order_after_prep.d = 0;
+    order_after_prep.D = 0;
+  } else {
+    n_obs_kf = n_obs;
+    d_y_kf = d_y;
+  }
+
+  // Create temporary array for the residuals
+  MLCommon::device_buffer<double> v_buffer(allocator, stream,
+                                           n_obs_kf * batch_size);
+  double* d_vs = v_buffer.data();
 
   // Create temporary array for the forecasts
   int num_steps = std::max(end - n_obs, 0);
-  double* d_y_fc = nullptr;
-  if (num_steps) {
-    d_y_fc = (double*)allocator->allocate(
-      num_steps * batch_size * sizeof(double), stream);
-  }
+  MLCommon::device_buffer<double> fc_buffer(allocator, stream,
+                                            num_steps * batch_size);
+  double* d_y_fc = fc_buffer.data();
 
-  // Compute the residual and forecast - provide already prepared data and
-  // extracted parameters
-  ARIMAOrder order_after_prep = {order.p, 0,       order.q, order.P,
-                                 0,       order.Q, order.s, order.k};
+  // Compute the residual and forecast
   std::vector<double> loglike = std::vector<double>(batch_size);
   /// TODO: use device loglike to avoid useless copy ; part of #2233
-  batched_loglike(handle, d_y_prep, batch_size, n_obs - diff_obs,
-                  order_after_prep, params, loglike.data(), d_vs, false, true,
-                  MLE, 0, num_steps, d_y_fc);
+  batched_loglike(handle, d_y_kf, batch_size, n_obs_kf, order_after_prep,
+                  params, loglike.data(), d_vs, false, true, MLE, 0, num_steps,
+                  d_y_fc, level, d_lower, d_upper);
 
   auto counting = thrust::make_counting_iterator(0);
   int predict_ld = end - start;
@@ -87,7 +101,8 @@ void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
   // In-sample prediction
   //
 
-  int p_start = std::max(start, diff_obs);
+  int res_offset = diff ? order.d + order.s * order.D : 0;
+  int p_start = std::max(start, res_offset);
   int p_end = std::min(n_obs, end);
 
   // The prediction loop starts by filling undefined predictions with NaN,
@@ -96,13 +111,13 @@ void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
     thrust::for_each(thrust::cuda::par.on(stream), counting,
                      counting + batch_size, [=] __device__(int bid) {
                        d_y_p[0] = 0.0;
-                       for (int i = 0; i < diff_obs - start; i++) {
+                       for (int i = 0; i < res_offset - start; i++) {
                          d_y_p[bid * predict_ld + i] = nan("");
                        }
                        for (int i = p_start; i < p_end; i++) {
                          d_y_p[bid * predict_ld + i - start] =
                            d_y[bid * n_obs + i] -
-                           d_vs[bid * ld_yprep + i - diff_obs];
+                           d_vs[bid * n_obs_kf + i - res_offset];
                        }
                      });
   }
@@ -112,10 +127,11 @@ void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
   //
 
   if (num_steps) {
-    // Add trend and/or undiff
-    MLCommon::TimeSeries::finalize_forecast(d_y_fc, d_y, num_steps, batch_size,
-                                            n_obs, n_obs, order.d, order.D,
-                                            order.s, stream);
+    if (diff) {
+      MLCommon::TimeSeries::finalize_forecast(d_y_fc, d_y, num_steps,
+                                              batch_size, n_obs, n_obs, order.d,
+                                              order.D, order.s, stream);
+    }
 
     // Copy forecast in d_y_p
     thrust::for_each(thrust::cuda::par.on(stream), counting,
@@ -125,13 +141,9 @@ void predict(cumlHandle& handle, const double* d_y, int batch_size, int n_obs,
                            d_y_fc[num_steps * bid + i];
                        }
                      });
-
-    allocator->deallocate(d_y_fc, num_steps * batch_size * sizeof(double),
-                          stream);
+    /// TODO: 2D copy kernel?
   }
 
-  allocator->deallocate(d_y_prep, ld_yprep * batch_size * sizeof(double),
-                        stream);
   ML::POP_RANGE();
 }
 
@@ -199,7 +211,7 @@ __global__ void sum_of_squares_kernel(const DataT* d_y, const DataT* d_mu,
       threadIdx.x < n_phi ? phi * b_y[i - threadIdx.x - 1 - start_y] : (DataT)0;
     res -= threadIdx.x < n_theta ? theta * b_vs[i - threadIdx.x - 1 - start_v]
                                  : (DataT)0;
-    res = MLCommon::blockReduce(res, temp_smem);
+    res = raft::blockReduce(res, temp_smem);
     if (threadIdx.x == 0) {
       res += b_y[i - start_y] - mu;
       b_vs[i - start_v] = res;
@@ -211,7 +223,7 @@ __global__ void sum_of_squares_kernel(const DataT* d_y, const DataT* d_mu,
   if (threadIdx.x == 0) {
     d_loglike[blockIdx.x] =
       -0.5 * static_cast<DataT>(n_obs) *
-      MLCommon::myLog(ssq / static_cast<DataT>(n_obs - start_sum));
+      raft::myLog(ssq / static_cast<DataT>(n_obs - start_sum));
   }
 }
 
@@ -227,13 +239,13 @@ __global__ void sum_of_squares_kernel(const DataT* d_y, const DataT* d_mu,
  * @param[out] d_loglike  Evaluated log-likelihood (device)
  * @param[in]  truncate   Number of observations to skip in the sum
  */
-void conditional_sum_of_squares(cumlHandle& handle, const double* d_y,
+void conditional_sum_of_squares(raft::handle_t& handle, const double* d_y,
                                 int batch_size, int n_obs,
                                 const ARIMAOrder& order,
                                 const ARIMAParams<double>& Tparams,
                                 double* d_loglike, int truncate) {
   ML::PUSH_RANGE(__func__);
-  auto stream = handle.getStream();
+  auto stream = handle.get_stream();
 
   int n_phi = order.n_phi();
   int n_theta = order.n_theta();
@@ -243,7 +255,7 @@ void conditional_sum_of_squares(cumlHandle& handle, const double* d_y,
   int start_v = start_sum - n_theta;
 
   // Compute the sum-of-squares and the log-likelihood
-  int n_warps = std::max(MLCommon::ceildiv<int>(max_lags, 32), 1);
+  int n_warps = std::max(raft::ceildiv<int>(max_lags, 32), 1);
   size_t shared_mem_size =
     (2 * n_obs - start_y - start_v + n_warps) * sizeof(double);
   sum_of_squares_kernel<<<batch_size, 32 * n_warps, shared_mem_size, stream>>>(
@@ -255,22 +267,21 @@ void conditional_sum_of_squares(cumlHandle& handle, const double* d_y,
   ML::POP_RANGE();
 }
 
-void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
+void batched_loglike(raft::handle_t& handle, const double* d_y, int batch_size,
                      int n_obs, const ARIMAOrder& order,
                      const ARIMAParams<double>& params, double* loglike,
                      double* d_vs, bool trans, bool host_loglike,
                      LoglikeMethod method, int truncate, int fc_steps,
-                     double* d_fc) {
+                     double* d_fc, double level, double* d_lower,
+                     double* d_upper) {
   ML::PUSH_RANGE(__func__);
 
-  auto allocator = handle.getDeviceAllocator();
-  auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  auto stream = handle.get_stream();
   ARIMAParams<double> Tparams;
 
-  if (method != MLE && fc_steps) {
-    /// TODO: add warning when solving #2232
-    method = MLE;
-  }
+  ASSERT(method == MLE || fc_steps == 0,
+         "Only MLE method is valid for forecasting");
 
   /* Create log-likelihood device array if host pointer is provided */
   double* d_loglike;
@@ -294,36 +305,18 @@ void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
     Tparams = params;
   }
 
-  if (!order.need_prep()) {
-    if (method == CSS) {
-      conditional_sum_of_squares(handle, d_y, batch_size, n_obs, order, Tparams,
-                                 d_loglike, truncate);
-    } else {
-      batched_kalman_filter(handle, d_y, n_obs, Tparams, order, batch_size,
-                            d_loglike, d_vs, fc_steps, d_fc);
-    }
+  if (method == CSS) {
+    conditional_sum_of_squares(handle, d_y, batch_size, n_obs, order, Tparams,
+                               d_loglike, truncate);
   } else {
-    MLCommon::device_buffer<double> y_prep(
-      allocator, stream, batch_size * (n_obs - order.lost_in_diff()));
-    double* d_y_prep = y_prep.data();
-
-    MLCommon::TimeSeries::prepare_data(d_y_prep, d_y, batch_size, n_obs,
-                                       order.d, order.D, order.s, stream);
-
-    if (method == CSS) {
-      conditional_sum_of_squares(handle, d_y_prep, batch_size,
-                                 n_obs - order.lost_in_diff(), order, Tparams,
-                                 d_loglike, truncate);
-    } else {
-      batched_kalman_filter(handle, d_y_prep, n_obs - order.lost_in_diff(),
-                            Tparams, order, batch_size, d_loglike, d_vs,
-                            fc_steps, d_fc);
-    }
+    batched_kalman_filter(handle, d_y, n_obs, Tparams, order, batch_size,
+                          d_loglike, d_vs, fc_steps, d_fc, level, d_lower,
+                          d_upper);
   }
 
   if (host_loglike) {
     /* Tranfer log-likelihood device -> host */
-    MLCommon::updateHost(loglike, d_loglike, batch_size, stream);
+    raft::update_host(loglike, d_loglike, batch_size, stream);
   }
 
   if (trans) {
@@ -332,49 +325,55 @@ void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
   ML::POP_RANGE();
 }
 
-void batched_loglike(cumlHandle& handle, const double* d_y, int batch_size,
+void batched_loglike(raft::handle_t& handle, const double* d_y, int batch_size,
                      int n_obs, const ARIMAOrder& order, const double* d_params,
                      double* loglike, double* d_vs, bool trans,
                      bool host_loglike, LoglikeMethod method, int truncate,
-                     int fc_steps, double* d_fc) {
+                     int fc_steps, double* d_fc, double level, double* d_lower,
+                     double* d_upper) {
   ML::PUSH_RANGE(__func__);
 
   // unpack parameters
-  auto allocator = handle.getDeviceAllocator();
-  auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  auto stream = handle.get_stream();
   ARIMAParams<double> params;
   params.allocate(order, batch_size, allocator, stream, false);
   params.unpack(order, batch_size, d_params, stream);
 
   batched_loglike(handle, d_y, batch_size, n_obs, order, params, loglike, d_vs,
-                  trans, host_loglike, method, truncate, fc_steps, d_fc);
+                  trans, host_loglike, method, truncate, fc_steps, d_fc, level,
+                  d_lower, d_upper);
 
   params.deallocate(order, batch_size, allocator, stream, false);
+
   ML::POP_RANGE();
 }
 
-void batched_loglike_grad(cumlHandle& handle, const double* d_y, int batch_size,
-                          int n_obs, const ARIMAOrder& order, const double* d_x,
-                          double* d_grad, double h, bool trans,
-                          LoglikeMethod method, int truncate) {
+void batched_loglike_grad(raft::handle_t& handle, const double* d_y,
+                          int batch_size, int n_obs, const ARIMAOrder& order,
+                          const double* d_x, double* d_grad, double h,
+                          bool trans, LoglikeMethod method, int truncate) {
   ML::PUSH_RANGE(__func__);
-  auto allocator = handle.getDeviceAllocator();
-  auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  auto stream = handle.get_stream();
   auto counting = thrust::make_counting_iterator(0);
   int N = order.complexity();
 
   // Initialize the perturbed x vector
   MLCommon::device_buffer<double> x_pert(allocator, stream, N * batch_size);
   double* d_x_pert = x_pert.data();
-  MLCommon::copy(d_x_pert, d_x, N * batch_size, stream);
+  raft::copy(d_x_pert, d_x, N * batch_size, stream);
 
   // Create buffers for the log-likelihood and residuals
-  MLCommon::device_buffer<double> ll_pos(allocator, stream, batch_size);
-  MLCommon::device_buffer<double> ll_neg(allocator, stream, batch_size);
-  MLCommon::device_buffer<double> res(
-    allocator, stream, (n_obs - order.lost_in_diff()) * batch_size);
-  double* d_ll_pos = ll_pos.data();
-  double* d_ll_neg = ll_neg.data();
+  MLCommon::device_buffer<double> ll_base(allocator, stream, batch_size);
+  MLCommon::device_buffer<double> ll_pert(allocator, stream, batch_size);
+  MLCommon::device_buffer<double> res(allocator, stream, n_obs * batch_size);
+  double* d_ll_base = ll_base.data();
+  double* d_ll_pert = ll_pert.data();
+
+  // Evaluate the log-likelihood with the given parameter vector
+  batched_loglike(handle, d_y, batch_size, n_obs, order, d_x, d_ll_base,
+                  res.data(), trans, false, method, truncate);
 
   for (int i = 0; i < N; i++) {
     // Add the perturbation to the i-th parameter
@@ -384,24 +383,14 @@ void batched_loglike_grad(cumlHandle& handle, const double* d_y, int batch_size,
                      });
 
     // Evaluate the log-likelihood with the positive perturbation
-    batched_loglike(handle, d_y, batch_size, n_obs, order, d_x_pert, d_ll_pos,
-                    res.data(), trans, false, method, truncate);
-
-    // Subtract the perturbation to the i-th parameter
-    thrust::for_each(thrust::cuda::par.on(stream), counting,
-                     counting + batch_size, [=] __device__(int bid) {
-                       d_x_pert[N * bid + i] = d_x[N * bid + i] - h;
-                     });
-
-    // Evaluate the log-likelihood with the negative perturbation
-    batched_loglike(handle, d_y, batch_size, n_obs, order, d_x_pert, d_ll_neg,
+    batched_loglike(handle, d_y, batch_size, n_obs, order, d_x_pert, d_ll_pert,
                     res.data(), trans, false, method, truncate);
 
-    // First derivative with a second-order accuracy
+    // First derivative with a first-order accuracy
     thrust::for_each(thrust::cuda::par.on(stream), counting,
                      counting + batch_size, [=] __device__(int bid) {
                        d_grad[N * bid + i] =
-                         (d_ll_pos[bid] - d_ll_neg[bid]) / (2.0 * h);
+                         (d_ll_pert[bid] - d_ll_base[bid]) / h;
                      });
 
     // Reset the i-th parameter
@@ -413,27 +402,26 @@ void batched_loglike_grad(cumlHandle& handle, const double* d_y, int batch_size,
   ML::POP_RANGE();
 }
 
-void information_criterion(cumlHandle& handle, const double* d_y,
+void information_criterion(raft::handle_t& handle, const double* d_y,
                            int batch_size, int n_obs, const ARIMAOrder& order,
                            const ARIMAParams<double>& params, double* d_ic,
                            int ic_type) {
   ML::PUSH_RANGE(__func__);
-  auto allocator = handle.getDeviceAllocator();
-  auto stream = handle.getStream();
-  double* d_vs = (double*)allocator->allocate(
-    sizeof(double) * (n_obs - order.lost_in_diff()) * batch_size, stream);
+  auto allocator = handle.get_device_allocator();
+  auto stream = handle.get_stream();
+
+  MLCommon::device_buffer<double> v_buffer(allocator, stream,
+                                           n_obs * batch_size);
 
   /* Compute log-likelihood in d_ic */
-  batched_loglike(handle, d_y, batch_size, n_obs, order, params, d_ic, d_vs,
-                  false, false);
+  batched_loglike(handle, d_y, batch_size, n_obs, order, params, d_ic,
+                  v_buffer.data(), false, false, MLE);
 
   /* Compute information criterion from log-likelihood and base term */
   MLCommon::Metrics::Batched::information_criterion(
     d_ic, d_ic, static_cast<MLCommon::Metrics::IC_Type>(ic_type),
-    order.complexity(), batch_size, n_obs - order.lost_in_diff(), stream);
+    order.complexity(), batch_size, n_obs - order.n_diff(), stream);
 
-  allocator->deallocate(
-    d_vs, sizeof(double) * (n_obs - order.lost_in_diff()) * batch_size, stream);
   ML::POP_RANGE();
 }
 
@@ -481,15 +469,15 @@ DI bool test_invparams(const double* params, int pq) {
  * ARMA model (with or without seasonality)
  * @note: in this function the non-seasonal case has s=1, not s=0!
  */
-void _arma_least_squares(cumlHandle& handle, double* d_ar, double* d_ma,
+void _arma_least_squares(raft::handle_t& handle, double* d_ar, double* d_ma,
                          double* d_sigma2,
                          const MLCommon::LinAlg::Batched::Matrix<double>& bm_y,
                          int p, int q, int s, bool estimate_sigma2, int k = 0,
                          double* d_mu = nullptr) {
-  const auto& handle_impl = handle.getImpl();
-  auto stream = handle_impl.getStream();
-  auto cublas_handle = handle_impl.getCublasHandle();
-  auto allocator = handle_impl.getDeviceAllocator();
+  const auto& handle_impl = handle;
+  auto stream = handle_impl.get_stream();
+  auto cublas_handle = handle_impl.get_cublas_handle();
+  auto allocator = handle_impl.get_device_allocator();
   auto counting = thrust::make_counting_iterator(0);
 
   int batch_size = bm_y.batches();
@@ -581,8 +569,8 @@ void _arma_least_squares(cumlHandle& handle, double* d_ar, double* d_ma,
   MLCommon::LinAlg::Batched::Matrix<double> bm_final_residual(
     n_obs - r, 1, batch_size, cublas_handle, allocator, stream, false);
   if (estimate_sigma2) {
-    MLCommon::copy(bm_final_residual.raw_data(), bm_arma_fit.raw_data(),
-                   (n_obs - r) * batch_size, stream);
+    raft::copy(bm_final_residual.raw_data(), bm_arma_fit.raw_data(),
+               (n_obs - r) * batch_size, stream);
   }
 
   // ARMA fit
@@ -655,7 +643,7 @@ void _arma_least_squares(cumlHandle& handle, double* d_ar, double* d_ma,
  * Auxiliary function of estimate_x0: compute the starting parameters for
  * the series pre-processed by estimate_x0
  */
-void _start_params(cumlHandle& handle, ARIMAParams<double>& params,
+void _start_params(raft::handle_t& handle, ARIMAParams<double>& params,
                    const MLCommon::LinAlg::Batched::Matrix<double>& bm_y,
                    const ARIMAOrder& order) {
   // Estimate an ARMA fit without seasonality
@@ -670,14 +658,14 @@ void _start_params(cumlHandle& handle, ARIMAParams<double>& params,
                         order.p + order.q + order.k == 0);
 }
 
-void estimate_x0(cumlHandle& handle, ARIMAParams<double>& params,
+void estimate_x0(raft::handle_t& handle, ARIMAParams<double>& params,
                  const double* d_y, int batch_size, int n_obs,
                  const ARIMAOrder& order) {
   ML::PUSH_RANGE(__func__);
-  const auto& handle_impl = handle.getImpl();
-  auto stream = handle_impl.getStream();
-  auto cublas_handle = handle_impl.getCublasHandle();
-  auto allocator = handle_impl.getDeviceAllocator();
+  const auto& handle_impl = handle;
+  auto stream = handle_impl.get_stream();
+  auto cublas_handle = handle_impl.get_cublas_handle();
+  auto allocator = handle_impl.get_device_allocator();
 
   // Difference if necessary, copy otherwise
   MLCommon::LinAlg::Batched::Matrix<double> bm_yd(
diff --git a/cpp/src/arima/batched_kalman.cu b/cpp/src/arima/batched_kalman.cu
index 7c1304c716..932f50d07c 100644
--- a/cpp/src/arima/batched_kalman.cu
+++ b/cpp/src/arima/batched_kalman.cu
@@ -24,42 +24,55 @@
 #include <cuml/cuml.hpp>
 #include <cuml/tsa/batched_kalman.hpp>
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/cumlHandle.hpp>
+#include <common/device_buffer.hpp>
 #include <common/nvtx.hpp>
-#include <cuda_utils.cuh>
 #include <linalg/batched/matrix.cuh>
-#include <linalg/binary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
 #include <sparse/batched/csr.cuh>
 #include <timeSeries/arima_helpers.cuh>
 
 namespace ML {
 
 //! Thread-local Matrix-Vector multiplication.
-template <int r>
-__device__ void Mv_l(const double* A, const double* v, double* out) {
-  for (int i = 0; i < r; i++) {
+template <int n>
+DI void Mv_l(const double* A, const double* v, double* out) {
+  for (int i = 0; i < n; i++) {
     double sum = 0.0;
-    for (int j = 0; j < r; j++) {
-      sum += A[i + j * r] * v[j];
+    for (int j = 0; j < n; j++) {
+      sum += A[i + j * n] * v[j];
     }
     out[i] = sum;
   }
 }
 
+template <int n>
+DI void Mv_l(double alpha, const double* A, const double* v, double beta,
+             double* out) {
+  for (int i = 0; i < n; i++) {
+    double sum = 0.0;
+    for (int j = 0; j < n; j++) {
+      sum += A[i + j * n] * v[j];
+    }
+    out[i] = alpha * sum + beta * out[i];
+  }
+}
+
 //! Thread-local Matrix-Matrix multiplication.
-template <int r, bool aT = false, bool bT = false>
-__device__ void MM_l(const double* A, const double* B, double* out) {
-  for (int i = 0; i < r; i++) {
-    for (int j = 0; j < r; j++) {
+template <int n, bool aT = false, bool bT = false>
+DI void MM_l(const double* A, const double* B, double* out) {
+  for (int i = 0; i < n; i++) {
+    for (int j = 0; j < n; j++) {
       double sum = 0.0;
-      for (int k = 0; k < r; k++) {
-        double Aik = aT ? A[k + i * r] : A[i + k * r];
-        double Bkj = bT ? B[j + k * r] : B[k + j * r];
+      for (int k = 0; k < n; k++) {
+        double Aik = aT ? A[k + i * n] : A[i + k * n];
+        double Bkj = bT ? B[j + k * n] : B[k + j * n];
         sum += Aik * Bkj;
       }
-      out[i + j * r] = sum;
+      out[i + j * n] = sum;
     }
   }
 }
@@ -73,7 +86,7 @@ __device__ void MM_l(const double* A, const double* B, double* out) {
  * @param[in]  nobs       Number of observation per series
  * @param[in]  T          Batched transition matrix.            (r x r)
  * @param[in]  Z          Batched "design" vector               (1 x r)
- * @param[in]  RRT        Batched R*R.T (R="selection" vector)  (r x r)
+ * @param[in]  RQR        Batched R*Q*R'                        (r x r)
  * @param[in]  P          Batched P                             (r x r)
  * @param[in]  alpha      Batched state vector                  (r x 1)
  * @param[in]  intercept  Do we fit an intercept?
@@ -82,40 +95,44 @@ __device__ void MM_l(const double* A, const double* B, double* out) {
  * @param[out] vs         Batched residuals                     (nobs)
  * @param[out] Fs         Batched variance of prediction errors (nobs)    
  * @param[out] sum_logFs  Batched sum of the logs of Fs         (1)
+ * @param[in]  n_diff       d + s*D
  * @param[in]  fc_steps   Number of steps to forecast
- * @param[in]  d_fc       Array to store the forecast
+ * @param[out] d_fc       Array to store the forecast
+ * @param[in]  conf_int   Whether to compute confidence intervals
+ * @param[in]  d_F_fc     Batched variance of forecast errors   (fc_steps)
  */
-template <int r>
+template <int rd>
 __global__ void batched_kalman_loop_kernel(
   const double* ys, int nobs, const double* T, const double* Z,
-  const double* RRT, const double* P, const double* alpha, bool intercept,
+  const double* RQR, const double* P, const double* alpha, bool intercept,
   const double* d_mu, int batch_size, double* vs, double* Fs, double* sum_logFs,
-  int fc_steps = 0, double* d_fc = nullptr) {
-  constexpr int r2 = r * r;
-  double l_RRT[r2];
-  double l_T[r2];
-  // double l_Z[r]; // note: will be used when introducing exogeneous var.
-  double l_P[r2];
-  double l_alpha[r];
-  double l_K[r];
-  double l_tmp[r2];
-  double l_TP[r2];
+  int n_diff, int fc_steps = 0, double* d_fc = nullptr, bool conf_int = false,
+  double* d_F_fc = nullptr) {
+  constexpr int rd2 = rd * rd;
+  double l_RQR[rd2];
+  double l_T[rd2];
+  double l_Z[rd];
+  double l_P[rd2];
+  double l_alpha[rd];
+  double l_K[rd];
+  double l_tmp[rd2];
+  double l_TP[rd2];
 
   int bid = blockDim.x * blockIdx.x + threadIdx.x;
 
   if (bid < batch_size) {
     // Load global mem into registers
     {
-      int b_r_offset = bid * r;
-      int b_r2_offset = bid * r2;
-      for (int i = 0; i < r2; i++) {
-        l_RRT[i] = RRT[b_r2_offset + i];
-        l_T[i] = T[b_r2_offset + i];
-        l_P[i] = P[b_r2_offset + i];
+      int b_rd_offset = bid * rd;
+      int b_rd2_offset = bid * rd2;
+      for (int i = 0; i < rd2; i++) {
+        l_RQR[i] = RQR[b_rd2_offset + i];
+        l_T[i] = T[b_rd2_offset + i];
+        l_P[i] = P[b_rd2_offset + i];
       }
-      for (int i = 0; i < r; i++) {
-        // l_Z[i] = Z[b_r_offset + i];
-        l_alpha[i] = alpha[b_r_offset + i];
+      for (int i = 0; i < rd; i++) {
+        if (n_diff > 0) l_Z[i] = Z[b_rd_offset + i];
+        l_alpha[i] = alpha[b_rd_offset + i];
       }
     }
 
@@ -127,65 +144,128 @@ __global__ void batched_kalman_loop_kernel(
     double mu = intercept ? d_mu[bid] : 0.0;
 
     for (int it = 0; it < nobs; it++) {
-      // 1. & 2.
-      double vs_it;
-      double _Fs = l_P[0];
-      vs_it = b_ys[it] - l_alpha[0];
+      // 1. v = y - Z*alpha
+      double vs_it = b_ys[it];
+      if (n_diff == 0)
+        vs_it -= l_alpha[0];
+      else {
+        for (int i = 0; i < rd; i++) {
+          vs_it -= l_alpha[i] * l_Z[i];
+        }
+      }
       b_vs[it] = vs_it;
+
+      // 2. F = Z*P*Z'
+      double _Fs;
+      if (n_diff == 0)
+        _Fs = l_P[0];
+      else {
+        _Fs = 0.0;
+        for (int i = 0; i < rd; i++) {
+          for (int j = 0; j < rd; j++) {
+            _Fs += l_P[j * rd + i] * l_Z[i] * l_Z[j];
+          }
+        }
+      }
       b_Fs[it] = _Fs;
-      b_sum_logFs += log(_Fs);
+      if (it >= n_diff) b_sum_logFs += log(_Fs);
 
       // 3. K = 1/Fs[it] * T*P*Z'
       // TP = T*P
-      MM_l<r>(l_T, l_P, l_TP);
-      // K = 1/Fs[it] * TP*Z' ; optimized for Z = (1 0 ... 0)
+      MM_l<rd>(l_T, l_P, l_TP);
+      // K = 1/Fs[it] * TP*Z'
       double _1_Fs = 1.0 / _Fs;
-      for (int i = 0; i < r; i++) {
-        l_K[i] = _1_Fs * l_TP[i];
-      }
+      if (n_diff == 0) {
+        for (int i = 0; i < rd; i++) {
+          l_K[i] = _1_Fs * l_TP[i];
+        }
+      } else
+        Mv_l<rd>(_1_Fs, l_TP, l_Z, 0.0, l_K);
 
       // 4. alpha = T*alpha + K*vs[it] + c
       // tmp = T*alpha
-      Mv_l<r>(l_T, l_alpha, l_tmp);
+      Mv_l<rd>(l_T, l_alpha, l_tmp);
       // alpha = tmp + K*vs[it]
-      for (int i = 0; i < r; i++) {
+      for (int i = 0; i < rd; i++) {
         l_alpha[i] = l_tmp[i] + l_K[i] * vs_it;
       }
-      // alpha_0 = alpha_0 + mu
-      l_alpha[0] += mu;
+      // alpha = alpha + c
+      l_alpha[n_diff] += mu;
 
       // 5. L = T - K * Z
       // L = T (L is tmp)
-      for (int i = 0; i < r2; i++) {
+      for (int i = 0; i < rd2; i++) {
         l_tmp[i] = l_T[i];
       }
-      // L = L - K * Z ; optimized for Z = (1 0 ... 0):
-      // substract K to the first column of L
-      for (int i = 0; i < r; i++) {
-        l_tmp[i] -= l_K[i];
+      // L = L - K * Z
+      if (n_diff == 0) {
+        for (int i = 0; i < rd; i++) {
+          l_tmp[i] -= l_K[i];
+        }
+      } else {
+        for (int i = 0; i < rd; i++) {
+          for (int j = 0; j < rd; j++) {
+            l_tmp[j * rd + i] -= l_K[i] * l_Z[j];
+          }
+        }
       }
 
-      // 6. P = T*P*L' + R*R'
+      // 6. P = T*P*L' + R*Q*R'
       // P = TP*L'
-      MM_l<r, false, true>(l_TP, l_tmp, l_P);
-      // P = P + RRT
-      for (int i = 0; i < r2; i++) {
-        l_P[i] += l_RRT[i];
+      MM_l<rd, false, true>(l_TP, l_tmp, l_P);
+      // P = P + RQR
+      for (int i = 0; i < rd2; i++) {
+        l_P[i] += l_RQR[i];
       }
     }
     sum_logFs[bid] = b_sum_logFs;
 
     // Forecast
-    double* b_fc = fc_steps ? d_fc + bid * fc_steps : nullptr;
-    for (int i = 0; i < fc_steps; i++) {
-      b_fc[i] = l_alpha[0];
-
-      // alpha = T*alpha + c
-      Mv_l<r>(l_T, l_alpha, l_tmp);
-      for (int i = 0; i < r; i++) {
-        l_alpha[i] = l_tmp[i];
+    {
+      double* b_fc = fc_steps ? d_fc + bid * fc_steps : nullptr;
+      double* b_F_fc = conf_int ? d_F_fc + bid * fc_steps : nullptr;
+      for (int it = 0; it < fc_steps; it++) {
+        if (n_diff == 0)
+          b_fc[it] = l_alpha[0];
+        else {
+          double pred = 0.0;
+          for (int i = 0; i < rd; i++) {
+            pred += l_alpha[i] * l_Z[i];
+          }
+          b_fc[it] = pred;
+        }
+
+        // alpha = T*alpha + c
+        Mv_l<rd>(l_T, l_alpha, l_tmp);
+        for (int i = 0; i < rd; i++) {
+          l_alpha[i] = l_tmp[i];
+        }
+        l_alpha[n_diff] += mu;
+
+        if (conf_int) {
+          if (n_diff == 0)
+            b_F_fc[it] = l_P[0];
+          else {
+            double _Fs = 0.0;
+            for (int i = 0; i < rd; i++) {
+              for (int j = 0; j < rd; j++) {
+                _Fs += l_P[j * rd + i] * l_Z[i] * l_Z[j];
+              }
+            }
+            b_F_fc[it] = _Fs;
+          }
+
+          // P = T*P*T' + RR'
+          // TP = T*P
+          MM_l<rd>(l_T, l_P, l_TP);
+          // P = TP*T'
+          MM_l<rd, false, true>(l_TP, l_T, l_P);
+          // P = P + RR'
+          for (int i = 0; i < rd2; i++) {
+            l_P[i] += l_RQR[i];
+          }
+        }
       }
-      l_alpha[0] += mu;
     }
   }
 }
@@ -198,7 +278,7 @@ __global__ void batched_kalman_loop_kernel(
  * @param[in]  T            Batched transition matrix.            (r x r)
  * @param[in]  T_sparse     Batched sparse matrix T               (r x r)
  * @param[in]  Z            Batched "design" vector               (1 x r)
- * @param[in]  RRT          Batched R*R' (R="selection" vector)   (r x r)
+ * @param[in]  RQR          Batched R*Q*R'                        (r x r)
  * @param[in]  P            Batched P                             (r x r)
  * @param[in]  alpha        Batched state vector                  (r x 1)
  * @param[in]  intercept    Do we fit an intercept?
@@ -207,37 +287,42 @@ __global__ void batched_kalman_loop_kernel(
  * @param[out] d_vs         Batched residuals                     (nobs)
  * @param[out] d_Fs         Batched variance of prediction errors (nobs)    
  * @param[out] d_sum_logFs  Batched sum of the logs of Fs         (1)
+ * @param[in]  n_diff         d + s*D
  * @param[in]  fc_steps     Number of steps to forecast
- * @param[in]  d_fc         Array to store the forecast
+ * @param[out] d_fc         Array to store the forecast
+ * @param[in]  conf_int     Whether to compute confidence intervals
+ * @param[out] d_F_fc       Batched variance of forecast errors   (fc_steps)
  */
 void _batched_kalman_loop_large(
   const double* d_ys, int nobs,
   const MLCommon::LinAlg::Batched::Matrix<double>& T,
   const MLCommon::Sparse::Batched::CSR<double>& T_sparse,
   const MLCommon::LinAlg::Batched::Matrix<double>& Z,
-  const MLCommon::LinAlg::Batched::Matrix<double>& RRT,
+  const MLCommon::LinAlg::Batched::Matrix<double>& RQR,
   MLCommon::LinAlg::Batched::Matrix<double>& P,
   MLCommon::LinAlg::Batched::Matrix<double>& alpha, bool intercept,
-  const double* d_mu, int r, double* d_vs, double* d_Fs, double* d_sum_logFs,
-  int fc_steps = 0, double* d_fc = nullptr) {
+  const double* d_mu, int rd, double* d_vs, double* d_Fs, double* d_sum_logFs,
+  int n_diff, int fc_steps = 0, double* d_fc = nullptr, bool conf_int = false,
+  double* d_F_fc = nullptr) {
   auto stream = T.stream();
   auto allocator = T.allocator();
   auto cublasHandle = T.cublasHandle();
   int nb = T.batches();
-  int r2 = r * r;
+  int rd2 = rd * rd;
   auto counting = thrust::make_counting_iterator(0);
 
   // Temporary matrices and vectors
-  MLCommon::LinAlg::Batched::Matrix<double> v_tmp(r, 1, nb, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> v_tmp(rd, 1, nb, cublasHandle,
                                                   allocator, stream, false);
-  MLCommon::LinAlg::Batched::Matrix<double> m_tmp(r, r, nb, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> m_tmp(rd, rd, nb, cublasHandle,
                                                   allocator, stream, false);
-  MLCommon::LinAlg::Batched::Matrix<double> K(r, 1, nb, cublasHandle, allocator,
-                                              stream, false);
-  MLCommon::LinAlg::Batched::Matrix<double> TP(r, r, nb, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> K(rd, 1, nb, cublasHandle,
+                                              allocator, stream, false);
+  MLCommon::LinAlg::Batched::Matrix<double> TP(rd, rd, nb, cublasHandle,
                                                allocator, stream, false);
 
   // Shortcuts
+  const double* d_Z = Z.raw_data();
   double* d_P = P.raw_data();
   double* d_alpha = alpha.raw_data();
   double* d_K = K.raw_data();
@@ -251,157 +336,263 @@ void _batched_kalman_loop_large(
     // 1. & 2.
     thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
                      [=] __device__(int bid) {
-                       d_vs[bid * nobs + it] =
-                         d_ys[bid * nobs + it] - d_alpha[bid * r];
-                       double l_P = d_P[bid * r2];
-                       d_Fs[bid * nobs + it] = l_P;
-                       d_sum_logFs[bid] += log(l_P);
+                       const double* b_P = d_P + bid * rd2;
+                       const double* b_Z = d_Z + bid * rd;
+                       const double* b_alpha = d_alpha + bid * rd;
+
+                       double vt = d_ys[bid * nobs + it];
+                       if (n_diff == 0) {
+                         vt -= b_alpha[0];
+                       } else {
+                         for (int i = 0; i < rd; i++) {
+                           vt -= b_alpha[i] * b_Z[i];
+                         }
+                       }
+                       d_vs[bid * nobs + it] = vt;
+
+                       double _F;
+                       if (n_diff == 0)
+                         _F = b_P[0];
+                       else {
+                         _F = 0.0;
+                         for (int i = 0; i < rd; i++) {
+                           for (int j = 0; j < rd; j++) {
+                             _F += b_P[j * rd + i] * b_Z[i] * b_Z[j];
+                           }
+                         }
+                       }
+                       d_Fs[bid * nobs + it] = _F;
+                       if (it >= n_diff) d_sum_logFs[bid] += log(_F);
                      });
 
     // 3. K = 1/Fs[it] * T*P*Z'
     // TP = T*P (also used later)
-    if (r <= 32)
+    if (rd <= 32)
       MLCommon::Sparse::Batched::b_spmm(1.0, T_sparse, P, 0.0, TP);
     else
-      MLCommon::LinAlg::Batched::b_gemm(false, false, r, r, r, 1.0, T, P, 0.0,
-                                        TP);
-    // K = 1/Fs[it] * TP*Z' ; optimized for Z = (1 0 ... 0)
+      MLCommon::LinAlg::Batched::b_gemm(false, false, rd, rd, rd, 1.0, T, P,
+                                        0.0, TP);
+    // K = 1/Fs[it] * TP*Z'
     thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
                      [=] __device__(int bid) {
+                       const double* b_TP = d_TP + bid * rd2;
+                       double* b_K = d_K + bid * rd;
+
                        double _1_Fs = 1.0 / d_Fs[bid * nobs + it];
-                       for (int i = 0; i < r; i++) {
-                         d_K[bid * r + i] = _1_Fs * d_TP[bid * r2 + i];
+                       if (n_diff == 0) {
+                         for (int i = 0; i < rd; i++) {
+                           b_K[i] = _1_Fs * b_TP[i];
+                         }
+                       } else {
+                         const double* b_Z = d_Z + bid * rd;
+                         for (int i = 0; i < rd; i++) {
+                           double acc = 0.0;
+                           for (int j = 0; j < rd; j++) {
+                             acc += b_TP[rd * j + i] * b_Z[j];
+                           }
+                           b_K[i] = _1_Fs * acc;
+                         }
                        }
                      });
 
     // 4. alpha = T*alpha + K*vs[it] + c
     // v_tmp = T*alpha
     MLCommon::Sparse::Batched::b_spmv(1.0, T_sparse, alpha, 0.0, v_tmp);
-    // alpha = v_tmp + K*vs[it]
-    // alpha_0 = alpha_0 + mu
+    // alpha = v_tmp + K*vs[it] + c
     thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
                      [=] __device__(int bid) {
+                       const double* b_Talpha = d_v_tmp + bid * rd;
+                       const double* b_K = d_K + bid * rd;
+                       double* b_alpha = d_alpha + bid * rd;
+
                        double _vs = d_vs[bid * nobs + it];
-                       for (int i = 0; i < r; i++) {
-                         double mu = (intercept && i == 0) ? d_mu[bid] : 0.0;
-                         d_alpha[bid * r + i] =
-                           d_v_tmp[bid * r + i] + _vs * d_K[bid * r + i] + mu;
+                       for (int i = 0; i < rd; i++) {
+                         double mu =
+                           (intercept && i == n_diff) ? d_mu[bid] : 0.0;
+                         b_alpha[i] = b_Talpha[i] + b_K[i] * _vs + mu;
                        }
                      });
 
     // 5. L = T - K * Z
     // L = T (L is m_tmp)
-    MLCommon::copy(m_tmp.raw_data(), T.raw_data(), nb * r2, stream);
-    // L = L - K * Z ; optimized for Z = (1 0 ... 0):
-    // substract K to the first column of L
+    raft::copy(m_tmp.raw_data(), T.raw_data(), nb * rd2, stream);
+    // L = L - K * Z
     thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
                      [=] __device__(int bid) {
-                       for (int i = 0; i < r; i++) {
-                         d_m_tmp[bid * r2 + i] -= d_K[bid * r + i];
+                       const double* b_K = d_K + bid * rd;
+                       double* b_L = d_m_tmp + bid * rd2;
+
+                       if (n_diff == 0) {
+                         for (int i = 0; i < rd; i++) {
+                           b_L[i] -= b_K[i];
+                         }
+                       } else {
+                         const double* b_Z = d_Z + bid * rd;
+                         for (int i = 0; i < rd; i++) {
+                           for (int j = 0; j < rd; j++) {
+                             b_L[j * rd + i] -= b_K[i] * b_Z[j];
+                           }
+                         }
                        }
                      });
-    // MLCommon::LinAlg::Batched::b_gemm(false, false, r, r, 1, -1.0, K, Z, 1.0,
+    // MLCommon::LinAlg::Batched::b_gemm(false, false, rd, rd, 1, -1.0, K, Z, 1.0,
     //                                   m_tmp);  // generic
 
-    // 6. P = T*P*L' + R*R'
+    // 6. P = T*P*L' + R*Q*R'
     // P = TP*L'
-    MLCommon::LinAlg::Batched::b_gemm(false, true, r, r, r, 1.0, TP, m_tmp, 0.0,
-                                      P);
-    // P = P + R*R'
-    MLCommon::LinAlg::binaryOp(
-      d_P, d_P, RRT.raw_data(), r2 * nb,
+    MLCommon::LinAlg::Batched::b_gemm(false, true, rd, rd, rd, 1.0, TP, m_tmp,
+                                      0.0, P);
+    // P = P + R*Q*R'
+    raft::linalg::binaryOp(
+      d_P, d_P, RQR.raw_data(), rd2 * nb,
       [=] __device__(double a, double b) { return a + b; }, stream);
   }
 
   // Forecast
-  for (int i = 0; i < fc_steps; i++) {
-    thrust::for_each(
-      thrust::cuda::par.on(stream), counting, counting + nb,
-      [=] __device__(int bid) { d_fc[bid * fc_steps + i] = d_alpha[bid * r]; });
+  for (int it = 0; it < fc_steps; it++) {
+    thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
+                     [=] __device__(int bid) {
+                       const double* b_alpha = d_alpha + bid * rd;
 
-    MLCommon::Sparse::Batched::b_spmv(1.0, T_sparse, alpha, 0.0, v_tmp);
-    MLCommon::copy(d_alpha, v_tmp.raw_data(), r * nb, stream);
+                       double pred;
+                       if (n_diff == 0) {
+                         pred = b_alpha[0];
+                       } else {
+                         const double* b_Z = d_Z + bid * rd;
 
+                         pred = 0.0;
+                         for (int i = 0; i < rd; i++) {
+                           pred += b_alpha[i] * b_Z[i];
+                         }
+                       }
+                       d_fc[bid * fc_steps + it] = pred;
+                     });
+
+    // alpha = T*alpha + c
+    // alpha = T*alpha
+    MLCommon::Sparse::Batched::b_spmv(1.0, T_sparse, alpha, 0.0, v_tmp);
+    raft::copy(d_alpha, v_tmp.raw_data(), rd * nb, stream);
+    // alpha += c
     if (intercept) {
       thrust::for_each(
         thrust::cuda::par.on(stream), counting, counting + nb,
-        [=] __device__(int bid) { d_alpha[bid * r] += d_mu[bid]; });
+        [=] __device__(int bid) { d_alpha[bid * rd + n_diff] += d_mu[bid]; });
+    }
+
+    if (conf_int) {
+      thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
+                       [=] __device__(int bid) {
+                         const double* b_P = d_P + bid * rd2;
+
+                         double Ft;
+                         if (n_diff == 0)
+                           Ft = b_P[0];
+                         else {
+                           const double* b_Z = d_Z + bid * rd;
+                           Ft = 0.0;
+                           for (int i = 0; i < rd; i++) {
+                             for (int j = 0; j < rd; j++) {
+                               Ft += b_P[j * rd + i] * b_Z[i] * b_Z[j];
+                             }
+                           }
+                         }
+
+                         d_F_fc[bid * fc_steps + it] = Ft;
+                       });
+
+      // P = T*P*T' + R*Q*R'
+      // TP = T*P
+      if (rd <= 32)
+        MLCommon::Sparse::Batched::b_spmm(1.0, T_sparse, P, 0.0, TP);
+      else
+        MLCommon::LinAlg::Batched::b_gemm(false, false, rd, rd, rd, 1.0, T, P,
+                                          0.0, TP);
+      // P = TP*T'
+      MLCommon::LinAlg::Batched::b_gemm(false, true, rd, rd, rd, 1.0, TP, T,
+                                        0.0, P);
+      // P = P + R*Q*R'
+      raft::linalg::binaryOp(
+        d_P, d_P, RQR.raw_data(), rd2 * nb,
+        [=] __device__(double a, double b) { return a + b; }, stream);
     }
   }
 }
 
 /// Wrapper around functions that execute the Kalman loop (for performance)
-void batched_kalman_loop(cumlHandle& handle, const double* ys, int nobs,
+void batched_kalman_loop(raft::handle_t& handle, const double* ys, int nobs,
                          const MLCommon::LinAlg::Batched::Matrix<double>& T,
                          const MLCommon::LinAlg::Batched::Matrix<double>& Z,
-                         const MLCommon::LinAlg::Batched::Matrix<double>& RRT,
+                         const MLCommon::LinAlg::Batched::Matrix<double>& RQR,
                          MLCommon::LinAlg::Batched::Matrix<double>& P0,
                          MLCommon::LinAlg::Batched::Matrix<double>& alpha,
                          std::vector<bool>& T_mask, bool intercept,
-                         const double* d_mu, int r, double* vs, double* Fs,
-                         double* sum_logFs, int fc_steps = 0,
-                         double* d_fc = nullptr) {
+                         const double* d_mu, const ARIMAOrder& order,
+                         double* vs, double* Fs, double* sum_logFs,
+                         int fc_steps = 0, double* d_fc = nullptr,
+                         bool conf_int = false, double* d_F_fc = nullptr) {
   const int batch_size = T.batches();
   auto stream = T.stream();
+  int rd = order.rd();
+  int n_diff = order.n_diff();
   dim3 numThreadsPerBlock(32, 1);
-  dim3 numBlocks(MLCommon::ceildiv<int>(batch_size, numThreadsPerBlock.x), 1);
-  if (r <= 8) {
-    switch (r) {
+  dim3 numBlocks(raft::ceildiv<int>(batch_size, numThreadsPerBlock.x), 1);
+  if (rd <= 8) {
+    switch (rd) {
       case 1:
         batched_kalman_loop_kernel<1>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 2:
         batched_kalman_loop_kernel<2>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 3:
         batched_kalman_loop_kernel<3>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 4:
         batched_kalman_loop_kernel<4>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 5:
         batched_kalman_loop_kernel<5>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 6:
         batched_kalman_loop_kernel<6>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 7:
         batched_kalman_loop_kernel<7>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
       case 8:
         batched_kalman_loop_kernel<8>
           <<<numBlocks, numThreadsPerBlock, 0, stream>>>(
-            ys, nobs, T.raw_data(), Z.raw_data(), RRT.raw_data(), P0.raw_data(),
+            ys, nobs, T.raw_data(), Z.raw_data(), RQR.raw_data(), P0.raw_data(),
             alpha.raw_data(), intercept, d_mu, batch_size, vs, Fs, sum_logFs,
-            fc_steps, d_fc);
+            n_diff, fc_steps, d_fc, conf_int, d_F_fc);
         break;
     }
     CUDA_CHECK(cudaPeekAtLastError());
@@ -409,31 +600,29 @@ void batched_kalman_loop(cumlHandle& handle, const double* ys, int nobs,
     // Note: not always used
     MLCommon::Sparse::Batched::CSR<double> T_sparse =
       MLCommon::Sparse::Batched::CSR<double>::from_dense(
-        T, T_mask, handle.getImpl().getcusolverSpHandle());
-    _batched_kalman_loop_large(ys, nobs, T, T_sparse, Z, RRT, P0, alpha,
-                               intercept, d_mu, r, vs, Fs, sum_logFs, fc_steps,
-                               d_fc);
+        T, T_mask, handle.get_cusolver_sp_handle());
+    _batched_kalman_loop_large(ys, nobs, T, T_sparse, Z, RQR, P0, alpha,
+                               intercept, d_mu, rd, vs, Fs, sum_logFs, n_diff,
+                               fc_steps, d_fc, conf_int, d_F_fc);
   }
 }
 
 template <int NUM_THREADS>
-__global__ void batched_kalman_loglike_kernel(const double* d_vs,
-                                              const double* d_Fs,
-                                              const double* d_sumLogFs,
-                                              int nobs, int batch_size,
-                                              double* loglike) {
+__global__ void batched_kalman_loglike_kernel(
+  const double* d_vs, const double* d_Fs, const double* d_sumLogFs, int nobs,
+  int batch_size, double* d_loglike, double* d_sigma2, int n_diff,
+  double level) {
   using BlockReduce = cub::BlockReduce<double, NUM_THREADS>;
   __shared__ typename BlockReduce::TempStorage temp_storage;
 
   int tid = threadIdx.x;
   int bid = blockIdx.x;
-  int num_threads = blockDim.x;
   double bid_sigma2 = 0.0;
-  for (int it = 0; it < nobs; it += num_threads) {
+  for (int it = 0; it < nobs; it += NUM_THREADS) {
     // vs and Fs are in time-major order (memory layout: column major)
     int idx = (it + tid) + bid * nobs;
     double d_vs2_Fs = 0.0;
-    if (it + tid < nobs) {
+    if (it + tid >= n_diff && it + tid < nobs) {
       double _vi = d_vs[idx];
       d_vs2_Fs = _vi * _vi / d_Fs[idx];
     }
@@ -442,79 +631,134 @@ __global__ void batched_kalman_loglike_kernel(const double* d_vs,
     bid_sigma2 += partial_sum;
   }
   if (tid == 0) {
-    double nobs_f = static_cast<double>(nobs);
-    bid_sigma2 /= nobs_f;
-    loglike[bid] =
-      -.5 * (d_sumLogFs[bid] + nobs_f * bid_sigma2 + nobs_f * (log(2 * M_PI)));
+    double nobs_diff_f = static_cast<double>(nobs - n_diff);
+    bid_sigma2 /= nobs_diff_f;
+    if (level != 0) d_sigma2[bid] = bid_sigma2;
+    d_loglike[bid] = -.5 * (d_sumLogFs[bid] + nobs_diff_f * bid_sigma2 +
+                            nobs_diff_f * (log(2 * M_PI)));
   }
 }
 
-void batched_kalman_loglike(const double* d_vs, const double* d_Fs,
-                            const double* d_sumLogFs, int nobs, int batch_size,
-                            double* loglike, cudaStream_t stream) {
-  constexpr int NUM_THREADS = 128;
-  batched_kalman_loglike_kernel<NUM_THREADS>
-    <<<batch_size, NUM_THREADS, 0, stream>>>(d_vs, d_Fs, d_sumLogFs, nobs,
-                                             batch_size, loglike);
-  CUDA_CHECK(cudaGetLastError());
+/**
+ * Kernel to finalize the computation of confidence intervals
+ *
+ * @note: One block per batch member, one thread per forecast time step
+ *
+ * @param[in]    d_fc       Mean forecasts
+ * @param[in]    d_sigma2   sum(v_t * v_t / F_t) / n_obs_diff
+ * @param[inout] d_lower    Input: F_{n+t}
+ *                          Output: lower bound of the confidence intervals
+ * @param[out]   d_upper    Upper bound of the confidence intervals
+ * @param[in]    fc_steps   Number of forecast steps
+ * @param[in]    multiplier Coefficient associated with the confidence level
+ */
+__global__ void confidence_intervals(const double* d_fc, const double* d_sigma2,
+                                     double* d_lower, double* d_upper,
+                                     int fc_steps, double multiplier) {
+  int idx = blockIdx.x * fc_steps + threadIdx.x;
+  double fc = d_fc[idx];
+  double margin = multiplier * sqrt(d_lower[idx] * d_sigma2[blockIdx.x]);
+  d_lower[idx] = fc - margin;
+  d_upper[idx] = fc + margin;
 }
 
 /// Internal Kalman filter implementation that assumes data exists on GPU.
-void _batched_kalman_filter(cumlHandle& handle, const double* d_ys, int nobs,
+void _batched_kalman_filter(raft::handle_t& handle, const double* d_ys,
+                            int nobs, const ARIMAOrder& order,
                             const MLCommon::LinAlg::Batched::Matrix<double>& Zb,
                             const MLCommon::LinAlg::Batched::Matrix<double>& Tb,
                             const MLCommon::LinAlg::Batched::Matrix<double>& Rb,
-                            std::vector<bool>& T_mask, int r, double* d_vs,
+                            std::vector<bool>& T_mask, double* d_vs,
                             double* d_Fs, double* d_loglike,
                             const double* d_sigma2, bool intercept,
-                            const double* d_mu, int fc_steps = 0,
-                            double* d_fc = nullptr) {
+                            const double* d_mu, int fc_steps, double* d_fc,
+                            double level, double* d_lower, double* d_upper) {
   const size_t batch_size = Zb.batches();
-  auto stream = handle.getStream();
-  auto cublasHandle = handle.getImpl().getCublasHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto stream = handle.get_stream();
+  auto cublasHandle = handle.get_cublas_handle();
+  auto allocator = handle.get_device_allocator();
 
   auto counting = thrust::make_counting_iterator(0);
 
-  MLCommon::LinAlg::Batched::Matrix<double> RQb(r, 1, batch_size, cublasHandle,
+  int n_diff = order.n_diff();
+  int rd = order.rd();
+  int r = order.r();
+
+  MLCommon::LinAlg::Batched::Matrix<double> RQb(rd, 1, batch_size, cublasHandle,
                                                 allocator, stream, true);
   double* d_RQ = RQb.raw_data();
   const double* d_R = Rb.raw_data();
   thrust::for_each(thrust::cuda::par.on(stream), counting,
                    counting + batch_size, [=] __device__(int bid) {
                      double sigma2 = d_sigma2[bid];
-                     for (int i = 0; i < r; i++) {
-                       d_RQ[bid * r + i] = d_R[bid * r + i] * sigma2;
+                     for (int i = 0; i < rd; i++) {
+                       d_RQ[bid * rd + i] = d_R[bid * rd + i] * sigma2;
                      }
                    });
-  MLCommon::LinAlg::Batched::Matrix<double> RRT =
+  MLCommon::LinAlg::Batched::Matrix<double> RQR =
     MLCommon::LinAlg::Batched::b_gemm(RQb, Rb, false, true);
 
   // Durbin Koopman "Time Series Analysis" pg 138
   ML::PUSH_RANGE("Init P");
-  MLCommon::LinAlg::Batched::Matrix<double> P =
-    MLCommon::LinAlg::Batched::b_lyapunov(Tb, RRT);
+  MLCommon::LinAlg::Batched::Matrix<double> P(rd, rd, batch_size, cublasHandle,
+                                              allocator, stream, true);
+  {
+    double* d_P = P.raw_data();
+
+    if (n_diff > 0) {
+      // Initialize the diffuse part with a large variance
+      /// TODO: pass this as a parameter
+      constexpr double kappa = 1e6;
+      thrust::for_each(thrust::cuda::par.on(stream), counting,
+                       counting + batch_size, [=] __device__(int bid) {
+                         double* b_P = d_P + rd * rd * bid;
+                         for (int i = 0; i < n_diff; i++) {
+                           b_P[(rd + 1) * i] = kappa;
+                         }
+                       });
+
+      // Initialize the stationary part by solving a Lyapunov equation
+      /// TODO: reduce amount of memory copies
+      MLCommon::LinAlg::Batched::Matrix<double> Ts =
+        MLCommon::LinAlg::Batched::b_2dcopy(Tb, n_diff, n_diff, r, r);
+      MLCommon::LinAlg::Batched::Matrix<double> RQRs =
+        MLCommon::LinAlg::Batched::b_2dcopy(RQR, n_diff, n_diff, r, r);
+      MLCommon::LinAlg::Batched::Matrix<double> Ps =
+        MLCommon::LinAlg::Batched::b_lyapunov(Ts, RQRs);
+      MLCommon::LinAlg::Batched::b_2dcopy(Ps, P, 0, 0, r, r, n_diff, n_diff);
+    } else {
+      // Initialize by solving a Lyapunov equation
+      /// TODO: avoid copy
+      P = MLCommon::LinAlg::Batched::b_lyapunov(Tb, RQR);
+    }
+  }
   ML::POP_RANGE();
 
-  // Initialize the state alpha as the solution of (I - T) x = c
-  // Note: optimized as c = (mu 0 ... 0)'
+  // Initialize the state alpha by solving (I - T*) x* = c with:
+  //     | mu |
+  // c = | 0  |
+  //     | .  |
+  //     | 0  |
+  // T* = T[d+s*D:, d+s*D:]
+  // x* = alpha_0[d+s*D:]
   MLCommon::LinAlg::Batched::Matrix<double> alpha(
-    r, 1, batch_size, handle.getImpl().getCublasHandle(),
-    handle.getDeviceAllocator(), stream, true);
+    rd, 1, batch_size, handle.get_cublas_handle(),
+    handle.get_device_allocator(), stream, false);
   if (intercept) {
-    // Compute I-T
+    // Compute I-T*
     MLCommon::LinAlg::Batched::Matrix<double> ImT(
       r, r, batch_size, cublasHandle, allocator, stream, false);
     const double* d_T = Tb.raw_data();
     double* d_ImT = ImT.raw_data();
     thrust::for_each(thrust::cuda::par.on(stream), counting,
                      counting + batch_size, [=] __device__(int bid) {
-                       const double* b_T = d_T + r * r * bid;
+                       const double* b_T = d_T + rd * rd * bid;
                        double* b_ImT = d_ImT + r * r * bid;
                        for (int i = 0; i < r; i++) {
                          for (int j = 0; j < r; j++) {
                            b_ImT[r * j + i] =
-                             (i == j ? 1.0 : 0.0) - b_T[r * j + i];
+                             (i == j ? 1.0 : 0.0) -
+                             b_T[rd * (j + n_diff) + i + n_diff];
                          }
                        }
                      });
@@ -524,169 +768,250 @@ void _batched_kalman_filter(cumlHandle& handle, const double* d_ys, int nobs,
       thrust::for_each(thrust::cuda::par.on(stream), counting,
                        counting + batch_size, [=] __device__(int bid) {
                          if (abs(d_ImT[bid]) < 1e-3)
-                           d_ImT[bid] = MLCommon::signPrim(d_ImT[bid]) * 1e-3;
+                           d_ImT[bid] = raft::signPrim(d_ImT[bid]) * 1e-3;
                        });
     }
 
-    // Compute (I-T)^-1
+    // Compute (I-T*)^-1
     MLCommon::LinAlg::Batched::Matrix<double> ImT_inv = ImT.inv();
 
-    // Compute (I-T)^-1 * c -> multiply 1st column by mu
+    // Compute (I-T*)^-1 * c -> multiply 1st column by mu
     const double* d_ImT_inv = ImT_inv.raw_data();
     double* d_alpha = alpha.raw_data();
     thrust::for_each(thrust::cuda::par.on(stream), counting,
                      counting + batch_size, [=] __device__(int bid) {
                        const double* b_ImT_inv = d_ImT_inv + r * r * bid;
-                       double* b_alpha = d_alpha + r * bid;
+                       double* b_alpha = d_alpha + rd * bid;
                        double mu = d_mu[bid];
+                       for (int i = 0; i < n_diff; i++) {
+                         b_alpha[i] = 0;
+                       }
                        for (int i = 0; i < r; i++) {
-                         b_alpha[i] = b_ImT_inv[i] * mu;
+                         b_alpha[i + n_diff] = b_ImT_inv[i] * mu;
                        }
                      });
+  } else {
+    // Memset alpha to 0
+    CUDA_CHECK(cudaMemsetAsync(alpha.raw_data(), 0,
+                               sizeof(double) * rd * batch_size, stream));
   }
 
-  // init vs, Fs
-  // In batch-major format.
-  double* d_sumlogFs;
-
-  d_sumlogFs = (double*)handle.getDeviceAllocator()->allocate(
-    sizeof(double) * batch_size, stream);
-
-  batched_kalman_loop(handle, d_ys, nobs, Tb, Zb, RRT, P, alpha, T_mask,
-                      intercept, d_mu, r, d_vs, d_Fs, d_sumlogFs, fc_steps,
-                      d_fc);
+  MLCommon::device_buffer<double> sumLogF_buffer(allocator, stream, batch_size);
 
-  // Finalize loglikelihood
-  batched_kalman_loglike(d_vs, d_Fs, d_sumlogFs, nobs, batch_size, d_loglike,
-                         stream);
+  batched_kalman_loop(handle, d_ys, nobs, Tb, Zb, RQR, P, alpha, T_mask,
+                      intercept, d_mu, order, d_vs, d_Fs, sumLogF_buffer.data(),
+                      fc_steps, d_fc, level > 0, d_lower);
 
-  handle.getDeviceAllocator()->deallocate(d_sumlogFs,
-                                          sizeof(double) * batch_size, stream);
+  // Finalize loglikelihood and prediction intervals
+  MLCommon::device_buffer<double> sigma2_buffer(allocator, stream, batch_size);
+  constexpr int NUM_THREADS = 128;
+  batched_kalman_loglike_kernel<NUM_THREADS>
+    <<<batch_size, NUM_THREADS, 0, stream>>>(
+      d_vs, d_Fs, sumLogF_buffer.data(), nobs, batch_size, d_loglike,
+      sigma2_buffer.data(), n_diff, level);
+  CUDA_CHECK(cudaPeekAtLastError());
+  if (level > 0) {
+    confidence_intervals<<<batch_size, fc_steps, 0, stream>>>(
+      d_fc, sigma2_buffer.data(), d_lower, d_upper, fc_steps,
+      sqrt(2.0) * erfinv(level));
+    CUDA_CHECK(cudaPeekAtLastError());
+  }
 }
 
-void init_batched_kalman_matrices(cumlHandle& handle, const double* d_ar,
+void init_batched_kalman_matrices(raft::handle_t& handle, const double* d_ar,
                                   const double* d_ma, const double* d_sar,
                                   const double* d_sma, int nb,
-                                  const ARIMAOrder& order, int r, double* d_Z_b,
-                                  double* d_R_b, double* d_T_b,
+                                  const ARIMAOrder& order, int rd,
+                                  double* d_Z_b, double* d_R_b, double* d_T_b,
                                   std::vector<bool>& T_mask) {
   ML::PUSH_RANGE(__func__);
 
-  auto stream = handle.getStream();
+  auto stream = handle.get_stream();
 
   // Note: Z is unused yet but kept to avoid reintroducing it later when
   // adding support for exogeneous variables
-  cudaMemsetAsync(d_Z_b, 0.0, r * nb * sizeof(double), stream);
-  cudaMemsetAsync(d_R_b, 0.0, r * nb * sizeof(double), stream);
-  cudaMemsetAsync(d_T_b, 0.0, r * r * nb * sizeof(double), stream);
+  cudaMemsetAsync(d_Z_b, 0.0, rd * nb * sizeof(double), stream);
+  cudaMemsetAsync(d_R_b, 0.0, rd * nb * sizeof(double), stream);
+  cudaMemsetAsync(d_T_b, 0.0, rd * rd * nb * sizeof(double), stream);
 
-  auto counting = thrust::make_counting_iterator(0);
-  thrust::for_each(thrust::cuda::par.on(stream), counting, counting + nb,
-                   [=] __device__(int bid) {
-                     // See TSA pg. 54 for Z,R,T matrices
-                     // Z = [1 0 0 0 ... 0]
-                     d_Z_b[bid * r] = 1.0;
-
-                     //     |1.0        |
-                     // R = |theta_1    |
-                     //     | ...       |
-                     //     |theta_{r-1}|
-                     //
-                     d_R_b[bid * r] = 1.0;
-                     for (int i = 0; i < r - 1; i++) {
-                       d_R_b[bid * r + i + 1] =
-                         MLCommon::TimeSeries::reduced_polynomial<false>(
-                           bid, d_ma, order.q, d_sma, order.Q, order.s, i + 1);
-                     }
+  int n_diff = order.n_diff();
+  int r = order.r();
 
-                     //     |phi_1  1.0  0.0  ...  0.0|
-                     //     | .          1.0          |
-                     //     | .              .        |
-                     // T = | .                .   0.0|
-                     //     | .                  .    |
-                     //     | .                    1.0|
-                     //     |phi_r  0.0  0.0  ...  0.0|
-                     //
-                     double* batch_T = d_T_b + bid * r * r;
-                     for (int i = 0; i < r; i++) {
-                       batch_T[i] =
-                         MLCommon::TimeSeries::reduced_polynomial<true>(
-                           bid, d_ar, order.p, d_sar, order.P, order.s, i + 1);
-                     }
-                     // shifted identity
-                     for (int i = 0; i < r - 1; i++) {
-                       batch_T[(i + 1) * r + i] = 1.0;
-                     }
+  auto counting = thrust::make_counting_iterator(0);
+  auto n_theta = order.n_theta();
+  auto n_phi = order.n_phi();
+  thrust::for_each(
+    thrust::cuda::par.on(stream), counting, counting + nb,
+    [=] __device__(int bid) {
+      // See TSA pg. 54 for Z, R, T matrices
+
+      // Z = [ 1 | 0 . . 0 1 0 . . 0 1 | 1 0 . . 0 ]
+      //       d |         s*D         |     r
+      for (int i = 0; i < order.d; i++) d_Z_b[bid * rd + i] = 1.0;
+      for (int i = 1; i <= order.D; i++)
+        d_Z_b[bid * rd + order.d + i * order.s - 1] = 1.0;
+      d_Z_b[bid * rd + n_diff] = 1.0;
+
+      //     |     0     |
+      //     |     .     |  d + s*D
+      //     |     0     |_ _
+      // R = |     1     |
+      //     |  theta_1  |  r
+      //     |     .     |
+      //     |theta_{r-1}|
+      //
+      d_R_b[bid * rd + n_diff] = 1.0;
+      for (int i = 0; i < n_theta; i++) {
+        d_R_b[bid * rd + n_diff + i + 1] =
+          MLCommon::TimeSeries::reduced_polynomial<false>(
+            bid, d_ma, order.q, d_sma, order.Q, order.s, i + 1);
+      }
 
-                     // If r=2 and phi_2=-1, I-TxT is singular
-                     if (r == 2 && order.p == 2 && abs(batch_T[1] + 1) < 0.01) {
-                       batch_T[1] = -0.99;
-                     }
-                   });
+      //     | 1 | 0 .. 0 1 | 1                |  d
+      //     |_ _|_ _ _ _ _ |_ _ _ _ _ _ _ _ _ |_ _
+      //     |   | 0 .. 0 1 | 1                |
+      //     |   | 1      0 |                  |
+      //     |   |   .    . |                  |  s*D
+      //     |   |    .   . |                  |
+      // T = |   | 0    1 0 |                  |
+      //     |_ _|_ _ _ _ _ |_ _ _ _ _ _ _ _ _ |_ _
+      //     |   |          | phi_1  1         |
+      //     |   |          |  .       1    0  |
+      //     |   |          |  .         .     |  r
+      //     |   |          |  .     0     .   |
+      //     |   |          |  .             1 |
+      //     |   |          | phi_r  0  . .  0 |
+      //
+      // (non-comprehensive example with d=1 and D=1)
+      //
+      double* batch_T = d_T_b + bid * rd * rd;
+      // 1. Differencing component
+      for (int i = 0; i < order.d; i++) {
+        for (int j = i; j < order.d; j++) {
+          batch_T[j * rd + i] = 1.0;
+        }
+      }
+      for (int id = 0; id < order.d; id++) {
+        batch_T[n_diff * rd + id] = 1.0;
+        for (int iD = 1; iD <= order.D; iD++) {
+          batch_T[(order.d + order.s * iD - 1) * rd + id] = 1.0;
+        }
+      }
+      // 2. Seasonal differencing component
+      for (int iD = 0; iD < order.D; iD++) {
+        int offset = order.d + iD * order.s;
+        for (int i = 0; i < order.s - 1; i++) {
+          batch_T[(offset + i) * rd + offset + i + 1] = 1.0;
+        }
+        batch_T[(offset + order.s - 1) * rd + offset] = 1.0;
+        batch_T[n_diff * rd + offset] = 1.0;
+      }
+      if (order.D == 2) {
+        batch_T[(n_diff - 1) * rd + order.d] = 1.0;
+      }
+      // 3. Auto-Regressive component
+      for (int i = 0; i < n_phi; i++) {
+        batch_T[n_diff * (rd + 1) + i] =
+          MLCommon::TimeSeries::reduced_polynomial<true>(
+            bid, d_ar, order.p, d_sar, order.P, order.s, i + 1);
+      }
+      for (int i = 0; i < r - 1; i++) {
+        batch_T[(n_diff + i + 1) * rd + n_diff + i] = 1.0;
+      }
 
-  T_mask.resize(r * r, false);
+      // If rd=2 and phi_2=-1, I-TxT is singular
+      if (rd == 2 && order.p == 2 && abs(batch_T[1] + 1) < 0.01) {
+        batch_T[1] = -0.99;
+      }
+    });
+
+  // T density/sparsity mask
+  T_mask.resize(rd * rd, false);
+  // 1. Differencing component
+  for (int i = 0; i < order.d; i++) {
+    for (int j = i; j < order.d; j++) {
+      T_mask[j * rd + i] = true;
+    }
+  }
+  for (int id = 0; id < order.d; id++) {
+    T_mask[n_diff * rd + id] = true;
+    for (int iD = 1; iD <= order.D; iD++) {
+      T_mask[(order.d + order.s * iD - 1) * rd + id] = true;
+    }
+  }
+  // 2. Seasonal differencing component
+  for (int iD = 0; iD < order.D; iD++) {
+    int offset = order.d + iD * order.s;
+    for (int i = 0; i < order.s - 1; i++) {
+      T_mask[(offset + i) * rd + offset + i + 1] = true;
+    }
+    T_mask[(offset + order.s - 1) * rd + offset] = true;
+    T_mask[n_diff * rd + offset] = true;
+  }
+  if (order.D == 2) {
+    T_mask[(n_diff - 1) * rd + order.d] = true;
+  }
+  // 3. Auto-Regressive component
   for (int iP = 0; iP < order.P + 1; iP++) {
     for (int ip = 0; ip < order.p + 1; ip++) {
       int i = iP * order.s + ip - 1;
-      if (i >= 0) T_mask[i] = true;
+      if (i >= 0) T_mask[n_diff * (rd + 1) + i] = true;
     }
   }
   for (int i = 0; i < r - 1; i++) {
-    T_mask[(i + 1) * r + i] = true;
+    T_mask[(n_diff + i + 1) * rd + n_diff + i] = true;
   }
 
   ML::POP_RANGE();
 }
 
-void batched_kalman_filter(cumlHandle& handle, const double* d_ys, int nobs,
+void batched_kalman_filter(raft::handle_t& handle, const double* d_ys, int nobs,
                            const ARIMAParams<double>& params,
                            const ARIMAOrder& order, int batch_size,
                            double* d_loglike, double* d_vs, int fc_steps,
-                           double* d_fc) {
+                           double* d_fc, double level, double* d_lower,
+                           double* d_upper) {
   ML::PUSH_RANGE(__func__);
 
-  const size_t ys_len = nobs;
-
-  auto cublasHandle = handle.getImpl().getCublasHandle();
-  auto stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  auto cublasHandle = handle.get_cublas_handle();
+  auto stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
 
   // see (3.18) in TSA by D&K
-  int r = order.r();
+  int rd = order.rd();
 
-  MLCommon::LinAlg::Batched::Matrix<double> Zb(1, r, batch_size, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> Zb(1, rd, batch_size, cublasHandle,
                                                allocator, stream, false);
-  MLCommon::LinAlg::Batched::Matrix<double> Tb(r, r, batch_size, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> Tb(rd, rd, batch_size, cublasHandle,
                                                allocator, stream, false);
-  MLCommon::LinAlg::Batched::Matrix<double> Rb(r, 1, batch_size, cublasHandle,
+  MLCommon::LinAlg::Batched::Matrix<double> Rb(rd, 1, batch_size, cublasHandle,
                                                allocator, stream, false);
 
   std::vector<bool> T_mask;
   init_batched_kalman_matrices(handle, params.ar, params.ma, params.sar,
-                               params.sma, batch_size, order, r, Zb.raw_data(),
+                               params.sma, batch_size, order, rd, Zb.raw_data(),
                                Rb.raw_data(), Tb.raw_data(), T_mask);
 
   ////////////////////////////////////////////////////////////
   // Computation
 
-  double* d_Fs =
-    (double*)allocator->allocate(ys_len * batch_size * sizeof(double), stream);
-
-  _batched_kalman_filter(handle, d_ys, nobs, Zb, Tb, Rb, T_mask, r, d_vs, d_Fs,
-                         d_loglike, params.sigma2, static_cast<bool>(order.k),
-                         params.mu, fc_steps, d_fc);
+  MLCommon::device_buffer<double> F_buffer(allocator, stream,
+                                           nobs * batch_size);
 
-  allocator->deallocate(d_Fs, ys_len * batch_size * sizeof(double), stream);
+  _batched_kalman_filter(handle, d_ys, nobs, order, Zb, Tb, Rb, T_mask, d_vs,
+                         F_buffer.data(), d_loglike, params.sigma2,
+                         static_cast<bool>(order.k), params.mu, fc_steps, d_fc,
+                         level, d_lower, d_upper);
 
   ML::POP_RANGE();
 }
 
-void batched_jones_transform(cumlHandle& handle, const ARIMAOrder& order,
+void batched_jones_transform(raft::handle_t& handle, const ARIMAOrder& order,
                              int batch_size, bool isInv, const double* h_params,
                              double* h_Tparams) {
   int N = order.complexity();
-  auto allocator = handle.getDeviceAllocator();
-  auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  auto stream = handle.get_stream();
   double* d_params =
     (double*)allocator->allocate(N * batch_size * sizeof(double), stream);
   double* d_Tparams =
@@ -695,7 +1020,7 @@ void batched_jones_transform(cumlHandle& handle, const ARIMAOrder& order,
   params.allocate(order, batch_size, allocator, stream, false);
   Tparams.allocate(order, batch_size, allocator, stream, true);
 
-  MLCommon::updateDevice(d_params, h_params, N * batch_size, stream);
+  raft::update_device(d_params, h_params, N * batch_size, stream);
 
   params.unpack(order, batch_size, d_params, stream);
 
@@ -705,7 +1030,7 @@ void batched_jones_transform(cumlHandle& handle, const ARIMAOrder& order,
 
   Tparams.pack(order, batch_size, d_Tparams, stream);
 
-  MLCommon::updateHost(h_Tparams, d_Tparams, N * batch_size, stream);
+  raft::update_host(h_Tparams, d_Tparams, N * batch_size, stream);
 
   allocator->deallocate(d_params, N * batch_size * sizeof(double), stream);
   allocator->deallocate(d_Tparams, N * batch_size * sizeof(double), stream);
diff --git a/cpp/src/common/allocatorAdapter.hpp b/cpp/src/common/allocatorAdapter.hpp
index 302da642a2..06aaf879d7 100644
--- a/cpp/src/common/allocatorAdapter.hpp
+++ b/cpp/src/common/allocatorAdapter.hpp
@@ -22,7 +22,7 @@
 
 #include <cuml/cuml.hpp>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/cuml_allocator.hpp>
 
 namespace ML {
@@ -95,9 +95,9 @@ class stdAllocatorAdapter {
 /**
  * @todo: Complete doxygen documentation
  * @code{.cpp}
- * void foo( const cumlHandle_impl& h, ... , cudaStream_t stream )
+ * void foo( const raft::handle_t& h, ... , cudaStream_t stream )
  * {
- *     auto execution_policy = ML::thrust_exec_policy(h.getDeviceAllocator(),stream);
+ *     auto execution_policy = ML::thrust_exec_policy(h.get_device_allocator(),stream);
  *     thrust::for_each(execution_policy->on(stream), ... );
  * }
  * @endcode
diff --git a/cpp/src/common/cuML_comms_impl.cpp b/cpp/src/common/cuML_comms_impl.cpp
deleted file mode 100644
index 41bb9e0dc2..0000000000
--- a/cpp/src/common/cuML_comms_impl.cpp
+++ /dev/null
@@ -1,137 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <memory>
-
-#include <common/cudart_utils.h>
-#include <common/cuml_comms_iface.hpp>
-#include <common/cuml_comms_int.hpp>
-
-namespace MLCommon {
-
-cumlCommunicator::cumlCommunicator(std::unique_ptr<cumlCommunicator_iface> impl)
-  : _impl(impl.release()) {
-  ASSERT(nullptr != _impl.get(), "ERROR: Invalid cumlCommunicator_iface used!");
-}
-
-int cumlCommunicator::getSize() const { return _impl->getSize(); }
-
-int cumlCommunicator::getRank() const { return _impl->getRank(); }
-
-cumlCommunicator cumlCommunicator::commSplit(int color, int key) const {
-  return cumlCommunicator(_impl->commSplit(color, key));
-}
-
-void cumlCommunicator::barrier() const { _impl->barrier(); }
-
-cumlCommunicator::status_t cumlCommunicator::syncStream(
-  cudaStream_t stream) const {
-  return _impl->syncStream(stream);
-}
-
-void cumlCommunicator::isend(const void* buf, int size, int dest, int tag,
-                             request_t* request) const {
-  _impl->isend(buf, size, dest, tag, request);
-}
-
-void cumlCommunicator::irecv(void* buf, int size, int source, int tag,
-                             request_t* request) const {
-  _impl->irecv(buf, size, source, tag, request);
-}
-
-void cumlCommunicator::waitall(int count, request_t array_of_requests[]) const {
-  _impl->waitall(count, array_of_requests);
-}
-
-void cumlCommunicator::allreduce(const void* sendbuff, void* recvbuff,
-                                 int count, datatype_t datatype, op_t op,
-                                 cudaStream_t stream) const {
-  _impl->allreduce(sendbuff, recvbuff, count, datatype, op, stream);
-}
-
-void cumlCommunicator::bcast(void* buff, int count, datatype_t datatype,
-                             int root, cudaStream_t stream) const {
-  _impl->bcast(buff, count, datatype, root, stream);
-}
-
-void cumlCommunicator::reduce(const void* sendbuff, void* recvbuff, int count,
-                              datatype_t datatype, op_t op, int root,
-                              cudaStream_t stream) const {
-  _impl->reduce(sendbuff, recvbuff, count, datatype, op, root, stream);
-}
-
-void cumlCommunicator::allgather(const void* sendbuff, void* recvbuff,
-                                 int sendcount, datatype_t datatype,
-                                 cudaStream_t stream) const {
-  _impl->allgather(sendbuff, recvbuff, sendcount, datatype, stream);
-}
-
-void cumlCommunicator::allgatherv(const void* sendbuf, void* recvbuf,
-                                  const int recvcounts[], const int displs[],
-                                  datatype_t datatype,
-                                  cudaStream_t stream) const {
-  _impl->allgatherv(sendbuf, recvbuf, recvcounts, displs, datatype, stream);
-}
-
-void cumlCommunicator::reducescatter(const void* sendbuff, void* recvbuff,
-                                     int recvcount, datatype_t datatype,
-                                     op_t op, cudaStream_t stream) const {
-  _impl->reducescatter(sendbuff, recvbuff, recvcount, datatype, op, stream);
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<char>() const {
-  return cumlCommunicator::CHAR;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<uint8_t>() const {
-  return cumlCommunicator::UINT8;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<int>() const {
-  return cumlCommunicator::INT;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<uint32_t>() const {
-  return cumlCommunicator::UINT;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<int64_t>() const {
-  return cumlCommunicator::INT64;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<uint64_t>() const {
-  return cumlCommunicator::UINT64;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<float>() const {
-  return cumlCommunicator::FLOAT;
-}
-
-template <>
-cumlCommunicator::datatype_t cumlCommunicator::getDataType<double>() const {
-  return cumlCommunicator::DOUBLE;
-}
-
-cumlCommunicator_iface::~cumlCommunicator_iface() {}
-
-}  // namespace MLCommon
diff --git a/cpp/src/common/cumlHandle.cpp b/cpp/src/common/cumlHandle.cpp
index c0b1bfe50e..c4697c14b5 100644
--- a/cpp/src/common/cumlHandle.cpp
+++ b/cpp/src/common/cumlHandle.cpp
@@ -15,235 +15,22 @@
  */
 
 #include "cumlHandle.hpp"
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/cusolver_wrappers.h>
-#include <sparse/cusparse_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/cusolver_wrappers.h>
+#include <raft/sparse/cusparse_wrappers.h>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/common/logger.hpp>
 
 namespace ML {
 
-int cumlHandle::getDefaultNumInternalStreams() {
-  return _default_num_internal_streams;
-}
-
-cumlHandle::cumlHandle(int n_streams) : _impl(new cumlHandle_impl(n_streams)) {}
-cumlHandle::cumlHandle() : _impl(new cumlHandle_impl()) {}
-cumlHandle::~cumlHandle() {}
-
-void cumlHandle::setStream(cudaStream_t stream) { _impl->setStream(stream); }
-
-cudaStream_t cumlHandle::getStream() const { return _impl->getStream(); }
-
-const cudaDeviceProp& cumlHandle::getDeviceProperties() const {
-  return _impl->getDeviceProperties();
-}
-
-std::vector<cudaStream_t> cumlHandle::getInternalStreams() const {
-  return _impl->getInternalStreams();
-}
-
-void cumlHandle::setDeviceAllocator(
-  std::shared_ptr<deviceAllocator> allocator) {
-  _impl->setDeviceAllocator(allocator);
-}
-
-std::shared_ptr<deviceAllocator> cumlHandle::getDeviceAllocator() const {
-  return _impl->getDeviceAllocator();
-}
-
-void cumlHandle::setHostAllocator(std::shared_ptr<hostAllocator> allocator) {
-  _impl->setHostAllocator(allocator);
-}
-
-std::shared_ptr<hostAllocator> cumlHandle::getHostAllocator() const {
-  return _impl->getHostAllocator();
-}
-int cumlHandle::getNumInternalStreams() {
-  return _impl->getNumInternalStreams();
-}
-const cumlHandle_impl& cumlHandle::getImpl() const { return *_impl.get(); }
-
-cumlHandle_impl& cumlHandle::getImpl() { return *_impl.get(); }
-
-using MLCommon::defaultDeviceAllocator;
-using MLCommon::defaultHostAllocator;
-
-cumlHandle_impl::cumlHandle_impl(int n_streams)
-  : _cublas_handle(),
-    _cusolverDn_handle(),
-    _cusolverSp_handle(),
-    _cusparse_handle(),
-    _userStream(nullptr),
-    _event(),
-    _deviceAllocator(std::make_shared<defaultDeviceAllocator>()),
-    _hostAllocator(std::make_shared<defaultHostAllocator>()),
-    _communicator(),
-    _streams(),
-    _prop(),
-    _dev_id([]() -> int {
-      int cur_dev = -1;
-      CUDA_CHECK(cudaGetDevice(&cur_dev));
-      return cur_dev;
-    }()),
-    _num_streams(n_streams),
-    _cublasInitialized(false),
-    _cusolverDnInitialized(false),
-    _cusolverSpInitialized(false),
-    _cusparseInitialized(false),
-    _devicePropInitialized(false) {
-  createResources();
-}
-
-cumlHandle_impl::~cumlHandle_impl() { destroyResources(); }
-
-int cumlHandle_impl::getDevice() const { return _dev_id; }
-
-void cumlHandle_impl::setStream(cudaStream_t stream) { _userStream = stream; }
-
-cudaStream_t cumlHandle_impl::getStream() const { return _userStream; }
-
-const cudaDeviceProp& cumlHandle_impl::getDeviceProperties() const {
-  if (!_devicePropInitialized) {
-    CUDA_CHECK(cudaGetDeviceProperties(&_prop, _dev_id));
-    _devicePropInitialized = true;
-  }
-  return _prop;
-}
-
-void cumlHandle_impl::setDeviceAllocator(
-  std::shared_ptr<deviceAllocator> allocator) {
-  _deviceAllocator = allocator;
-}
-
-std::shared_ptr<deviceAllocator> cumlHandle_impl::getDeviceAllocator() const {
-  return _deviceAllocator;
-}
-
-void cumlHandle_impl::setHostAllocator(
-  std::shared_ptr<hostAllocator> allocator) {
-  _hostAllocator = allocator;
-}
-
-std::shared_ptr<hostAllocator> cumlHandle_impl::getHostAllocator() const {
-  return _hostAllocator;
-}
-
-cublasHandle_t cumlHandle_impl::getCublasHandle() const {
-  if (!_cublasInitialized) {
-    CUBLAS_CHECK(cublasCreate(&_cublas_handle));
-    _cublasInitialized = true;
-  }
-  return _cublas_handle;
-}
-
-cusolverDnHandle_t cumlHandle_impl::getcusolverDnHandle() const {
-  if (!_cusolverDnInitialized) {
-    CUSOLVER_CHECK(cusolverDnCreate(&_cusolverDn_handle));
-    _cusolverDnInitialized = true;
-  }
-  return _cusolverDn_handle;
-}
-
-cusolverSpHandle_t cumlHandle_impl::getcusolverSpHandle() const {
-  if (!_cusolverSpInitialized) {
-    CUSOLVER_CHECK(cusolverSpCreate(&_cusolverSp_handle));
-    _cusolverSpInitialized = true;
-  }
-  return _cusolverSp_handle;
-}
-
-cusparseHandle_t cumlHandle_impl::getcusparseHandle() const {
-  if (!_cusparseInitialized) {
-    CUSPARSE_CHECK(cusparseCreate(&_cusparse_handle));
-    _cusparseInitialized = true;
-  }
-  return _cusparse_handle;
-}
-
-cudaStream_t cumlHandle_impl::getInternalStream(int sid) const {
-  return _streams[sid];
-}
-
-int cumlHandle_impl::getNumInternalStreams() const { return _num_streams; }
-
-std::vector<cudaStream_t> cumlHandle_impl::getInternalStreams() const {
-  std::vector<cudaStream_t> int_streams_vec(_num_streams);
-  for (auto s : _streams) {
-    int_streams_vec.push_back(s);
-  }
-  return int_streams_vec;
-}
-
-void cumlHandle_impl::waitOnUserStream() const {
-  CUDA_CHECK(cudaEventRecord(_event, _userStream));
-  for (auto s : _streams) {
-    CUDA_CHECK(cudaStreamWaitEvent(s, _event, 0));
-  }
-}
-
-void cumlHandle_impl::waitOnInternalStreams() const {
-  for (auto s : _streams) {
-    CUDA_CHECK(cudaEventRecord(_event, s));
-    CUDA_CHECK(cudaStreamWaitEvent(_userStream, _event, 0));
-  }
-}
-
-void cumlHandle_impl::setCommunicator(
-  std::shared_ptr<MLCommon::cumlCommunicator> communicator) {
-  _communicator = communicator;
-}
-
-const MLCommon::cumlCommunicator& cumlHandle_impl::getCommunicator() const {
-  ASSERT(nullptr != _communicator.get(),
-         "ERROR: Communicator was not initialized\n");
-  return *_communicator;
-}
-
-bool cumlHandle_impl::commsInitialized() const {
-  return (nullptr != _communicator.get());
-}
-
-void cumlHandle_impl::createResources() {
-  cudaStream_t stream;
-  CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
-  _streams.push_back(stream);
-  for (int i = 1; i < _num_streams; ++i) {
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
-    _streams.push_back(stream);
-  }
-  CUDA_CHECK(cudaEventCreateWithFlags(&_event, cudaEventDisableTiming));
-}
-
-void cumlHandle_impl::destroyResources() {
-  if (_cusparseInitialized) {
-    CUSPARSE_CHECK_NO_THROW(cusparseDestroy(_cusparse_handle));
-  }
-  if (_cusolverDnInitialized) {
-    CUSOLVER_CHECK_NO_THROW(cusolverDnDestroy(_cusolverDn_handle));
-  }
-  if (_cusolverSpInitialized) {
-    CUSOLVER_CHECK_NO_THROW(cusolverSpDestroy(_cusolverSp_handle));
-  }
-  if (_cublasInitialized) {
-    CUBLAS_CHECK_NO_THROW(cublasDestroy(_cublas_handle));
-  }
-  while (!_streams.empty()) {
-    CUDA_CHECK_NO_THROW(cudaStreamDestroy(_streams.back()));
-    _streams.pop_back();
-  }
-  CUDA_CHECK_NO_THROW(cudaEventDestroy(_event));
-}
-
 HandleMap handleMap;
 
 std::pair<cumlHandle_t, cumlError_t> HandleMap::createAndInsertHandle() {
   cumlError_t status = CUML_SUCCESS;
   cumlHandle_t chosen_handle;
   try {
-    auto handle_ptr = new ML::cumlHandle();
+    auto handle_ptr = new raft::handle_t();
     bool inserted;
     {
       std::lock_guard<std::mutex> guard(_mapMutex);
@@ -274,19 +61,20 @@ std::pair<cumlHandle_t, cumlError_t> HandleMap::createAndInsertHandle() {
   return std::pair<cumlHandle_t, cumlError_t>(chosen_handle, status);
 }
 
-std::pair<cumlHandle*, cumlError_t> HandleMap::lookupHandlePointer(
+std::pair<raft::handle_t*, cumlError_t> HandleMap::lookupHandlePointer(
   cumlHandle_t handle) const {
   std::lock_guard<std::mutex> guard(_mapMutex);
   auto it = _handleMap.find(handle);
   if (it == _handleMap.end()) {
-    return std::pair<cumlHandle*, cumlError_t>(nullptr, CUML_INVALID_HANDLE);
+    return std::pair<raft::handle_t*, cumlError_t>(nullptr,
+                                                   CUML_INVALID_HANDLE);
   } else {
-    return std::pair<cumlHandle*, cumlError_t>(it->second, CUML_SUCCESS);
+    return std::pair<raft::handle_t*, cumlError_t>(it->second, CUML_SUCCESS);
   }
 }
 
 cumlError_t HandleMap::removeAndDestroyHandle(cumlHandle_t handle) {
-  ML::cumlHandle* handle_ptr;
+  raft::handle_t* handle_ptr;
   {
     std::lock_guard<std::mutex> guard(_mapMutex);
     auto it = _handleMap.find(handle);
diff --git a/cpp/src/common/cumlHandle.hpp b/cpp/src/common/cumlHandle.hpp
index 0b732f4ccd..864d2c8cc2 100644
--- a/cpp/src/common/cumlHandle.hpp
+++ b/cpp/src/common/cumlHandle.hpp
@@ -26,10 +26,11 @@
 #include <cusolverSp.h>
 #include <cusparse.h>
 
-#include <common/cuml_comms_int.hpp>
+#include <cuml/cuml.hpp>
+#include <raft/comms/comms.hpp>
 
 #include <cuml/cuml_api.h>
-#include <cuml/cuml.hpp>
+#include <raft/handle.hpp>
 
 #include <cuml/common/cuml_allocator.hpp>
 
@@ -38,65 +39,6 @@ namespace ML {
 using MLCommon::deviceAllocator;
 using MLCommon::hostAllocator;
 
-/**
- * @todo: Add doxygen documentation
- */
-class cumlHandle_impl {
- public:
-  cumlHandle_impl(int n_streams = cumlHandle::getDefaultNumInternalStreams());
-  ~cumlHandle_impl();
-  int getDevice() const;
-  void setStream(cudaStream_t stream);
-  cudaStream_t getStream() const;
-  void setDeviceAllocator(std::shared_ptr<deviceAllocator> allocator);
-  std::shared_ptr<deviceAllocator> getDeviceAllocator() const;
-  void setHostAllocator(std::shared_ptr<hostAllocator> allocator);
-  std::shared_ptr<hostAllocator> getHostAllocator() const;
-
-  cublasHandle_t getCublasHandle() const;
-  cusolverDnHandle_t getcusolverDnHandle() const;
-  cusolverSpHandle_t getcusolverSpHandle() const;
-  cusparseHandle_t getcusparseHandle() const;
-
-  cudaStream_t getInternalStream(int sid) const;
-  int getNumInternalStreams() const;
-
-  std::vector<cudaStream_t> getInternalStreams() const;
-
-  void waitOnUserStream() const;
-  void waitOnInternalStreams() const;
-
-  void setCommunicator(
-    std::shared_ptr<MLCommon::cumlCommunicator> communicator);
-  const MLCommon::cumlCommunicator& getCommunicator() const;
-  bool commsInitialized() const;
-
-  const cudaDeviceProp& getDeviceProperties() const;
-
- private:
-  mutable cublasHandle_t _cublas_handle;
-  mutable cusolverDnHandle_t _cusolverDn_handle;
-  mutable cusolverSpHandle_t _cusolverSp_handle;
-  mutable cusparseHandle_t _cusparse_handle;
-  cudaStream_t _userStream;
-  cudaEvent_t _event;
-  std::shared_ptr<deviceAllocator> _deviceAllocator;
-  std::shared_ptr<hostAllocator> _hostAllocator;
-  std::shared_ptr<MLCommon::cumlCommunicator> _communicator;
-  std::vector<cudaStream_t> _streams;
-  mutable cudaDeviceProp _prop;
-  const int _dev_id;
-  const int _num_streams;
-  mutable bool _cublasInitialized;
-  mutable bool _cusolverDnInitialized;
-  mutable bool _cusolverSpInitialized;
-  mutable bool _cusparseInitialized;
-  mutable bool _devicePropInitialized;
-
-  void createResources();
-  void destroyResources();
-};
-
 /**
  * Map from integral cumlHandle_t identifiers to cumlHandle pointer protected
  * by a mutex for thread-safe access.
@@ -118,7 +60,7 @@ class HandleMap {
      *                   the handle is INVALID_HANDLE. Error code CUML_INAVLID_HANDLE
      *                   is returned if the provided `handle` is invald.
      */
-  std::pair<cumlHandle*, cumlError_t> lookupHandlePointer(
+  std::pair<raft::handle_t*, cumlError_t> lookupHandlePointer(
     cumlHandle_t handle) const;
 
   /**
@@ -134,7 +76,7 @@ class HandleMap {
     -1;  //!< sentinel value for invalid ID
 
  private:
-  std::unordered_map<cumlHandle_t, cumlHandle*>
+  std::unordered_map<cumlHandle_t, raft::handle_t*>
     _handleMap;                  //!< map from ID to pointer
   mutable std::mutex _mapMutex;  //!< mutex protecting the map
   cumlHandle_t _nextHandle;      //!< value of next handle ID
@@ -143,25 +85,4 @@ class HandleMap {
 /// Static handle map instance (see cumlHandle.cpp)
 extern HandleMap handleMap;
 
-namespace detail {
-
-/**
- * @todo: Add doxygen documentation
- */
-class streamSyncer {
- public:
-  streamSyncer(const cumlHandle_impl& handle) : _handle(handle) {
-    _handle.waitOnUserStream();
-  }
-  ~streamSyncer() { _handle.waitOnInternalStreams(); }
-
-  streamSyncer(const streamSyncer& other) = delete;
-  streamSyncer& operator=(const streamSyncer& other) = delete;
-
- private:
-  const cumlHandle_impl& _handle;
-};
-
-}  // end namespace detail
-
 }  // end namespace ML
diff --git a/cpp/src/common/cuml_api.cpp b/cpp/src/common/cuml_api.cpp
index d41e721951..fb66ff78ab 100644
--- a/cpp/src/common/cuml_api.cpp
+++ b/cpp/src/common/cuml_api.cpp
@@ -14,16 +14,18 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/cuml_api.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/utils.hpp>
 #include <functional>
+#include <raft/mr/device/allocator.hpp>
+#include <raft/mr/host/allocator.hpp>
 #include "cumlHandle.hpp"
 
 namespace ML {
 namespace detail {
 
-class hostAllocatorFunctionWrapper : public MLCommon::hostAllocator {
+class hostAllocatorFunctionWrapper : public raft::mr::host::allocator {
  public:
   hostAllocatorFunctionWrapper(cuml_allocate allocate_fn,
                                cuml_deallocate deallocate_fn)
@@ -44,7 +46,8 @@ class hostAllocatorFunctionWrapper : public MLCommon::hostAllocator {
   const std::function<cudaError_t(void*, size_t, cudaStream_t)> _deallocate_fn;
 };
 
-class deviceAllocatorFunctionWrapper : public MLCommon::deviceAllocator {
+class deviceAllocatorFunctionWrapper
+  : public raft::mr::device::default_allocator {
  public:
   deviceAllocatorFunctionWrapper(cuml_allocate allocate_fn,
                                  cuml_deallocate deallocate_fn)
@@ -87,11 +90,11 @@ extern "C" cumlError_t cumlCreate(cumlHandle_t* handle) {
 
 extern "C" cumlError_t cumlSetStream(cumlHandle_t handle, cudaStream_t stream) {
   cumlError_t status;
-  ML::cumlHandle* handle_ptr;
+  raft::handle_t* handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
-      handle_ptr->setStream(stream);
+      handle_ptr->set_stream(stream);
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
@@ -109,11 +112,11 @@ extern "C" cumlError_t cumlSetStream(cumlHandle_t handle, cudaStream_t stream) {
 extern "C" cumlError_t cumlGetStream(cumlHandle_t handle,
                                      cudaStream_t* stream) {
   cumlError_t status;
-  ML::cumlHandle* handle_ptr;
+  raft::handle_t* handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
-      *stream = handle_ptr->getStream();
+      *stream = handle_ptr->get_stream();
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
@@ -132,14 +135,14 @@ extern "C" cumlError_t cumlSetDeviceAllocator(cumlHandle_t handle,
                                               cuml_allocate allocate_fn,
                                               cuml_deallocate deallocate_fn) {
   cumlError_t status;
-  ML::cumlHandle* handle_ptr;
+  raft::handle_t* handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
       std::shared_ptr<ML::detail::deviceAllocatorFunctionWrapper> allocator(
         new ML::detail::deviceAllocatorFunctionWrapper(allocate_fn,
                                                        deallocate_fn));
-      handle_ptr->setDeviceAllocator(allocator);
+      handle_ptr->set_device_allocator(allocator);
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
@@ -158,14 +161,14 @@ extern "C" cumlError_t cumlSetHostAllocator(cumlHandle_t handle,
                                             cuml_allocate allocate_fn,
                                             cuml_deallocate deallocate_fn) {
   cumlError_t status;
-  ML::cumlHandle* handle_ptr;
+  raft::handle_t* handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
       std::shared_ptr<ML::detail::hostAllocatorFunctionWrapper> allocator(
         new ML::detail::hostAllocatorFunctionWrapper(allocate_fn,
                                                      deallocate_fn));
-      handle_ptr->setHostAllocator(allocator);
+      handle_ptr->set_host_allocator(allocator);
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
diff --git a/cpp/src/common/logger.cpp b/cpp/src/common/logger.cpp
index 8da2c5384b..2a01754e2b 100644
--- a/cpp/src/common/logger.cpp
+++ b/cpp/src/common/logger.cpp
@@ -18,7 +18,9 @@
 #include <spdlog/spdlog.h>                    // NOLINT
 
 #include <algorithm>
+#include <cuml/common/callbackSink.hpp>
 #include <cuml/common/logger.hpp>
+#include <memory>
 
 namespace ML {
 
@@ -48,7 +50,10 @@ Logger& Logger::get() {
   return logger;
 }
 
-Logger::Logger() : logger{spdlog::stdout_color_mt("cuml")}, currPattern() {
+Logger::Logger()
+  : sink{std::make_shared<spdlog::sinks::callback_sink_mt>()},
+    logger{std::make_shared<spdlog::logger>("cuml", sink)},
+    currPattern() {
   setPattern(DefaultPattern);
   setLevel(CUML_LEVEL_INFO);
 }
@@ -63,6 +68,12 @@ void Logger::setPattern(const std::string& pattern) {
   logger->set_pattern(pattern);
 }
 
+void Logger::setCallback(spdlog::sinks::LogCallback callback) {
+  sink->set_callback(callback);
+}
+
+void Logger::setFlush(void (*flush)()) { sink->set_flush(flush); }
+
 bool Logger::shouldLogFor(int level) const {
   level = convert_level_to_spdlog(level);
   auto level_e = static_cast<spdlog::level::level_enum>(level);
@@ -87,6 +98,8 @@ void Logger::log(int level, const char* fmt, ...) {
   }
 }
 
+void Logger::flush() { logger->flush(); }
+
 PatternSetter::PatternSetter(const std::string& pattern) : prevPattern() {
   prevPattern = Logger::get().getPattern();
   Logger::get().setPattern(pattern);
diff --git a/cpp/src/common/nvtx.cu b/cpp/src/common/nvtx.cu
index a2f8fb9be9..a1cad6420c 100644
--- a/cpp/src/common/nvtx.cu
+++ b/cpp/src/common/nvtx.cu
@@ -54,7 +54,7 @@ uint32_t hsv2rgb(float h, float s, float v) {
   }
   // convert hue from [0, 1] range to [0, 360]
   float h_deg = h * 360.f;
-  if (0.f < h_deg || h_deg >= 360.f) h_deg = 0.f;
+  if (0.f > h_deg || h_deg >= 360.f) h_deg = 0.f;
   h_deg /= 60.f;
   int h_range = (int)h_deg;
   float h_mod = h_deg - h_range;
@@ -147,16 +147,7 @@ void PUSH_RANGE(const char *name) {
   nvtxRangePushEx(&eventAttrib);
 }
 
-#ifdef ENABLE_EMPTY_MARKER_KERNEL
-__global__ void emptyMarkerKernel() {}
-#endif  // ENABLE_EMPTY_MARKER_KERNEL
-
-void POP_RANGE() {
-  nvtxRangePop();
-#ifdef ENABLE_EMPTY_MARKER_KERNEL
-  emptyMarkerKernel<<<1, 1>>>();
-#endif
-}
+void POP_RANGE() { nvtxRangePop(); }
 
 #else  // NVTX_ENABLED
 
diff --git a/cpp/src/common/tensor.hpp b/cpp/src/common/tensor.hpp
index 18fe6d2b0f..5646d421c2 100644
--- a/cpp/src/common/tensor.hpp
+++ b/cpp/src/common/tensor.hpp
@@ -16,6 +16,7 @@
 
 #pragma once
 
+#include <raft/cudart_utils.h>
 #include <cuml/common/cuml_allocator.hpp>
 #include <vector>
 
diff --git a/cpp/src/comms/cuML_comms_test.cpp b/cpp/src/comms/cuML_comms_test.cpp
deleted file mode 100644
index a15cc77647..0000000000
--- a/cpp/src/comms/cuML_comms_test.cpp
+++ /dev/null
@@ -1,175 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "cuML_comms_test.hpp"
-
-#include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
-#include <common/device_buffer.hpp>
-#include <iostream>
-
-namespace ML {
-namespace Comms {
-
-bool test_collective_allreduce(const ML::cumlHandle& h) {
-  const cumlHandle_impl& handle = h.getImpl();
-  ML::detail::streamSyncer _(handle);
-  const MLCommon::cumlCommunicator& communicator = handle.getCommunicator();
-
-  const int send = 1;
-
-  cudaStream_t stream = handle.getStream();
-
-  MLCommon::device_buffer<int> temp_d(handle.getDeviceAllocator(), stream);
-  temp_d.resize(1, stream);
-  CUDA_CHECK(cudaMemcpyAsync(temp_d.data(), &send, sizeof(int),
-                             cudaMemcpyHostToDevice, stream));
-  communicator.allreduce(temp_d.data(), temp_d.data(), 1,
-                         MLCommon::cumlCommunicator::SUM, stream);
-  int temp_h = 0;
-  CUDA_CHECK(cudaMemcpyAsync(&temp_h, temp_d.data(), sizeof(int),
-                             cudaMemcpyDeviceToHost, stream));
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  communicator.barrier();
-
-  std::cout << "Clique size: " << communicator.getSize() << std::endl;
-  std::cout << "final_size: " << temp_h << std::endl;
-
-  return temp_h == communicator.getSize();
-}
-
-bool test_pointToPoint_simple_send_recv(const ML::cumlHandle& h,
-                                        int numTrials) {
-  const cumlHandle_impl& handle = h.getImpl();
-  const MLCommon::cumlCommunicator& communicator = handle.getCommunicator();
-  const int rank = communicator.getRank();
-
-  bool ret = true;
-  for (int i = 0; i < numTrials; i++) {
-    std::vector<int> received_data((communicator.getSize() - 1), -1);
-
-    std::vector<MLCommon::cumlCommunicator::request_t> requests;
-    requests.resize(2 * (communicator.getSize() - 1));
-    int request_idx = 0;
-    //post receives
-    for (int r = 0; r < communicator.getSize(); ++r) {
-      if (r != rank) {
-        communicator.irecv(received_data.data() + request_idx, 1, r, 0,
-                           requests.data() + request_idx);
-        ++request_idx;
-      }
-    }
-
-    for (int r = 0; r < communicator.getSize(); ++r) {
-      if (r != rank) {
-        communicator.isend(&rank, 1, r, 0, requests.data() + request_idx);
-        ++request_idx;
-      }
-    }
-
-    communicator.waitall(requests.size(), requests.data());
-    communicator.barrier();
-
-    if (communicator.getRank() == 0) {
-      std::cout << "=========================" << std::endl;
-      std::cout << "Trial " << i << std::endl;
-    }
-
-    for (int printrank = 0; printrank < communicator.getSize(); ++printrank) {
-      if (communicator.getRank() == printrank) {
-        std::cout << "Rank " << communicator.getRank() << " received: [";
-        for (int i = 0; i < received_data.size(); i++) {
-          auto rec = received_data[i];
-          std::cout << rec;
-          if (rec == -1) ret = false;
-          communicator.barrier();
-          if (i < received_data.size() - 1) std::cout << ", ";
-        }
-        std::cout << "]" << std::endl;
-      }
-
-      communicator.barrier();
-    }
-
-    if (communicator.getRank() == 0)
-      std::cout << "=========================" << std::endl;
-  }
-
-  return ret;
-}
-
-bool test_pointToPoint_recv_any_rank(const ML::cumlHandle& h, int numTrials) {
-  const cumlHandle_impl& handle = h.getImpl();
-  const MLCommon::cumlCommunicator& communicator = handle.getCommunicator();
-  const int rank = communicator.getRank();
-
-  bool ret = true;
-  for (int i = 0; i < numTrials; i++) {
-    std::vector<int> received_data((communicator.getSize() - 1), -1);
-
-    std::vector<MLCommon::cumlCommunicator::request_t> requests;
-    requests.resize(2 * (communicator.getSize() - 1));
-    int request_idx = 0;
-    //post receives
-    for (int r = 0; r < communicator.getSize(); ++r) {
-      if (r != rank) {
-        communicator.irecv(received_data.data() + request_idx, 1,
-                           MLCommon::cumlCommunicator::CUML_ANY_SOURCE, 0,
-                           requests.data() + request_idx);
-        ++request_idx;
-      }
-    }
-
-    for (int r = 0; r < communicator.getSize(); ++r) {
-      if (r != rank) {
-        communicator.isend(&rank, 1, r, 0, requests.data() + request_idx);
-        ++request_idx;
-      }
-    }
-
-    std::cout << "Waiting..." << std::endl;
-    communicator.waitall(requests.size(), requests.data());
-    communicator.barrier();
-
-    if (communicator.getRank() == 0) {
-      std::cout << "=========================" << std::endl;
-      std::cout << "Trial " << i << std::endl;
-    }
-
-    for (int printrank = 0; printrank < communicator.getSize(); ++printrank) {
-      if (communicator.getRank() == printrank) {
-        std::cout << "Rank " << communicator.getRank() << " received: [";
-        for (int i = 0; i < received_data.size(); i++) {
-          auto rec = received_data[i];
-          std::cout << rec;
-          if (rec == -1) ret = false;
-          communicator.barrier();
-          if (i < received_data.size() - 1) std::cout << ", ";
-        }
-        std::cout << "]" << std::endl;
-      }
-      communicator.barrier();
-    }
-
-    if (communicator.getRank() == 0)
-      std::cout << "=========================" << std::endl;
-  }
-
-  return ret;
-}
-
-};  // namespace Comms
-};  // end namespace ML
diff --git a/cpp/src/comms/cuML_comms_test.hpp b/cpp/src/comms/cuML_comms_test.hpp
deleted file mode 100644
index f363d5d47c..0000000000
--- a/cpp/src/comms/cuML_comms_test.hpp
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuml/cuml.hpp>
-
-namespace ML {
-namespace Comms {
-
-/**
- * @brief Simple allreduce test for single integer value of 1. Each rank
- * evaluates whether their allreduced value equals the size of the clique.
- * @param[in] handle cumlHandle instance with initialized cumlCommunicator
- */
-bool test_collective_allreduce(const ML::cumlHandle& handle);
-
-/**
- * @brief Simple point-to-point test. Each rank passes its rank to all other
- * ranks and verifies that it received messages from all other ranks.
- * @param[in] handle cumlHandle instance with initialized cumlCommunicator
- * @param[in] n_trials number of iterations to pass messages
- */
-bool test_pointToPoint_simple_send_recv(const ML::cumlHandle& handle,
-                                        int n_trials);
-
-bool test_pointToPoint_recv_any_rank(const ML::cumlHandle& handle,
-                                     int numTrials);
-
-};  // namespace Comms
-};  // end namespace ML
diff --git a/cpp/src/datasets/make_arima.cu b/cpp/src/datasets/make_arima.cu
index dac40d1fb4..f309828d4b 100644
--- a/cpp/src/datasets/make_arima.cu
+++ b/cpp/src/datasets/make_arima.cu
@@ -22,25 +22,25 @@ namespace ML {
 namespace Datasets {
 
 template <typename DataT, typename IdxT>
-inline void make_arima_helper(const cumlHandle& handle, DataT* out,
+inline void make_arima_helper(const raft::handle_t& handle, DataT* out,
                               IdxT batch_size, IdxT n_obs, ARIMAOrder order,
                               DataT scale, DataT noise_scale,
                               DataT intercept_scale, uint64_t seed) {
-  auto stream = handle.getStream();
-  auto allocator = handle.getImpl().getDeviceAllocator();
+  auto stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
 
   MLCommon::Random::make_arima(out, batch_size, n_obs, order, allocator, stream,
                                scale, noise_scale, intercept_scale, seed);
 }
 
-void make_arima(const cumlHandle& handle, float* out, int batch_size, int n_obs,
-                ARIMAOrder order, float scale, float noise_scale,
+void make_arima(const raft::handle_t& handle, float* out, int batch_size,
+                int n_obs, ARIMAOrder order, float scale, float noise_scale,
                 float intercept_scale, uint64_t seed) {
   make_arima_helper(handle, out, batch_size, n_obs, order, scale, noise_scale,
                     intercept_scale, seed);
 }
 
-void make_arima(const cumlHandle& handle, double* out, int batch_size,
+void make_arima(const raft::handle_t& handle, double* out, int batch_size,
                 int n_obs, ARIMAOrder order, double scale, double noise_scale,
                 double intercept_scale, uint64_t seed) {
   make_arima_helper(handle, out, batch_size, n_obs, order, scale, noise_scale,
diff --git a/cpp/src/datasets/make_blobs.cu b/cpp/src/datasets/make_blobs.cu
index 0c3713a0ee..8059d2bd89 100644
--- a/cpp/src/datasets/make_blobs.cu
+++ b/cpp/src/datasets/make_blobs.cu
@@ -21,48 +21,48 @@
 namespace ML {
 namespace Datasets {
 
-void make_blobs(const cumlHandle& handle, float* out, int64_t* labels,
+void make_blobs(const raft::handle_t& handle, float* out, int64_t* labels,
                 int64_t n_rows, int64_t n_cols, int64_t n_clusters,
                 bool row_major, const float* centers, const float* cluster_std,
                 const float cluster_std_scalar, bool shuffle,
                 float center_box_min, float center_box_max, uint64_t seed) {
   MLCommon::Random::make_blobs(
-    out, labels, n_rows, n_cols, n_clusters, handle.getDeviceAllocator(),
-    handle.getStream(), row_major, centers, cluster_std, cluster_std_scalar,
+    out, labels, n_rows, n_cols, n_clusters, handle.get_device_allocator(),
+    handle.get_stream(), row_major, centers, cluster_std, cluster_std_scalar,
     shuffle, center_box_min, center_box_max, seed);
 }
 
-void make_blobs(const cumlHandle& handle, double* out, int64_t* labels,
+void make_blobs(const raft::handle_t& handle, double* out, int64_t* labels,
                 int64_t n_rows, int64_t n_cols, int64_t n_clusters,
                 bool row_major, const double* centers,
                 const double* cluster_std, const double cluster_std_scalar,
                 bool shuffle, double center_box_min, double center_box_max,
                 uint64_t seed) {
   MLCommon::Random::make_blobs(
-    out, labels, n_rows, n_cols, n_clusters, handle.getDeviceAllocator(),
-    handle.getStream(), row_major, centers, cluster_std, cluster_std_scalar,
+    out, labels, n_rows, n_cols, n_clusters, handle.get_device_allocator(),
+    handle.get_stream(), row_major, centers, cluster_std, cluster_std_scalar,
     shuffle, center_box_min, center_box_max, seed);
 }
 
-void make_blobs(const cumlHandle& handle, float* out, int* labels, int n_rows,
-                int n_cols, int n_clusters, bool row_major,
+void make_blobs(const raft::handle_t& handle, float* out, int* labels,
+                int n_rows, int n_cols, int n_clusters, bool row_major,
                 const float* centers, const float* cluster_std,
                 const float cluster_std_scalar, bool shuffle,
                 float center_box_min, float center_box_max, uint64_t seed) {
   MLCommon::Random::make_blobs(
-    out, labels, n_rows, n_cols, n_clusters, handle.getDeviceAllocator(),
-    handle.getStream(), row_major, centers, cluster_std, cluster_std_scalar,
+    out, labels, n_rows, n_cols, n_clusters, handle.get_device_allocator(),
+    handle.get_stream(), row_major, centers, cluster_std, cluster_std_scalar,
     shuffle, center_box_min, center_box_max, seed);
 }
 
-void make_blobs(const cumlHandle& handle, double* out, int* labels, int n_rows,
-                int n_cols, int n_clusters, bool row_major,
+void make_blobs(const raft::handle_t& handle, double* out, int* labels,
+                int n_rows, int n_cols, int n_clusters, bool row_major,
                 const double* centers, const double* cluster_std,
                 const double cluster_std_scalar, bool shuffle,
                 double center_box_min, double center_box_max, uint64_t seed) {
   MLCommon::Random::make_blobs(
-    out, labels, n_rows, n_cols, n_clusters, handle.getDeviceAllocator(),
-    handle.getStream(), row_major, centers, cluster_std, cluster_std_scalar,
+    out, labels, n_rows, n_cols, n_clusters, handle.get_device_allocator(),
+    handle.get_stream(), row_major, centers, cluster_std, cluster_std_scalar,
     shuffle, center_box_min, center_box_max, seed);
 }
 }  // namespace Datasets
diff --git a/cpp/src/datasets/make_regression.cu b/cpp/src/datasets/make_regression.cu
index 0fd600f2d2..82b30428cf 100644
--- a/cpp/src/datasets/make_regression.cu
+++ b/cpp/src/datasets/make_regression.cu
@@ -22,24 +22,24 @@ namespace ML {
 namespace Datasets {
 
 template <typename DataT, typename IdxT>
-void make_regression_helper(const cumlHandle& handle, DataT* out, DataT* values,
-                            IdxT n_rows, IdxT n_cols, IdxT n_informative,
-                            DataT* coef, IdxT n_targets, DataT bias,
-                            IdxT effective_rank, DataT tail_strength,
-                            DataT noise, bool shuffle, uint64_t seed) {
-  const auto& handle_impl = handle.getImpl();
-  cudaStream_t stream = handle_impl.getStream();
-  cublasHandle_t cublas_handle = handle_impl.getCublasHandle();
-  cusolverDnHandle_t cusolver_handle = handle_impl.getcusolverDnHandle();
-  auto allocator = handle_impl.getDeviceAllocator();
+void make_regression_helper(const raft::handle_t& handle, DataT* out,
+                            DataT* values, IdxT n_rows, IdxT n_cols,
+                            IdxT n_informative, DataT* coef, IdxT n_targets,
+                            DataT bias, IdxT effective_rank,
+                            DataT tail_strength, DataT noise, bool shuffle,
+                            uint64_t seed) {
+  const auto& handle_impl = handle;
+  cudaStream_t stream = handle_impl.get_stream();
+  cublasHandle_t cublas_handle = handle_impl.get_cublas_handle();
+  cusolverDnHandle_t cusolver_handle = handle_impl.get_cusolver_dn_handle();
+  auto allocator = handle_impl.get_device_allocator();
 
   MLCommon::Random::make_regression(
-    out, values, n_rows, n_cols, n_informative, cublas_handle, cusolver_handle,
-    allocator, stream, coef, n_targets, bias, effective_rank, tail_strength,
-    noise, shuffle, seed);
+    handle, out, values, n_rows, n_cols, n_informative, stream, coef, n_targets,
+    bias, effective_rank, tail_strength, noise, shuffle, seed);
 }
 
-void make_regression(const cumlHandle& handle, float* out, float* values,
+void make_regression(const raft::handle_t& handle, float* out, float* values,
                      int64_t n_rows, int64_t n_cols, int64_t n_informative,
                      float* coef, int64_t n_targets, float bias,
                      int64_t effective_rank, float tail_strength, float noise,
@@ -49,7 +49,7 @@ void make_regression(const cumlHandle& handle, float* out, float* values,
                          noise, shuffle, seed);
 }
 
-void make_regression(const cumlHandle& handle, double* out, double* values,
+void make_regression(const raft::handle_t& handle, double* out, double* values,
                      int64_t n_rows, int64_t n_cols, int64_t n_informative,
                      double* coef, int64_t n_targets, double bias,
                      int64_t effective_rank, double tail_strength, double noise,
@@ -59,7 +59,7 @@ void make_regression(const cumlHandle& handle, double* out, double* values,
                          noise, shuffle, seed);
 }
 
-void make_regression(const cumlHandle& handle, float* out, float* values,
+void make_regression(const raft::handle_t& handle, float* out, float* values,
                      int n_rows, int n_cols, int n_informative, float* coef,
                      int n_targets, float bias, int effective_rank,
                      float tail_strength, float noise, bool shuffle,
@@ -69,7 +69,7 @@ void make_regression(const cumlHandle& handle, float* out, float* values,
                          noise, shuffle, seed);
 }
 
-void make_regression(const cumlHandle& handle, double* out, double* values,
+void make_regression(const raft::handle_t& handle, double* out, double* values,
                      int n_rows, int n_cols, int n_informative, double* coef,
                      int n_targets, double bias, int effective_rank,
                      double tail_strength, double noise, bool shuffle,
diff --git a/cpp/src/dbscan/adjgraph/algo.cuh b/cpp/src/dbscan/adjgraph/algo.cuh
index a9082b6134..f80a827d7c 100644
--- a/cpp/src/dbscan/adjgraph/algo.cuh
+++ b/cpp/src/dbscan/adjgraph/algo.cuh
@@ -20,7 +20,7 @@
 #include <thrust/scan.h>
 #include <common/allocatorAdapter.hpp>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "../common.cuh"
 #include "pack.h"
 
@@ -41,12 +41,12 @@ static const int TPB_X = 256;
  * CSR row_ind_ptr array (adj_graph) and filters into a core_pts array based on min_pts.
  */
 template <typename Index_ = int>
-void launcher(const ML::cumlHandle_impl &handle, Pack<Index_> data,
-              Index_ batchSize, cudaStream_t stream) {
+void launcher(const raft::handle_t &handle, Pack<Index_> data, Index_ batchSize,
+              cudaStream_t stream) {
   device_ptr<Index_> dev_vd = device_pointer_cast(data.vd);
   device_ptr<Index_> dev_ex_scan = device_pointer_cast(data.ex_scan);
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   exclusive_scan(thrust::cuda::par(alloc).on(stream), dev_vd,
                  dev_vd + batchSize, dev_ex_scan);
 
diff --git a/cpp/src/dbscan/adjgraph/naive.cuh b/cpp/src/dbscan/adjgraph/naive.cuh
index dbffae372d..ae44f5a59e 100644
--- a/cpp/src/dbscan/adjgraph/naive.cuh
+++ b/cpp/src/dbscan/adjgraph/naive.cuh
@@ -16,10 +16,10 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/host_buffer.hpp>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "../common.cuh"
 #include "pack.h"
 
@@ -28,23 +28,23 @@ namespace AdjGraph {
 namespace Naive {
 
 template <typename Index_ = int>
-void launcher(const ML::cumlHandle_impl& handle, Pack<Index_> data,
-              Index_ batchSize, cudaStream_t stream) {
+void launcher(const raft::handle_t& handle, Pack<Index_> data, Index_ batchSize,
+              cudaStream_t stream) {
   Index_ k = 0;
   Index_ N = data.N;
-  MLCommon::host_buffer<Index_> host_vd(handle.getHostAllocator(), stream,
+  MLCommon::host_buffer<Index_> host_vd(handle.get_host_allocator(), stream,
                                         batchSize + 1);
-  MLCommon::host_buffer<bool> host_core_pts(handle.getHostAllocator(), stream,
+  MLCommon::host_buffer<bool> host_core_pts(handle.get_host_allocator(), stream,
                                             batchSize);
-  MLCommon::host_buffer<bool> host_adj(handle.getHostAllocator(), stream,
+  MLCommon::host_buffer<bool> host_adj(handle.get_host_allocator(), stream,
                                        batchSize * N);
-  MLCommon::host_buffer<Index_> host_ex_scan(handle.getHostAllocator(), stream,
-                                             batchSize);
-  MLCommon::updateHost(host_adj.data(), data.adj, batchSize * N, stream);
-  MLCommon::updateHost(host_vd.data(), data.vd, batchSize + 1, stream);
+  MLCommon::host_buffer<Index_> host_ex_scan(handle.get_host_allocator(),
+                                             stream, batchSize);
+  raft::update_host(host_adj.data(), data.adj, batchSize * N, stream);
+  raft::update_host(host_vd.data(), data.vd, batchSize + 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   size_t adjgraph_size = size_t(host_vd[batchSize]);
-  MLCommon::host_buffer<Index_> host_adj_graph(handle.getHostAllocator(),
+  MLCommon::host_buffer<Index_> host_adj_graph(handle.get_host_allocator(),
                                                stream, adjgraph_size);
   for (Index_ i = 0; i < batchSize; i++) {
     for (Index_ j = 0; j < N; j++) {
@@ -59,11 +59,10 @@ void launcher(const ML::cumlHandle_impl& handle, Pack<Index_> data,
   host_ex_scan[0] = Index_(0);
   for (Index_ i = 1; i < batchSize; i++)
     host_ex_scan[i] = host_ex_scan[i - 1] + host_vd[i - 1];
-  MLCommon::updateDevice(data.adj_graph, host_adj_graph.data(), adjgraph_size,
-                         stream);
-  MLCommon::updateDevice(data.core_pts, host_core_pts.data(), batchSize,
-                         stream);
-  MLCommon::updateDevice(data.ex_scan, host_ex_scan.data(), batchSize, stream);
+  raft::update_device(data.adj_graph, host_adj_graph.data(), adjgraph_size,
+                      stream);
+  raft::update_device(data.core_pts, host_core_pts.data(), batchSize, stream);
+  raft::update_device(data.ex_scan, host_ex_scan.data(), batchSize, stream);
 }
 }  // namespace Naive
 }  // namespace AdjGraph
diff --git a/cpp/src/dbscan/adjgraph/runner.cuh b/cpp/src/dbscan/adjgraph/runner.cuh
index 4fe0ae463a..90122a3bd5 100644
--- a/cpp/src/dbscan/adjgraph/runner.cuh
+++ b/cpp/src/dbscan/adjgraph/runner.cuh
@@ -25,10 +25,9 @@ namespace Dbscan {
 namespace AdjGraph {
 
 template <typename Index_ = int>
-void run(const ML::cumlHandle_impl& handle, bool* adj, Index_* vd,
-         Index_* adj_graph, Index_ adjnnz, Index_* ex_scan, Index_ N,
-         Index_ minpts, bool* core_pts, int algo, Index_ batchSize,
-         cudaStream_t stream) {
+void run(const raft::handle_t& handle, bool* adj, Index_* vd, Index_* adj_graph,
+         Index_ adjnnz, Index_* ex_scan, Index_ N, Index_ minpts,
+         bool* core_pts, int algo, Index_ batchSize, cudaStream_t stream) {
   Pack<Index_> data = {vd,      adj,      adj_graph, adjnnz,
                        ex_scan, core_pts, N,         minpts};
   switch (algo) {
diff --git a/cpp/src/dbscan/dbscan.cu b/cpp/src/dbscan/dbscan.cu
index 077e72212a..7f8ae9e286 100644
--- a/cpp/src/dbscan/dbscan.cu
+++ b/cpp/src/dbscan/dbscan.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <cuml/cuml_api.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <cuml/cluster/dbscan.hpp>
 #include "dbscan.cuh"
@@ -24,70 +24,72 @@ namespace ML {
 
 using namespace Dbscan;
 
-void dbscanFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
-               float eps, int min_pts, int *labels, size_t max_bytes_per_batch,
-               int verbosity) {
-  dbscanFitImpl<float, int>(handle.getImpl(), input, n_rows, n_cols, eps,
-                            min_pts, labels, nullptr, max_bytes_per_batch,
-                            handle.getStream(), verbosity);
+void dbscanFit(const raft::handle_t &handle, float *input, int n_rows,
+               int n_cols, float eps, int min_pts, int *labels,
+               size_t max_bytes_per_batch, int verbosity) {
+  dbscanFitImpl<float, int>(handle, input, n_rows, n_cols, eps, min_pts, labels,
+                            nullptr, max_bytes_per_batch, handle.get_stream(),
+                            verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
-               double eps, int min_pts, int *labels, size_t max_bytes_per_batch,
-               int verbosity) {
-  dbscanFitImpl<double, int>(handle.getImpl(), input, n_rows, n_cols, eps,
-                             min_pts, labels, nullptr, max_bytes_per_batch,
-                             handle.getStream(), verbosity);
+void dbscanFit(const raft::handle_t &handle, double *input, int n_rows,
+               int n_cols, double eps, int min_pts, int *labels,
+               size_t max_bytes_per_batch, int verbosity) {
+  dbscanFitImpl<double, int>(handle, input, n_rows, n_cols, eps, min_pts,
+                             labels, nullptr, max_bytes_per_batch,
+                             handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
-               float eps, int min_pts, int *labels, int *core_sample_indices,
-               size_t max_bytes_per_batch, int verbosity) {
-  dbscanFitImpl<float, int>(handle.getImpl(), input, n_rows, n_cols, eps,
-                            min_pts, labels, core_sample_indices,
-                            max_bytes_per_batch, handle.getStream(), verbosity);
+void dbscanFit(const raft::handle_t &handle, float *input, int n_rows,
+               int n_cols, float eps, int min_pts, int *labels,
+               int *core_sample_indices, size_t max_bytes_per_batch,
+               int verbosity) {
+  dbscanFitImpl<float, int>(handle, input, n_rows, n_cols, eps, min_pts, labels,
+                            core_sample_indices, max_bytes_per_batch,
+                            handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
-               double eps, int min_pts, int *labels, int *core_sample_indices,
-               size_t max_bytes_per_batch, int verbosity) {
-  dbscanFitImpl<double, int>(
-    handle.getImpl(), input, n_rows, n_cols, eps, min_pts, labels,
-    core_sample_indices, max_bytes_per_batch, handle.getStream(), verbosity);
+void dbscanFit(const raft::handle_t &handle, double *input, int n_rows,
+               int n_cols, double eps, int min_pts, int *labels,
+               int *core_sample_indices, size_t max_bytes_per_batch,
+               int verbosity) {
+  dbscanFitImpl<double, int>(handle, input, n_rows, n_cols, eps, min_pts,
+                             labels, core_sample_indices, max_bytes_per_batch,
+                             handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, float *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, float *input, int64_t n_rows,
                int64_t n_cols, float eps, int min_pts, int64_t *labels,
                size_t max_bytes_per_batch, int verbosity) {
-  dbscanFitImpl<float, int64_t>(handle.getImpl(), input, n_rows, n_cols, eps,
-                                min_pts, labels, nullptr, max_bytes_per_batch,
-                                handle.getStream(), verbosity);
+  dbscanFitImpl<float, int64_t>(handle, input, n_rows, n_cols, eps, min_pts,
+                                labels, nullptr, max_bytes_per_batch,
+                                handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, double *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, double *input, int64_t n_rows,
                int64_t n_cols, double eps, int min_pts, int64_t *labels,
                size_t max_bytes_per_batch, int verbosity) {
-  dbscanFitImpl<double, int64_t>(handle.getImpl(), input, n_rows, n_cols, eps,
-                                 min_pts, labels, nullptr, max_bytes_per_batch,
-                                 handle.getStream(), verbosity);
+  dbscanFitImpl<double, int64_t>(handle, input, n_rows, n_cols, eps, min_pts,
+                                 labels, nullptr, max_bytes_per_batch,
+                                 handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, float *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, float *input, int64_t n_rows,
                int64_t n_cols, float eps, int min_pts, int64_t *labels,
                int64_t *core_sample_indices, size_t max_bytes_per_batch,
                int verbosity) {
   dbscanFitImpl<float, int64_t>(
-    handle.getImpl(), input, n_rows, n_cols, eps, min_pts, labels,
-    core_sample_indices, max_bytes_per_batch, handle.getStream(), verbosity);
+    handle, input, n_rows, n_cols, eps, min_pts, labels, core_sample_indices,
+    max_bytes_per_batch, handle.get_stream(), verbosity);
 }
 
-void dbscanFit(const cumlHandle &handle, double *input, int64_t n_rows,
+void dbscanFit(const raft::handle_t &handle, double *input, int64_t n_rows,
                int64_t n_cols, double eps, int min_pts, int64_t *labels,
                int64_t *core_sample_indices, size_t max_bytes_per_batch,
                int verbosity) {
   dbscanFitImpl<double, int64_t>(
-    handle.getImpl(), input, n_rows, n_cols, eps, min_pts, labels,
-    core_sample_indices, max_bytes_per_batch, handle.getStream(), verbosity);
+    handle, input, n_rows, n_cols, eps, min_pts, labels, core_sample_indices,
+    max_bytes_per_batch, handle.get_stream(), verbosity);
 }
 
 };  // end namespace ML
diff --git a/cpp/src/dbscan/dbscan.cuh b/cpp/src/dbscan/dbscan.cuh
index 04a9d642f8..af6420f1d1 100644
--- a/cpp/src/dbscan/dbscan.cuh
+++ b/cpp/src/dbscan/dbscan.cuh
@@ -60,12 +60,12 @@ Index_ computeBatchCount(size_t &estimated_memory, Index_ n_rows,
     max_mbytes_per_batch = DEFAULT_MAX_MEM_MBYTES;
   }
 
-  Index_ nBatches =
-    (Index_)ceildiv<size_t>(estimated_memory, max_mbytes_per_batch * 1000000);
+  Index_ nBatches = (Index_)raft::ceildiv<size_t>(
+    estimated_memory, max_mbytes_per_batch * 1000000);
   Index_ MAX_LABEL = std::numeric_limits<Index_>::max();
   // to avoid overflow, we need: batch_size <= MAX_LABEL / n_rows (floor div)
-  // -> num_batches >= ceildiv(n_rows / (MAX_LABEL / n_rows))
-  Index_ nBatchesPrec = ceildiv(n_rows, MAX_LABEL / n_rows);
+  // -> num_batches >= raft::ceildiv(n_rows / (MAX_LABEL / n_rows))
+  Index_ nBatchesPrec = raft::ceildiv(n_rows, MAX_LABEL / n_rows);
   // at some point, if nBatchesPrec is larger than nBatches
   // (or larger by a given factor) and we know that there are clear
   // performance benefits of using a smaller number of batches,
@@ -75,7 +75,7 @@ Index_ computeBatchCount(size_t &estimated_memory, Index_ n_rows,
   // actually improve performance, even when using >16.10^9 points per batch.
   // Much larger batches than 16.10^9 do not currently fit on GPU architectures
   if (sizeof(Index_) > sizeof(int) &&
-      (size_t)n_rows * ceildiv<Index_>(n_rows, nBatches) <
+      (size_t)n_rows * raft::ceildiv<Index_>(n_rows, nBatches) <
         std::numeric_limits<int>::max()) {
     CUML_LOG_WARN(
       "You are using an index type of size (%d bytes) but a smaller index "
@@ -92,7 +92,7 @@ Index_ computeBatchCount(size_t &estimated_memory, Index_ n_rows,
 }
 
 template <typename T, typename Index_ = int>
-void dbscanFitImpl(const ML::cumlHandle_impl &handle, T *input, Index_ n_rows,
+void dbscanFitImpl(const raft::handle_t &handle, T *input, Index_ n_rows,
                    Index_ n_cols, T eps, Index_ min_pts, Index_ *labels,
                    Index_ *core_sample_indices, size_t max_mbytes_per_batch,
                    cudaStream_t stream, int verbosity) {
@@ -117,7 +117,7 @@ void dbscanFitImpl(const ML::cumlHandle_impl &handle, T *input, Index_ n_rows,
     handle, input, n_rows, n_cols, eps, min_pts, labels, core_sample_indices,
     algoVd, algoAdj, algoCcl, NULL, n_batches, stream);
 
-  MLCommon::device_buffer<char> workspace(handle.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(), stream,
                                           workspaceSize);
   Dbscan::run(handle, input, n_rows, n_cols, eps, min_pts, labels,
               core_sample_indices, algoVd, algoAdj, algoCcl, workspace.data(),
diff --git a/cpp/src/dbscan/dbscan_api.cpp b/cpp/src/dbscan/dbscan_api.cpp
index de7233fcf5..15cb1bf684 100644
--- a/cpp/src/dbscan/dbscan_api.cpp
+++ b/cpp/src/dbscan/dbscan_api.cpp
@@ -24,12 +24,12 @@ cumlError_t cumlSpDbscanFit(cumlHandle_t handle, float *input, int n_rows,
                             int *core_sample_indices,
                             size_t max_bytes_per_batch, int verbosity) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
-      dbscanFit(*handle_ptr, input, n_rows, n_cols, eps, min_pts, labels,
-                core_sample_indices, max_bytes_per_batch, verbosity);
+      ML::dbscanFit(*handle_ptr, input, n_rows, n_cols, eps, min_pts, labels,
+                    core_sample_indices, max_bytes_per_batch, verbosity);
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
@@ -49,12 +49,12 @@ cumlError_t cumlDpDbscanFit(cumlHandle_t handle, double *input, int n_rows,
                             int *core_sample_indices,
                             size_t max_bytes_per_batch, int verbosity) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
-      dbscanFit(*handle_ptr, input, n_rows, n_cols, eps, min_pts, labels,
-                core_sample_indices, max_bytes_per_batch, verbosity);
+      ML::dbscanFit(*handle_ptr, input, n_rows, n_cols, eps, min_pts, labels,
+                    core_sample_indices, max_bytes_per_batch, verbosity);
     }
     //TODO: Implement this
     //catch (const MLCommon::Exception& e)
diff --git a/cpp/src/dbscan/runner.cuh b/cpp/src/dbscan/runner.cuh
index 749bc0f04e..1f0a5c7f8d 100644
--- a/cpp/src/dbscan/runner.cuh
+++ b/cpp/src/dbscan/runner.cuh
@@ -16,12 +16,12 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <common/nvtx.hpp>
-#include <cuda_utils.cuh>
 #include <label/classlabels.cuh>
+#include <raft/cuda_utils.cuh>
 #include <sparse/csr.cuh>
 #include "adjgraph/runner.cuh"
 #include "vertexdeg/runner.cuh"
@@ -57,11 +57,12 @@ __global__ void relabelForSkl(Index_* labels, Index_ N, Index_ MAX_LABEL) {
  * an array of labels drawn from a monotonically increasing set.
  */
 template <typename Index_ = int>
-void final_relabel(Index_* db_cluster, Index_ N, cudaStream_t stream) {
+void final_relabel(Index_* db_cluster, Index_ N, cudaStream_t stream,
+                   std::shared_ptr<deviceAllocator> allocator) {
   Index_ MAX_LABEL = std::numeric_limits<Index_>::max();
   MLCommon::Label::make_monotonic(
     db_cluster, db_cluster, N, stream,
-    [MAX_LABEL] __device__(Index_ val) { return val == MAX_LABEL; });
+    [MAX_LABEL] __device__(Index_ val) { return val == MAX_LABEL; }, allocator);
 }
 
 /* @param N number of points
@@ -77,12 +78,12 @@ void final_relabel(Index_* db_cluster, Index_ N, cudaStream_t stream) {
  * @return in case the temp buffer is null, this returns the size needed.
  */
 template <typename Type_f, typename Index_ = int>
-size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
+size_t run(const raft::handle_t& handle, Type_f* x, Index_ N, Index_ D,
            Type_f eps, Index_ minPts, Index_* labels,
            Index_* core_sample_indices, int algoVd, int algoAdj, int algoCcl,
            void* workspace, Index_ nBatches, cudaStream_t stream) {
   const size_t align = 256;
-  size_t batchSize = ceildiv<size_t>(N, nBatches);
+  size_t batchSize = raft::ceildiv<size_t>(N, nBatches);
 
   /**
    * Note on coupling between data types:
@@ -95,12 +96,13 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
    * elements in their neighborhood, so any IdxType can be safely used, so long as N doesn't
    * overflow.
    */
-  size_t adjSize = alignTo<size_t>(sizeof(bool) * N * batchSize, align);
-  size_t corePtsSize = alignTo<size_t>(sizeof(bool) * N, align);
-  size_t xaSize = alignTo<size_t>(sizeof(bool) * N, align);
-  size_t mSize = alignTo<size_t>(sizeof(bool), align);
-  size_t vdSize = alignTo<size_t>(sizeof(Index_) * (batchSize + 1), align);
-  size_t exScanSize = alignTo<size_t>(sizeof(Index_) * batchSize, align);
+  size_t adjSize = raft::alignTo<size_t>(sizeof(bool) * N * batchSize, align);
+  size_t corePtsSize = raft::alignTo<size_t>(sizeof(bool) * N, align);
+  size_t xaSize = raft::alignTo<size_t>(sizeof(bool) * N, align);
+  size_t mSize = raft::alignTo<size_t>(sizeof(bool), align);
+  size_t vdSize =
+    raft::alignTo<size_t>(sizeof(Index_) * (batchSize + 1), align);
+  size_t exScanSize = raft::alignTo<size_t>(sizeof(Index_) * batchSize, align);
 
   Index_ MAX_LABEL = std::numeric_limits<Index_>::max();
 
@@ -139,7 +141,7 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
 
   // Running VertexDeg
   MLCommon::Sparse::WeakCCState state(xa, fa, m);
-  MLCommon::device_buffer<Index_> adj_graph(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<Index_> adj_graph(handle.get_device_allocator(),
                                             stream);
 
   for (int i = 0; i < nBatches; i++) {
@@ -152,21 +154,21 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
     CUML_LOG_DEBUG("- Iteration %d / %ld. Batch size is %ld samples", i + 1,
                    (unsigned long)nBatches, (unsigned long)nPoints);
 
-    int64_t start_time = curTimeMillis();
+    int64_t start_time = raft::curTimeMillis();
 
     CUML_LOG_DEBUG("--> Computing vertex degrees");
     VertexDeg::run<Type_f, Index_>(handle, adj, vd, x, eps, N, D, algoVd,
                                    startVertexId, nPoints, stream);
-    MLCommon::updateHost(&curradjlen, vd + nPoints, 1, stream);
+    raft::update_host(&curradjlen, vd + nPoints, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     ML::POP_RANGE();
 
-    int64_t cur_time = curTimeMillis();
+    int64_t cur_time = raft::curTimeMillis();
     CUML_LOG_DEBUG("    |-> Took %ld ms", (cur_time - start_time));
 
     CUML_LOG_DEBUG("--> Computing adjacency graph of size %ld samples.",
                    (unsigned long)curradjlen);
-    start_time = curTimeMillis();
+    start_time = raft::curTimeMillis();
     // Running AdjGraph
     ML::PUSH_RANGE("Trace::Dbscan::AdjGraph");
     if (curradjlen > maxadjlen || adj_graph.data() == NULL) {
@@ -182,12 +184,12 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
 
     ML::PUSH_RANGE("Trace::Dbscan::WeakCC");
 
-    cur_time = curTimeMillis();
+    cur_time = raft::curTimeMillis();
     CUML_LOG_DEBUG("    |-> Took %ld ms.", (cur_time - start_time));
 
     CUML_LOG_DEBUG("--> Computing connected components");
 
-    start_time = curTimeMillis();
+    start_time = raft::curTimeMillis();
     MLCommon::Sparse::weak_cc_batched<Index_, 1024>(
       labels, ex_scan, adj_graph.data(), curradjlen, N, startVertexId, nPoints,
       &state, stream,
@@ -197,13 +199,14 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
       });
     ML::POP_RANGE();
 
-    cur_time = curTimeMillis();
+    cur_time = raft::curTimeMillis();
     CUML_LOG_DEBUG("    |-> Took %ld ms.", (cur_time - start_time));
   }
 
   ML::PUSH_RANGE("Trace::Dbscan::FinalRelabel");
-  if (algoCcl == 2) final_relabel(labels, N, stream);
-  size_t nblks = ceildiv<size_t>(N, TPB);
+  if (algoCcl == 2)
+    final_relabel(labels, N, stream, handle.get_device_allocator());
+  size_t nblks = raft::ceildiv<size_t>(N, TPB);
   relabelForSkl<Index_><<<nblks, TPB, 0, stream>>>(labels, N, MAX_LABEL);
   CUDA_CHECK(cudaPeekAtLastError());
   ML::POP_RANGE();
@@ -213,7 +216,7 @@ size_t run(const ML::cumlHandle_impl& handle, Type_f* x, Index_ N, Index_ D,
     ML::PUSH_RANGE("Trace::Dbscan::CoreSampleIndices");
 
     // Create the execution policy
-    ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+    ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
     auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
 
     // Get wrappers for the device ptrs
diff --git a/cpp/src/dbscan/vertexdeg/algo.cuh b/cpp/src/dbscan/vertexdeg/algo.cuh
index 3389f08622..9b9c488fda 100644
--- a/cpp/src/dbscan/vertexdeg/algo.cuh
+++ b/cpp/src/dbscan/vertexdeg/algo.cuh
@@ -32,7 +32,7 @@ namespace Algo {
  * Calculates the vertex degree array and the epsilon neighborhood adjacency matrix for the batch.
  */
 template <typename value_t, typename index_t = int>
-void launcher(const ML::cumlHandle_impl &handle, Pack<value_t, index_t> data,
+void launcher(const raft::handle_t &handle, Pack<value_t, index_t> data,
               index_t startVertexId, index_t batchSize, cudaStream_t stream) {
   data.resetArray(stream, batchSize + 1);
 
diff --git a/cpp/src/dbscan/vertexdeg/naive.cuh b/cpp/src/dbscan/vertexdeg/naive.cuh
index 1d1272f2c4..db239711e9 100644
--- a/cpp/src/dbscan/vertexdeg/naive.cuh
+++ b/cpp/src/dbscan/vertexdeg/naive.cuh
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "pack.h"
 
 namespace Dbscan {
@@ -62,11 +62,13 @@ __global__ void vertex_degree_kernel(Pack<Type, Index_> data,
   adj[row * N + col] = res;
 
   if (sizeof(Index_) == 4) {
-    atomicAdd((int *)(vd + row), (int)res);
-    atomicAdd((int *)(vd + batchSize), (int)res);
+    raft::myAtomicAdd((int *)(vd + row), (int)res);
+    raft::myAtomicAdd((int *)(vd + batchSize), (int)res);
   } else if (sizeof(Index_) == 8) {
-    atomicAdd((unsigned long long *)(vd + row), res);
-    atomicAdd((unsigned long long *)(vd + batchSize), res);
+    raft::myAtomicAdd<unsigned long long>((unsigned long long *)(vd + row),
+                                          res);
+    raft::myAtomicAdd<unsigned long long>(
+      (unsigned long long *)(vd + batchSize), res);
   }
 }
 
@@ -76,8 +78,8 @@ void launcher(Pack<Type, Index_> data, Index_ startVertexId, Index_ batchSize,
   ASSERT(sizeof(Index_) == 4 || sizeof(Index_) == 8,
          "index_t should be 4 or 8 bytes");
 
-  dim3 grid(ceildiv(data.N, (Index_)TPB_X), ceildiv(batchSize, (Index_)TPB_Y),
-            1);
+  dim3 grid(raft::ceildiv(data.N, (Index_)TPB_X),
+            raft::ceildiv(batchSize, (Index_)TPB_Y), 1);
   dim3 blk(TPB_X, TPB_Y, 1);
   data.resetArray(stream, batchSize + 1);
   vertex_degree_kernel<<<grid, blk, 0, stream>>>(data, startVertexId,
diff --git a/cpp/src/dbscan/vertexdeg/runner.cuh b/cpp/src/dbscan/vertexdeg/runner.cuh
index 913a93556d..f3e05108e9 100644
--- a/cpp/src/dbscan/vertexdeg/runner.cuh
+++ b/cpp/src/dbscan/vertexdeg/runner.cuh
@@ -25,7 +25,7 @@ namespace Dbscan {
 namespace VertexDeg {
 
 template <typename Type_f, typename Index_ = int>
-void run(const ML::cumlHandle_impl& handle, bool* adj, Index_* vd, Type_f* x,
+void run(const raft::handle_t& handle, bool* adj, Index_* vd, Type_f* x,
          Type_f eps, Index_ N, Index_ D, int algo, Index_ startVertexId,
          Index_ batchSize, cudaStream_t stream) {
   Pack<Type_f, Index_> data = {vd, adj, x, eps, N, D};
diff --git a/cpp/src/decisiontree/batched-levelalgo/builder.cuh b/cpp/src/decisiontree/batched-levelalgo/builder.cuh
new file mode 100644
index 0000000000..79d18022b1
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/builder.cuh
@@ -0,0 +1,131 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "builder_base.cuh"
+
+namespace ML {
+namespace DecisionTree {
+
+template <typename Traits, typename DataT = typename Traits::DataT,
+          typename LabelT = typename Traits::LabelT,
+          typename IdxT = typename Traits::IdxT>
+void convertToSparse(const Builder<Traits>& b,
+                     const Node<DataT, LabelT, IdxT>* h_nodes,
+                     std::vector<SparseTreeNode<DataT, LabelT>>& sparsetree) {
+  auto len = sparsetree.size();
+  sparsetree.resize(len + b.h_total_nodes);
+  for (IdxT i = 0; i < b.h_total_nodes; ++i) {
+    const auto& hnode = h_nodes[i].info;
+    sparsetree[i + len] = hnode;
+    if (hnode.left_child_id != -1) sparsetree[i + len].left_child_id += len;
+  }
+}
+
+///@todo: support col subsampling per node
+template <typename Traits, typename DataT = typename Traits::DataT,
+          typename LabelT = typename Traits::LabelT,
+          typename IdxT = typename Traits::IdxT>
+void grow_tree(std::shared_ptr<MLCommon::deviceAllocator> d_allocator,
+               std::shared_ptr<MLCommon::hostAllocator> h_allocator,
+               const DataT* data, IdxT ncols, IdxT nrows, const LabelT* labels,
+               const DataT* quantiles, IdxT* rowids, IdxT* colids,
+               int n_sampled_rows, int unique_labels,
+               const DecisionTreeParams& params, cudaStream_t stream,
+               std::vector<SparseTreeNode<DataT, LabelT>>& sparsetree,
+               IdxT& num_leaves, IdxT& depth) {
+  Builder<Traits> builder;
+  size_t d_wsize, h_wsize;
+  builder.workspaceSize(d_wsize, h_wsize, params, data, labels, nrows, ncols,
+                        n_sampled_rows, IdxT(params.max_features * ncols),
+                        rowids, colids, unique_labels, quantiles);
+  MLCommon::device_buffer<char> d_buff(d_allocator, stream, d_wsize);
+  MLCommon::host_buffer<char> h_buff(h_allocator, stream, h_wsize);
+
+  std::vector<Node<DataT, LabelT, IdxT>> h_nodes;
+  h_nodes.reserve(builder.maxNodes);
+  builder.assignWorkspace(d_buff.data(), h_buff.data());
+  builder.train(h_nodes, num_leaves, depth, stream);
+  CUDA_CHECK(cudaStreamSynchronize(stream));
+  d_buff.release(stream);
+  h_buff.release(stream);
+  convertToSparse<Traits>(builder, h_nodes.data(), sparsetree);
+}
+
+/**
+ * @defgroup GrowTree Main entry point function for batched-level algo to build
+ *                    a decision tree
+ *
+ * @tparam DataT  data type
+ * @tparam LabelT label type
+ * @tparam IdxT   index type
+ *
+ * @param[in]  d_allocator    device allocator
+ * @param[in]  h_allocator    host allocator
+ * @param[in]  data           input dataset [on device] [col-major]
+ *                            [dim = nrows x ncols]
+ * @param[in]  ncols          number of features in the dataset
+ * @param[in]  nrows          number of rows in the dataset
+ * @param[in]  labels         labels for the input [on device] [len = nrows]
+ * @param[in]  quantiles      histograms/quantiles of the input dataset
+ *                            [on device] [col-major]
+ *                            [dim = params.n_bins x ncols]
+ * @param[in]  rowids         sampled rows [on device] [len = n_sampled_rows]
+ * @param[in]  colids         sampled cols [on device]
+ *                            [len = params.max_features * ncols]
+ * @param[in]  n_sampled_rows number of sub-sampled rows
+ * @param[in]  unique_labels  number of classes (meaningful only for
+ *                            classification)
+ * @param[in]  params         decisiontree learning params
+ * @param[in]  stream         cuda stream
+ * @param[out] sparsetree     output learned tree
+ * @param[out] num_leaves     number of leaves created during tree build
+ * @param[out] depth          max depth of the built tree
+ * @{
+ */
+template <typename DataT, typename LabelT, typename IdxT>
+void grow_tree(std::shared_ptr<MLCommon::deviceAllocator> d_allocator,
+               std::shared_ptr<MLCommon::hostAllocator> h_allocator,
+               const DataT* data, IdxT ncols, IdxT nrows, const LabelT* labels,
+               const DataT* quantiles, IdxT* rowids, IdxT* colids,
+               int n_sampled_rows, int unique_labels,
+               const DecisionTreeParams& params, cudaStream_t stream,
+               std::vector<SparseTreeNode<DataT, LabelT>>& sparsetree,
+               IdxT& num_leaves, IdxT& depth) {
+  typedef ClsTraits<DataT, LabelT, IdxT> Traits;
+  grow_tree<Traits>(d_allocator, h_allocator, data, ncols, nrows, labels,
+                    quantiles, rowids, colids, n_sampled_rows, unique_labels,
+                    params, stream, sparsetree, num_leaves, depth);
+}
+template <typename DataT, typename IdxT>
+void grow_tree(std::shared_ptr<MLCommon::deviceAllocator> d_allocator,
+               std::shared_ptr<MLCommon::hostAllocator> h_allocator,
+               const DataT* data, IdxT ncols, IdxT nrows, const DataT* labels,
+               const DataT* quantiles, IdxT* rowids, IdxT* colids,
+               int n_sampled_rows, int unique_labels,
+               const DecisionTreeParams& params, cudaStream_t stream,
+               std::vector<SparseTreeNode<DataT, DataT>>& sparsetree,
+               IdxT& num_leaves, IdxT& depth) {
+  typedef RegTraits<DataT, IdxT> Traits;
+  grow_tree<Traits>(d_allocator, h_allocator, data, ncols, nrows, labels,
+                    quantiles, rowids, colids, n_sampled_rows, unique_labels,
+                    params, stream, sparsetree, num_leaves, depth);
+}
+/** @} */
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/builder_base.cuh b/cpp/src/decisiontree/batched-levelalgo/builder_base.cuh
new file mode 100644
index 0000000000..40a3720915
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/builder_base.cuh
@@ -0,0 +1,490 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <cuml/tree/flatnode.h>
+#include <common/cumlHandle.hpp>
+#include <common/device_buffer.hpp>
+#include <common/grid_sync.cuh>
+#include <common/host_buffer.hpp>
+#include <cuml/tree/decisiontree.hpp>
+#include <raft/cuda_utils.cuh>
+#include "input.cuh"
+#include "kernels.cuh"
+#include "node.cuh"
+#include "split.cuh"
+
+namespace ML {
+namespace DecisionTree {
+
+/**
+ * Internal struct used to do all the heavy-lifting required for tree building
+ *
+ * @note This struct does NOT own any of the underlying device/host pointers.
+ *       They all must explicitly be allocated by the caller and passed to it.
+ */
+template <typename Traits>
+struct Builder {
+  typedef typename Traits::DataT DataT;
+  typedef typename Traits::LabelT LabelT;
+  typedef typename Traits::IdxT IdxT;
+  typedef typename Traits::NodeT NodeT;
+  typedef typename Traits::SplitT SplitT;
+  typedef typename Traits::InputT InputT;
+
+  /** DT params */
+  DecisionTreeParams params;
+  /** input dataset */
+  InputT input;
+
+  /** max nodes that we can create */
+  IdxT maxNodes;
+  /** total number of histogram bins (classification only) */
+  IdxT nHistBins;
+  /** total number of prediction counts (regression only) */
+  IdxT nPredCounts;
+  /** size of block-sync workspace (regression + MAE only) */
+  size_t block_sync_size;
+
+  /** number of nodes created in the current batch */
+  IdxT* n_nodes;
+  /** class histograms (classification only) */
+  int* hist;
+  /** sum of predictions (regression only) */
+  DataT* pred;
+  /** MAE computation (regression only) */
+  DataT* pred2;
+  /** parent MAE computation (regression only) */
+  DataT* pred2P;
+  /** node count tracker for averaging (regression only) */
+  IdxT* pred_count;
+  /** threadblock arrival count */
+  int* done_count;
+  /** mutex array used for atomically updating best split */
+  int* mutex;
+  /** used for syncing across blocks in a kernel (regression + MAE only) */
+  char* block_sync;
+  /** number of leaves created so far */
+  IdxT* n_leaves;
+  /** max depth reached so far */
+  IdxT* n_depth;
+  /** best splits for the current batch of nodes */
+  SplitT* splits;
+  /** current batch of nodes */
+  NodeT* curr_nodes;
+  /** next batch of nodes */
+  NodeT* next_nodes;
+
+  /** host copy of the number of new nodes in current branch */
+  IdxT* h_n_nodes;
+  /** total number of nodes created so far */
+  IdxT h_total_nodes;
+  /** range of the currently worked upon nodes */
+  IdxT node_start, node_end;
+  /** number of blocks used to parallelize column-wise computations. */
+  int n_blks_for_cols = 10;
+  /** Number of blocks used to parallelize row-wise computations. */
+  int n_blks_for_rows = 4;
+  /** Memory alignment value */
+  const size_t alignValue = 512;
+
+  /** checks if this struct is being used for classification or regression */
+  static constexpr bool isRegression() {
+    return std::is_same<DataT, LabelT>::value;
+  }
+
+  size_t calculateAlignedBytes(const size_t actualSize) {
+    return raft::alignTo(actualSize, alignValue);
+  }
+
+  /**
+   * @brief Computes workspace size needed for the current computation
+   *
+   * @param[out] d_wsize    (in B) of the device workspace to be allocated
+   * @param[out] h_wsize    (in B) of the host workspace to be allocated
+   * @param[in]  p the      input params
+   * @param[in]  data       input dataset [on device] [col-major]
+   *                        [dim = totalRows x totalCols]
+   * @param[in]  labels     output label for each row in the dataset
+   *                        [len = totalRows] [on device].
+   * @param[in] totalRows   total rows in the dataset
+   * @param[in] totalCols   total cols in the dataset
+   * @param[in] sampledRows number of rows sampled in the dataset
+   * @param[in] sampledCols number of cols sampled in the dataset
+   * @param[in] rowids      sampled row ids [on device] [len = sampledRows]
+   * @param[in] colids      sampled col ids [on device] [len = sampledCols]
+   * @param[in] nclasses    number of output classes (only for classification)
+   * @param[in] quantiles   histogram/quantile bins of the input dataset, for
+   *                        each of its column. Pass a nullptr if this needs to
+   *                        be computed fresh. [on device] [col-major]
+   *                        [dim = nbins x sampledCols]
+   */
+  void workspaceSize(size_t& d_wsize, size_t& h_wsize,
+                     const DecisionTreeParams& p, const DataT* data,
+                     const LabelT* labels, IdxT totalRows, IdxT totalCols,
+                     IdxT sampledRows, IdxT sampledCols, IdxT* rowids,
+                     IdxT* colids, IdxT nclasses, const DataT* quantiles) {
+    ASSERT(quantiles != nullptr,
+           "Currently quantiles need to be computed before this call!");
+    params = p;
+    n_blks_for_cols = std::min(sampledCols, n_blks_for_cols);
+    input.data = data;
+    input.labels = labels;
+    input.M = totalRows;
+    input.N = totalCols;
+    input.nSampledRows = sampledRows;
+    input.nSampledCols = sampledCols;
+    input.rowids = rowids;
+    input.colids = colids;
+    input.nclasses = nclasses;
+    input.quantiles = quantiles;
+    auto max_batch = params.max_batch_size;
+    auto n_col_blks = n_blks_for_cols;
+    nHistBins = 2 * max_batch * params.n_bins * n_col_blks * nclasses;
+    // x2 for mean and mean-of-square
+    nPredCounts = max_batch * params.n_bins * n_col_blks;
+    if (params.max_depth < 13) {
+      // Start with allocation for a dense tree for depth < 13
+      maxNodes = pow(2, (params.max_depth + 1)) - 1;
+    } else {
+      // Start with fixed size allocation for depth >= 13
+      maxNodes = 8191;
+    }
+
+    if (isRegression() && params.split_criterion == CRITERION::MAE) {
+      dim3 grid(n_blks_for_rows, n_col_blks, max_batch);
+      block_sync_size = MLCommon::GridSync::computeWorkspaceSize(
+        grid, MLCommon::SyncType::ACROSS_X, false);
+    } else {
+      block_sync_size = 0;
+    }
+    d_wsize = 0;
+    d_wsize += calculateAlignedBytes(sizeof(IdxT));  // n_nodes
+    if (!isRegression()) {
+      d_wsize += calculateAlignedBytes(sizeof(int) * nHistBins);  // hist
+    } else {
+      // x2 for left and right children
+      d_wsize +=
+        calculateAlignedBytes(2 * nPredCounts * sizeof(DataT));  // pred
+      d_wsize +=
+        calculateAlignedBytes(2 * nPredCounts * sizeof(DataT));       // pred2
+      d_wsize += calculateAlignedBytes(nPredCounts * sizeof(DataT));  // pred2P
+      d_wsize +=
+        calculateAlignedBytes(nPredCounts * sizeof(IdxT));  // pred_count
+    }
+    d_wsize += calculateAlignedBytes(sizeof(int) * max_batch *
+                                     n_col_blks);                  // done_count
+    d_wsize += calculateAlignedBytes(sizeof(int) * max_batch);     // mutex
+    d_wsize += calculateAlignedBytes(block_sync_size);             // block_sync
+    d_wsize += calculateAlignedBytes(sizeof(IdxT));                // n_leaves
+    d_wsize += calculateAlignedBytes(sizeof(IdxT));                // n_depth
+    d_wsize += calculateAlignedBytes(sizeof(SplitT) * max_batch);  // splits
+    d_wsize += calculateAlignedBytes(sizeof(NodeT) * max_batch);   // curr_nodes
+    d_wsize +=
+      calculateAlignedBytes(sizeof(NodeT) * 2 * max_batch);  // next_nodes
+    // all nodes in the tree
+    h_wsize = calculateAlignedBytes(sizeof(IdxT));  // h_n_nodes
+  }
+
+  /**
+   * @brief assign workspace to the current state
+   *
+   * @param[in] d_wspace device buffer allocated by the user for the workspace.
+   *                     Its size should be atleast workspaceSize()
+   * @param[in] h_wspace pinned host buffer needed to store the learned nodes
+   */
+  void assignWorkspace(char* d_wspace, char* h_wspace) {
+    auto max_batch = params.max_batch_size;
+    auto n_col_blks = n_blks_for_cols;
+    // device
+    n_nodes = reinterpret_cast<IdxT*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(IdxT));
+    if (!isRegression()) {
+      hist = reinterpret_cast<int*>(d_wspace);
+      d_wspace += calculateAlignedBytes(sizeof(int) * nHistBins);
+    } else {
+      pred = reinterpret_cast<DataT*>(d_wspace);
+      d_wspace += calculateAlignedBytes(2 * nPredCounts * sizeof(DataT));
+      pred2 = reinterpret_cast<DataT*>(d_wspace);
+      d_wspace += calculateAlignedBytes(2 * nPredCounts * sizeof(DataT));
+      pred2P = reinterpret_cast<DataT*>(d_wspace);
+      d_wspace += calculateAlignedBytes(nPredCounts * sizeof(DataT));
+      pred_count = reinterpret_cast<IdxT*>(d_wspace);
+      d_wspace += calculateAlignedBytes(nPredCounts * sizeof(IdxT));
+    }
+    done_count = reinterpret_cast<int*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(int) * max_batch * n_col_blks);
+    mutex = reinterpret_cast<int*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(int) * max_batch);
+    block_sync = reinterpret_cast<char*>(d_wspace);
+    d_wspace += calculateAlignedBytes(block_sync_size);
+    n_leaves = reinterpret_cast<IdxT*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(IdxT));
+    n_depth = reinterpret_cast<IdxT*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(IdxT));
+    splits = reinterpret_cast<SplitT*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(SplitT) * max_batch);
+    curr_nodes = reinterpret_cast<NodeT*>(d_wspace);
+    d_wspace += calculateAlignedBytes(sizeof(NodeT) * max_batch);
+    next_nodes = reinterpret_cast<NodeT*>(d_wspace);
+    // host
+    h_n_nodes = reinterpret_cast<IdxT*>(h_wspace);
+  }
+
+  /**
+   * @brief Main training method. To be called only after `assignWorkspace()`
+   *
+   * @param[out] h_nodes    list of nodes (must be allocated using
+   *                        cudaMallocHost!)
+   * @param[out] num_leaves number of leaves created in the tree
+   * @param[out] depth      max depth of the built tree
+   * @param[in]  s          cuda steam
+   */
+  void train(std::vector<Node<DataT, LabelT, IdxT>>& h_nodes, IdxT& num_leaves,
+             IdxT& depth, cudaStream_t s) {
+    init(h_nodes, s);
+    while (true) {
+      IdxT new_nodes = doSplit(h_nodes, s);
+      h_total_nodes += new_nodes;
+      if (new_nodes == 0 && isOver()) break;
+      updateNodeRange();
+    }
+    raft::update_host(&num_leaves, n_leaves, 1, s);
+    raft::update_host(&depth, n_depth, 1, s);
+  }
+
+ private:
+  /**
+   * @brief Initialize buffers and state
+   *
+   * @param[out] h_nodes list of nodes (must be allocated using cudaMallocHost!)
+   * @param[in]  s       cuda stream
+   */
+  void init(std::vector<Node<DataT, LabelT, IdxT>>& h_nodes, cudaStream_t s) {
+    *h_n_nodes = 0;
+    auto max_batch = params.max_batch_size;
+    auto n_col_blks = n_blks_for_cols;
+    CUDA_CHECK(
+      cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, s));
+    CUDA_CHECK(cudaMemsetAsync(mutex, 0, sizeof(int) * max_batch, s));
+    CUDA_CHECK(cudaMemsetAsync(n_leaves, 0, sizeof(IdxT), s));
+    CUDA_CHECK(cudaMemsetAsync(n_depth, 0, sizeof(IdxT), s));
+    if (isRegression()) {
+      CUDA_CHECK(
+        cudaMemsetAsync(block_sync, 0, sizeof(char) * block_sync_size, s));
+    }
+    node_start = 0;
+    node_end = h_total_nodes = 1;  // start with root node
+    h_nodes.resize(1);
+    h_nodes[0].initSpNode();
+    h_nodes[0].start = 0;
+    h_nodes[0].count = input.nSampledRows;
+    h_nodes[0].depth = 0;
+  }
+
+  /** check whether any more nodes need to be processed or not */
+  bool isOver() const { return node_end == h_total_nodes; }
+
+  /**
+   * @brief After the current batch is finished processing, update the range
+   *        of nodes to be worked upon in the next batch
+   */
+  void updateNodeRange() {
+    node_start = node_end;
+    auto nodes_remaining = h_total_nodes - node_end;
+    node_end = std::min(nodes_remaining, params.max_batch_size) + node_end;
+  }
+
+  /**
+   * @brief Computes best split across all nodes in the current batch and splits
+   *        the nodes accordingly
+   *
+   * @param[out] h_nodes list of nodes (must be allocated using cudaMallocHost!)
+   * @param[in]  s cuda stream
+   * @return the number of newly created nodes
+   */
+  IdxT doSplit(std::vector<Node<DataT, LabelT, IdxT>>& h_nodes,
+               cudaStream_t s) {
+    auto batchSize = node_end - node_start;
+    // start fresh on the number of *new* nodes created in this batch
+    CUDA_CHECK(cudaMemsetAsync(n_nodes, 0, sizeof(IdxT), s));
+    initSplit<DataT, IdxT, Traits::TPB_DEFAULT>(splits, batchSize, s);
+    // get the current set of nodes to be worked upon
+    raft::update_device(curr_nodes, h_nodes.data() + node_start, batchSize, s);
+    // iterate through a batch of columns (to reduce the memory pressure) and
+    // compute the best split at the end
+    auto n_col_blks = n_blks_for_cols;
+    for (IdxT c = 0; c < input.nSampledCols; c += n_col_blks) {
+      Traits::computeSplit(*this, c, batchSize, params.split_criterion, s);
+      CUDA_CHECK(cudaGetLastError());
+    }
+    // create child nodes (or make the current ones leaf)
+    auto smemSize = Traits::nodeSplitSmemSize(*this);
+    nodeSplitKernel<DataT, LabelT, IdxT, typename Traits::DevTraits,
+                    Traits::TPB_SPLIT>
+      <<<batchSize, Traits::TPB_SPLIT, smemSize, s>>>(
+        params.max_depth, params.min_rows_per_node, params.max_leaves,
+        params.min_impurity_decrease, input, curr_nodes, next_nodes, n_nodes,
+        splits, n_leaves, h_total_nodes, n_depth);
+    CUDA_CHECK(cudaGetLastError());
+    // copy the updated (due to leaf creation) and newly created child nodes
+    raft::update_host(h_n_nodes, n_nodes, 1, s);
+    CUDA_CHECK(cudaStreamSynchronize(s));
+    h_nodes.resize(h_nodes.size() + batchSize + *h_n_nodes);
+    raft::update_host(h_nodes.data() + node_start, curr_nodes, batchSize, s);
+    raft::update_host(h_nodes.data() + h_total_nodes, next_nodes, *h_n_nodes,
+                      s);
+    return *h_n_nodes;
+  }
+};  // end Builder
+
+/**
+ * @brief Traits used to customize the Builder for classification task
+ *
+ * @tparam _data  data type
+ * @tparam _label label type
+ * @tparam _idx   index type
+ */
+template <typename _data, typename _label, typename _idx>
+struct ClsTraits {
+  typedef _data DataT;
+  typedef _label LabelT;
+  typedef _idx IdxT;
+  typedef Node<DataT, LabelT, IdxT> NodeT;
+  typedef Split<DataT, IdxT> SplitT;
+  typedef Input<DataT, LabelT, IdxT> InputT;
+
+  /** default threads per block for most kernels in here */
+  static constexpr int TPB_DEFAULT = 256;
+  /** threads per block for the nodeSplitKernel */
+  static constexpr int TPB_SPLIT = 128;
+
+  typedef ClsDeviceTraits<DataT, LabelT, IdxT, TPB_SPLIT> DevTraits;
+
+  /**
+   * @brief Compute best split for the currently given set of columns
+   *
+   * @param[in] b         builder object
+   * @param[in] col       start column id
+   * @param[in] batchSize number of nodes to be processed in this call
+   * @param[in] splitType split criterion
+   * @param[in] s         cuda stream
+   */
+  static void computeSplit(Builder<ClsTraits<DataT, LabelT, IdxT>>& b, IdxT col,
+                           IdxT batchSize, CRITERION splitType,
+                           cudaStream_t s) {
+    auto nbins = b.params.n_bins;
+    auto nclasses = b.input.nclasses;
+    auto binSize = nbins * 2 * nclasses;
+    auto colBlks = std::min(b.n_blks_for_cols, b.input.nSampledCols - col);
+    dim3 grid(b.n_blks_for_rows, colBlks, batchSize);
+    size_t smemSize = sizeof(int) * binSize + sizeof(DataT) * nbins;
+    smemSize += sizeof(int);
+    CUDA_CHECK(cudaMemsetAsync(b.hist, 0, sizeof(int) * b.nHistBins, s));
+    computeSplitClassificationKernel<DataT, LabelT, IdxT, TPB_DEFAULT>
+      <<<grid, TPB_DEFAULT, smemSize, s>>>(
+        b.hist, b.params.n_bins, b.params.max_depth, b.params.min_rows_per_node,
+        b.params.max_leaves, b.input, b.curr_nodes, col, b.done_count, b.mutex,
+        b.n_leaves, b.splits, splitType);
+  }
+
+  /**
+   * @brief Computes the smem size (in B) needed for `nodeSplitKernel`
+   *
+   * @param[in] b         builder object
+   *
+   * @return the smem size (in B)
+   */
+  static size_t nodeSplitSmemSize(Builder<ClsTraits<DataT, LabelT, IdxT>>& b) {
+    return std::max(2 * sizeof(IdxT) * TPB_SPLIT,
+                    sizeof(int) * b.input.nclasses);
+  }
+};  // end ClsTraits
+
+/**
+ * @brief Traits used to customize the Builder for regression task
+ *
+ * @tparam _data data type
+ * @tparam _idx  index type
+ *
+ * @note label type is assumed to be the same as input data type
+ */
+template <typename _data, typename _idx>
+struct RegTraits {
+  typedef _data DataT;
+  typedef _data LabelT;
+  typedef _idx IdxT;
+  typedef Node<DataT, LabelT, IdxT> NodeT;
+  typedef Split<DataT, IdxT> SplitT;
+  typedef Input<DataT, LabelT, IdxT> InputT;
+
+  /** default threads per block for most kernels in here */
+  static constexpr int TPB_DEFAULT = 256;
+  /** threads per block for the nodeSplitKernel */
+  static constexpr int TPB_SPLIT = 128;
+
+  typedef RegDeviceTraits<DataT, LabelT, IdxT, TPB_SPLIT> DevTraits;
+
+  /**
+   * @brief Compute best split for the currently given set of columns
+   *
+   * @param[in] b         builder object
+   * @param[in] col       start column id
+   * @param[in] batchSize number of nodes to be processed in this call
+   * @param[in] splitType split criterion
+   * @param[in] s         cuda stream
+   */
+  static void computeSplit(Builder<RegTraits<DataT, IdxT>>& b, IdxT col,
+                           IdxT batchSize, CRITERION splitType,
+                           cudaStream_t s) {
+    auto n_col_blks = b.n_blks_for_cols;
+    dim3 grid(b.n_blks_for_rows, n_col_blks, batchSize);
+    auto nbins = b.params.n_bins;
+    size_t smemSize = 7 * nbins * sizeof(DataT) + nbins * sizeof(int);
+    smemSize += sizeof(int);
+    CUDA_CHECK(
+      cudaMemsetAsync(b.pred, 0, sizeof(DataT) * b.nPredCounts * 2, s));
+    if (splitType == CRITERION::MAE) {
+      CUDA_CHECK(
+        cudaMemsetAsync(b.pred2, 0, sizeof(DataT) * b.nPredCounts * 2, s));
+      CUDA_CHECK(
+        cudaMemsetAsync(b.pred2P, 0, sizeof(DataT) * b.nPredCounts, s));
+    }
+    CUDA_CHECK(
+      cudaMemsetAsync(b.pred_count, 0, sizeof(IdxT) * b.nPredCounts, s));
+    computeSplitRegressionKernel<DataT, DataT, IdxT, TPB_DEFAULT>
+      <<<grid, TPB_DEFAULT, smemSize, s>>>(
+        b.pred, b.pred2, b.pred2P, b.pred_count, b.params.n_bins,
+        b.params.max_depth, b.params.min_rows_per_node, b.params.max_leaves,
+        b.input, b.curr_nodes, col, b.done_count, b.mutex, b.n_leaves, b.splits,
+        b.block_sync, splitType);
+  }
+
+  /**
+   * @brief Computes the smem size (in B) needed for `nodeSplitKernel`
+   *
+   * @param[in] b         builder object
+   *
+   * @return the smem size (in B)
+   */
+  static size_t nodeSplitSmemSize(Builder<RegTraits<DataT, IdxT>>& b) {
+    return 2 * sizeof(IdxT) * TPB_SPLIT;
+  }
+};  // end RegTraits
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/input.cuh b/cpp/src/decisiontree/batched-levelalgo/input.cuh
new file mode 100644
index 0000000000..7e3a54c908
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/input.cuh
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+namespace ML {
+namespace DecisionTree {
+
+template <typename DataT, typename LabelT, typename IdxT>
+struct Input {
+  /** input dataset (assumed to be col-major) */
+  const DataT* data;
+  /** input labels */
+  const LabelT* labels;
+  /** total rows in dataset */
+  IdxT M;
+  /** total cols in dataset */
+  IdxT N;
+  /** total sampled rows in dataset */
+  IdxT nSampledRows;
+  /** total sampled cols in dataset */
+  IdxT nSampledCols;
+  /** indices of sampled rows */
+  IdxT* rowids;
+  /** indices of sampled cols */
+  IdxT* colids;
+  /** number of classes (useful only in classification) */
+  IdxT nclasses;
+  /** quantiles/histogram computed on the dataset (col-major) */
+  const DataT* quantiles;
+};
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/kernels.cuh b/cpp/src/decisiontree/batched-levelalgo/kernels.cuh
new file mode 100644
index 0000000000..7c8051ccdb
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/kernels.cuh
@@ -0,0 +1,471 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <common/grid_sync.cuh>
+#include <raft/cuda_utils.cuh>
+#include "input.cuh"
+#include "metrics.cuh"
+#include "node.cuh"
+#include "split.cuh"
+
+namespace ML {
+namespace DecisionTree {
+
+/**
+ * @brief Traits used to customize device-side methods for classification task
+ *
+ * @tparam _data  data type
+ * @tparam _label label type
+ * @tparam _idx   index type
+ * @tparam TPB    threads per block
+ */
+template <typename _data, typename _label, typename _idx, int TPB>
+struct ClsDeviceTraits {
+  typedef _data DataT;
+  typedef _label LabelT;
+  typedef _idx IdxT;
+
+ private:
+  struct Int2Max {
+    DI int2 operator()(const int2& a, const int2& b) {
+      int2 out;
+      if (a.y > b.y)
+        out = a;
+      else if (a.y == b.y && a.x < b.x)
+        out = a;
+      else
+        out = b;
+      return out;
+    }
+  };  // struct Int2Max
+
+ public:
+  /**
+   * @note to be called by only one block from all participating blocks
+   *       'smem' must be atleast of size `sizeof(int) * input.nclasses`
+   */
+  static DI void computePrediction(IdxT range_start, IdxT range_len,
+                                   const Input<DataT, LabelT, IdxT>& input,
+                                   volatile Node<DataT, LabelT, IdxT>* nodes,
+                                   IdxT* n_leaves, void* smem) {
+    typedef cub::BlockReduce<int2, TPB> BlockReduceT;
+    __shared__ typename BlockReduceT::TempStorage temp;
+    auto* shist = reinterpret_cast<int*>(smem);
+    auto tid = threadIdx.x;
+    for (int i = tid; i < input.nclasses; i += blockDim.x) shist[i] = 0;
+    __syncthreads();
+    auto len = range_start + range_len;
+    for (auto i = range_start + tid; i < len; i += blockDim.x) {
+      auto label = input.labels[input.rowids[i]];
+      atomicAdd(shist + label, 1);
+    }
+    __syncthreads();
+    auto op = Int2Max();
+    int2 v = {-1, -1};
+    for (int i = tid; i < input.nclasses; i += blockDim.x) {
+      int2 tmp = {i, shist[i]};
+      v = op(v, tmp);
+    }
+    v = BlockReduceT(temp).Reduce(v, op);
+    __syncthreads();
+    if (tid == 0) {
+      nodes[0].makeLeaf(n_leaves, LabelT(v.x));
+    }
+  }
+};  // struct ClsDeviceTraits
+
+/**
+ * @brief Traits used to customize device-side methods for regression task
+ *
+ * @tparam _data  data type
+ * @tparam _label label type
+ * @tparam _idx   index type
+ * @tparam TPB    threads per block
+ */
+template <typename _data, typename _label, typename _idx, int TPB>
+struct RegDeviceTraits {
+  typedef _data DataT;
+  typedef _label LabelT;
+  typedef _idx IdxT;
+
+  /**
+   * @note to be called by only one block from all participating blocks
+   *       'smem' is not used, but kept for the sake of interface parity with
+   *       the corresponding method for classification
+   */
+  static DI void computePrediction(IdxT range_start, IdxT range_len,
+                                   const Input<DataT, LabelT, IdxT>& input,
+                                   volatile Node<DataT, LabelT, IdxT>* nodes,
+                                   IdxT* n_leaves, void* smem) {
+    typedef cub::BlockReduce<LabelT, TPB> BlockReduceT;
+    __shared__ typename BlockReduceT::TempStorage temp;
+    LabelT sum = LabelT(0.0);
+    auto tid = threadIdx.x;
+    auto len = range_start + range_len;
+    for (auto i = range_start + tid; i < len; i += blockDim.x) {
+      auto label = input.labels[input.rowids[i]];
+      sum += label;
+    }
+    sum = BlockReduceT(temp).Sum(sum);
+    __syncthreads();
+    if (tid == 0) {
+      if (range_len != 0) {
+        nodes[0].makeLeaf(n_leaves, sum / range_len);
+      } else {
+        nodes[0].makeLeaf(n_leaves, 0.0);
+      }
+    }
+  }
+};  // struct RegDeviceTraits
+
+/**
+ * @brief Decide whether the current node is to be declared as a leaf entirely
+ *        based on the input hyper-params set by the user
+ *
+ * @param[in] myDepth           depth of this node
+ * @param[in] max_depth maximum possible tree depth
+ * @param[in] min_rows_per_node min number of samples needed to split the node
+ * @param[in] max_leaves        max leaf nodes per tree (it's a soft constraint)
+ * @param[in] n_leaves          number of leaves in the tree already
+ * @param[in] nSamples          number of samples belonging to this node
+ *
+ * @return true if the current node is to be declared as a leaf, else false
+ */
+template <typename DataT, typename IdxT>
+DI bool leafBasedOnParams(IdxT myDepth, IdxT max_depth, IdxT min_rows_per_node,
+                          IdxT max_leaves, const IdxT* n_leaves,
+                          IdxT nSamples) {
+  if (myDepth >= max_depth) return true;
+  if (nSamples < min_rows_per_node) return true;
+  if (max_leaves != -1) {
+    if (*n_leaves >= max_leaves) return true;
+  }
+  return false;
+}
+
+/**
+ * @brief Partition the samples to left/right nodes based on the best split
+ * @return the position of the left child node in the nodes list. However, this
+ *         value is valid only for threadIdx.x == 0.
+ * @note this should be called by only one block from all participating blocks
+ *       'smem' should be atleast of size `sizeof(IdxT) * TPB * 2`
+ */
+template <typename DataT, typename LabelT, typename IdxT, int TPB>
+DI void partitionSamples(const Input<DataT, LabelT, IdxT>& input,
+                         const Split<DataT, IdxT>* splits,
+                         volatile Node<DataT, LabelT, IdxT>* curr_nodes,
+                         volatile Node<DataT, LabelT, IdxT>* next_nodes,
+                         IdxT* n_nodes, IdxT* n_depth, IdxT total_nodes,
+                         char* smem) {
+  typedef cub::BlockScan<int, TPB> BlockScanT;
+  __shared__ typename BlockScanT::TempStorage temp1, temp2;
+  volatile auto* rowids = reinterpret_cast<volatile IdxT*>(input.rowids);
+  // for compaction
+  size_t smemSize = sizeof(IdxT) * TPB;
+  auto* lcomp = reinterpret_cast<IdxT*>(smem);
+  auto* rcomp = reinterpret_cast<IdxT*>(smem + smemSize);
+  auto nid = blockIdx.x;
+  auto split = splits[nid];
+  auto range_start = curr_nodes[nid].start;
+  auto range_len = curr_nodes[nid].count;
+  auto* col = input.data + split.colid * input.M;
+  auto loffset = range_start, part = loffset + split.nLeft, roffset = part;
+  auto end = range_start + range_len;
+  int lflag = 0, rflag = 0, llen = 0, rlen = 0, minlen = 0;
+  auto tid = threadIdx.x;
+  while (loffset < part && roffset < end) {
+    // find the samples in the left that belong to right and vice-versa
+    auto loff = loffset + tid, roff = roffset + tid;
+    if (llen == minlen)
+      lflag = loff < part ? col[rowids[loff]] > split.quesval : 0;
+    if (rlen == minlen)
+      rflag = roff < end ? col[rowids[roff]] <= split.quesval : 0;
+    // scan to compute the locations for each 'misfit' in the two partitions
+    int lidx, ridx;
+    BlockScanT(temp1).ExclusiveSum(lflag, lidx, llen);
+    BlockScanT(temp2).ExclusiveSum(rflag, ridx, rlen);
+    __syncthreads();
+    minlen = llen < rlen ? llen : rlen;
+    // compaction to figure out the right locations to swap
+    if (lflag) lcomp[lidx] = loff;
+    if (rflag) rcomp[ridx] = roff;
+    __syncthreads();
+    // reset the appropriate flags for the longer of the two
+    if (lidx < minlen) lflag = 0;
+    if (ridx < minlen) rflag = 0;
+    if (llen == minlen) loffset += TPB;
+    if (rlen == minlen) roffset += TPB;
+    // swap the 'misfit's
+    if (tid < minlen) {
+      auto a = rowids[lcomp[tid]];
+      auto b = rowids[rcomp[tid]];
+      rowids[lcomp[tid]] = b;
+      rowids[rcomp[tid]] = a;
+    }
+  }
+  if (tid == 0) {
+    curr_nodes[nid].makeChildNodes(n_nodes, total_nodes, next_nodes,
+                                   splits[nid], n_depth);
+  }
+}
+
+template <typename DataT, typename LabelT, typename IdxT, typename DevTraits,
+          int TPB>
+__global__ void nodeSplitKernel(IdxT max_depth, IdxT min_rows_per_node,
+                                IdxT max_leaves, DataT min_impurity_decrease,
+                                Input<DataT, LabelT, IdxT> input,
+                                volatile Node<DataT, LabelT, IdxT>* curr_nodes,
+                                volatile Node<DataT, LabelT, IdxT>* next_nodes,
+                                IdxT* n_nodes, const Split<DataT, IdxT>* splits,
+                                IdxT* n_leaves, IdxT total_nodes,
+                                IdxT* n_depth) {
+  extern __shared__ char smem[];
+  IdxT nid = blockIdx.x;
+  volatile auto* node = curr_nodes + nid;
+  auto range_start = node->start, range_len = node->count;
+  auto isLeaf = leafBasedOnParams<DataT, IdxT>(
+    node->depth, max_depth, min_rows_per_node, max_leaves, n_leaves, range_len);
+  if (isLeaf || splits[nid].best_metric_val <= min_impurity_decrease) {
+    DevTraits::computePrediction(range_start, range_len, input, node, n_leaves,
+                                 smem);
+    return;
+  }
+  partitionSamples<DataT, LabelT, IdxT, TPB>(input, splits, curr_nodes,
+                                             next_nodes, n_nodes, n_depth,
+                                             total_nodes, (char*)smem);
+}
+
+template <typename DataT, typename LabelT, typename IdxT, int TPB>
+__global__ void computeSplitClassificationKernel(
+  int* hist, IdxT nbins, IdxT max_depth, IdxT min_rows_per_node,
+  IdxT max_leaves, Input<DataT, LabelT, IdxT> input,
+  const Node<DataT, LabelT, IdxT>* nodes, IdxT colStart, int* done_count,
+  int* mutex, const IdxT* n_leaves, Split<DataT, IdxT>* splits,
+  CRITERION splitType) {
+  extern __shared__ char smem[];
+  IdxT nid = blockIdx.z;
+  auto node = nodes[nid];
+  auto range_start = node.start;
+  auto range_len = node.count;
+  if (leafBasedOnParams<DataT, IdxT>(node.depth, max_depth, min_rows_per_node,
+                                     max_leaves, n_leaves, range_len)) {
+    return;
+  }
+  auto end = range_start + range_len;
+  auto nclasses = input.nclasses;
+  auto len = nbins * 2 * nclasses;
+  auto* shist = reinterpret_cast<int*>(smem);
+  auto* sbins = reinterpret_cast<DataT*>(shist + len);
+  auto* sDone = reinterpret_cast<int*>(sbins + nbins);
+  IdxT stride = blockDim.x * gridDim.x;
+  IdxT tid = threadIdx.x + blockIdx.x * blockDim.x;
+  auto col = input.colids[colStart + blockIdx.y];
+  for (IdxT i = threadIdx.x; i < len; i += blockDim.x) shist[i] = 0;
+  for (IdxT b = threadIdx.x; b < nbins; b += blockDim.x)
+    sbins[b] = input.quantiles[col * nbins + b];
+  __syncthreads();
+  auto coloffset = col * input.M;
+  // compute class histogram for all bins for all classes in shared mem
+  for (auto i = range_start + tid; i < end; i += stride) {
+    auto row = input.rowids[i];
+    auto d = input.data[row + coloffset];
+    auto label = input.labels[row];
+    for (IdxT b = 0; b < nbins; ++b) {
+      auto isRight = d > sbins[b];  // no divergence
+      auto offset = b * 2 * nclasses + isRight * nclasses + label;
+      atomicAdd(shist + offset, 1);  // class hist
+    }
+  }
+  __syncthreads();
+  // update the corresponding global location
+  auto histOffset = ((nid * gridDim.y) + blockIdx.y) * len;
+  for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+    atomicAdd(hist + histOffset + i, shist[i]);
+  }
+  __threadfence();  // for commit guarantee
+  __syncthreads();
+  // last threadblock will go ahead and compute the best split
+  bool last = true;
+  if (gridDim.x > 1) {
+    last = MLCommon::signalDone(done_count + nid * gridDim.y + blockIdx.y,
+                                gridDim.x, blockIdx.x == 0, sDone);
+  }
+  if (!last) return;
+  for (IdxT i = threadIdx.x; i < len; i += blockDim.x)
+    shist[i] = hist[histOffset + i];
+  Split<DataT, IdxT> sp;
+  sp.init();
+  __syncthreads();
+  if (splitType == CRITERION::GINI) {
+    giniGain<DataT, IdxT>(shist, sbins, sp, col, range_len, nbins, nclasses);
+  } else {
+    entropyGain<DataT, IdxT>(shist, sbins, sp, col, range_len, nbins, nclasses);
+  }
+  __syncthreads();
+  sp.evalBestSplit(smem, splits + nid, mutex + nid);
+}
+
+template <typename DataT, typename LabelT, typename IdxT, int TPB>
+__global__ void computeSplitRegressionKernel(
+  DataT* pred, DataT* pred2, DataT* pred2P, IdxT* count, IdxT nbins,
+  IdxT max_depth, IdxT min_rows_per_node, IdxT max_leaves,
+  Input<DataT, LabelT, IdxT> input, const Node<DataT, LabelT, IdxT>* nodes,
+  IdxT colStart, int* done_count, int* mutex, const IdxT* n_leaves,
+  Split<DataT, IdxT>* splits, void* workspace, CRITERION splitType) {
+  extern __shared__ char smem[];
+  IdxT nid = blockIdx.z;
+  auto node = nodes[nid];
+  auto range_start = node.start;
+  auto range_len = node.count;
+  if (leafBasedOnParams<DataT, IdxT>(node.depth, max_depth, min_rows_per_node,
+                                     max_leaves, n_leaves, range_len)) {
+    return;
+  }
+  auto end = range_start + range_len;
+  auto len = nbins * 2;
+  auto* spred = reinterpret_cast<DataT*>(smem);
+  auto* scount = reinterpret_cast<int*>(spred + len);
+  auto* sbins = reinterpret_cast<DataT*>(scount + nbins);
+  // used only for MAE criterion
+  auto* spred2 = reinterpret_cast<DataT*>(sbins + nbins);
+  auto* spred2P = reinterpret_cast<DataT*>(spred2 + len);
+  auto* spredP = reinterpret_cast<DataT*>(spred2P + nbins);
+  auto* sDone = reinterpret_cast<int*>(spredP + nbins);
+  IdxT stride = blockDim.x * gridDim.x;
+  IdxT tid = threadIdx.x + blockIdx.x * blockDim.x;
+  auto col = input.colids[colStart + blockIdx.y];
+  for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+    spred[i] = DataT(0.0);
+  }
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    scount[i] = 0;
+    sbins[i] = input.quantiles[col * nbins + i];
+  }
+  __syncthreads();
+  auto coloffset = col * input.M;
+  // compute prediction averages for all bins in shared mem
+  for (auto i = range_start + tid; i < end; i += stride) {
+    auto row = input.rowids[i];
+    auto d = input.data[row + coloffset];
+    auto label = input.labels[row];
+    for (IdxT b = 0; b < nbins; ++b) {
+      auto isRight = d > sbins[b];  // no divergence
+      auto offset = isRight * nbins + b;
+      atomicAdd(spred + offset, label);
+      if (!isRight) atomicAdd(scount + b, 1);
+    }
+  }
+  __syncthreads();
+  // update the corresponding global location
+  auto gcOffset = ((nid * gridDim.y) + blockIdx.y) * nbins;
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    atomicAdd(count + gcOffset + i, scount[i]);
+  }
+  auto gOffset = gcOffset * 2;
+  for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+    atomicAdd(pred + gOffset + i, spred[i]);
+  }
+  __threadfence();  // for commit guarantee
+  __syncthreads();
+  // for MAE computation, we'd need a 2nd pass over data :(
+  if (splitType == CRITERION::MAE) {
+    // wait until all blockIdx.x's are done
+    MLCommon::GridSync gs(workspace, MLCommon::SyncType::ACROSS_X, false);
+    gs.sync();
+    // now, compute the mean value to be used for MAE update
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      scount[i] = count[gcOffset + i];
+      spred2P[i] = DataT(0.0);
+    }
+    for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+      spred[i] = pred[gOffset + i];
+      spred2[i] = DataT(0.0);
+    }
+    __syncthreads();
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      spredP[i] = spred[i] + spred[i + nbins];
+    }
+    __syncthreads();
+    auto invlen = DataT(1.0) / range_len;
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      auto cnt_l = DataT(scount[i]);
+      auto cnt_r = DataT(range_len - scount[i]);
+      spred[i] /= cnt_l;
+      spred[i + nbins] /= cnt_r;
+      spredP[i] *= invlen;
+    }
+    __syncthreads();
+    // 2nd pass over data to compute partial MAE's across blockIdx.x's
+    for (auto i = range_start + tid; i < end; i += stride) {
+      auto row = input.rowids[i];
+      auto d = input.data[row + coloffset];
+      auto label = input.labels[row];
+      for (IdxT b = 0; b < nbins; ++b) {
+        auto isRight = d > sbins[b];  // no divergence
+        auto offset = isRight * nbins + b;
+        auto diff = label - (isRight ? spred[nbins + b] : spred[b]);
+        atomicAdd(spred2 + offset, raft::myAbs(diff));
+        atomicAdd(spred2P + b, raft::myAbs(label - spredP[b]));
+      }
+    }
+    __syncthreads();
+    // update the corresponding global location
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      atomicAdd(pred2P + gcOffset + i, spred2P[i]);
+    }
+    for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+      atomicAdd(pred2 + gOffset + i, spred2[i]);
+    }
+    __threadfence();  // for commit guarantee
+    __syncthreads();
+  }
+  // last threadblock will go ahead and compute the best split
+  bool last = true;
+  if (gridDim.x > 1) {
+    last = MLCommon::signalDone(done_count + nid * gridDim.y + blockIdx.y,
+                                gridDim.x, blockIdx.x == 0, sDone);
+  }
+  if (!last) return;
+  // last block computes the final gain
+  Split<DataT, IdxT> sp;
+  sp.init();
+  if (splitType == CRITERION::MSE) {
+    for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+      spred[i] = pred[gOffset + i];
+    }
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      scount[i] = count[gcOffset + i];
+    }
+    __syncthreads();
+    mseGain(spred, scount, sbins, sp, col, range_len, nbins);
+  } else {
+    for (IdxT i = threadIdx.x; i < len; i += blockDim.x) {
+      spred2[i] = pred2[gOffset + i];
+    }
+    for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+      spred2P[i] = pred2P[gcOffset + i];
+    }
+    __syncthreads();
+    maeGain(spred2, spred2P, scount, sbins, sp, col, range_len, nbins);
+  }
+  __syncthreads();
+  sp.evalBestSplit(smem, splits + nid, mutex + nid);
+}
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/metrics.cuh b/cpp/src/decisiontree/batched-levelalgo/metrics.cuh
new file mode 100644
index 0000000000..1aae3f5b31
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/metrics.cuh
@@ -0,0 +1,199 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <common/grid_sync.cuh>
+#include <raft/cuda_utils.cuh>
+#include "input.cuh"
+#include "node.cuh"
+#include "split.cuh"
+
+namespace ML {
+namespace DecisionTree {
+
+/**
+ * @brief Compute gain based on gini impurity metric
+ *
+ * @param[in]    shist    left/right class histograms for all bins
+ *                        [dim = nbins x 2 x nclasses]
+ * @param[in]    sbins    quantiles for the current column [len = nbins]
+ * @param[inout] sp       will contain the per-thread best split so far
+ * @param[in]    col      current column
+ * @param[in]    len      total number of samples for the current node to be
+ *                        split
+ * @param[in]    nbins    number of bins
+ * @param[in]    nclasses number of classes
+ */
+template <typename DataT, typename IdxT>
+DI void giniGain(int* shist, DataT* sbins, Split<DataT, IdxT>& sp, IdxT col,
+                 IdxT len, IdxT nbins, IdxT nclasses) {
+  constexpr DataT One = DataT(1.0);
+  DataT invlen = One / len;
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    int nLeft = 0;
+    for (IdxT j = 0; j < nclasses; ++j) {
+      nLeft += shist[i * 2 * nclasses + j];
+    }
+    auto nRight = len - nLeft;
+    auto invLeft = One / nLeft;
+    auto invRight = One / nRight;
+    auto gain = DataT(0.0);
+    for (IdxT j = 0; j < nclasses; ++j) {
+      int val_i = 0;
+      if (nLeft != 0) {
+        auto lval_i = shist[i * 2 * nclasses + j];
+        auto lval = DataT(lval_i);
+        gain += lval * invLeft * lval * invlen;
+        val_i += lval_i;
+      }
+      if (nRight != 0) {
+        auto rval_i = shist[i * 2 * nclasses + nclasses + j];
+        auto rval = DataT(rval_i);
+        gain += rval * invRight * rval * invlen;
+        val_i += rval_i;
+      }
+      auto val = DataT(val_i) * invlen;
+      gain -= val * val;
+    }
+    sp.update({sbins[i], col, gain, nLeft});
+  }
+}
+
+/**
+ * @brief Compute gain based on entropy
+ *
+ * @param[in]    shist    left/right class histograms for all bins
+ *                        [dim = nbins x 2 x nclasses]
+ * @param[in]    sbins    quantiles for the current column [len = nbins]
+ * @param[inout] sp       will contain the per-thread best split so far
+ * @param[in]    col      current column
+ * @param[in]    len      total number of samples for the current node to be split
+ * @param[in]    nbins    number of bins
+ * @param[in]    nclasses number of classes
+ */
+template <typename DataT, typename IdxT>
+DI void entropyGain(int* shist, DataT* sbins, Split<DataT, IdxT>& sp, IdxT col,
+                    IdxT len, IdxT nbins, IdxT nclasses) {
+  constexpr DataT One = DataT(1.0);
+  DataT invlen = One / len;
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    int nLeft = 0;
+    for (IdxT j = 0; j < nclasses; ++j) {
+      nLeft += shist[i * 2 * nclasses + j];
+    }
+    auto nRight = len - nLeft;
+    auto invLeft = One / nLeft;
+    auto invRight = One / nRight;
+    auto gain = DataT(0.0);
+    for (IdxT j = 0; j < nclasses; ++j) {
+      int val_i = 0;
+      if (nLeft != 0) {
+        auto lval_i = shist[i * 2 * nclasses + j];
+        if (lval_i != 0) {
+          auto lval = DataT(lval_i);
+          gain += raft::myLog(lval * invLeft) * lval * invlen;
+        }
+        val_i += lval_i;
+      }
+      if (nRight != 0) {
+        auto rval_i = shist[i * 2 * nclasses + nclasses + j];
+        if (rval_i != 0) {
+          auto rval = DataT(rval_i);
+          gain += raft::myLog(rval * invRight) * rval * invlen;
+        }
+        val_i += rval_i;
+      }
+      if (val_i != 0) {
+        auto val = DataT(val_i) * invlen;
+        gain -= val * raft::myLog(val);
+      }
+    }
+    sp.update({sbins[i], col, gain, nLeft});
+  }
+}
+
+/**
+ * @brief Compute gain based on MSE
+ *
+ * @param[in]    spred  left/right child mean prediction for all bins
+ *                      [dim = 2 x bins]
+ * @param[in]    scount left child count for all bins [len = nbins]
+ * @param[in]    sbins  quantiles for the current column [len = nbins]
+ * @param[inout] sp     will contain the per-thread best split so far
+ * @param[in]    col    current column
+ * @param[in]    len    total number of samples for the current node to be split
+ * @param[in]    nbins  number of bins
+ */
+template <typename DataT, typename IdxT>
+DI void mseGain(DataT* spred, IdxT* scount, DataT* sbins,
+                Split<DataT, IdxT>& sp, IdxT col, IdxT len, IdxT nbins) {
+  auto invlen = DataT(1.0) / len;
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    auto nLeft = scount[i];
+    auto nRight = len - nLeft;
+    auto invLeft = (DataT)len / nLeft;
+    auto invRight = (DataT)len / nRight;
+    auto valL = spred[i];
+    auto valR = spred[nbins + i];
+    // parent sum is basically sum of its left and right children
+    auto valP = (valL + valR) * invlen;
+    DataT gain = -valP * valP;
+    if (nLeft != 0) {
+      gain += valL * invlen * valL * invLeft;
+    }
+    if (nRight != 0) {
+      gain += valR * invlen * valR * invRight;
+    }
+    sp.update({sbins[i], col, gain, nLeft});
+  }
+}
+
+/**
+ * @brief Compute gain based on MAE
+ *
+ * @param[in]    spred   left/right child sum of abs diff of prediction for all
+ *                       bins [dim = 2 x bins]
+ * @param[in]    spredP  parent's sum of abs diff of prediction for all bins
+ *                       [dim = 2 x bins]
+ * @param[in]    scount  left child count for all bins [len = nbins]
+ * @param[in]    sbins   quantiles for the current column [len = nbins]
+ * @param[inout] sp      will contain the per-thread best split so far
+ * @param[in]    col     current column
+ * @param[in]    len     total number of samples for current node to be split
+ * @param[in]    nbins   number of bins
+ */
+template <typename DataT, typename IdxT>
+DI void maeGain(DataT* spred, DataT* spredP, IdxT* scount, DataT* sbins,
+                Split<DataT, IdxT>& sp, IdxT col, IdxT len, IdxT nbins) {
+  auto invlen = DataT(1.0) / len;
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    auto nLeft = scount[i];
+    auto nRight = len - nLeft;
+    DataT gain = spredP[i];
+    if (nLeft != 0) {
+      gain -= spred[i];
+    }
+    if (nRight != 0) {
+      gain -= spred[i + nbins];
+    }
+    gain *= invlen;
+    sp.update({sbins[i], col, gain, nLeft});
+  }
+}
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/node.cuh b/cpp/src/decisiontree/batched-levelalgo/node.cuh
new file mode 100644
index 0000000000..827ef914a2
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/node.cuh
@@ -0,0 +1,136 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <cuml/tree/flatnode.h>
+#include "split.cuh"
+
+namespace ML {
+namespace DecisionTree {
+
+/**
+ * @brief All info pertaining to a node in the decision tree.
+ *
+ * @tparam DataT  data type
+ * @tparam LabelT label type
+ * @tparam IdxT   indexing type
+ */
+template <typename DataT, typename LabelT, typename IdxT>
+struct Node {
+  typedef Node<DataT, LabelT, IdxT> NodeT;
+  typedef Split<DataT, IdxT> SplitT;
+
+  /** special value to represent the leaf node */
+  static constexpr IdxT Leaf = static_cast<IdxT>(-1);
+
+  /** node related public information */
+  SparseTreeNode<DataT, LabelT, IdxT> info;
+  /** start of sampled rows belonging to this node */
+  IdxT start;
+  /** number of sampled rows belonging to this node */
+  IdxT count;
+  /** depth of this node */
+  IdxT depth;
+
+  /**
+   * @brief Initialize the underlying sparse tree node struct
+   */
+  HDI void initSpNode() volatile {
+    info.prediction = LabelT(0);
+    info.colid = Leaf;
+    info.quesval = DataT(0);
+    info.best_metric_val = DataT(0);
+    info.left_child_id = Leaf;
+  }
+
+  /**
+   * @brief Makes this node as a leaf. Side effect of this is that it atomically
+   *        updates the number of leaves counter
+   *
+   * @param[inout] n_leaves global memory location tracking the total number of
+   *                        leaves created in the tree so far
+   * @param[in]    pred     the prediction for this leaf node
+   *
+   * @note to be called only by one thread across all participating threadblocks
+   */
+  DI void makeLeaf(IdxT* n_leaves, LabelT pred) volatile {
+    info.prediction = pred;
+    info.colid = Leaf;
+    info.quesval = DataT(0);          // don't care for leaf nodes
+    info.best_metric_val = DataT(0);  // don't care for leaf nodes
+    info.left_child_id = Leaf;
+    atomicAdd(n_leaves, 1);
+    __threadfence();
+  }
+
+  /**
+   * @brief create left/right child nodes
+   *
+   * @param[inout] n_nodes     number of nodes created in current kernel launch
+   * @param[in]    total_nodes total nodes created so far across all levels
+   * @param[out]   nodes       the list of nodes
+   * @param[in]    splits      split info for current node
+   * @param[inout] n_depth     max depth of the created tree so far
+   *
+   * @return the position of the left child node in the above list
+   *
+   * @note to be called only by one thread across all participating threadblocks
+   */
+  DI IdxT makeChildNodes(IdxT* n_nodes, IdxT total_nodes, volatile NodeT* nodes,
+                         const SplitT& split, IdxT* n_depth) volatile {
+    IdxT pos = atomicAdd(n_nodes, 2);
+    // current
+    info.prediction = LabelT(0);  // don't care for non-leaf nodes
+    info.colid = split.colid;
+    info.quesval = split.quesval;
+    info.best_metric_val = split.best_metric_val;
+    info.left_child_id = total_nodes + pos;
+    // left
+    nodes[pos].initSpNode();
+    nodes[pos].depth = depth + 1;
+    nodes[pos].start = start;
+    nodes[pos].count = split.nLeft;
+    // right
+    ++pos;
+    nodes[pos].initSpNode();
+    nodes[pos].depth = depth + 1;
+    nodes[pos].start = start + split.nLeft;
+    nodes[pos].count = count - split.nLeft;
+    // update depth
+    auto val = atomicMax(n_depth, depth + 1);
+    __threadfence();
+    return pos;
+  }
+};  // end Node
+
+template <typename DataT, typename LabelT, typename IdxT, int TPB = 256>
+void printNodes(Node<DataT, LabelT, IdxT>* nodes, IdxT len, cudaStream_t s) {
+  auto op = [] __device__(Node<DataT, LabelT, IdxT> * ptr, IdxT idx) {
+    printf(
+      "prediction = %d, colid = %d, quesval = %f, best_metric_val = %f, "
+      "left_child_id = %d, start = %d, count = %d, depth = %d\n",
+      ptr->info.prediction, ptr->info.colid, ptr->info.quesval,
+      ptr->info.best_metric_val, ptr->info.left_child_id, ptr->start,
+      ptr->count, ptr->depth);
+  };
+  raft::linalg::writeOnlyUnaryOp<Node<DataT, LabelT, IdxT>, decltype(op), IdxT,
+                                 TPB>(nodes, len, op, s);
+  CUDA_CHECK(cudaDeviceSynchronize());
+}
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/batched-levelalgo/split.cuh b/cpp/src/decisiontree/batched-levelalgo/split.cuh
new file mode 100644
index 0000000000..e05091ded5
--- /dev/null
+++ b/cpp/src/decisiontree/batched-levelalgo/split.cuh
@@ -0,0 +1,160 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
+
+namespace ML {
+namespace DecisionTree {
+
+/**
+ * @brief All info pertaining to splitting a node
+ *
+ * @tparam DataT input data type
+ * @tparam IdxT  indexing type
+ */
+template <typename DataT, typename IdxT>
+struct Split {
+  typedef Split<DataT, IdxT> SplitT;
+
+  /** start with this as the initial gain */
+  static constexpr DataT Min = -std::numeric_limits<DataT>::max();
+  /** special value to represent invalid column id */
+  static constexpr IdxT Invalid = static_cast<IdxT>(-1);
+
+  /** threshold to compare in this node */
+  DataT quesval;
+  /** feature index */
+  IdxT colid;
+  /** best info gain on this node */
+  DataT best_metric_val;
+  /** number of samples in the left child */
+  IdxT nLeft;
+
+  /**
+   * @brief Initialize the current object
+   */
+  DI void init() {
+    quesval = best_metric_val = Min;
+    colid = Invalid;
+    nLeft = 0;
+  }
+
+  /**
+   * @brief Assignment operator overload
+   *
+   * @param[in] other source object from where to copy
+   * 
+   * @return the reference to the copied object (typically useful for chaining)
+   */
+  DI SplitT& operator=(const SplitT& other) {
+    quesval = other.quesval;
+    colid = other.colid;
+    best_metric_val = other.best_metric_val;
+    nLeft = other.nLeft;
+    return *this;
+  }
+
+  /**
+   * @brief updates the current split if the input gain is better
+   */
+  DI void update(const SplitT& other) {
+    if (other.best_metric_val > best_metric_val) *this = other;
+  }
+
+  /**
+   * @brief reduce the split info in the warp. Best split will be with 0th lane
+   */
+  DI void warpReduce() {
+    auto lane = raft::laneId();
+#pragma unroll
+    for (int i = raft::WarpSize / 2; i >= 1; i /= 2) {
+      auto id = lane + i;
+      auto qu = raft::shfl(quesval, id);
+      auto co = raft::shfl(colid, id);
+      auto be = raft::shfl(best_metric_val, id);
+      auto nl = raft::shfl(nLeft, id);
+      update({qu, co, be, nl});
+    }
+  }
+
+  /**
+   * @brief Computes the best split across the threadblocks
+   *
+   * @param[in]    smem  shared mem for scratchpad purposes
+   * @param[inout] split current split to be updated
+   * @param[inout] mutex location which provides exclusive access to node update
+   *
+   * @note all threads in the block must enter this function together. At the
+   *       end thread0 will contain the best split.
+   */
+  DI void evalBestSplit(void* smem, SplitT* split, int* mutex) {
+    auto* sbest = reinterpret_cast<SplitT*>(smem);
+    warpReduce();
+    auto warp = threadIdx.x / raft::WarpSize;
+    auto nWarps = blockDim.x / raft::WarpSize;
+    auto lane = raft::laneId();
+    if (lane == 0) sbest[warp] = *this;
+    __syncthreads();
+    if (warp == 0) {
+      if (lane < nWarps)
+        *this = sbest[lane];
+      else
+        this->init();
+      warpReduce();
+      // only the first thread will go ahead and update the best split info
+      // for current node
+      if (threadIdx.x == 0) {
+        while (atomicCAS(mutex, 0, 1))
+          ;
+        split->update(*this);
+        __threadfence();
+        atomicCAS(mutex, 1, 0);
+      }
+    }
+    __syncthreads();
+  }
+};  // struct Split
+
+/**
+ * @brief Initialize the split array
+ *
+ * @param[out] splits the array to be initialized
+ * @param[in]  len    length of this array
+ * @param[in]  s      cuda stream where to schedule work
+ */
+template <typename DataT, typename IdxT, int TPB = 256>
+void initSplit(Split<DataT, IdxT>* splits, IdxT len, cudaStream_t s) {
+  auto op = [] __device__(Split<DataT, IdxT> * ptr, IdxT idx) { ptr->init(); };
+  raft::linalg::writeOnlyUnaryOp<Split<DataT, IdxT>, decltype(op), IdxT, TPB>(
+    splits, len, op, s);
+}
+
+template <typename DataT, typename IdxT, int TPB = 256>
+void printSplits(Split<DataT, IdxT>* splits, IdxT len, cudaStream_t s) {
+  auto op = [] __device__(Split<DataT, IdxT> * ptr, IdxT idx) {
+    printf("quesval = %f, colid = %d, best_metric_val = %f, nLeft = %d\n",
+           ptr->quesval, ptr->colid, ptr->best_metric_val, ptr->nLeft);
+  };
+  raft::linalg::writeOnlyUnaryOp<Split<DataT, IdxT>, decltype(op), IdxT, TPB>(
+    splits, len, op, s);
+  CUDA_CHECK(cudaDeviceSynchronize());
+}
+
+}  // namespace DecisionTree
+}  // namespace ML
diff --git a/cpp/src/decisiontree/decisiontree.cu b/cpp/src/decisiontree/decisiontree.cu
index 9d690203ac..ea2a59f68f 100644
--- a/cpp/src/decisiontree/decisiontree.cu
+++ b/cpp/src/decisiontree/decisiontree.cu
@@ -21,12 +21,60 @@
 namespace ML {
 namespace DecisionTree {
 
+/**
+ * @brief Set all DecisionTreeParams members.
+ * @param[in,out] params: update with tree parameters
+ * @param[in] cfg_max_depth: maximum tree depth; default -1
+ * @param[in] cfg_max_leaves: maximum leaves; default -1
+ * @param[in] cfg_max_features: maximum number of features; default 1.0f
+ * @param[in] cfg_n_bins: number of bins; default 8
+ * @param[in] cfg_split_algo: split algorithm; default SPLIT_ALGO::HIST
+ * @param[in] cfg_min_rows_per_node: min. rows per node; default 2
+ * @param[in] cfg_bootstrap_features: bootstrapping for features; default false
+ * @param[in] cfg_split_criterion: split criterion; default CRITERION_END,
+ *            i.e., GINI for classification or MSE for regression
+ * @param[in] cfg_quantile_per_tree: compute quantile per tree; default false
+ * @param[in] cfg_use_experimental_backend: Switch to using experimental
+              backend; default false
+ * @param[in] cfg_max_batch_size: batch size for experimental backend
+ */
 void set_tree_params(DecisionTreeParams &params, int cfg_max_depth,
                      int cfg_max_leaves, float cfg_max_features, int cfg_n_bins,
                      int cfg_split_algo, int cfg_min_rows_per_node,
                      float cfg_min_impurity_decrease,
                      bool cfg_bootstrap_features, CRITERION cfg_split_criterion,
-                     bool cfg_quantile_per_tree, bool cfg_shuffle_features) {
+                     bool cfg_quantile_per_tree,
+                     bool cfg_use_experimental_backend,
+                     int cfg_max_batch_size) {
+  if (cfg_use_experimental_backend) {
+    if (cfg_split_algo != SPLIT_ALGO::GLOBAL_QUANTILE) {
+      CUML_LOG_WARN(
+        "Experimental backend does not yet support histogram split algorithm");
+      CUML_LOG_WARN(
+        "To use experimental backend set split_algo = 1 (GLOBAL_QUANTILE)");
+      cfg_use_experimental_backend = false;
+    }
+    if (cfg_max_features != 1.0) {
+      CUML_LOG_WARN(
+        "Experimental backend does not yet support feature sub-sampling");
+      CUML_LOG_WARN("To use experimental backend set max_features = 1.0");
+      cfg_use_experimental_backend = false;
+    }
+    if (cfg_quantile_per_tree) {
+      CUML_LOG_WARN(
+        "Experimental backend does not yet support per tree quantile "
+        "computation");
+      CUML_LOG_WARN(
+        "To use experimental backend set quantile_per_tree = false");
+      cfg_use_experimental_backend = false;
+    }
+    if (!cfg_use_experimental_backend) {
+      CUML_LOG_WARN(
+        "Not using the experimental backend due to above mentioned reason(s)");
+      CUML_LOG_WARN("Switching back to default backend");
+    }
+  }
+
   params.max_depth = cfg_max_depth;
   params.max_leaves = cfg_max_leaves;
   params.max_features = cfg_max_features;
@@ -36,12 +84,13 @@ void set_tree_params(DecisionTreeParams &params, int cfg_max_depth,
   params.bootstrap_features = cfg_bootstrap_features;
   params.split_criterion = cfg_split_criterion;
   params.quantile_per_tree = cfg_quantile_per_tree;
-  params.shuffle_features = cfg_shuffle_features;
+  params.use_experimental_backend = cfg_use_experimental_backend;
   params.min_impurity_decrease = cfg_min_impurity_decrease;
+  params.max_batch_size = cfg_max_batch_size;
 }
 
 void validity_check(const DecisionTreeParams params) {
-  ASSERT((params.max_depth > 0), "Invalid max depth %d", params.max_depth);
+  ASSERT((params.max_depth >= 0), "Invalid max depth %d", params.max_depth);
   ASSERT((params.max_leaves == -1) || (params.max_leaves > 0),
          "Invalid max leaves %d", params.max_leaves);
   ASSERT((params.max_features > 0) && (params.max_features <= 1.0),
@@ -67,7 +116,10 @@ void print(const DecisionTreeParams params) {
   CUML_LOG_DEBUG("bootstrap_features: %d", params.bootstrap_features);
   CUML_LOG_DEBUG("split_criterion: %d", params.split_criterion);
   CUML_LOG_DEBUG("quantile_per_tree: %d", params.quantile_per_tree);
-  CUML_LOG_DEBUG("shuffle_features: %d", params.shuffle_features);
+  CUML_LOG_DEBUG("min_impurity_decrease: %f", params.min_impurity_decrease);
+  CUML_LOG_DEBUG("use_experimental_backend: %s",
+                 params.use_experimental_backend ? "True" : "False");
+  CUML_LOG_DEBUG("max_batch_size: %d", params.max_batch_size);
 }
 
 template <class T, class L>
@@ -86,7 +138,13 @@ void print_tree(const TreeMetaDataNode<T, L> *tree) {
   print_node<T, L>("", tree->sparsetree, 0, false);
 }
 
-void decisionTreeClassifierFit(const ML::cumlHandle &handle,
+template <class T, class L>
+std::string dump_tree_as_json(const TreeMetaDataNode<T, L> *tree) {
+  std::ostringstream oss;
+  return dump_node_as_json("", tree->sparsetree, 0);
+}
+
+void decisionTreeClassifierFit(const raft::handle_t &handle,
                                TreeClassifierF *&tree, float *data,
                                const int ncols, const int nrows, int *labels,
                                unsigned int *rowids, const int n_sampled_rows,
@@ -98,7 +156,7 @@ void decisionTreeClassifierFit(const ML::cumlHandle &handle,
                      unique_labels, tree, tree_params);
 }
 
-void decisionTreeClassifierFit(const ML::cumlHandle &handle,
+void decisionTreeClassifierFit(const raft::handle_t &handle,
                                TreeClassifierD *&tree, double *data,
                                const int ncols, const int nrows, int *labels,
                                unsigned int *rowids, const int n_sampled_rows,
@@ -110,7 +168,7 @@ void decisionTreeClassifierFit(const ML::cumlHandle &handle,
                      unique_labels, tree, tree_params);
 }
 
-void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
+void decisionTreeClassifierPredict(const raft::handle_t &handle,
                                    const TreeClassifierF *tree,
                                    const float *rows, const int n_rows,
                                    const int n_cols, int *predictions,
@@ -121,7 +179,7 @@ void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
                          verbosity);
 }
 
-void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
+void decisionTreeClassifierPredict(const raft::handle_t &handle,
                                    const TreeClassifierD *tree,
                                    const double *rows, const int n_rows,
                                    const int n_cols, int *predictions,
@@ -134,7 +192,7 @@ void decisionTreeClassifierPredict(const ML::cumlHandle &handle,
 
 // ----------------------------- Regression ----------------------------------- //
 
-void decisionTreeRegressorFit(const ML::cumlHandle &handle,
+void decisionTreeRegressorFit(const raft::handle_t &handle,
                               TreeRegressorF *&tree, float *data,
                               const int ncols, const int nrows, float *labels,
                               unsigned int *rowids, const int n_sampled_rows,
@@ -145,7 +203,7 @@ void decisionTreeRegressorFit(const ML::cumlHandle &handle,
                     tree, tree_params);
 }
 
-void decisionTreeRegressorFit(const ML::cumlHandle &handle,
+void decisionTreeRegressorFit(const raft::handle_t &handle,
                               TreeRegressorD *&tree, double *data,
                               const int ncols, const int nrows, double *labels,
                               unsigned int *rowids, const int n_sampled_rows,
@@ -156,7 +214,7 @@ void decisionTreeRegressorFit(const ML::cumlHandle &handle,
                     tree, tree_params);
 }
 
-void decisionTreeRegressorPredict(const ML::cumlHandle &handle,
+void decisionTreeRegressorPredict(const raft::handle_t &handle,
                                   const TreeRegressorF *tree, const float *rows,
                                   const int n_rows, const int n_cols,
                                   float *predictions, int verbosity) {
@@ -166,7 +224,7 @@ void decisionTreeRegressorPredict(const ML::cumlHandle &handle,
                         verbosity);
 }
 
-void decisionTreeRegressorPredict(const ML::cumlHandle &handle,
+void decisionTreeRegressorPredict(const raft::handle_t &handle,
                                   const TreeRegressorD *tree,
                                   const double *rows, const int n_rows,
                                   const int n_cols, double *predictions,
@@ -188,5 +246,13 @@ template void print_tree<double, int>(const TreeClassifierD *tree);
 template void print_tree<float, float>(const TreeRegressorF *tree);
 template void print_tree<double, double>(const TreeRegressorD *tree);
 
+template std::string dump_tree_as_json<float, int>(const TreeClassifierF *tree);
+template std::string dump_tree_as_json<double, int>(
+  const TreeClassifierD *tree);
+template std::string dump_tree_as_json<float, float>(
+  const TreeRegressorF *tree);
+template std::string dump_tree_as_json<double, double>(
+  const TreeRegressorD *tree);
+
 }  // End namespace DecisionTree
 }  //End namespace ML
diff --git a/cpp/src/decisiontree/decisiontree_impl.cuh b/cpp/src/decisiontree/decisiontree_impl.cuh
index ae339700a5..dc7e124cb5 100644
--- a/cpp/src/decisiontree/decisiontree_impl.cuh
+++ b/cpp/src/decisiontree/decisiontree_impl.cuh
@@ -14,11 +14,16 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
+#include <decisiontree/quantile/quantile.h>
+#include <raft/cudart_utils.h>
+#include <common/iota.cuh>
 #include <cuml/common/logger.hpp>
+#include <iomanip>
+#include <locale>
 #include <queue>
 #include <random>
 #include <type_traits>
+#include "batched-levelalgo/builder.cuh"
 #include "decisiontree_impl.h"
 #include "levelalgo/levelfunc_classifier.cuh"
 #include "levelalgo/levelfunc_regressor.cuh"
@@ -75,6 +80,47 @@ void print_node(const std::string &prefix,
   }
 }
 
+template <typename T>
+std::string to_string_high_precision(T x) {
+  static_assert(std::is_floating_point<T>::value || std::is_integral<T>::value,
+                "T must be float, double, or integer");
+  std::ostringstream oss;
+  oss.imbue(std::locale::classic());  // use C locale
+  if (std::is_floating_point<T>::value) {
+    oss << std::setprecision(std::numeric_limits<T>::max_digits10) << x;
+  } else {
+    oss << x;
+  }
+  return oss.str();
+}
+
+template <class T, class L>
+std::string dump_node_as_json(
+  const std::string &prefix,
+  const std::vector<SparseTreeNode<T, L>> &sparsetree, int idx) {
+  const SparseTreeNode<T, L> &node = sparsetree[idx];
+
+  std::ostringstream oss;
+  if ((node.colid != -1)) {
+    oss << prefix << "{\"nodeid\": " << idx
+        << ", \"split_feature\": " << node.colid
+        << ", \"split_threshold\": " << to_string_high_precision(node.quesval)
+        << ", \"yes\": " << node.left_child_id
+        << ", \"no\": " << (node.left_child_id + 1) << ", \"children\": [\n";
+    // enter the next tree level - left and right branch
+    oss << dump_node_as_json(prefix + "  ", sparsetree, node.left_child_id)
+        << ",\n"
+        << dump_node_as_json(prefix + "  ", sparsetree, node.left_child_id + 1)
+        << "\n"
+        << prefix << "]}";
+  } else {
+    oss << prefix << "{\"nodeid\": " << idx
+        << ", \"leaf_value\": " << to_string_high_precision(node.prediction)
+        << "}";
+  }
+  return oss.str();
+}
+
 template <typename T, typename L>
 std::ostream &operator<<(std::ostream &os, const SparseTreeNode<T, L> &node) {
   DecisionTree::print(node, os);
@@ -237,14 +283,28 @@ void DecisionTreeBase<T, L>::plant(
 
   total_temp_mem = tempmem->totalmem;
   MLCommon::TimerCPU timer;
-  grow_deep_tree(data, labels, rowids, n_sampled_rows, ncols,
-                 tree_params.max_features, dinfo.NLocalrows, sparsetree, treeid,
-                 tempmem);
+  if (tree_params.use_experimental_backend) {
+    if (treeid == 0) {
+      CUML_LOG_WARN("Using experimental backend for growing trees\n");
+    }
+    T *quantiles = tempmem->d_quantile->data();
+    int *colids = (int *)tempmem->device_allocator->allocate(
+      sizeof(int) * ncols, tempmem->stream);
+    MLCommon::iota(colids, 0, 1, ncols, tempmem->stream);
+    grow_tree(tempmem->device_allocator, tempmem->host_allocator, data, ncols,
+              nrows, labels, quantiles, (int *)rowids, (int *)colids,
+              n_sampled_rows, unique_labels, tree_params, tempmem->stream,
+              sparsetree, this->leaf_counter, this->depth_counter);
+  } else {
+    grow_deep_tree(data, labels, rowids, n_sampled_rows, ncols,
+                   tree_params.max_features, dinfo.NLocalrows, sparsetree,
+                   treeid, tempmem);
+  }
   train_time = timer.getElapsedSeconds();
 }
 
 template <typename T, typename L>
-void DecisionTreeBase<T, L>::predict(const ML::cumlHandle &handle,
+void DecisionTreeBase<T, L>::predict(const raft::handle_t &handle,
                                      const TreeMetaDataNode<T, L> *tree,
                                      const T *rows, const int n_rows,
                                      const int n_cols, L *predictions,
@@ -356,17 +416,16 @@ void DecisionTreeBase<T, L>::base_fit(
 
 template <typename T>
 void DecisionTreeClassifier<T>::fit(
-  const ML::cumlHandle &handle, const T *data, const int ncols, const int nrows,
+  const raft::handle_t &handle, const T *data, const int ncols, const int nrows,
   const int *labels, unsigned int *rowids, const int n_sampled_rows,
   const int unique_labels, TreeMetaDataNode<T, int> *&tree,
   DecisionTreeParams tree_parameters,
   std::shared_ptr<TemporaryMemory<T, int>> in_tempmem) {
   this->tree_params = tree_parameters;
-  this->base_fit(handle.getImpl().getDeviceAllocator(),
-                 handle.getImpl().getHostAllocator(),
-                 handle.getImpl().getStream(), data, ncols, nrows, labels,
-                 rowids, n_sampled_rows, unique_labels, tree->sparsetree,
-                 tree->treeid, true, in_tempmem);
+  this->base_fit(handle.get_device_allocator(), handle.get_host_allocator(),
+                 handle.get_stream(), data, ncols, nrows, labels, rowids,
+                 n_sampled_rows, unique_labels, tree->sparsetree, tree->treeid,
+                 true, in_tempmem);
   this->set_metadata(tree);
 }
 
@@ -389,15 +448,15 @@ void DecisionTreeClassifier<T>::fit(
 
 template <typename T>
 void DecisionTreeRegressor<T>::fit(
-  const ML::cumlHandle &handle, const T *data, const int ncols, const int nrows,
+  const raft::handle_t &handle, const T *data, const int ncols, const int nrows,
   const T *labels, unsigned int *rowids, const int n_sampled_rows,
   TreeMetaDataNode<T, T> *&tree, DecisionTreeParams tree_parameters,
   std::shared_ptr<TemporaryMemory<T, T>> in_tempmem) {
   this->tree_params = tree_parameters;
-  this->base_fit(
-    handle.getImpl().getDeviceAllocator(), handle.getImpl().getHostAllocator(),
-    handle.getImpl().getStream(), data, ncols, nrows, labels, rowids,
-    n_sampled_rows, 1, tree->sparsetree, tree->treeid, false, in_tempmem);
+  this->base_fit(handle.get_device_allocator(), handle.get_host_allocator(),
+                 handle.get_stream(), data, ncols, nrows, labels, rowids,
+                 n_sampled_rows, 1, tree->sparsetree, tree->treeid, false,
+                 in_tempmem);
   this->set_metadata(tree);
 }
 
diff --git a/cpp/src/decisiontree/decisiontree_impl.h b/cpp/src/decisiontree/decisiontree_impl.h
index a4a33cb928..4ba3266130 100644
--- a/cpp/src/decisiontree/decisiontree_impl.h
+++ b/cpp/src/decisiontree/decisiontree_impl.h
@@ -49,6 +49,11 @@ void print_node(const std::string &prefix,
                 const std::vector<SparseTreeNode<T, L>> &sparsetree, int idx,
                 bool isLeft);
 
+template <class T, class L>
+std::string dump_node_as_json(
+  const std::string &prefix,
+  const std::vector<SparseTreeNode<T, L>> &sparsetree, int idx);
+
 template <class T, class L>
 void build_treelite_tree(TreeBuilderHandle tree_builder,
                          DecisionTree::TreeMetaDataNode<T, L> *tree_ptr,
@@ -105,7 +110,7 @@ class DecisionTreeBase {
   void print(const std::vector<SparseTreeNode<T, L>> &sparsetree) const;
 
   // Predict labels for n_rows rows, with n_cols features each, for a given tree. rows in row-major format.
-  void predict(const ML::cumlHandle &handle, const TreeMetaDataNode<T, L> *tree,
+  void predict(const raft::handle_t &handle, const TreeMetaDataNode<T, L> *tree,
                const T *rows, const int n_rows, const int n_cols,
                L *predictions, int verbosity = -1) const;
   void predict_all(const TreeMetaDataNode<T, L> *tree, const T *rows,
@@ -124,7 +129,7 @@ class DecisionTreeClassifier : public DecisionTreeBase<T, int> {
   // Expects column major T dataset, integer labels
   // data, labels are both device ptr.
   // Assumption: labels are all mapped to contiguous numbers starting from 0 during preprocessing. Needed for gini hist impl.
-  void fit(const ML::cumlHandle &handle, const T *data, const int ncols,
+  void fit(const raft::handle_t &handle, const T *data, const int ncols,
            const int nrows, const int *labels, unsigned int *rowids,
            const int n_sampled_rows, const int unique_labels,
            TreeMetaDataNode<T, int> *&tree, DecisionTreeParams tree_parameters,
@@ -152,7 +157,7 @@ class DecisionTreeClassifier : public DecisionTreeBase<T, int> {
 template <class T>
 class DecisionTreeRegressor : public DecisionTreeBase<T, T> {
  public:
-  void fit(const ML::cumlHandle &handle, const T *data, const int ncols,
+  void fit(const raft::handle_t &handle, const T *data, const int ncols,
            const int nrows, const T *labels, unsigned int *rowids,
            const int n_sampled_rows, TreeMetaDataNode<T, T> *&tree,
            DecisionTreeParams tree_parameters,
diff --git a/cpp/src/decisiontree/levelalgo/common_helper.cuh b/cpp/src/decisiontree/levelalgo/common_helper.cuh
index a8e0ad3863..5a028817eb 100644
--- a/cpp/src/decisiontree/levelalgo/common_helper.cuh
+++ b/cpp/src/decisiontree/levelalgo/common_helper.cuh
@@ -16,10 +16,10 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/tree/flatnode.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/logger.hpp>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include <stats/minmax.cuh>
 #include "common_kernel.cuh"
 
@@ -37,21 +37,20 @@ void update_feature_sampling(unsigned int *h_colids, unsigned int *d_colids,
                              const int n_nodes, RNG rng, DIST dist,
                              std::vector<unsigned int> &feature_selector,
                              std::shared_ptr<TemporaryMemory<T, L>> tempmem,
-                             MLCommon::Random::Rng &d_rng) {
+                             raft::random::Rng &d_rng) {
   if (h_colstart != nullptr) {
     if (Ncols != ncols_sampled) {
       std::shuffle(h_colids, h_colids + Ncols, rng);
-      MLCommon::updateDevice(d_colids, h_colids, Ncols, tempmem->stream);
+      raft::update_device(d_colids, h_colids, Ncols, tempmem->stream);
       if (n_nodes < 256 * tempmem->num_sms) {
         for (int i = 0; i < n_nodes; i++) {
           h_colstart[i] = dist(rng);
         }
-        MLCommon::updateDevice(d_colstart, h_colstart, n_nodes,
-                               tempmem->stream);
+        raft::update_device(d_colstart, h_colstart, n_nodes, tempmem->stream);
       } else {
         d_rng.uniformInt<unsigned>(d_colstart, n_nodes, 0, Ncols,
                                    tempmem->stream);
-        MLCommon::updateHost(h_colstart, d_colstart, n_nodes, tempmem->stream);
+        raft::update_host(h_colstart, d_colstart, n_nodes, tempmem->stream);
       }
     }
   } else {
@@ -61,8 +60,8 @@ void update_feature_sampling(unsigned int *h_colids, unsigned int *d_colids,
       memcpy(&h_colids[i * ncols_sampled], temp.data(),
              ncols_sampled * sizeof(unsigned int));
     }
-    MLCommon::updateDevice(d_colids, h_colids, ncols_sampled * n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_colids, h_colids, ncols_sampled * n_nodes,
+                        tempmem->stream);
   }
 }
 
@@ -76,12 +75,12 @@ void get_minmax(const T *data, const unsigned int *flags,
   using E = typename MLCommon::Stats::encode_traits<T>::E;
   T init_val = std::numeric_limits<T>::max();
   int threads = 128;
-  int nblocks = MLCommon::ceildiv(2 * ncols_sampled * n_nodes, threads);
+  int nblocks = raft::ceildiv(2 * ncols_sampled * n_nodes, threads);
   minmax_init_kernel<T, E><<<nblocks, threads, 0, stream>>>(
     d_minmax, ncols_sampled * n_nodes, n_nodes, init_val);
   CUDA_CHECK(cudaGetLastError());
 
-  nblocks = MLCommon::ceildiv(nrows, threads);
+  nblocks = raft::ceildiv(nrows, threads);
   if (n_nodes <= max_shmem_nodes) {
     get_minmax_kernel<T, E>
       <<<nblocks, threads, 2 * n_nodes * sizeof(T), stream>>>(
@@ -94,12 +93,12 @@ void get_minmax(const T *data, const unsigned int *flags,
   }
   CUDA_CHECK(cudaGetLastError());
 
-  nblocks = MLCommon::ceildiv(2 * ncols_sampled * n_nodes, threads);
+  nblocks = raft::ceildiv(2 * ncols_sampled * n_nodes, threads);
   minmax_decode_kernel<T, E>
     <<<nblocks, threads, 0, stream>>>(d_minmax, ncols_sampled * n_nodes);
 
   CUDA_CHECK(cudaGetLastError());
-  MLCommon::updateHost(h_minmax, d_minmax, 2 * n_nodes * ncols_sampled, stream);
+  raft::update_host(h_minmax, d_minmax, 2 * n_nodes * ncols_sampled, stream);
 }
 // This function does setup for flags. and count.
 void setup_sampling(unsigned int *flagsptr, unsigned int *sample_cnt,
@@ -107,11 +106,11 @@ void setup_sampling(unsigned int *flagsptr, unsigned int *sample_cnt,
                     const int n_sampled_rows, cudaStream_t &stream) {
   CUDA_CHECK(cudaMemsetAsync(sample_cnt, 0, nrows * sizeof(int), stream));
   int threads = 256;
-  int blocks = MLCommon::ceildiv(n_sampled_rows, threads);
+  int blocks = raft::ceildiv(n_sampled_rows, threads);
   setup_counts_kernel<<<blocks, threads, 0, stream>>>(sample_cnt, rowids,
                                                       n_sampled_rows);
   CUDA_CHECK(cudaGetLastError());
-  blocks = MLCommon::ceildiv(nrows, threads);
+  blocks = raft::ceildiv(nrows, threads);
   setup_flags_kernel<<<blocks, threads, 0, stream>>>(sample_cnt, flagsptr,
                                                      nrows);
   CUDA_CHECK(cudaGetLastError());
@@ -126,7 +125,7 @@ void make_level_split(const T *data, const int nrows, const int Ncols,
                       const unsigned int *new_node_flags, unsigned int *flags,
                       std::shared_ptr<TemporaryMemory<T, L>> tempmem) {
   int threads = 256;
-  int blocks = MLCommon::ceildiv(nrows, threads);
+  int blocks = raft::ceildiv(nrows, threads);
   unsigned int *d_colstart = nullptr;
   if (tempmem->d_colstart != nullptr) d_colstart = tempmem->d_colstart->data();
   if (split_algo == 0) {
@@ -196,7 +195,7 @@ void convert_scatter_to_gather(const unsigned int *flagsptr,
                              tempmem->stream));
 
   int nthreads = 128;
-  int nblocks = MLCommon::ceildiv(n_rows, nthreads);
+  int nblocks = raft::ceildiv(n_rows, nthreads);
   fill_counts<<<nblocks, nthreads, 0, tempmem->stream>>>(flagsptr, sample_cnt,
                                                          n_rows, nodecount);
 
@@ -206,7 +205,7 @@ void convert_scatter_to_gather(const unsigned int *flagsptr,
                                 tempmem->stream);
   CUDA_CHECK(cudaGetLastError());
   unsigned int *h_nodestart = (unsigned int *)(tempmem->h_split_binidx->data());
-  MLCommon::updateHost(h_nodestart, nodestart + n_nodes, 1, tempmem->stream);
+  raft::update_host(h_nodestart, nodestart + n_nodes, 1, tempmem->stream);
   CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
   CUDA_CHECK(cudaMemsetAsync(nodecount, 0, n_nodes * sizeof(unsigned int),
                              tempmem->stream));
@@ -223,13 +222,13 @@ void print_convertor(unsigned int *d_nodecount, unsigned int *d_nodestart,
   unsigned int *nodecount = (unsigned int *)(tempmem->h_split_colidx->data());
   unsigned int *nodestart = (unsigned int *)(tempmem->h_split_binidx->data());
   unsigned int *samplelist = (unsigned int *)(tempmem->h_parent_metric->data());
-  MLCommon::updateHost(nodecount, d_nodecount, n_nodes + 1, tempmem->stream);
-  MLCommon::updateHost(nodestart, d_nodestart, n_nodes + 1, tempmem->stream);
+  raft::update_host(nodecount, d_nodecount, n_nodes + 1, tempmem->stream);
+  raft::update_host(nodestart, d_nodestart, n_nodes + 1, tempmem->stream);
   CUDA_CHECK(cudaDeviceSynchronize());
   ML::PatternSetter _("%v");
   CUML_LOG_DEBUG("Full sample list size %u", nodestart[n_nodes]);
-  MLCommon::updateHost(samplelist, d_samplelist, nodestart[n_nodes],
-                       tempmem->stream);
+  raft::update_host(samplelist, d_samplelist, nodestart[n_nodes],
+                    tempmem->stream);
   CUDA_CHECK(cudaDeviceSynchronize());
 
   {
@@ -270,7 +269,7 @@ void print_nodes(SparseTreeNode<T, L> *sparsenodes, float *gain, int *nodelist,
     "Node format --> (colid, quesval, best_metric, prediction, left_child) ");
   int *h_nodelist = (int *)(tempmem->h_outgain->data());
   if (nodelist != nullptr) {
-    MLCommon::updateHost(h_nodelist, nodelist, n_nodes, tempmem->stream);
+    raft::update_host(h_nodelist, nodelist, n_nodes, tempmem->stream);
     CUDA_CHECK(cudaDeviceSynchronize());
   }
   for (int i = 0; i < n_nodes; i++) {
@@ -301,7 +300,7 @@ void make_split_gather(const T *data, unsigned int *nodestart,
   CUDA_CHECK(cudaMemsetAsync(flagsptr, LEAF, nrows * sizeof(unsigned int),
                              tempmem->stream));
   int nthreads = 128;
-  int nblocks = MLCommon::ceildiv(nrows, nthreads);
+  int nblocks = raft::ceildiv(nrows, nthreads);
   split_nodes_compute_counts_kernel<<<n_nodes, 64, sizeof(SparseTreeNode<T, L>),
                                       tempmem->stream>>>(
     data, d_sparsenodes, nodestart, samplelist, nrows, nodelist, new_nodelist,
@@ -309,7 +308,7 @@ void make_split_gather(const T *data, unsigned int *nodestart,
   CUDA_CHECK(cudaGetLastError());
   void *d_temp_storage = (void *)(tempmem->temp_cub_buffer->data());
   int *h_counter = tempmem->h_counter->data();
-  MLCommon::updateHost(h_counter, counter, 1, tempmem->stream);
+  raft::update_host(h_counter, counter, 1, tempmem->stream);
   CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
   cub::DeviceScan::ExclusiveSum(d_temp_storage, tempmem->temp_cub_bytes,
                                 nodecount, nodestart, h_counter[0] + 1,
diff --git a/cpp/src/decisiontree/levelalgo/common_kernel.cuh b/cpp/src/decisiontree/levelalgo/common_kernel.cuh
index 7003ea35bc..20b1abeda6 100644
--- a/cpp/src/decisiontree/levelalgo/common_kernel.cuh
+++ b/cpp/src/decisiontree/levelalgo/common_kernel.cuh
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 #pragma once
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #define LEAF 0xFFFFFFFF
 #define PUSHRIGHT 0x00000001
 #include <stats/minmax.cuh>
@@ -192,7 +192,7 @@ __global__ void setup_counts_kernel(unsigned int* sample_cnt,
   for (int tid = threadid; tid < n_sampled_rows;
        tid += blockDim.x * gridDim.x) {
     unsigned int stid = rowids[tid];
-    atomicAdd(&sample_cnt[stid], 1);
+    raft::myAtomicAdd<unsigned int>(&sample_cnt[stid], 1);
   }
 }
 //This initializes the flags to 0x00000000. IF a sample is not used at all we Leaf out.
@@ -307,7 +307,7 @@ __global__ void fill_counts(const unsigned int* __restrict__ flagsptr,
     unsigned int nodeid = flagsptr[tid];
     if (nodeid != LEAF) {
       unsigned int count = sample_cnt[tid];
-      atomicAdd(&nodecount[nodeid], count);
+      raft::myAtomicAdd(&nodecount[nodeid], count);
     }
   }
 }
diff --git a/cpp/src/decisiontree/levelalgo/levelfunc_classifier.cuh b/cpp/src/decisiontree/levelalgo/levelfunc_classifier.cuh
index 5d010ab76f..fc65b603f0 100644
--- a/cpp/src/decisiontree/levelalgo/levelfunc_classifier.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelfunc_classifier.cuh
@@ -15,8 +15,8 @@
  */
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/tree/flatnode.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/logger.hpp>
 #include <cuml/tree/decisiontree.hpp>
 #include <iostream>
@@ -61,7 +61,7 @@ void grow_deep_tree_classification(
                                                      n_unique_labels, histvec,
                                                      initial_metric, tempmem);
   }
-  int reserve_depth = std::min(tempmem->swap_depth, tree_params.max_depth);
+  int reserve_depth = std::min(tempmem->swap_depth, tree_params.max_depth + 1);
   size_t total_nodes = pow(2, (reserve_depth + 1)) - 1;
 
   unsigned int* h_parent_hist = tempmem->h_parent_hist->data();
@@ -83,7 +83,7 @@ void grow_deep_tree_classification(
 
   //RNG setup
   std::mt19937 mtg(treeid * 1000);
-  MLCommon::Random::Rng d_rng(treeid * 1000);
+  raft::random::Rng d_rng(treeid * 1000);
   std::uniform_int_distribution<unsigned int> dist(0, Ncols - 1);
   //Setup pointers
   unsigned int* d_histogram = tempmem->d_histogram->data();
@@ -105,14 +105,15 @@ void grow_deep_tree_classification(
       d_colstart, 0, tempmem->max_nodes_per_level * sizeof(unsigned int),
       tempmem->stream));
     memset(h_colstart, 0, tempmem->max_nodes_per_level * sizeof(unsigned int));
-    MLCommon::updateDevice(d_colids, h_colids, Ncols, tempmem->stream);
+    raft::update_device(d_colids, h_colids, Ncols, tempmem->stream);
   }
   std::vector<unsigned int> feature_selector(h_colids, h_colids + Ncols);
 
-  int scatter_algo_depth = std::min(tempmem->swap_depth, tree_params.max_depth);
+  int scatter_algo_depth =
+    std::min(tempmem->swap_depth, tree_params.max_depth + 1);
   for (int depth = 0; (depth < scatter_algo_depth) && (n_nodes_nextitr != 0);
        depth++) {
-    depth_cnt = depth + 1;
+    depth_cnt = depth;
     n_nodes = n_nodes_nextitr;
     sparsesize = sparsesize_nextitr;
     sparsesize_nextitr = sparsetree.size();
@@ -154,8 +155,8 @@ void grow_deep_tree_classification(
       n_unique_labels, tree_params.max_leaves, h_new_node_flags, sparsetree,
       sparsesize, h_parent_hist, n_nodes_nextitr, sparse_nodelist, leaf_cnt);
 
-    MLCommon::updateDevice(d_new_node_flags, h_new_node_flags, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_new_node_flags, h_new_node_flags, n_nodes,
+                        tempmem->stream);
     make_level_split(data, nrows, Ncols, ncols_sampled, tree_params.n_bins,
                      n_nodes, tree_params.split_algo, d_split_colidx,
                      d_split_binidx, d_new_node_flags, flagsptr, tempmem);
@@ -189,12 +190,15 @@ void grow_deep_tree_classification(
   int* d_counter = tempmem->d_counter->data();
   memcpy(h_nodelist, sparse_nodelist.data(),
          sizeof(int) * sparse_nodelist.size());
-  MLCommon::updateDevice(d_nodelist, h_nodelist, sparse_nodelist.size(),
-                         tempmem->stream);
+  raft::update_device(d_nodelist, h_nodelist, sparse_nodelist.size(),
+                      tempmem->stream);
   //Resize to remove trailing nodes from previous algorithm
   sparsetree.resize(sparsetree.size() - lastsize);
   convert_scatter_to_gather(flagsptr, sample_cnt, n_nodes, nrows, d_nodecount,
                             d_nodestart, d_samplelist, tempmem);
+  if (tempmem->swap_depth == tree_params.max_depth) {
+    ++depth_cnt;
+  }
   for (int depth = tempmem->swap_depth;
        (depth < tree_params.max_depth) && (n_nodes != 0); depth++) {
     depth_cnt = depth + 1;
@@ -216,8 +220,7 @@ void grow_deep_tree_classification(
         tree_params.split_algo, sparsetree.size() + lastsize,
         tree_params.min_impurity_decrease, tempmem, d_sparsenodes, d_nodelist);
     }
-    MLCommon::updateHost(h_sparsenodes, d_sparsenodes, lastsize,
-                         tempmem->stream);
+    raft::update_host(h_sparsenodes, d_sparsenodes, lastsize, tempmem->stream);
     //Update nodelist and split nodes
 
     make_split_gather(data, d_nodestart, d_samplelist, n_nodes, nrows,
@@ -242,8 +245,7 @@ void grow_deep_tree_classification(
         labels, d_nodestart, d_samplelist, n_unique_labels, d_sparsenodes,
         d_nodelist, n_nodes, tempmem);
     }
-    MLCommon::updateHost(h_sparsenodes, d_sparsenodes, lastsize,
-                         tempmem->stream);
+    raft::update_host(h_sparsenodes, d_sparsenodes, lastsize, tempmem->stream);
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
     sparsetree.insert(sparsetree.end(), h_sparsenodes,
                       h_sparsenodes + lastsize);
diff --git a/cpp/src/decisiontree/levelalgo/levelfunc_regressor.cuh b/cpp/src/decisiontree/levelalgo/levelfunc_regressor.cuh
index ccde2fe867..0df203fdfb 100644
--- a/cpp/src/decisiontree/levelalgo/levelfunc_regressor.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelfunc_regressor.cuh
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 #pragma once
-#include <common/cudart_utils.h>
 #include <cuml/tree/flatnode.h>
+#include <raft/cudart_utils.h>
 #include <cuml/tree/decisiontree.hpp>
 #include <iostream>
 #include <numeric>
@@ -59,7 +59,7 @@ void grow_deep_tree_regression(
     initial_metric_regression<T, AbsFunctor>(labels, sample_cnt, nrows, mean,
                                              count, initial_metric, tempmem);
   }
-  int reserve_depth = std::min(tempmem->swap_depth, tree_params.max_depth);
+  int reserve_depth = std::min(tempmem->swap_depth, tree_params.max_depth + 1);
   size_t total_nodes = pow(2, (reserve_depth + 1)) - 1;
 
   std::vector<T> sparse_meanstate;
@@ -83,7 +83,7 @@ void grow_deep_tree_regression(
   sparse_nodelist.push_back(0);
   //RNG setup
   std::mt19937 mtg(treeid * 1000);
-  MLCommon::Random::Rng d_rng(treeid * 1000);
+  raft::random::Rng d_rng(treeid * 1000);
   std::uniform_int_distribution<int> dist(0, Ncols - 1);
 
   //Setup pointers
@@ -110,15 +110,16 @@ void grow_deep_tree_regression(
       d_colstart, 0, tempmem->max_nodes_per_level * sizeof(unsigned int),
       tempmem->stream));
     memset(h_colstart, 0, tempmem->max_nodes_per_level * sizeof(unsigned int));
-    MLCommon::updateDevice(d_colids, h_colids, Ncols, tempmem->stream);
+    raft::update_device(d_colids, h_colids, Ncols, tempmem->stream);
   }
   std::vector<unsigned int> feature_selector(h_colids, h_colids + Ncols);
   float* infogain = tempmem->h_outgain->data();
 
-  int scatter_algo_depth = std::min(tempmem->swap_depth, tree_params.max_depth);
+  int scatter_algo_depth =
+    std::min(tempmem->swap_depth, tree_params.max_depth + 1);
   for (int depth = 0; (depth < scatter_algo_depth) && (n_nodes_nextitr != 0);
        depth++) {
-    depth_cnt = depth + 1;
+    depth_cnt = depth;
     n_nodes = n_nodes_nextitr;
     update_feature_sampling(h_colids, d_colids, h_colstart, d_colstart, Ncols,
                             ncols_sampled, n_nodes, mtg, dist, feature_selector,
@@ -166,8 +167,8 @@ void grow_deep_tree_regression(
       tree_params.max_leaves, h_new_node_flags, sparsetree, sparsesize,
       sparse_meanstate, n_nodes_nextitr, sparse_nodelist, leaf_cnt);
 
-    MLCommon::updateDevice(d_new_node_flags, h_new_node_flags, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_new_node_flags, h_new_node_flags, n_nodes,
+                        tempmem->stream);
     make_level_split(data, nrows, Ncols, ncols_sampled, tree_params.n_bins,
                      n_nodes, tree_params.split_algo, d_split_colidx,
                      d_split_binidx, d_new_node_flags, flagsptr, tempmem);
@@ -197,12 +198,15 @@ void grow_deep_tree_regression(
   int* d_counter = tempmem->d_counter->data();
   memcpy(h_nodelist, sparse_nodelist.data(),
          sizeof(int) * sparse_nodelist.size());
-  MLCommon::updateDevice(d_nodelist, h_nodelist, sparse_nodelist.size(),
-                         tempmem->stream);
+  raft::update_device(d_nodelist, h_nodelist, sparse_nodelist.size(),
+                      tempmem->stream);
   //Resize to remove trailing nodes from previous algorithm
   sparsetree.resize(sparsetree.size() - lastsize);
   convert_scatter_to_gather(flagsptr, sample_cnt, n_nodes, nrows, d_nodecount,
                             d_nodestart, d_samplelist, tempmem);
+  if (tempmem->swap_depth == tree_params.max_depth) {
+    ++depth_cnt;
+  }
   for (int depth = tempmem->swap_depth;
        (depth < tree_params.max_depth) && (n_nodes != 0); depth++) {
     depth_cnt = depth + 1;
@@ -217,8 +221,7 @@ void grow_deep_tree_regression(
       tree_params.split_criterion, sparsetree.size() + lastsize,
       tree_params.min_impurity_decrease, tempmem, d_sparsenodes, d_nodelist);
 
-    MLCommon::updateHost(h_sparsenodes, d_sparsenodes, lastsize,
-                         tempmem->stream);
+    raft::update_host(h_sparsenodes, d_sparsenodes, lastsize, tempmem->stream);
     //Update nodelist and split nodes
 
     make_split_gather(data, d_nodestart, d_samplelist, n_nodes, nrows,
@@ -236,8 +239,7 @@ void grow_deep_tree_regression(
   if (n_nodes != 0) {
     make_leaf_gather_regression(labels, d_nodestart, d_samplelist,
                                 d_sparsenodes, d_nodelist, n_nodes, tempmem);
-    MLCommon::updateHost(h_sparsenodes, d_sparsenodes, lastsize,
-                         tempmem->stream);
+    raft::update_host(h_sparsenodes, d_sparsenodes, lastsize, tempmem->stream);
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
     sparsetree.insert(sparsetree.end(), h_sparsenodes,
                       h_sparsenodes + lastsize);
@@ -245,4 +247,4 @@ void grow_deep_tree_regression(
 }
 
 }  // namespace DecisionTree
-}  // namespace ML
\ No newline at end of file
+}  // namespace ML
diff --git a/cpp/src/decisiontree/levelalgo/levelhelper_classifier.cuh b/cpp/src/decisiontree/levelalgo/levelhelper_classifier.cuh
index 1af4b92937..51cf5b7d10 100644
--- a/cpp/src/decisiontree/levelalgo/levelhelper_classifier.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelhelper_classifier.cuh
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "levelkernel_classifier.cuh"
 
 namespace ML {
@@ -28,15 +28,15 @@ void initial_metric_classification(
   CUDA_CHECK(cudaMemsetAsync(tempmem->d_parent_hist->data(), 0,
                              n_unique_labels * sizeof(unsigned int),
                              tempmem->stream));
-  int blocks = MLCommon::ceildiv(nrows, 128);
+  int blocks = raft::ceildiv(nrows, 128);
   sample_count_histogram_kernel<<<blocks, 128, sizeof(int) * n_unique_labels,
                                   tempmem->stream>>>(
     labels, sample_cnt, nrows, n_unique_labels,
     (int *)tempmem->d_parent_hist->data());
   CUDA_CHECK(cudaGetLastError());
-  MLCommon::updateHost(tempmem->h_parent_hist->data(),
-                       tempmem->d_parent_hist->data(), n_unique_labels,
-                       tempmem->stream);
+  raft::update_host(tempmem->h_parent_hist->data(),
+                    tempmem->d_parent_hist->data(), n_unique_labels,
+                    tempmem->stream);
   CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
   histvec.assign(tempmem->h_parent_hist->data(),
                  tempmem->h_parent_hist->data() + n_unique_labels);
@@ -56,7 +56,7 @@ void get_histogram_classification(
   int node_batch = min(n_nodes, tempmem->max_nodes_class);
   size_t shmem = nbins * n_unique_labels * sizeof(int) * node_batch;
   int threads = 256;
-  int blocks = MLCommon::ceildiv(nrows, threads);
+  int blocks = raft::ceildiv(nrows, threads);
   unsigned int *d_colstart = nullptr;
   if (tempmem->d_colstart != nullptr) d_colstart = tempmem->d_colstart->data();
   if (split_algo == 0) {
@@ -143,10 +143,10 @@ void get_best_split_classification(
              n_unique_labels * sizeof(int));
     }
 
-    MLCommon::updateDevice(d_parent_hist, h_parent_hist,
-                           n_nodes * n_unique_labels, tempmem->stream);
-    MLCommon::updateDevice(d_parent_metric, h_parent_metric, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_parent_hist, h_parent_hist, n_nodes * n_unique_labels,
+                        tempmem->stream);
+    raft::update_device(d_parent_metric, h_parent_metric, n_nodes,
+                        tempmem->stream);
     CUDA_CHECK(
       cudaMemsetAsync(d_outgain, 0, n_nodes * sizeof(float), tempmem->stream));
     CUDA_CHECK(cudaMemsetAsync(d_split_binidx, 0, n_nodes * sizeof(int),
@@ -162,15 +162,13 @@ void get_best_split_classification(
         n_unique_labels, min_rpn, d_outgain, d_split_colidx, d_split_binidx,
         d_child_hist, d_child_best_metric);
     CUDA_CHECK(cudaGetLastError());
-    MLCommon::updateHost(h_child_hist, d_child_hist,
-                         2 * n_nodes * n_unique_labels, tempmem->stream);
-    MLCommon::updateHost(h_outgain, d_outgain, n_nodes, tempmem->stream);
-    MLCommon::updateHost(h_child_best_metric, d_child_best_metric, 2 * n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(split_binidx, d_split_binidx, n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(split_colidx, d_split_colidx, n_nodes,
-                         tempmem->stream);
+    raft::update_host(h_child_hist, d_child_hist, 2 * n_nodes * n_unique_labels,
+                      tempmem->stream);
+    raft::update_host(h_outgain, d_outgain, n_nodes, tempmem->stream);
+    raft::update_host(h_child_best_metric, d_child_best_metric, 2 * n_nodes,
+                      tempmem->stream);
+    raft::update_host(split_binidx, d_split_binidx, n_nodes, tempmem->stream);
+    raft::update_host(split_colidx, d_split_colidx, n_nodes, tempmem->stream);
 
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
     for (int nodecnt = 0; nodecnt < n_nodes; nodecnt++) {
@@ -195,7 +193,7 @@ void get_best_split_classification(
       sparsetree.push_back(rightnode);
     }
   } else {
-    MLCommon::updateHost(hist, d_hist, histcount, tempmem->stream);
+    raft::update_host(hist, d_hist, histcount, tempmem->stream);
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
 
     for (int nodecnt = 0; nodecnt < n_nodes; nodecnt++) {
@@ -283,10 +281,8 @@ void get_best_split_classification(
       memcpy(&h_child_hist[(2 * nodecnt + 1) * n_unique_labels],
              besthist_right.data(), n_unique_labels * sizeof(unsigned int));
     }
-    MLCommon::updateDevice(d_split_binidx, split_binidx, n_nodes,
-                           tempmem->stream);
-    MLCommon::updateDevice(d_split_colidx, split_colidx, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_split_binidx, split_binidx, n_nodes, tempmem->stream);
+    raft::update_device(d_split_colidx, split_colidx, n_nodes, tempmem->stream);
   }
 }
 
@@ -302,7 +298,7 @@ void leaf_eval_classification(
 
   int non_leaf_counter = 0;
   // decide if the "next" layer of nodes are to be forcefully marked as leaves
-  bool condition_global = curr_depth >= max_depth - 1;
+  bool condition_global = curr_depth >= max_depth;
   if (max_leaves != -1)
     condition_global = condition_global || (tree_leaf_cnt >= max_leaves);
 
diff --git a/cpp/src/decisiontree/levelalgo/levelhelper_regressor.cuh b/cpp/src/decisiontree/levelalgo/levelhelper_regressor.cuh
index 6053cad580..48c424ba6a 100644
--- a/cpp/src/decisiontree/levelalgo/levelhelper_regressor.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelhelper_regressor.cuh
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "levelkernel_regressor.cuh"
 
 namespace ML {
@@ -32,7 +32,7 @@ void initial_metric_regression(const T *labels, unsigned int *sample_cnt,
   CUDA_CHECK(cudaMemsetAsync(tempmem->d_count->data(), 0, sizeof(unsigned int),
                              tempmem->stream));
   int threads = 128;
-  int blocks = MLCommon::ceildiv(nrows, threads);
+  int blocks = raft::ceildiv(nrows, threads);
 
   pred_kernel_level<<<blocks, threads, 0, tempmem->stream>>>(
     labels, sample_cnt, nrows, tempmem->d_predout->data(),
@@ -42,12 +42,12 @@ void initial_metric_regression(const T *labels, unsigned int *sample_cnt,
     labels, sample_cnt, nrows, tempmem->d_predout->data(),
     tempmem->d_count->data(), tempmem->d_mseout->data());
   CUDA_CHECK(cudaGetLastError());
-  MLCommon::updateHost(tempmem->h_count->data(), tempmem->d_count->data(), 1,
-                       tempmem->stream);
-  MLCommon::updateHost(tempmem->h_predout->data(), tempmem->d_predout->data(),
-                       1, tempmem->stream);
-  MLCommon::updateHost(tempmem->h_mseout->data(), tempmem->d_mseout->data(), 1,
-                       tempmem->stream);
+  raft::update_host(tempmem->h_count->data(), tempmem->d_count->data(), 1,
+                    tempmem->stream);
+  raft::update_host(tempmem->h_predout->data(), tempmem->d_predout->data(), 1,
+                    tempmem->stream);
+  raft::update_host(tempmem->h_mseout->data(), tempmem->d_mseout->data(), 1,
+                    tempmem->stream);
   CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
   count = tempmem->h_count->data()[0];
   mean = tempmem->h_predout->data()[0] / count;
@@ -75,7 +75,7 @@ void get_mse_regression_fused(const T *data, const T *labels,
   size_t shmempred = nbins * (sizeof(unsigned int) + sizeof(T)) * n_nodes;
 
   int threads = 256;
-  int blocks = MLCommon::ceildiv(nrows, threads);
+  int blocks = raft::ceildiv(nrows, threads);
   unsigned int *d_colstart = nullptr;
   if (tempmem->d_colstart != nullptr) d_colstart = tempmem->d_colstart->data();
 
@@ -137,7 +137,7 @@ void get_mse_regression(const T *data, const T *labels, unsigned int *flags,
   size_t shmemmse = shmempred + 2 * nbins * n_nodes * sizeof(T);
 
   int threads = 256;
-  int blocks = MLCommon::ceildiv(nrows, threads);
+  int blocks = raft::ceildiv(nrows, threads);
   unsigned int *d_colstart = nullptr;
   if (tempmem->d_colstart != nullptr) d_colstart = tempmem->d_colstart->data();
 
@@ -259,8 +259,8 @@ void get_best_split_regression(
     }
 
     //Here parent mean and count are already updated
-    MLCommon::updateDevice(d_parentmetric, h_parentmetric, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_parentmetric, h_parentmetric, n_nodes,
+                        tempmem->stream);
     CUDA_CHECK(
       cudaMemsetAsync(d_outgain, 0, n_nodes * sizeof(float), tempmem->stream));
     CUDA_CHECK(cudaMemsetAsync(d_split_binidx, 0, n_nodes * sizeof(int),
@@ -276,17 +276,13 @@ void get_best_split_regression(
         d_childmetric);
     CUDA_CHECK(cudaGetLastError());
 
-    MLCommon::updateHost(h_childmetric, d_childmetric, 2 * n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(h_outgain, d_outgain, n_nodes, tempmem->stream);
-    MLCommon::updateHost(h_childmean, d_childmean, 2 * n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(h_childcount, d_childcount, 2 * n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(split_binidx, d_split_binidx, n_nodes,
-                         tempmem->stream);
-    MLCommon::updateHost(split_colidx, d_split_colidx, n_nodes,
-                         tempmem->stream);
+    raft::update_host(h_childmetric, d_childmetric, 2 * n_nodes,
+                      tempmem->stream);
+    raft::update_host(h_outgain, d_outgain, n_nodes, tempmem->stream);
+    raft::update_host(h_childmean, d_childmean, 2 * n_nodes, tempmem->stream);
+    raft::update_host(h_childcount, d_childcount, 2 * n_nodes, tempmem->stream);
+    raft::update_host(split_binidx, d_split_binidx, n_nodes, tempmem->stream);
+    raft::update_host(split_colidx, d_split_colidx, n_nodes, tempmem->stream);
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
 
     for (int nodecnt = 0; nodecnt < n_nodes; nodecnt++) {
@@ -316,9 +312,9 @@ void get_best_split_regression(
     }
 
   } else {
-    MLCommon::updateHost(mseout, d_mseout, 2 * predcount, tempmem->stream);
-    MLCommon::updateHost(predout, d_predout, predcount, tempmem->stream);
-    MLCommon::updateHost(count, d_count, predcount, tempmem->stream);
+    raft::update_host(mseout, d_mseout, 2 * predcount, tempmem->stream);
+    raft::update_host(predout, d_predout, predcount, tempmem->stream);
+    raft::update_host(count, d_count, predcount, tempmem->stream);
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
     for (int nodecnt = 0; nodecnt < n_nodes; nodecnt++) {
       T bestmetric_left = 0;
@@ -396,10 +392,8 @@ void get_best_split_regression(
       sparsetree.push_back(leftnode);
       sparsetree.push_back(rightnode);
     }
-    MLCommon::updateDevice(d_split_binidx, split_binidx, n_nodes,
-                           tempmem->stream);
-    MLCommon::updateDevice(d_split_colidx, split_colidx, n_nodes,
-                           tempmem->stream);
+    raft::update_device(d_split_binidx, split_binidx, n_nodes, tempmem->stream);
+    raft::update_device(d_split_colidx, split_colidx, n_nodes, tempmem->stream);
   }
 }
 
@@ -416,7 +410,7 @@ void leaf_eval_regression(float *gain, int curr_depth,
   sparse_nodelist.clear();
 
   int non_leaf_counter = 0;
-  bool condition_global = (curr_depth >= max_depth - 1);
+  bool condition_global = (curr_depth >= max_depth);
   if (max_leaves != -1)
     condition_global = condition_global || (tree_leaf_cnt >= max_leaves);
 
@@ -456,10 +450,10 @@ void init_parent_value(std::vector<T> &sparse_meanstate,
     h_predout[i] = sparse_meanstate[sparsesize + sparse_nodeid];
     h_count[i] = sparse_countstate[sparsesize + sparse_nodeid];
   }
-  MLCommon::updateDevice(tempmem->d_parent_pred->data(), h_predout, n_nodes,
-                         tempmem->stream);
-  MLCommon::updateDevice(tempmem->d_parent_count->data(), h_count, n_nodes,
-                         tempmem->stream);
+  raft::update_device(tempmem->d_parent_pred->data(), h_predout, n_nodes,
+                      tempmem->stream);
+  raft::update_device(tempmem->d_parent_count->data(), h_count, n_nodes,
+                      tempmem->stream);
 }
 
 template <typename T>
diff --git a/cpp/src/decisiontree/levelalgo/levelkernel_classifier.cuh b/cpp/src/decisiontree/levelalgo/levelkernel_classifier.cuh
index 93a379c9ec..aeacaaadfb 100644
--- a/cpp/src/decisiontree/levelalgo/levelkernel_classifier.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelkernel_classifier.cuh
@@ -35,13 +35,14 @@ __global__ void sample_count_histogram_kernel(
   for (int tid = threadid; tid < nrows; tid += blockDim.x * gridDim.x) {
     int label = labels[tid];
     int count = sample_cnt[tid];
-    atomicAdd(&shmemhist[label], count);
+    raft::myAtomicAdd<unsigned int>(&shmemhist[label], count);
   }
 
   __syncthreads();
 
   for (int tid = threadIdx.x; tid < nmax; tid += blockDim.x) {
-    atomicAdd(&histout[tid], shmemhist[tid]);
+    raft::myAtomicAdd<unsigned int>((unsigned int*)&histout[tid],
+                                    shmemhist[tid]);
   }
   return;
 }
@@ -92,8 +93,9 @@ __global__ void get_hist_kernel(
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= question(binid)) {
           unsigned int nodeoff = local_flag * nbins * n_unique_labels;
-          atomicAdd(&shmemhist[nodeoff + binid * n_unique_labels + local_label],
-                    local_cnt);
+          raft::myAtomicAdd<unsigned int>(
+            &shmemhist[nodeoff + binid * n_unique_labels + local_label],
+            local_cnt);
         }
       }
     }
@@ -102,7 +104,7 @@ __global__ void get_hist_kernel(
     for (unsigned int i = threadIdx.x; i < nbins * n_nodes * n_unique_labels;
          i += blockDim.x) {
       unsigned int offset = colcnt * nbins * n_nodes * n_unique_labels;
-      atomicAdd(&histout[offset + i], shmemhist[i]);
+      raft::myAtomicAdd(&histout[offset + i], shmemhist[i]);
     }
     __syncthreads();
   }
@@ -147,9 +149,10 @@ __global__ void get_hist_kernel_global(
           if (local_data <= question(binid)) {
             unsigned int coloff = colcnt * nbins * n_nodes * n_unique_labels;
             unsigned int nodeoff = local_flag * nbins * n_unique_labels;
-            atomicAdd(&histout[coloff + nodeoff + binid * n_unique_labels +
-                               local_label],
-                      local_cnt);
+            raft::myAtomicAdd<unsigned int>(
+              &histout[coloff + nodeoff + binid * n_unique_labels +
+                       local_label],
+              local_cnt);
           }
         }
       }
@@ -174,7 +177,7 @@ struct GiniDevFunctor {
     if (tid < n_unique_labels) {
       float prob = ((float)hist[tid]) / nrows;
       prob = -1 * prob * prob;
-      atomicAdd(metric, prob);
+      raft::myAtomicAdd(metric, prob);
     }
     __syncthreads();
   }
@@ -200,7 +203,7 @@ struct EntropyDevFunctor {
       if (hist[tid] != 0) {
         float prob = ((float)hist[tid]) / nrows;
         prob = -1 * prob * logf(prob);
-        atomicAdd(metric, prob);
+        raft::myAtomicAdd(metric, prob);
       }
     }
     __syncthreads();
@@ -301,9 +304,11 @@ __global__ void get_best_split_classification_kernel(
         unsigned int val_left = hist[coloffset + binoffset + nodeoffset + j];
         unsigned int val_right = parent_hist_local[j] - val_left;
         best_split_hist[j] = val_left;
-        atomicAdd(&best_nrows[0], val_left);
+        raft::myAtomicAdd<unsigned int>((unsigned int*)&best_nrows[0],
+                                        val_left);
         best_split_hist[j + n_unique_labels] = val_right;
-        atomicAdd(&best_nrows[1], val_right);
+        raft::myAtomicAdd<unsigned int>((unsigned int*)&best_nrows[1],
+                                        val_right);
       }
       __syncthreads();
 
@@ -395,7 +400,7 @@ __global__ void best_split_gather_classification_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&shmemhist_parent[local_label], 1);
+    raft::myAtomicAdd<unsigned int>(&shmemhist_parent[local_label], 1);
   }
   FDEV::execshared(shmemhist_parent, &parent_metric, count, n_unique_labels);
   //Loop over cols
@@ -416,7 +421,7 @@ __global__ void best_split_gather_classification_kernel(
       for (unsigned int binid = 0; binid < nbins; binid++) {
         int histid = binid * n_unique_labels + local_label;
         if (local_data <= question(binid)) {
-          atomicAdd(&shmemhist_left[histid], 1);
+          raft::myAtomicAdd<unsigned int>(&shmemhist_left[histid], 1);
         }
       }
     }
@@ -499,7 +504,7 @@ __global__ void best_split_gather_classification_minmax_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&shmemhist_parent[local_label], 1);
+    raft::myAtomicAdd<unsigned int>(&shmemhist_parent[local_label], 1);
   }
   FDEV::execshared(shmemhist_parent, &parent_metric, count, n_unique_labels);
   //Loop over cols
@@ -536,7 +541,7 @@ __global__ void best_split_gather_classification_minmax_kernel(
       for (unsigned int binid = 0; binid < nbins; binid++) {
         int histid = binid * n_unique_labels + local_label;
         if (local_data <= threadmin + delta * (binid + 1)) {
-          atomicAdd(&shmemhist_left[histid], 1);
+          raft::myAtomicAdd<unsigned int>(&shmemhist_left[histid], 1);
         }
       }
     }
@@ -597,7 +602,7 @@ __global__ void make_leaf_gather_classification_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     unsigned int dataid = samplelist[nodestart + tid];
     int local_label = labels[dataid];
-    atomicAdd(&shmemhist_parent[local_label], 1);
+    raft::myAtomicAdd<unsigned int>(&shmemhist_parent[local_label], 1);
   }
   FDEV::execshared(shmemhist_parent, &parent_metric, count, n_unique_labels);
   __syncthreads();
diff --git a/cpp/src/decisiontree/levelalgo/levelkernel_regressor.cuh b/cpp/src/decisiontree/levelalgo/levelkernel_regressor.cuh
index 9b4338bb5b..7e863791dc 100644
--- a/cpp/src/decisiontree/levelalgo/levelkernel_regressor.cuh
+++ b/cpp/src/decisiontree/levelalgo/levelkernel_regressor.cuh
@@ -88,14 +88,14 @@ __global__ void pred_kernel_level(const T *__restrict__ labels,
   for (int tid = threadid; tid < nrows; tid += blockDim.x * gridDim.x) {
     T label = labels[tid];
     unsigned int count = sample_cnt[tid];
-    atomicAdd(&shmemcnt, count);
-    atomicAdd(&shmempred, label * count);
+    raft::myAtomicAdd(&shmemcnt, count);
+    raft::myAtomicAdd(&shmempred, label * count);
   }
   __syncthreads();
 
   if (threadIdx.x == 0) {
-    atomicAdd(predout, shmempred);
-    atomicAdd(countout, shmemcnt);
+    raft::myAtomicAdd(predout, shmempred);
+    raft::myAtomicAdd(countout, shmemcnt);
   }
   return;
 }
@@ -115,13 +115,13 @@ __global__ void mse_kernel_level(const T *__restrict__ labels,
     T label = labels[tid];
     unsigned int local_count = sample_cnt[tid];
     T value = F::exec(label - mean);
-    atomicAdd(&shmemmse, local_count * value);
+    raft::myAtomicAdd(&shmemmse, local_count * value);
   }
 
   __syncthreads();
 
   if (threadIdx.x == 0) {
-    atomicAdd(mseout, shmemmse);
+    raft::myAtomicAdd(mseout, shmemmse);
   }
   return;
 }
@@ -174,8 +174,10 @@ __global__ void get_pred_kernel(
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= question(binid)) {
           unsigned int nodeoff = local_flag * nbins;
-          atomicAdd(&shmempred[nodeoff + binid], local_label * local_cnt);
-          atomicAdd(&shmemcount[nodeoff + binid], local_cnt);
+          raft::myAtomicAdd(&shmempred[nodeoff + binid],
+                            local_label * local_cnt);
+          raft::myAtomicAdd<unsigned int>(&shmemcount[nodeoff + binid],
+                                          local_cnt);
         }
       }
     }
@@ -183,8 +185,8 @@ __global__ void get_pred_kernel(
     __syncthreads();
     for (unsigned int i = threadIdx.x; i < nbins * n_nodes; i += blockDim.x) {
       unsigned int offset = colcnt * nbins * n_nodes;
-      atomicAdd(&predout[offset + i], shmempred[i]);
-      atomicAdd(&countout[offset + i], shmemcount[i]);
+      raft::myAtomicAdd(&predout[offset + i], shmempred[i]);
+      raft::myAtomicAdd(&countout[offset + i], shmemcount[i]);
     }
     __syncthreads();
   }
@@ -258,13 +260,13 @@ __global__ void get_mse_kernel(
         unsigned int local_count = shmem_countout[nodeoff + binid];
         if (local_data <= question(binid)) {
           T leftmean = local_pred / local_count;
-          atomicAdd(&shmem_mse[2 * (nodeoff + binid)],
-                    local_cnt * F::exec(local_label - leftmean));
+          raft::myAtomicAdd(&shmem_mse[2 * (nodeoff + binid)],
+                            local_cnt * F::exec(local_label - leftmean));
         } else {
           T rightmean = parent_pred * parent_count - local_pred;
           rightmean = rightmean / (parent_count - local_count);
-          atomicAdd(&shmem_mse[2 * (nodeoff + binid) + 1],
-                    local_cnt * F::exec(local_label - rightmean));
+          raft::myAtomicAdd(&shmem_mse[2 * (nodeoff + binid) + 1],
+                            local_cnt * F::exec(local_label - rightmean));
         }
       }
     }
@@ -272,7 +274,7 @@ __global__ void get_mse_kernel(
     __syncthreads();
     for (unsigned int i = threadIdx.x; i < 2 * nbins * n_nodes;
          i += blockDim.x) {
-      atomicAdd(&mseout[2 * coloff + i], shmem_mse[i]);
+      raft::myAtomicAdd(&mseout[2 * coloff + i], shmem_mse[i]);
     }
     __syncthreads();
   }
@@ -314,9 +316,10 @@ __global__ void get_pred_kernel_global(
         for (unsigned int binid = 0; binid < nbins; binid++) {
           if (local_data <= question(binid)) {
             unsigned int nodeoff = local_flag * nbins;
-            atomicAdd(&predout[coloffset + nodeoff + binid],
-                      local_label * local_cnt);
-            atomicAdd(&countout[coloffset + nodeoff + binid], local_cnt);
+            raft::myAtomicAdd(&predout[coloffset + nodeoff + binid],
+                              local_label * local_cnt);
+            raft::myAtomicAdd<unsigned int>(
+              &countout[coloffset + nodeoff + binid], local_cnt);
           }
         }
       }
@@ -369,13 +372,13 @@ __global__ void get_mse_kernel_global(
           unsigned int local_count = countout[coloff + nodeoff + binid];
           if (local_data <= question(binid)) {
             T leftmean = local_pred / local_count;
-            atomicAdd(&mseout[2 * (coloff + nodeoff + binid)],
-                      local_cnt * F::exec(local_label - leftmean));
+            raft::myAtomicAdd(&mseout[2 * (coloff + nodeoff + binid)],
+                              local_cnt * F::exec(local_label - leftmean));
           } else {
             T rightmean = parent_pred * parent_count - local_pred;
             rightmean = rightmean / (parent_count - local_count);
-            atomicAdd(&mseout[2 * (coloff + nodeoff + binid) + 1],
-                      local_cnt * F::exec(local_label - rightmean));
+            raft::myAtomicAdd(&mseout[2 * (coloff + nodeoff + binid) + 1],
+                              local_cnt * F::exec(local_label - rightmean));
           }
         }
       }
@@ -550,7 +553,7 @@ __global__ void best_split_gather_regression_mse_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&mean_parent, local_label);
+    raft::myAtomicAdd(&mean_parent, local_label);
   }
 
   //Loop over cols
@@ -573,9 +576,9 @@ __global__ void best_split_gather_regression_mse_kernel(
 #pragma unroll(8)
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= question(binid)) {
-          atomicAdd(&shmean_left[binid], local_label);
+          raft::myAtomicAdd(&shmean_left[binid], local_label);
         } else {
-          atomicAdd(&shcount_right[binid], 1);
+          raft::myAtomicAdd<unsigned int>(&shcount_right[binid], 1);
         }
       }
     }
@@ -654,7 +657,7 @@ __global__ void best_split_gather_regression_mse_minmax_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&mean_parent, local_label);
+    raft::myAtomicAdd(&mean_parent, local_label);
   }
 
   //Loop over cols
@@ -693,9 +696,9 @@ __global__ void best_split_gather_regression_mse_minmax_kernel(
 #pragma unroll(8)
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= threadmin + delta * (binid + 1)) {
-          atomicAdd(&shmean_left[binid], local_label);
+          raft::myAtomicAdd(&shmean_left[binid], local_label);
         } else {
-          atomicAdd(&shcount_right[binid], 1);
+          raft::myAtomicAdd<unsigned int>(&shcount_right[binid], 1);
         }
       }
     }
@@ -749,7 +752,7 @@ __global__ void make_leaf_gather_regression_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     unsigned int dataid = samplelist[nodestart + tid];
     T local_label = labels[dataid];
-    atomicAdd(&mean_parent, local_label);
+    raft::myAtomicAdd(&mean_parent, local_label);
   }
   __syncthreads();
   if (threadIdx.x == 0) {
@@ -804,14 +807,14 @@ __global__ void best_split_gather_regression_mae_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&mean_parent, local_label);
+    raft::myAtomicAdd(&mean_parent, local_label);
   }
   __syncthreads();
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = get_samplelist(samplelist, dataid, nodestart, tid, count);
     local_label = get_label(labels, local_label, dataid, count);
     T value = (mean_parent / count) - local_label;
-    atomicAdd(&mae_parent, MLCommon::myAbs(value));
+    raft::myAtomicAdd(&mae_parent, raft::myAbs(value));
   }
   //Loop over cols
   for (unsigned int colcnt = 0; colcnt < ncols_sampled; colcnt++) {
@@ -836,9 +839,9 @@ __global__ void best_split_gather_regression_mae_kernel(
 #pragma unroll(8)
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= question(binid)) {
-          atomicAdd(&shmean_left[binid], local_label);
+          raft::myAtomicAdd(&shmean_left[binid], local_label);
         } else {
-          atomicAdd(&shcount_right[binid], 1);
+          raft::myAtomicAdd<unsigned int>(&shcount_right[binid], 1);
         }
       }
     }
@@ -853,12 +856,12 @@ __global__ void best_split_gather_regression_mae_kernel(
         if (local_data <= question(binid)) {
           T value =
             (shmean_left[binid] / (count - shcount_right[binid])) - local_label;
-          atomicAdd(&shmae_left[binid], MLCommon::myAbs(value));
+          raft::myAtomicAdd(&shmae_left[binid], raft::myAbs(value));
         } else {
           T value =
             ((mean_parent - shmean_left[binid]) / shcount_right[binid]) -
             local_label;
-          atomicAdd(&shmae_right[binid], MLCommon::myAbs(value));
+          raft::myAtomicAdd(&shmae_right[binid], raft::myAbs(value));
         }
       }
     }
@@ -940,14 +943,14 @@ __global__ void best_split_gather_regression_mae_minmax_kernel(
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = samplelist[nodestart + tid];
     local_label = labels[dataid];
-    atomicAdd(&mean_parent, local_label);
+    raft::myAtomicAdd(&mean_parent, local_label);
   }
   __syncthreads();
   for (int tid = threadIdx.x; tid < count; tid += blockDim.x) {
     dataid = get_samplelist(samplelist, dataid, nodestart, tid, count);
     local_label = get_label(labels, local_label, dataid, count);
     T value = (mean_parent / count) - local_label;
-    atomicAdd(&mae_parent, MLCommon::myAbs(value));
+    raft::myAtomicAdd(&mae_parent, raft::myAbs(value));
   }
 
   //Loop over cols
@@ -990,9 +993,9 @@ __global__ void best_split_gather_regression_mae_minmax_kernel(
 #pragma unroll(8)
       for (unsigned int binid = 0; binid < nbins; binid++) {
         if (local_data <= threadmin + delta * (binid + 1)) {
-          atomicAdd(&shmean_left[binid], local_label);
+          raft::myAtomicAdd(&shmean_left[binid], local_label);
         } else {
-          atomicAdd(&shcount_right[binid], 1);
+          raft::myAtomicAdd<unsigned int>(&shcount_right[binid], 1);
         }
       }
     }
@@ -1007,12 +1010,12 @@ __global__ void best_split_gather_regression_mae_minmax_kernel(
         if (local_data <= threadmin + delta * (binid + 1)) {
           T value =
             (shmean_left[binid] / (count - shcount_right[binid])) - local_label;
-          atomicAdd(&shmae_left[binid], MLCommon::myAbs(value));
+          raft::myAtomicAdd(&shmae_left[binid], raft::myAbs(value));
         } else {
           T value =
             ((mean_parent - shmean_left[binid]) / shcount_right[binid]) -
             local_label;
-          atomicAdd(&shmae_right[binid], MLCommon::myAbs(value));
+          raft::myAtomicAdd(&shmae_right[binid], raft::myAbs(value));
         }
       }
     }
diff --git a/cpp/src/decisiontree/levelalgo/metric.cuh b/cpp/src/decisiontree/levelalgo/metric.cuh
index 883bd1271d..0f533decf7 100644
--- a/cpp/src/decisiontree/levelalgo/metric.cuh
+++ b/cpp/src/decisiontree/levelalgo/metric.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "metric_def.cuh"
 
 namespace ML {
@@ -23,12 +23,12 @@ namespace DecisionTree {
 
 template <class T>
 DI T SquareFunctor::exec(T x) {
-  return MLCommon::myPow(x, (T)2);
+  return raft::myPow(x, (T)2);
 }
 
 template <class T>
 DI T AbsFunctor::exec(T x) {
-  return MLCommon::myAbs(x);
+  return raft::myAbs(x);
 }
 
 float GiniFunctor::max_val(int nclass) { return 1.0; }
diff --git a/cpp/src/decisiontree/levelalgo/metric_def.cuh b/cpp/src/decisiontree/levelalgo/metric_def.cuh
index 8da38d4c31..90a264c6a5 100644
--- a/cpp/src/decisiontree/levelalgo/metric_def.cuh
+++ b/cpp/src/decisiontree/levelalgo/metric_def.cuh
@@ -15,9 +15,9 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
 #include <math.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 #include "../memory.h"
 
diff --git a/cpp/src/decisiontree/memory.cuh b/cpp/src/decisiontree/memory.cuh
index 418d137698..c4c6f17995 100644
--- a/cpp/src/decisiontree/memory.cuh
+++ b/cpp/src/decisiontree/memory.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <cuml/tree/algo_helper.h>
 #include <thrust/extrema.h>
 #include <algorithm>
 #include <cub/cub.cuh>
@@ -29,8 +29,8 @@ TemporaryMemory<T, L>::TemporaryMemory(
   const cudaStream_t stream_in, int N, int Ncols, int n_unique,
   const ML::DecisionTree::DecisionTreeParams& tree_params) {
   stream = stream_in;
-  max_shared_mem = MLCommon::getSharedMemPerBlock();
-  num_sms = MLCommon::getMultiProcessorCount();
+  max_shared_mem = raft::getSharedMemPerBlock();
+  num_sms = raft::getMultiProcessorCount();
   device_allocator = device_allocator_in;
   host_allocator = host_allocator_in;
   LevelMemAllocator(N, Ncols, n_unique, tree_params);
@@ -38,13 +38,13 @@ TemporaryMemory<T, L>::TemporaryMemory(
 
 template <class T, class L>
 TemporaryMemory<T, L>::TemporaryMemory(
-  const ML::cumlHandle_impl& handle, cudaStream_t stream_in, int N, int Ncols,
+  const raft::handle_t& handle, cudaStream_t stream_in, int N, int Ncols,
   int n_unique, const ML::DecisionTree::DecisionTreeParams& tree_params) {
   stream = stream_in;
-  max_shared_mem = MLCommon::getSharedMemPerBlock();
-  num_sms = MLCommon::getMultiProcessorCount();
-  device_allocator = handle.getDeviceAllocator();
-  host_allocator = handle.getHostAllocator();
+  max_shared_mem = raft::getSharedMemPerBlock();
+  num_sms = raft::getMultiProcessorCount();
+  device_allocator = handle.get_device_allocator();
+  host_allocator = handle.get_host_allocator();
   LevelMemAllocator(N, Ncols, n_unique, tree_params);
 }
 
@@ -75,7 +75,7 @@ void TemporaryMemory<T, L>::LevelMemAllocator(
   int nrows, int ncols, int n_unique,
   const ML::DecisionTree::DecisionTreeParams& tree_params) {
   int nbins = tree_params.n_bins;
-  int depth = tree_params.max_depth;
+  int depth = (tree_params.max_depth < 0) ? -1 : (tree_params.max_depth + 1);
   if (depth > swap_depth || (depth == -1)) {
     max_nodes_per_level = pow(2, swap_depth);
   } else {
@@ -83,6 +83,7 @@ void TemporaryMemory<T, L>::LevelMemAllocator(
   }
   size_t maxnodes = max_nodes_per_level;
   size_t ncols_sampled = (size_t)(ncols * tree_params.max_features);
+  ncols_sampled = ncols_sampled > 0 ? ncols_sampled : 1;
   if (depth < 64) {
     gather_max_nodes = std::min((size_t)(nrows + 1),
                                 (size_t)(pow((size_t)2, (size_t)depth) + 1));
@@ -140,25 +141,16 @@ void TemporaryMemory<T, L>::LevelMemAllocator(
   }
   d_sample_cnt =
     new MLCommon::device_buffer<unsigned int>(device_allocator, stream, nrows);
-  if (tree_params.shuffle_features == true) {
-    d_colids = new MLCommon::device_buffer<unsigned int>(
-      device_allocator, stream, ncols_sampled * gather_max_nodes);
-    h_colids = new MLCommon::host_buffer<unsigned int>(
-      host_allocator, stream, ncols_sampled * gather_max_nodes);
-    totalmem += ncols_sampled * maxnodes * sizeof(int);
-
-  } else {
-    //This buffers are also reused by gather algorithm
-    d_colids = new MLCommon::device_buffer<unsigned int>(device_allocator,
-                                                         stream, ncols);
-    d_colstart = new MLCommon::device_buffer<unsigned int>(device_allocator,
-                                                           stream, parentsz);
-    h_colids =
-      new MLCommon::host_buffer<unsigned int>(host_allocator, stream, ncols);
-    h_colstart =
-      new MLCommon::host_buffer<unsigned int>(host_allocator, stream, parentsz);
-    totalmem += ncols * sizeof(int) + parentsz * sizeof(int);
-  }
+  //This buffers are also reused by gather algorithm
+  d_colids =
+    new MLCommon::device_buffer<unsigned int>(device_allocator, stream, ncols);
+  d_colstart = new MLCommon::device_buffer<unsigned int>(device_allocator,
+                                                         stream, parentsz);
+  h_colids =
+    new MLCommon::host_buffer<unsigned int>(host_allocator, stream, ncols);
+  h_colstart =
+    new MLCommon::host_buffer<unsigned int>(host_allocator, stream, parentsz);
+  totalmem += ncols * sizeof(int) + parentsz * sizeof(int);
   //CUB memory for gather algorithms
   size_t temp_storage_bytes = 0;
   void* cub_buffer = NULL;
diff --git a/cpp/src/decisiontree/memory.h b/cpp/src/decisiontree/memory.h
index 2f01e54526..8964937c2d 100644
--- a/cpp/src/decisiontree/memory.h
+++ b/cpp/src/decisiontree/memory.h
@@ -15,7 +15,8 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <cuml/tree/flatnode.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <common/host_buffer.hpp>
@@ -112,8 +113,8 @@ struct TemporaryMemory {
     const cudaStream_t stream_in, int N, int Ncols, int n_unique,
     const ML::DecisionTree::DecisionTreeParams &tree_params);
 
-  TemporaryMemory(const ML::cumlHandle_impl &handle, cudaStream_t stream_in,
-                  int N, int Ncols, int n_unique,
+  TemporaryMemory(const raft::handle_t &handle, cudaStream_t stream_in, int N,
+                  int Ncols, int n_unique,
                   const ML::DecisionTree::DecisionTreeParams &tree_params);
 
   ~TemporaryMemory();
diff --git a/cpp/src/decisiontree/quantile/quantile.cuh b/cpp/src/decisiontree/quantile/quantile.cuh
index f64fb9ca94..cf3a61d2ec 100644
--- a/cpp/src/decisiontree/quantile/quantile.cuh
+++ b/cpp/src/decisiontree/quantile/quantile.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <cub/cub.cuh>
 #include "quantile.h"
 
@@ -95,7 +95,7 @@ void preprocess_quantile(const T *data, const unsigned int *rowids,
   if (tempmem->temp_data != nullptr) {
     T *d_keys_out = tempmem->temp_data->data();
     unsigned int *colids = nullptr;
-    blocks = MLCommon::ceildiv(ncols * n_sampled_rows, threads);
+    blocks = raft::ceildiv(ncols * n_sampled_rows, threads);
     allcolsampler_kernel<<<blocks, threads, 0, tempmem->stream>>>(
       data, rowids, colids, n_sampled_rows, ncols, rowoffset,
       d_keys_out);  // d_keys_in already allocated for all ncols
@@ -108,7 +108,7 @@ void preprocess_quantile(const T *data, const unsigned int *rowids,
   d_offsets = new MLCommon::device_buffer<int>(tempmem->device_allocator,
                                                tempmem->stream, batch_cols + 1);
 
-  blocks = MLCommon::ceildiv(batch_cols + 1, threads);
+  blocks = raft::ceildiv(batch_cols + 1, threads);
   set_sorting_offset<<<blocks, threads, 0, tempmem->stream>>>(
     n_sampled_rows, batch_cols, d_offsets->data());
   CUDA_CHECK(cudaGetLastError());
@@ -118,7 +118,7 @@ void preprocess_quantile(const T *data, const unsigned int *rowids,
   size_t temp_storage_bytes = 0;
 
   int batch_cnt =
-    MLCommon::ceildiv(ncols, batch_cols);  // number of loop iterations
+    raft::ceildiv(ncols, batch_cols);  // number of loop iterations
   int last_batch_size =
     ncols - batch_cols * (batch_cnt - 1);  // number of columns in last batch
   int batch_items =
@@ -149,7 +149,7 @@ void preprocess_quantile(const T *data, const unsigned int *rowids,
       &d_keys_in[batch_offset], d_keys_out->data(), n_sampled_rows, 0,
       8 * sizeof(T), tempmem->stream));
 
-    blocks = MLCommon::ceildiv(cur_batch_cols * nbins, threads);
+    blocks = raft::ceildiv(cur_batch_cols * nbins, threads);
     get_all_quantiles<<<blocks, threads, 0, tempmem->stream>>>(
       d_keys_out->data(), &tempmem->d_quantile->data()[quantile_offset],
       n_sampled_rows, cur_batch_cols, nbins);
@@ -157,8 +157,8 @@ void preprocess_quantile(const T *data, const unsigned int *rowids,
     CUDA_CHECK(cudaGetLastError());
     CUDA_CHECK(cudaStreamSynchronize(tempmem->stream));
   }
-  MLCommon::updateHost(tempmem->h_quantile->data(), tempmem->d_quantile->data(),
-                       nbins * ncols, tempmem->stream);
+  raft::update_host(tempmem->h_quantile->data(), tempmem->d_quantile->data(),
+                    nbins * ncols, tempmem->stream);
   d_keys_out->release(tempmem->stream);
   d_offsets->release(tempmem->stream);
   d_temp_storage->release(tempmem->stream);
diff --git a/cpp/src/fil/common.cuh b/cpp/src/fil/common.cuh
index 8ec5dedb74..11e49e38bc 100644
--- a/cpp/src/fil/common.cuh
+++ b/cpp/src/fil/common.cuh
@@ -23,7 +23,7 @@
 #include <stdexcept>
 #include <string>
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 #include <cuml/fil/fil.h>
 
@@ -115,43 +115,78 @@ struct dense_storage {
   int node_pitch_ = 0;
 };
 
-/** sparse_node is a single node in a sparse forest */
-struct alignas(16) sparse_node : base_node, sparse_node_extra_data {
-  //__host__ __device__ sparse_node() : left_idx(0), base_node() {}
-  sparse_node(sparse_node_t node)
-    : base_node(node), sparse_node_extra_data(node) {}
-  sparse_node(val_t output, float thresh, int fid, bool def_left, bool is_leaf,
-              int left_index)
+/** sparse_node16 is a 16-byte node in a sparse forest */
+struct alignas(16) sparse_node16 : base_node, sparse_node16_extra_data {
+  sparse_node16(sparse_node16_t node)
+    : base_node(node), sparse_node16_extra_data(node) {}
+  sparse_node16(val_t output, float thresh, int fid, bool def_left,
+                bool is_leaf, int left_index)
     : base_node(output, thresh, fid, def_left, is_leaf),
-      sparse_node_extra_data({.left_idx = left_index, .dummy = 0}) {}
+      sparse_node16_extra_data({.left_idx = left_index, .dummy = 0}) {}
   __host__ __device__ int left_index() const { return left_idx; }
   /** index of the left child, where curr is the index of the current node */
   __host__ __device__ int left(int curr) const { return left_idx; }
 };
 
+/** sparse_node8 is a node of reduced size (8 bytes) in a sparse forest */
+struct alignas(8) sparse_node8 : base_node {
+  static const int FID_NUM_BITS = 14;
+  static const int FID_MASK = (1 << FID_NUM_BITS) - 1;
+  static const int LEFT_OFFSET = FID_NUM_BITS;
+  static const int LEFT_NUM_BITS = 16;
+  static const int LEFT_MASK = ((1 << LEFT_NUM_BITS) - 1) << LEFT_OFFSET;
+  static const int DEF_LEFT_OFFSET = LEFT_OFFSET + LEFT_NUM_BITS;
+  static const int DEF_LEFT_MASK = 1 << DEF_LEFT_OFFSET;
+  static const int IS_LEAF_OFFSET = 31;
+  static const int IS_LEAF_MASK = 1 << IS_LEAF_OFFSET;
+  __host__ __device__ int fid() const { return bits & FID_MASK; }
+  __host__ __device__ bool def_left() const { return bits & DEF_LEFT_MASK; }
+  __host__ __device__ bool is_leaf() const { return bits & IS_LEAF_MASK; }
+  __host__ __device__ int left_index() const {
+    return (bits & LEFT_MASK) >> LEFT_OFFSET;
+  }
+  sparse_node8(sparse_node8_t node) : base_node(node) {}
+  sparse_node8(val_t output, float thresh, int fid, bool def_left, bool is_leaf,
+               int left_index) {
+    if (is_leaf)
+      val = output;
+    else
+      val.f = thresh;
+    bits = fid | left_index << LEFT_OFFSET |
+           (def_left ? 1 : 0) << DEF_LEFT_OFFSET |
+           (is_leaf ? 1 : 0) << IS_LEAF_OFFSET;
+  }
+  /** index of the left child, where curr is the index of the current node */
+  __host__ __device__ int left(int curr) const { return left_index(); }
+};
+
 /** sparse_tree is a sparse tree */
+template <typename node_t>
 struct sparse_tree {
-  __host__ __device__ sparse_tree(sparse_node* nodes) : nodes_(nodes) {}
-  __host__ __device__ const sparse_node& operator[](int i) const {
+  __host__ __device__ sparse_tree(node_t* nodes) : nodes_(nodes) {}
+  __host__ __device__ const node_t& operator[](int i) const {
     return nodes_[i];
   }
-  sparse_node* nodes_ = nullptr;
+  node_t* nodes_ = nullptr;
 };
 
 /** sparse_storage stores the forest as a collection of sparse nodes */
+template <typename node_t>
 struct sparse_storage {
   int* trees_ = nullptr;
-  sparse_node* nodes_ = nullptr;
+  node_t* nodes_ = nullptr;
   int num_trees_ = 0;
-  __host__ __device__ sparse_storage(int* trees, sparse_node* nodes,
-                                     int num_trees)
+  __host__ __device__ sparse_storage(int* trees, node_t* nodes, int num_trees)
     : trees_(trees), nodes_(nodes), num_trees_(num_trees) {}
   __host__ __device__ int num_trees() const { return num_trees_; }
-  __host__ __device__ sparse_tree operator[](int i) const {
-    return sparse_tree(&nodes_[trees_[i]]);
+  __host__ __device__ sparse_tree<node_t> operator[](int i) const {
+    return sparse_tree<node_t>(&nodes_[trees_[i]]);
   }
 };
 
+typedef sparse_storage<sparse_node16> sparse_storage16;
+typedef sparse_storage<sparse_node8> sparse_storage8;
+
 // predict_params are parameters for prediction
 struct predict_params {
   // Model parameters.
@@ -163,8 +198,9 @@ struct predict_params {
   // for class probabilities, this is the number of classes considered
   // ignored otherwise
   int num_classes;
-  // leaf_payload_type determines what the leaves store (predict)
-  leaf_value_t leaf_payload_type;
+  // leaf_algo determines what the leaves store (predict) and how FIL
+  // aggregates them into class margins/predicted class/regression answer
+  leaf_algo_t leaf_algo;
 
   // Data parameters.
   float* preds;
diff --git a/cpp/src/fil/fil.cu b/cpp/src/fil/fil.cu
index 5c0b7df2af..6a43bb1f1c 100644
--- a/cpp/src/fil/fil.cu
+++ b/cpp/src/fil/fil.cu
@@ -28,6 +28,7 @@
 #include <utility>
 
 #include <cuml/fil/fil.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/cuml_allocator.hpp>
 #include "common.cuh"
 
@@ -37,13 +38,13 @@ namespace fil {
 using namespace MLCommon;
 namespace tl = treelite;
 
-void dense_node_init(dense_node_t* n, val_t output, float thresh, int fid,
-                     bool def_left, bool is_leaf) {
+void node_init(dense_node_t* n, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf) {
   *n = dense_node(output, thresh, fid, def_left, is_leaf);
 }
 
-void dense_node_decode(const dense_node_t* n, val_t* output, float* thresh,
-                       int* fid, bool* def_left, bool* is_leaf) {
+void node_decode(const dense_node_t* n, val_t* output, float* thresh, int* fid,
+                 bool* def_left, bool* is_leaf) {
   dense_node dn(*n);
   *output = dn.output<val_t>();
   *thresh = dn.thresh();
@@ -52,26 +53,42 @@ void dense_node_decode(const dense_node_t* n, val_t* output, float* thresh,
   *is_leaf = dn.is_leaf();
 }
 
-inline void sparse_node_init_inline(sparse_node_t* node, val_t output,
-                                    float thresh, int fid, bool def_left,
-                                    bool is_leaf, int left_index) {
-  sparse_node n(output, thresh, fid, def_left, is_leaf, left_index);
+inline void node_init_inline(sparse_node16_t* node, val_t output, float thresh,
+                             int fid, bool def_left, bool is_leaf,
+                             int left_index) {
+  sparse_node16 n(output, thresh, fid, def_left, is_leaf, left_index);
+  *node = sparse_node16_t(n, n);
+}
+
+void node_init(sparse_node16_t* node, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf, int left_index) {
+  node_init_inline(node, output, thresh, fid, def_left, is_leaf, left_index);
+}
 
-  *node = sparse_node_t(n, n);
+void node_decode(const sparse_node16_t* node, val_t* output, float* thresh,
+                 int* fid, bool* def_left, bool* is_leaf, int* left_index) {
+  node_decode((const dense_node_t*)node, output, thresh, fid, def_left,
+              is_leaf);
+  *left_index = sparse_node16(*node).left_index();
 }
 
-void sparse_node_init(sparse_node_t* node, val_t output, float thresh, int fid,
-                      bool def_left, bool is_leaf, int left_index) {
-  sparse_node_init_inline(node, output, thresh, fid, def_left, is_leaf,
-                          left_index);
+inline void node_init_inline(sparse_node8_t* node, val_t output, float thresh,
+                             int fid, bool def_left, bool is_leaf,
+                             int left_index) {
+  sparse_node8 n(output, thresh, fid, def_left, is_leaf, left_index);
+  *node = sparse_node8_t(n);
 }
 
-/** sparse_node_decode extracts individual members from node */
-void sparse_node_decode(const sparse_node_t* node, val_t* output, float* thresh,
-                        int* fid, bool* def_left, bool* is_leaf,
-                        int* left_index) {
-  dense_node_decode(node, output, thresh, fid, def_left, is_leaf);
-  *left_index = sparse_node(*node).left_index();
+void node_init(sparse_node8_t* node, val_t output, float thresh, int fid,
+               bool def_left, bool is_leaf, int left_index) {
+  node_init_inline(node, output, thresh, fid, def_left, is_leaf, left_index);
+}
+
+void node_decode(const sparse_node8_t* node, val_t* output, float* thresh,
+                 int* fid, bool* def_left, bool* is_leaf, int* left_index) {
+  node_decode((const dense_node_t*)node, output, thresh, fid, def_left,
+              is_leaf);
+  *left_index = sparse_node8(*node).left_index();
 }
 
 __host__ __device__ float sigmoid(float x) { return 1.0f / (1.0f + expf(-x)); }
@@ -92,7 +109,7 @@ __global__ void transform_k(float* preds, size_t n, output_t output,
   if ((output & output_t::AVG) != 0) result *= inv_num_trees;
   result += global_bias;
   if ((output & output_t::SIGMOID) != 0) result = sigmoid(result);
-  // will not be done on INT_CLASS_LABEL because the whole kernel will not run
+  // will not be done on CATEGORICAL_LEAF because the whole kernel will not run
   if ((output & output_t::CLASS) != 0) {
     result = result > threshold ? 1.0f : 0.0f;
   }
@@ -109,7 +126,7 @@ struct forest {
   void init_max_shm() {
     int max_shm_std = 48 * 1024;  // 48 KiB
     int device = 0;
-    // TODO(canonizer): use cumlHandle for this
+    // TODO(canonizer): use raft::handle_t for this
     CUDA_CHECK(cudaGetDevice(&device));
     CUDA_CHECK(cudaDeviceGetAttribute(
       &max_shm_, cudaDevAttrMaxSharedMemoryPerBlockOptin, device));
@@ -125,14 +142,14 @@ struct forest {
     output_ = params->output;
     threshold_ = params->threshold;
     global_bias_ = params->global_bias;
-    leaf_payload_type_ = params->leaf_payload_type;
+    leaf_algo_ = params->leaf_algo;
     num_classes_ = params->num_classes;
     init_max_shm();
   }
 
   virtual void infer(predict_params params, cudaStream_t stream) = 0;
 
-  void predict(const cumlHandle& h, float* preds, const float* data,
+  void predict(const raft::handle_t& h, float* preds, const float* data,
                size_t num_rows, bool predict_proba) {
     // Initialize prediction parameters.
     predict_params params;
@@ -143,10 +160,10 @@ struct forest {
     params.num_rows = num_rows;
     params.max_shm = max_shm_;
     params.num_classes = num_classes_;
-    params.leaf_payload_type = leaf_payload_type_;
+    params.leaf_algo = leaf_algo_;
 
     /**
-    The binary classification / regression (FLOAT_SCALAR) predict_proba() works as follows
+    The binary classification / regression (FLOAT_UNARY_BINARY) predict_proba() works as follows
       (always 2 outputs):
     RAW: output the sum of tree predictions
     AVG is set: divide by the number of trees (averaging)
@@ -154,69 +171,87 @@ struct forest {
     CLASS is set: ignored
     write the output of the previous stages and its complement
 
-    The binary classification / regression (FLOAT_SCALAR) predict() works as follows
+    The binary classification / regression (FLOAT_UNARY_BINARY) predict() works as follows
       (always 1 output):
     RAW (no values set): output the sum of tree predictions
     AVG is set: divide by the number of trees (averaging)
     SIGMOID is set: apply sigmoid
     CLASS is set: apply threshold (equivalent to choosing best class)
     
-    The multi-class classification / regression (INT_CLASS_LABEL) predict_proba() works as follows
+    The multi-class classification / regression (CATEGORICAL_LEAF) predict_proba() works as follows
       (always num_classes outputs):
     RAW (no values set): output class votes
     AVG is set: divide by the number of trees (averaging, output class probability)
     SIGMOID is set: apply sigmoid
     CLASS is set: ignored
     
-    The multi-class classification / regression (INT_CLASS_LABEL) predict() works as follows
+    The multi-class classification / regression (CATEGORICAL_LEAF) predict() works as follows
       (always 1 output):
     RAW (no values set): output the label of the class with highest probability, else output label 0.
-    AVG is set: ignored
-    SIGMOID is set: ignored
-    CLASS is set: ignored
+    All other flags (AVG, SIGMOID, CLASS) are ignored
+    
+    The multi-class classification / regression (GROVE_PER_CLASS) predict_proba() is not implemented
+    
+    The multi-class classification / regression (GROVE_PER_CLASS) predict() works as follows
+      (always 1 output):
+    RAW (no values set): output the label of the class with highest margin,
+      equal margins resolved in favor of smaller label integer
+    All other flags (AVG, SIGMOID, CLASS) are ignored
     */
     output_t ot = output_;
-    bool complement_proba = false, do_transform = global_bias_ != 0.0f;
-
-    if (leaf_payload_type_ == leaf_value_t::FLOAT_SCALAR) {
-      if (predict_proba) {
-        params.num_outputs = 2;
-        ot = output_t(ot & ~output_t::CLASS);  // no threshold on probabilities
-        complement_proba = true;
-        do_transform = true;
-      } else {
-        params.num_outputs = 1;
-        if (ot != output_t::RAW) do_transform = true;
+    bool complement_proba = false, do_transform;
+
+    if (predict_proba) {
+      // no threshold on probabilities
+      ot = output_t(ot & ~output_t::CLASS);
+
+      switch (leaf_algo_) {
+        case leaf_algo_t::FLOAT_UNARY_BINARY:
+          params.num_outputs = 2;
+          complement_proba = true;
+          do_transform = true;
+          break;
+        case leaf_algo_t::GROVE_PER_CLASS:
+          // TODO(levsnv): add softmax to implement predict_proba
+          ASSERT(
+            false,
+            "predict_proba not supported for multi-class gradient boosted "
+            "decision trees (encountered in xgboost, scikit-learn, lightgbm)");
+        case leaf_algo_t::CATEGORICAL_LEAF:
+          params.num_outputs = num_classes_;
+          do_transform = ot != output_t::RAW || global_bias_ != 0.0f;
+          break;
+        default:
+          ASSERT(false, "internal error: invalid leaf_algo_");
       }
-    } else if (leaf_payload_type_ == leaf_value_t::INT_CLASS_LABEL) {
-      if (predict_proba) {
-        params.num_outputs = num_classes_;
-        ot = output_t(ot & ~output_t::CLASS);  // no threshold on probabilities
-        if (ot != output_t::RAW) do_transform = true;
+    } else {
+      if (leaf_algo_ == leaf_algo_t::FLOAT_UNARY_BINARY) {
+        do_transform = ot != output_t::RAW || global_bias_ != 0.0f;
       } else {
-        params.num_outputs = 1;
-        // moot since choosing best class and all transforms are monotonic
-        // also, would break current code
+        // GROVE_PER_CLASS, CATEGORICAL_LEAF: moot since choosing best class and
+        // all transforms are monotonic. also, would break current code
         do_transform = false;
       }
+      params.num_outputs = 1;
     }
 
     // Predict using the forest.
-    cudaStream_t stream = h.getStream();
+    cudaStream_t stream = h.get_stream();
     infer(params, stream);
 
     if (do_transform) {
       size_t num_values_to_transform =
         (size_t)num_rows * (size_t)params.num_outputs;
-      transform_k<<<ceildiv(num_values_to_transform, (size_t)FIL_TPB), FIL_TPB,
-                    0, stream>>>(preds, num_values_to_transform, ot,
-                                 num_trees_ > 0 ? (1.0f / num_trees_) : 1.0f,
-                                 threshold_, global_bias_, complement_proba);
+      transform_k<<<raft::ceildiv(num_values_to_transform, (size_t)FIL_TPB),
+                    FIL_TPB, 0, stream>>>(
+        preds, num_values_to_transform, ot,
+        num_trees_ > 0 ? (1.0f / num_trees_) : 1.0f, threshold_, global_bias_,
+        complement_proba);
       CUDA_CHECK(cudaPeekAtLastError());
     }
   }
 
-  virtual void free(const cumlHandle& h) = 0;
+  virtual void free(const raft::handle_t& h) = 0;
   virtual ~forest() {}
 
   int num_trees_ = 0;
@@ -227,30 +262,45 @@ struct forest {
   output_t output_ = output_t::RAW;
   float threshold_ = 0.5;
   float global_bias_ = 0;
-  leaf_value_t leaf_payload_type_ = leaf_value_t::FLOAT_SCALAR;
-  int num_classes_ = 0;
+  leaf_algo_t leaf_algo_ = leaf_algo_t::FLOAT_UNARY_BINARY;
+  int num_classes_ = 1;
 };
 
 struct dense_forest : forest {
   void transform_trees(const dense_node_t* nodes) {
-    // populate node information
-    for (int i = 0, gid = 0; i < num_trees_; ++i) {
-      for (int j = 0, nid = 0; j <= depth_; ++j) {
-        for (int k = 0; k < 1 << j; ++k, ++nid, ++gid) {
-          h_nodes_[nid * num_trees_ + i] = dense_node(nodes[gid]);
+    /* Populate node information:
+       For each tree, the nodes are still stored in the breadth-first,
+       left-to-right order. However, instead of storing the nodes of the same
+       tree adjacently, it uses a different layout. In this layout, the roots
+       of all trees (node 0) are stored first, followed by left children of
+       the roots of all trees (node 1), followed by the right children of the
+       roots of all trees (node 2), and so on.
+    */
+    int global_node = 0;
+    for (int tree = 0; tree < num_trees_; ++tree) {
+      int tree_node = 0;
+      // the counters `level` and `branch` are not used for computing node
+      // indices, they are only here to highlight the node ordering within
+      // each tree
+      for (int level = 0; level <= depth_; ++level) {
+        for (int branch = 0; branch < 1 << level; ++branch) {
+          h_nodes_[tree_node * num_trees_ + tree] =
+            dense_node(nodes[global_node]);
+          ++tree_node;
+          ++global_node;
         }
       }
     }
   }
 
-  void init(const cumlHandle& h, const dense_node_t* nodes,
+  void init(const raft::handle_t& h, const dense_node_t* nodes,
             const forest_params_t* params) {
     init_common(params);
     if (algo_ == algo_t::NAIVE) algo_ = algo_t::BATCH_TREE_REORG;
 
     int num_nodes = forest_num_nodes(num_trees_, depth_);
-    nodes_ = (dense_node*)h.getDeviceAllocator()->allocate(
-      sizeof(dense_node) * num_nodes, h.getStream());
+    nodes_ = (dense_node*)h.get_device_allocator()->allocate(
+      sizeof(dense_node) * num_nodes, h.get_stream());
     h_nodes_.resize(num_nodes);
     if (algo_ == algo_t::NAIVE) {
       std::copy(nodes, nodes + num_nodes, h_nodes_.begin());
@@ -259,9 +309,9 @@ struct dense_forest : forest {
     }
     CUDA_CHECK(cudaMemcpyAsync(nodes_, h_nodes_.data(),
                                num_nodes * sizeof(dense_node),
-                               cudaMemcpyHostToDevice, h.getStream()));
+                               cudaMemcpyHostToDevice, h.get_stream()));
     // copy must be finished before freeing the host data
-    CUDA_CHECK(cudaStreamSynchronize(h.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(h.get_stream()));
     h_nodes_.clear();
     h_nodes_.shrink_to_fit();
   }
@@ -273,52 +323,67 @@ struct dense_forest : forest {
     fil::infer(forest, params, stream);
   }
 
-  virtual void free(const cumlHandle& h) override {
+  virtual void free(const raft::handle_t& h) override {
     int num_nodes = forest_num_nodes(num_trees_, depth_);
-    h.getDeviceAllocator()->deallocate(nodes_, sizeof(dense_node) * num_nodes,
-                                       h.getStream());
+    h.get_device_allocator()->deallocate(nodes_, sizeof(dense_node) * num_nodes,
+                                         h.get_stream());
   }
 
   dense_node* nodes_ = nullptr;
   thrust::host_vector<dense_node> h_nodes_;
 };
 
+template <typename node_t>
+struct external_node {};
+
+template <>
+struct external_node<sparse_node16> {
+  typedef sparse_node16_t t;
+};
+
+template <>
+struct external_node<sparse_node8> {
+  typedef sparse_node8_t t;
+};
+
+template <typename node_t>
 struct sparse_forest : forest {
-  void init(const cumlHandle& h, const int* trees, const sparse_node_t* nodes,
-            const forest_params_t* params) {
+  typedef typename external_node<node_t>::t external_node_t;
+  void init(const raft::handle_t& h, const int* trees,
+            const external_node_t* nodes, const forest_params_t* params) {
     init_common(params);
     if (algo_ == algo_t::ALGO_AUTO) algo_ = algo_t::NAIVE;
     depth_ = 0;  // a placeholder value
     num_nodes_ = params->num_nodes;
 
     // trees
-    trees_ = (int*)h.getDeviceAllocator()->allocate(sizeof(int) * num_trees_,
-                                                    h.getStream());
+    trees_ = (int*)h.get_device_allocator()->allocate(sizeof(int) * num_trees_,
+                                                      h.get_stream());
     CUDA_CHECK(cudaMemcpyAsync(trees_, trees, sizeof(int) * num_trees_,
-                               cudaMemcpyHostToDevice, h.getStream()));
+                               cudaMemcpyHostToDevice, h.get_stream()));
 
     // nodes
-    nodes_ = (sparse_node*)h.getDeviceAllocator()->allocate(
-      sizeof(sparse_node) * num_nodes_, h.getStream());
-    CUDA_CHECK(cudaMemcpyAsync(nodes_, nodes, sizeof(sparse_node) * num_nodes_,
-                               cudaMemcpyHostToDevice, h.getStream()));
+    nodes_ = (node_t*)h.get_device_allocator()->allocate(
+      sizeof(node_t) * num_nodes_, h.get_stream());
+    CUDA_CHECK(cudaMemcpyAsync(nodes_, nodes, sizeof(node_t) * num_nodes_,
+                               cudaMemcpyHostToDevice, h.get_stream()));
   }
 
   virtual void infer(predict_params params, cudaStream_t stream) override {
-    sparse_storage forest(trees_, nodes_, num_trees_);
+    sparse_storage<node_t> forest(trees_, nodes_, num_trees_);
     fil::infer(forest, params, stream);
   }
 
-  void free(const cumlHandle& h) override {
-    h.getDeviceAllocator()->deallocate(trees_, sizeof(int) * num_trees_,
-                                       h.getStream());
-    h.getDeviceAllocator()->deallocate(nodes_, sizeof(sparse_node) * num_nodes_,
-                                       h.getStream());
+  void free(const raft::handle_t& h) override {
+    h.get_device_allocator()->deallocate(trees_, sizeof(int) * num_trees_,
+                                         h.get_stream());
+    h.get_device_allocator()->deallocate(nodes_, sizeof(node_t) * num_nodes_,
+                                         h.get_stream());
   }
 
   int num_nodes_ = 0;
   int* trees_ = nullptr;
-  sparse_node* nodes_ = nullptr;
+  node_t* nodes_ = nullptr;
 };
 
 void check_params(const forest_params_t* params, bool dense) {
@@ -343,20 +408,33 @@ void check_params(const forest_params_t* params, bool dense) {
       ASSERT(false,
              "algo should be ALGO_AUTO, NAIVE, TREE_REORG or BATCH_TREE_REORG");
   }
-  switch (params->leaf_payload_type) {
-    case leaf_value_t::FLOAT_SCALAR:
-      /* params->num_classes is ignored in this case, since the user might call
-         predict_proba() on regression. Hence, no point checking the range of
-         an ignored variable */
+  switch (params->leaf_algo) {
+    case leaf_algo_t::FLOAT_UNARY_BINARY:
+      if ((params->output & output_t::CLASS) != 0) {
+        ASSERT(params->num_classes == 2,
+               "only supporting binary"
+               " classification using FLOAT_UNARY_BINARY");
+      } else {
+        ASSERT(params->num_classes == 1,
+               "num_classes must be 1 for "
+               "regression");
+      }
       break;
-    case leaf_value_t::INT_CLASS_LABEL:
+    case leaf_algo_t::GROVE_PER_CLASS:
+      ASSERT(params->num_classes > 2,
+             "num_classes > 2 is required for leaf_algo == GROVE_PER_CLASS");
+      ASSERT(params->num_trees % params->num_classes == 0,
+             "num_classes must divide num_trees evenly for GROVE_PER_CLASS");
+      break;
+    case leaf_algo_t::CATEGORICAL_LEAF:
       ASSERT(params->num_classes >= 2,
              "num_classes >= 2 is required for "
-             "leaf_payload_type == INT_CLASS_LABEL");
+             "leaf_algo == CATEGORICAL_LEAF");
       break;
     default:
       ASSERT(false,
-             "leaf_payload_type should be FLOAT_SCALAR or INT_CLASS_LABEL");
+             "leaf_algo must be FLOAT_UNARY_BINARY, CATEGORICAL_LEAF"
+             " or GROVE_PER_CLASS");
   }
   // output_t::RAW == 0, and doesn't have a separate flag
   output_t all_set =
@@ -466,19 +544,20 @@ template <typename fil_node_t>
 void tl2fil_leaf_payload(fil_node_t* fil_node, const tl::Tree& tl_tree,
                          int tl_node_id, const forest_params_t& forest_params) {
   auto vec = tl_tree.LeafVector(tl_node_id);
-  switch (forest_params.leaf_payload_type) {
-    case leaf_value_t::INT_CLASS_LABEL:
+  switch (forest_params.leaf_algo) {
+    case leaf_algo_t::CATEGORICAL_LEAF:
       ASSERT(vec.size() == forest_params.num_classes,
              "inconsistent number of classes in treelite leaves");
       fil_node->val.idx = find_class_label_from_one_hot(&vec[0], vec.size());
       break;
-    case leaf_value_t::FLOAT_SCALAR:
+    case leaf_algo_t::FLOAT_UNARY_BINARY:
+    case leaf_algo_t::GROVE_PER_CLASS:
       fil_node->val.f = tl_tree.LeafValue(tl_node_id);
       ASSERT(!tl_tree.HasLeafVector(tl_node_id),
              "some but not all treelite leaves have leaf_vector()");
       break;
     default:
-      ASSERT(false, "internal error: invalid leaf_payload_type");
+      ASSERT(false, "internal error: invalid leaf_algo");
   };
 }
 
@@ -486,8 +565,7 @@ void node2fil_dense(std::vector<dense_node_t>* pnodes, int root, int cur,
                     const tl::Tree& tree, int node_id,
                     const forest_params_t& forest_params) {
   if (tree.IsLeaf(node_id)) {
-    dense_node_init(&(*pnodes)[root + cur], val_t{.f = NAN}, NAN, 0, false,
-                    true);
+    node_init(&(*pnodes)[root + cur], val_t{.f = NAN}, NAN, 0, false, true);
     tl2fil_leaf_payload(&(*pnodes)[root + cur], tree, node_id, forest_params);
     return;
   }
@@ -500,8 +578,8 @@ void node2fil_dense(std::vector<dense_node_t>* pnodes, int root, int cur,
   float threshold = tree.Threshold(node_id);
   adjust_threshold(&threshold, &tl_left, &tl_right, &default_left,
                    tree.ComparisonOp(node_id));
-  dense_node_init(&(*pnodes)[root + cur], val_t{.f = 0}, threshold,
-                  tree.SplitIndex(node_id), default_left, false);
+  node_init(&(*pnodes)[root + cur], val_t{.f = 0}, threshold,
+            tree.SplitIndex(node_id), default_left, false);
   int left = 2 * cur + 1;
   node2fil_dense(pnodes, root, left, tree, tl_left, forest_params);
   node2fil_dense(pnodes, root, left + 1, tree, tl_right, forest_params);
@@ -513,12 +591,13 @@ void tree2fil_dense(std::vector<dense_node_t>* pnodes, int root,
   node2fil_dense(pnodes, root, 0, tree, tree_root(tree), forest_params);
 }
 
-int tree2fil_sparse(std::vector<sparse_node_t>* pnodes, const tl::Tree& tree,
+template <typename fil_node_t>
+int tree2fil_sparse(std::vector<fil_node_t>* pnodes, const tl::Tree& tree,
                     const forest_params_t& forest_params) {
   typedef std::pair<int, int> pair_t;
   std::stack<pair_t> stack;
   int root = pnodes->size();
-  pnodes->push_back(sparse_node_t());
+  pnodes->push_back(fil_node_t());
   stack.push(pair_t(tree_root(tree), 0));
   while (!stack.empty()) {
     const pair_t& top = stack.top();
@@ -543,11 +622,10 @@ int tree2fil_sparse(std::vector<sparse_node_t>* pnodes, const tl::Tree& tree,
       // left is the offset of the left child node relative to the tree root
       // in the array of all nodes of the FIL sparse forest
       int left = pnodes->size() - root;
-      pnodes->push_back(sparse_node_t());
-      pnodes->push_back(sparse_node_t());
-      sparse_node_init_inline(&(*pnodes)[root + cur], val_t{.f = 0}, threshold,
-                              tree.SplitIndex(node_id), default_left, false,
-                              left);
+      pnodes->push_back(fil_node_t());
+      pnodes->push_back(fil_node_t());
+      node_init_inline(&(*pnodes)[root + cur], val_t{.f = 0}, threshold,
+                       tree.SplitIndex(node_id), default_left, false, left);
 
       // push child nodes into the stack
       stack.push(pair_t(tl_right, left + 1));
@@ -557,7 +635,7 @@ int tree2fil_sparse(std::vector<sparse_node_t>* pnodes, const tl::Tree& tree,
     }
 
     // leaf node
-    sparse_node_init(&(*pnodes)[root + cur], val_t{.f = NAN}, NAN, 0, false,
+    node_init_inline(&(*pnodes)[root + cur], val_t{.f = NAN}, NAN, 0, false,
                      true, 0);
     tl2fil_leaf_payload(&(*pnodes)[root + cur], tree, node_id, forest_params);
   }
@@ -586,24 +664,59 @@ void tl2fil_common(forest_params_t* params, const tl::Model& model,
   // fill in forest-dependent params
   params->depth = max_depth(model);  // also checks for cycles
 
+  const tl::ModelParam& param = model.param;
+
   // assuming either all leaves use the .leaf_vector() or all leaves use .leaf_value()
   size_t leaf_vec_size = tl_leaf_vector_size(model);
+  std::string pred_transform(param.pred_transform);
   if (leaf_vec_size > 0) {
     ASSERT(leaf_vec_size == model.num_output_group,
            "treelite model inconsistent");
     params->num_classes = leaf_vec_size;
-    params->leaf_payload_type = leaf_value_t::INT_CLASS_LABEL;
+    params->leaf_algo = leaf_algo_t::CATEGORICAL_LEAF;
+
+    ASSERT(tl_params->output_class,
+           "output_class==true is required for multi-class models");
+
+    ASSERT(
+      pred_transform == "max_index" || pred_transform == "identity_multiclass",
+      "only max_index and identity_multiclass values of pred_transform "
+      "are supported for multi-class models");
+
   } else {
-    params->leaf_payload_type = leaf_value_t::FLOAT_SCALAR;
-    params->num_classes = 0;  // ignored
+    if (model.num_output_group > 1) {
+      params->num_classes = model.num_output_group;
+      ASSERT(tl_params->output_class,
+             "output_class==true is required for multi-class models");
+      ASSERT(pred_transform == "sigmoid" || pred_transform == "identity" ||
+               pred_transform == "max_index" || pred_transform == "softmax" ||
+               pred_transform == "multiclass_ova",
+             "only sigmoid, identity, max_index, multiclass_ova and softmax "
+             "values of pred_transform are supported for xgboost-style "
+             "multi-class classification models.");
+      // this function should not know how many threads per block will be used
+      params->leaf_algo = leaf_algo_t::GROVE_PER_CLASS;
+    } else {
+      params->num_classes = tl_params->output_class ? 2 : 1;
+      ASSERT(pred_transform == "sigmoid" || pred_transform == "identity",
+             "only sigmoid and identity values of pred_transform "
+             "are supported for binary classification and regression models.");
+      params->leaf_algo = leaf_algo_t::FLOAT_UNARY_BINARY;
+    }
   }
 
   params->num_cols = model.num_feature;
-  const tl::ModelParam& param = model.param;
+
   ASSERT(param.sigmoid_alpha == 1.0f, "sigmoid_alpha not supported");
   params->global_bias = param.global_bias;
   params->output = output_t::RAW;
-  if (tl_params->output_class) {
+  /** output_t::CLASS denotes using a threshold in FIL, when
+      predict_proba == false. For all multiclass models, the best class is
+      selected using argmax instead. This happens when either
+      leaf_algo == CATEGORICAL_LEAF or num_classes > 2.
+  **/
+  if (tl_params->output_class && params->leaf_algo != CATEGORICAL_LEAF &&
+      params->num_classes <= 2) {
     params->output = output_t(params->output | output_t::CLASS);
   }
   // "random forest" in treelite means tree output averaging
@@ -612,9 +725,6 @@ void tl2fil_common(forest_params_t* params, const tl::Model& model,
   }
   if (std::string(param.pred_transform) == "sigmoid") {
     params->output = output_t(params->output | output_t::SIGMOID);
-  } else if (std::string(param.pred_transform) != "identity") {
-    ASSERT(false, "%s: unsupported treelite prediction transform",
-           param.pred_transform);
   }
   params->num_trees = model.trees.size();
 }
@@ -634,12 +744,53 @@ void tl2fil_dense(std::vector<dense_node_t>* pnodes, forest_params_t* params,
   }
 }
 
+template <typename fil_node_t>
+struct tl2fil_sparse_check_t {
+  static void check(const tl::Model& model) {
+    ASSERT(false,
+           "internal error: "
+           "only a specialization of this tempalte should be used");
+  }
+};
+
+template <>
+struct tl2fil_sparse_check_t<sparse_node16_t> {
+  // no extra check for 16-byte sparse nodes
+  static void check(const tl::Model& model) {}
+};
+
+template <>
+struct tl2fil_sparse_check_t<sparse_node8_t> {
+  static const int MAX_FEATURES = 1 << sparse_node8::FID_NUM_BITS;
+  static const int MAX_TREE_NODES = (1 << sparse_node8::LEFT_NUM_BITS) - 1;
+  static void check(const tl::Model& model) {
+    // check the number of features
+    int num_features = model.num_feature;
+    ASSERT(num_features <= MAX_FEATURES,
+           "model has %d features, "
+           "but only %d supported for 8-byte sparse nodes",
+           num_features, MAX_FEATURES);
+
+    // check the number of tree nodes
+    const std::vector<tl::Tree>& trees = model.trees;
+    for (int i = 0; i < trees.size(); ++i) {
+      int num_nodes = trees[i].num_nodes;
+      ASSERT(num_nodes <= MAX_TREE_NODES,
+             "tree %d has %d nodes, "
+             "but only %d supported for 8-byte sparse nodes",
+             i, num_nodes, MAX_TREE_NODES);
+    }
+  }
+};
+
 // uses treelite model with additional tl_params to initialize FIL params,
 // trees (stored in *ptrees) and sparse nodes (stored in *pnodes)
-void tl2fil_sparse(std::vector<int>* ptrees, std::vector<sparse_node_t>* pnodes,
+template <typename fil_node_t>
+void tl2fil_sparse(std::vector<int>* ptrees, std::vector<fil_node_t>* pnodes,
                    forest_params_t* params, const tl::Model& model,
                    const treelite_params_t* tl_params) {
   tl2fil_common(params, model, tl_params);
+  tl2fil_sparse_check_t<fil_node_t>::check(model);
 
   // convert the nodes
   for (int i = 0; i < model.trees.size(); ++i) {
@@ -649,23 +800,35 @@ void tl2fil_sparse(std::vector<int>* ptrees, std::vector<sparse_node_t>* pnodes,
   params->num_nodes = pnodes->size();
 }
 
-void init_dense(const cumlHandle& h, forest_t* pf, const dense_node_t* nodes,
-                const forest_params_t* params) {
+void init_dense(const raft::handle_t& h, forest_t* pf,
+                const dense_node_t* nodes, const forest_params_t* params) {
   check_params(params, true);
   dense_forest* f = new dense_forest;
   f->init(h, nodes, params);
   *pf = f;
 }
 
-void init_sparse(const cumlHandle& h, forest_t* pf, const int* trees,
-                 const sparse_node_t* nodes, const forest_params_t* params) {
+template <typename fil_node_t>
+void init_sparse(const raft::handle_t& h, forest_t* pf, const int* trees,
+                 const typename external_node<fil_node_t>::t* nodes,
+                 const forest_params_t* params) {
   check_params(params, false);
-  sparse_forest* f = new sparse_forest;
+  sparse_forest<fil_node_t>* f = new sparse_forest<fil_node_t>;
   f->init(h, trees, nodes, params);
   *pf = f;
 }
 
-void from_treelite(const cumlHandle& handle, forest_t* pforest,
+void init_sparse(const raft::handle_t& h, forest_t* pf, const int* trees,
+                 const sparse_node16_t* nodes, const forest_params_t* params) {
+  init_sparse<sparse_node16>(h, pf, trees, nodes, params);
+}
+
+void init_sparse(const raft::handle_t& h, forest_t* pf, const int* trees,
+                 const sparse_node8_t* nodes, const forest_params_t* params) {
+  init_sparse<sparse_node8>(h, pf, trees, nodes, params);
+}
+
+void from_treelite(const raft::handle_t& handle, forest_t* pforest,
                    ModelHandle model, const treelite_params_t* tl_params) {
   storage_type_t storage_type = tl_params->storage_type;
   // build dense trees by default
@@ -695,15 +858,25 @@ void from_treelite(const cumlHandle& handle, forest_t* pforest,
       init_dense(handle, pforest, nodes.data(), &params);
       // sync is necessary as nodes is used in init_dense(),
       // but destructed at the end of this function
-      CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+      CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
       break;
     }
     case storage_type_t::SPARSE: {
       std::vector<int> trees;
-      std::vector<sparse_node_t> nodes;
+      std::vector<sparse_node16_t> nodes;
+      tl2fil_sparse(&trees, &nodes, &params, model_ref, tl_params);
+      init_sparse<sparse_node16>(handle, pforest, trees.data(), nodes.data(),
+                                 &params);
+      CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
+      break;
+    }
+    case storage_type_t::SPARSE8: {
+      std::vector<int> trees;
+      std::vector<sparse_node8_t> nodes;
       tl2fil_sparse(&trees, &nodes, &params, model_ref, tl_params);
-      init_sparse(handle, pforest, trees.data(), nodes.data(), &params);
-      CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+      init_sparse<sparse_node8>(handle, pforest, trees.data(), nodes.data(),
+                                &params);
+      CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
       break;
     }
     default:
@@ -711,13 +884,13 @@ void from_treelite(const cumlHandle& handle, forest_t* pforest,
   }
 }
 
-void free(const cumlHandle& h, forest_t f) {
+void free(const raft::handle_t& h, forest_t f) {
   f->free(h);
   delete f;
 }
 
-void predict(const cumlHandle& h, forest_t f, float* preds, const float* data,
-             size_t num_rows, bool predict_proba) {
+void predict(const raft::handle_t& h, forest_t f, float* preds,
+             const float* data, size_t num_rows, bool predict_proba) {
   f->predict(h, preds, data, num_rows, predict_proba);
 }
 
diff --git a/cpp/src/fil/infer.cu b/cpp/src/fil/infer.cu
index 3f6dd76472..5c31b21585 100644
--- a/cpp/src/fil/infer.cu
+++ b/cpp/src/fil/infer.cu
@@ -26,9 +26,9 @@ using namespace MLCommon;
 template <int N, typename T>
 struct vec {
   T data[N];
-  inline __host__ __device__ vec() {
+  explicit __host__ __device__ vec(T t = T()) {
 #pragma unroll
-    for (int i = 0; i < N; ++i) data[i] = 0;
+    for (int i = 0; i < N; ++i) data[i] = t;
   }
   __host__ __device__ T& operator[](int i) { return data[i]; }
   __host__ __device__ T operator[](int i) const { return data[i]; }
@@ -44,6 +44,28 @@ struct vec {
   }
 };
 
+typedef cub::KeyValuePair<int, float> best_margin_label;
+
+template <int NITEMS>
+__device__ __forceinline__ vec<NITEMS, best_margin_label> to_vec(
+  int c, vec<NITEMS, float> margin) {
+  vec<NITEMS, best_margin_label> ret;
+#pragma unroll
+  for (int i = 0; i < NITEMS; i++) ret[i] = best_margin_label(c, margin[i]);
+  return ret;
+}
+
+struct ArgMax {
+  template <int NITEMS>
+  __host__ __device__ __forceinline__ vec<NITEMS, best_margin_label> operator()(
+    vec<NITEMS, best_margin_label> a, vec<NITEMS, best_margin_label> b) const {
+    vec<NITEMS, best_margin_label> c;
+#pragma unroll
+    for (int i = 0; i < NITEMS; i++) c[i] = cub::ArgMax()(a[i], b[i]);
+    return c;
+  }
+};
+
 template <int NITEMS, typename output_type, typename tree_type>
 __device__ __forceinline__ vec<NITEMS, output_type> infer_one_tree(
   tree_type tree, float* sdata, int cols) {
@@ -95,6 +117,9 @@ __device__ __forceinline__ vec<1, output_type> infer_one_tree(tree_type tree,
 // CUB defaults
 template <int NITEMS>
 using BlockReduce = typename cub::BlockReduce<vec<NITEMS, float>, FIL_TPB>;
+template <int NITEMS>
+using BlockReduceBestClass =
+  typename cub::BlockReduce<vec<NITEMS, best_margin_label>, FIL_TPB>;
 /**
 The shared memory requirements for finalization stage may differ based
 on the set of PTX architectures the kernels were compiled for, as well as 
@@ -114,9 +139,13 @@ template <int NITEMS>
 using BlockReduceHost =
   typename cub::BlockReduce<vec<NITEMS, float>, FIL_TPB,
                             cub::BLOCK_REDUCE_WARP_REDUCTIONS, 1, 1, 600>;
+template <int NITEMS>
+using BlockReduceHostBestClass =
+  typename cub::BlockReduce<vec<NITEMS, best_margin_label>, FIL_TPB,
+                            cub::BLOCK_REDUCE_WARP_REDUCTIONS, 1, 1, 600>;
 
 template <int NITEMS,
-          leaf_value_t leaf_payload_type>  // = FLOAT_SCALAR
+          leaf_algo_t leaf_algo>  // = FLOAT_UNARY_BINARY
 struct tree_aggregator_t {
   vec<NITEMS, float> acc;
   void* tmp_storage;
@@ -142,7 +171,7 @@ struct tree_aggregator_t {
     : tmp_storage(shared_workspace) {}
 
   __device__ __forceinline__ void accumulate(
-    vec<NITEMS, float> single_tree_prediction) {
+    vec<NITEMS, float> single_tree_prediction, int tree) {
     acc += single_tree_prediction;
   }
 
@@ -160,10 +189,134 @@ struct tree_aggregator_t {
   }
 };
 
+struct finalize_block {
+  void* tmp_storage;
+  int num_classes;
+
+  __device__ __forceinline__ finalize_block(void* tmp_storage_,
+                                            int num_classes_)
+    : tmp_storage(tmp_storage_), num_classes(num_classes_) {}
+
+  template <int NITEMS>
+  static size_t smem_footprint() {
+    return sizeof(typename BlockReduceHostBestClass<NITEMS>::TempStorage);
+  }
+
+  template <int NITEMS>
+  __device__ __forceinline__ void write_best_class_in_block(
+    vec<NITEMS, best_margin_label> best, int valid_threads, float* out,
+    int num_rows) {
+    // find best class per block (for each of the NITEMS rows)
+    typedef BlockReduceBestClass<NITEMS> BlockReduceT;
+    best = BlockReduceT(*(typename BlockReduceT::TempStorage*)tmp_storage)
+             .Reduce(best, ArgMax(), valid_threads);
+    // write it out to global memory
+    if (threadIdx.x == 0) {
+      for (int i = 0; i < NITEMS; ++i) {
+        int row = blockIdx.x * NITEMS + i;
+        if (row < num_rows) out[row] = best[i].key;
+      }
+    }
+  }
+};
+
 template <int NITEMS>
-struct tree_aggregator_t<NITEMS, INT_CLASS_LABEL> {
+struct tree_aggregator_t<NITEMS, GROVE_PER_CLASS_FEW_CLASSES> : finalize_block {
+  vec<NITEMS, float> acc;
+
+  static size_t smem_finalize_footprint(int num_classes) {
+    size_t phase1 =
+      (FIL_TPB - FIL_TPB % num_classes) * sizeof(vec<NITEMS, float>);
+    size_t phase2 = finalize_block::smem_footprint<NITEMS>();
+    return std::max(phase1, phase2);
+  }
+
+  static size_t smem_accumulate_footprint(int num_classes) { return 0; }
+
+  __device__ __forceinline__ tree_aggregator_t(int num_classes_,
+                                               void* shared_workspace, size_t)
+    : finalize_block(shared_workspace, num_classes_) {}
+
+  __device__ __forceinline__ void accumulate(
+    vec<NITEMS, float> single_tree_prediction, int tree) {
+    acc += single_tree_prediction;
+  }
+
+  // block-reduce the best candidate class and write it out to global memory
+  __device__ __forceinline__ void finalize(float* out, int num_rows,
+                                           int num_outputs) {
+    __syncthreads();  // free up input row
+    // load margin into shared memory
+    auto per_thread = (vec<NITEMS, float>*)tmp_storage;
+    if (threadIdx.x >= num_classes) per_thread[threadIdx.x] = acc;
+
+    __syncthreads();
+    // reduce per-thread margin summand into per-class complete margin
+    // (for each of the NITEMS rows)
+    // TODO(levsnv): use CUB/tree reduction when num_classes is small
+    if (threadIdx.x < num_classes) {
+      for (int c = threadIdx.x + num_classes; c < blockDim.x; c += num_classes)
+        acc += per_thread[c];
+    }
+    __syncthreads();  // free up per_thread[] margin
+
+    write_best_class_in_block(to_vec(threadIdx.x, acc), num_classes, out,
+                              num_rows);
+  }
+};
+
+template <int NITEMS>
+struct tree_aggregator_t<NITEMS, GROVE_PER_CLASS_MANY_CLASSES>
+  : finalize_block {
+  vec<NITEMS, float> acc;
+  vec<NITEMS, float>* per_class_margin;
+
+  static size_t smem_finalize_footprint(int num_classes) {
+    size_t phase1 = num_classes * sizeof(vec<NITEMS, float>);
+    size_t phase2 = finalize_block::smem_footprint<NITEMS>();
+    return std::max(phase1, phase2);
+  }
+
+  static size_t smem_accumulate_footprint(int num_classes) {
+    return num_classes * sizeof(vec<NITEMS, float>);
+  }
+
+  __device__ __forceinline__ tree_aggregator_t(int num_classes_,
+                                               void* shared_workspace,
+                                               size_t data_row_size)
+    : finalize_block(shared_workspace, num_classes_),
+      per_class_margin(
+        (vec<NITEMS, float>*)((char*)shared_workspace + data_row_size)) {
+    for (int c = threadIdx.x; c < num_classes; c += blockDim.x)
+      per_class_margin[c] = vec<NITEMS, float>();  // initialize to 0.0f
+    // __syncthreads() is called in infer_k
+  }
+
+  __device__ __forceinline__ void accumulate(
+    vec<NITEMS, float> single_tree_prediction, int tree) {
+    // since threads are assigned to consecutive classes, no need for atomics
+    per_class_margin[tree % num_classes] += single_tree_prediction;
+    // __syncthreads() is called in infer_k
+  }
+
+  __device__ __forceinline__ void finalize(float* out, int num_rows,
+                                           int num_outputs) {
+    // reduce per-class candidate margins to one best class candidate
+    // per thread (for each of the NITEMS rows)
+    vec<NITEMS, best_margin_label> best({-1, -INFINITY});
+
+    for (int c = threadIdx.x; c < num_classes; c += blockDim.x)
+      best = ArgMax()(best, to_vec(c, per_class_margin[c]));
+
+    __syncthreads();  // free up per_class_margin[]
+    write_best_class_in_block(best, blockDim.x, out, num_rows);
+  }
+};
+
+template <int NITEMS>
+struct tree_aggregator_t<NITEMS, CATEGORICAL_LEAF> {
   // could switch to unsigned short to save shared memory
-  // provided atomicAdd(short*) simulated with appropriate shifts
+  // provided raft::myAtomicAdd(short*) simulated with appropriate shifts
   int* votes;
   int num_classes;
 
@@ -182,13 +335,14 @@ struct tree_aggregator_t<NITEMS, INT_CLASS_LABEL> {
     for (int c = threadIdx.x; c < num_classes; c += FIL_TPB * NITEMS)
 #pragma unroll
       for (int item = 0; item < NITEMS; ++item) votes[c * NITEMS + item] = 0;
-    //__syncthreads(); // happening outside already
+    // __syncthreads() is called in infer_k
   }
   __device__ __forceinline__ void accumulate(
-    vec<NITEMS, int> single_tree_prediction) {
+    vec<NITEMS, int> single_tree_prediction, int tree) {
 #pragma unroll
     for (int item = 0; item < NITEMS; ++item)
-      atomicAdd(votes + single_tree_prediction[item] * NITEMS + item, 1);
+      raft::myAtomicAdd(votes + single_tree_prediction[item] * NITEMS + item,
+                        1);
   }
   // class probabilities or regression. for regression, num_classes
   // is just the number of outputs for each data instance
@@ -233,7 +387,7 @@ struct tree_aggregator_t<NITEMS, INT_CLASS_LABEL> {
   }
 };
 
-template <int NITEMS, leaf_value_t leaf_payload_type, class storage_type>
+template <int NITEMS, leaf_algo_t leaf_algo, class storage_type>
 __global__ void infer_k(storage_type forest, predict_params params) {
   // cache the row for all threads to reuse
   extern __shared__ char smem[];
@@ -247,35 +401,44 @@ __global__ void infer_k(storage_type forest, predict_params params) {
     }
   }
 
-  tree_aggregator_t<NITEMS, leaf_payload_type> acc(
+  tree_aggregator_t<NITEMS, leaf_algo> acc(
     params.num_classes, sdata, params.num_cols * NITEMS * sizeof(float));
 
   __syncthreads();  // for both row cache init and acc init
 
   // one block works on NITEMS rows and the whole forest
-  for (int j = threadIdx.x; j < forest.num_trees(); j += blockDim.x) {
-    acc.accumulate(infer_one_tree<NITEMS, leaf_output_t<leaf_payload_type>::T>(
-      forest[j], sdata, params.num_cols));
+  for (int j = threadIdx.x; j - threadIdx.x < forest.num_trees();
+       j += blockDim.x) {
+    /* j - threadIdx.x < forest.num_trees() is a necessary but block-uniform
+       condition for "j < forest.num_trees()". It lets use __syncthreads()
+       and is made exact below.
+    */
+    if (j < forest.num_trees()) {
+      acc.accumulate(infer_one_tree<NITEMS, leaf_output_t<leaf_algo>::T>(
+                       forest[j], sdata, params.num_cols),
+                     j);
+    }
+    if (leaf_algo == GROVE_PER_CLASS_MANY_CLASSES) __syncthreads();
   }
   acc.finalize(params.preds, params.num_rows, params.num_outputs);
 }
 
-template <int NITEMS, leaf_value_t leaf_payload_type>
+template <int NITEMS, leaf_algo_t leaf_algo>
 size_t get_smem_footprint(predict_params params) {
   size_t finalize_footprint =
-    tree_aggregator_t<NITEMS, leaf_payload_type>::smem_finalize_footprint(
+    tree_aggregator_t<NITEMS, leaf_algo>::smem_finalize_footprint(
       params.num_classes);
   size_t accumulate_footprint =
     sizeof(float) * params.num_cols * NITEMS +
-    tree_aggregator_t<NITEMS, leaf_payload_type>::smem_accumulate_footprint(
+    tree_aggregator_t<NITEMS, leaf_algo>::smem_accumulate_footprint(
       params.num_classes);
 
   return std::max(accumulate_footprint, finalize_footprint);
 }
 
-template <leaf_value_t leaf_payload_type, typename storage_type>
+template <leaf_algo_t leaf_algo, typename storage_type>
 void infer_k_launcher(storage_type forest, predict_params params,
-                      cudaStream_t stream) {
+                      cudaStream_t stream, int blockdim_x) {
   const int MAX_BATCH_ITEMS = 4;
   params.max_items =
     params.algo == algo_t::BATCH_TREE_REORG ? MAX_BATCH_ITEMS : 1;
@@ -289,16 +452,16 @@ void infer_k_launcher(storage_type forest, predict_params params,
     size_t peak_footprint;
     switch (nitems) {
       case 1:
-        peak_footprint = get_smem_footprint<1, leaf_payload_type>(params);
+        peak_footprint = get_smem_footprint<1, leaf_algo>(params);
         break;
       case 2:
-        peak_footprint = get_smem_footprint<2, leaf_payload_type>(params);
+        peak_footprint = get_smem_footprint<2, leaf_algo>(params);
         break;
       case 3:
-        peak_footprint = get_smem_footprint<3, leaf_payload_type>(params);
+        peak_footprint = get_smem_footprint<3, leaf_algo>(params);
         break;
       case 4:
-        peak_footprint = get_smem_footprint<4, leaf_payload_type>(params);
+        peak_footprint = get_smem_footprint<4, leaf_algo>(params);
         break;
       default:
         ASSERT(false, "internal error: nitems > 4");
@@ -316,32 +479,29 @@ void infer_k_launcher(storage_type forest, predict_params params,
     params.num_cols = params.max_shm / sizeof(float);
     // since we're crashing, this will not take too long
     while (params.num_cols > 0 &&
-           get_smem_footprint<1, leaf_payload_type>(params) > params.max_shm) {
+           get_smem_footprint<1, leaf_algo>(params) > params.max_shm) {
       --params.num_cols;
     }
-    ASSERT(false, "p.num_cols == %d: too many features, only %d allowed%s",
-           given_num_cols, params.num_cols,
-           leaf_payload_type == INT_CLASS_LABEL
-             ? " (accounting for shared class vote histogram)"
-             : "");
+    ASSERT(false, "p.num_cols == %d: too many features, only %d allowed",
+           given_num_cols, params.num_cols);
   }
-  int num_blocks = ceildiv(int(params.num_rows), num_items);
+  int num_blocks = raft::ceildiv(int(params.num_rows), num_items);
   switch (num_items) {
     case 1:
-      infer_k<1, leaf_payload_type>
-        <<<num_blocks, FIL_TPB, shm_sz, stream>>>(forest, params);
+      infer_k<1, leaf_algo>
+        <<<num_blocks, blockdim_x, shm_sz, stream>>>(forest, params);
       break;
     case 2:
-      infer_k<2, leaf_payload_type>
-        <<<num_blocks, FIL_TPB, shm_sz, stream>>>(forest, params);
+      infer_k<2, leaf_algo>
+        <<<num_blocks, blockdim_x, shm_sz, stream>>>(forest, params);
       break;
     case 3:
-      infer_k<3, leaf_payload_type>
-        <<<num_blocks, FIL_TPB, shm_sz, stream>>>(forest, params);
+      infer_k<3, leaf_algo>
+        <<<num_blocks, blockdim_x, shm_sz, stream>>>(forest, params);
       break;
     case 4:
-      infer_k<4, leaf_payload_type>
-        <<<num_blocks, FIL_TPB, shm_sz, stream>>>(forest, params);
+      infer_k<4, leaf_algo>
+        <<<num_blocks, blockdim_x, shm_sz, stream>>>(forest, params);
       break;
     default:
       ASSERT(false, "internal error: nitems > 4");
@@ -351,22 +511,37 @@ void infer_k_launcher(storage_type forest, predict_params params,
 
 template <typename storage_type>
 void infer(storage_type forest, predict_params params, cudaStream_t stream) {
-  switch (params.leaf_payload_type) {
-    case FLOAT_SCALAR:
-      infer_k_launcher<FLOAT_SCALAR, storage_type>(forest, params, stream);
+  switch (params.leaf_algo) {
+    case FLOAT_UNARY_BINARY:
+      infer_k_launcher<FLOAT_UNARY_BINARY>(forest, params, stream, FIL_TPB);
+      break;
+    case GROVE_PER_CLASS:
+      if (params.num_classes > FIL_TPB) {
+        params.leaf_algo = GROVE_PER_CLASS_MANY_CLASSES;
+        infer_k_launcher<GROVE_PER_CLASS_MANY_CLASSES>(forest, params, stream,
+                                                       FIL_TPB);
+      } else {
+        params.leaf_algo = GROVE_PER_CLASS_FEW_CLASSES;
+        infer_k_launcher<GROVE_PER_CLASS_FEW_CLASSES>(
+          forest, params, stream, FIL_TPB - FIL_TPB % params.num_classes);
+      }
       break;
-    case INT_CLASS_LABEL:
-      infer_k_launcher<INT_CLASS_LABEL, storage_type>(forest, params, stream);
+    case CATEGORICAL_LEAF:
+      infer_k_launcher<CATEGORICAL_LEAF>(forest, params, stream, FIL_TPB);
       break;
     default:
-      ASSERT(false, "internal error: invalid leaf_payload_type");
+      ASSERT(false, "internal error: invalid leaf_algo");
   }
 }
 
 template void infer<dense_storage>(dense_storage forest, predict_params params,
                                    cudaStream_t stream);
-template void infer<sparse_storage>(sparse_storage forest,
-                                    predict_params params, cudaStream_t stream);
+template void infer<sparse_storage16>(sparse_storage16 forest,
+                                      predict_params params,
+                                      cudaStream_t stream);
+template void infer<sparse_storage8>(sparse_storage8 forest,
+                                     predict_params params,
+                                     cudaStream_t stream);
 
 }  // namespace fil
 }  // namespace ML
diff --git a/cpp/src/glm/glm.cu b/cpp/src/glm/glm.cu
index ee668f0f8c..88905ea998 100644
--- a/cpp/src/glm/glm.cu
+++ b/cpp/src/glm/glm.cu
@@ -25,115 +25,116 @@ namespace GLM {
 
 using namespace MLCommon;
 
-void olsFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
+void olsFit(const raft::handle_t &handle, float *input, int n_rows, int n_cols,
             float *labels, float *coef, float *intercept, bool fit_intercept,
             bool normalize, int algo) {
-  olsFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-         fit_intercept, normalize, handle.getStream(), algo);
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  olsFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+         normalize, handle.get_stream(), algo);
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void olsFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
+void olsFit(const raft::handle_t &handle, double *input, int n_rows, int n_cols,
             double *labels, double *coef, double *intercept, bool fit_intercept,
             bool normalize, int algo) {
-  olsFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-         fit_intercept, normalize, handle.getStream(), algo);
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  olsFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+         normalize, handle.get_stream(), algo);
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void olsPredict(const cumlHandle &handle, const float *input, int n_rows,
+void olsPredict(const raft::handle_t &handle, const float *input, int n_rows,
                 int n_cols, const float *coef, float intercept, float *preds) {
-  olsPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-             handle.getStream());
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  olsPredict(handle, input, n_rows, n_cols, coef, intercept, preds,
+             handle.get_stream());
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void olsPredict(const cumlHandle &handle, const double *input, int n_rows,
+void olsPredict(const raft::handle_t &handle, const double *input, int n_rows,
                 int n_cols, const double *coef, double intercept,
                 double *preds) {
-  olsPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-             handle.getStream());
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  olsPredict(handle, input, n_rows, n_cols, coef, intercept, preds,
+             handle.get_stream());
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void ridgeFit(const cumlHandle &handle, float *input, int n_rows, int n_cols,
-              float *labels, float *alpha, int n_alpha, float *coef,
+void ridgeFit(const raft::handle_t &handle, float *input, int n_rows,
+              int n_cols, float *labels, float *alpha, int n_alpha, float *coef,
               float *intercept, bool fit_intercept, bool normalize, int algo) {
-  ridgeFit(handle.getImpl(), input, n_rows, n_cols, labels, alpha, n_alpha,
-           coef, intercept, fit_intercept, normalize, handle.getStream(), algo);
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  ridgeFit(handle, input, n_rows, n_cols, labels, alpha, n_alpha, coef,
+           intercept, fit_intercept, normalize, handle.get_stream(), algo);
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void ridgeFit(const cumlHandle &handle, double *input, int n_rows, int n_cols,
-              double *labels, double *alpha, int n_alpha, double *coef,
-              double *intercept, bool fit_intercept, bool normalize, int algo) {
-  ridgeFit(handle.getImpl(), input, n_rows, n_cols, labels, alpha, n_alpha,
-           coef, intercept, fit_intercept, normalize, handle.getStream(), algo);
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+void ridgeFit(const raft::handle_t &handle, double *input, int n_rows,
+              int n_cols, double *labels, double *alpha, int n_alpha,
+              double *coef, double *intercept, bool fit_intercept,
+              bool normalize, int algo) {
+  ridgeFit(handle, input, n_rows, n_cols, labels, alpha, n_alpha, coef,
+           intercept, fit_intercept, normalize, handle.get_stream(), algo);
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void ridgePredict(const cumlHandle &handle, const float *input, int n_rows,
+void ridgePredict(const raft::handle_t &handle, const float *input, int n_rows,
                   int n_cols, const float *coef, float intercept,
                   float *preds) {
-  ridgePredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-               handle.getStream());
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  ridgePredict(handle, input, n_rows, n_cols, coef, intercept, preds,
+               handle.get_stream());
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void ridgePredict(const cumlHandle &handle, const double *input, int n_rows,
+void ridgePredict(const raft::handle_t &handle, const double *input, int n_rows,
                   int n_cols, const double *coef, double intercept,
                   double *preds) {
-  ridgePredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-               handle.getStream());
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  ridgePredict(handle, input, n_rows, n_cols, coef, intercept, preds,
+               handle.get_stream());
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
-void qnFit(const cumlHandle &cuml_handle, float *X, float *y, int N, int D,
+void qnFit(const raft::handle_t &cuml_handle, float *X, float *y, int N, int D,
            int C, bool fit_intercept, float l1, float l2, int max_iter,
            float grad_tol, int linesearch_max_iter, int lbfgs_memory,
            int verbosity, float *w0, float *f, int *num_iters, bool X_col_major,
            int loss_type) {
-  qnFit(cuml_handle.getImpl(), X, y, N, D, C, fit_intercept, l1, l2, max_iter,
-        grad_tol, linesearch_max_iter, lbfgs_memory, verbosity, w0, f,
-        num_iters, X_col_major, loss_type, cuml_handle.getStream());
+  qnFit(cuml_handle, X, y, N, D, C, fit_intercept, l1, l2, max_iter, grad_tol,
+        linesearch_max_iter, lbfgs_memory, verbosity, w0, f, num_iters,
+        X_col_major, loss_type, cuml_handle.get_stream());
 }
 
-void qnFit(const cumlHandle &cuml_handle, double *X, double *y, int N, int D,
-           int C, bool fit_intercept, double l1, double l2, int max_iter,
+void qnFit(const raft::handle_t &cuml_handle, double *X, double *y, int N,
+           int D, int C, bool fit_intercept, double l1, double l2, int max_iter,
            double grad_tol, int linesearch_max_iter, int lbfgs_memory,
            int verbosity, double *w0, double *f, int *num_iters,
            bool X_col_major, int loss_type) {
-  qnFit(cuml_handle.getImpl(), X, y, N, D, C, fit_intercept, l1, l2, max_iter,
-        grad_tol, linesearch_max_iter, lbfgs_memory, verbosity, w0, f,
-        num_iters, X_col_major, loss_type, cuml_handle.getStream());
+  qnFit(cuml_handle, X, y, N, D, C, fit_intercept, l1, l2, max_iter, grad_tol,
+        linesearch_max_iter, lbfgs_memory, verbosity, w0, f, num_iters,
+        X_col_major, loss_type, cuml_handle.get_stream());
 }
 
-void qnDecisionFunction(const cumlHandle &cuml_handle, float *X, int N, int D,
-                        int C, bool fit_intercept, float *params,
+void qnDecisionFunction(const raft::handle_t &cuml_handle, float *X, int N,
+                        int D, int C, bool fit_intercept, float *params,
                         bool X_col_major, int loss_type, float *preds) {
-  qnDecisionFunction(cuml_handle.getImpl(), X, N, D, C, fit_intercept, params,
-                     X_col_major, loss_type, preds, cuml_handle.getStream());
+  qnDecisionFunction(cuml_handle, X, N, D, C, fit_intercept, params,
+                     X_col_major, loss_type, preds, cuml_handle.get_stream());
 }
 
-void qnDecisionFunction(const cumlHandle &cuml_handle, double *X, int N, int D,
-                        int C, bool fit_intercept, double *params,
+void qnDecisionFunction(const raft::handle_t &cuml_handle, double *X, int N,
+                        int D, int C, bool fit_intercept, double *params,
                         bool X_col_major, int loss_type, double *scores) {
-  qnDecisionFunction(cuml_handle.getImpl(), X, N, D, C, fit_intercept, params,
-                     X_col_major, loss_type, scores, cuml_handle.getStream());
+  qnDecisionFunction(cuml_handle, X, N, D, C, fit_intercept, params,
+                     X_col_major, loss_type, scores, cuml_handle.get_stream());
 }
 
-void qnPredict(const cumlHandle &cuml_handle, float *X, int N, int D, int C,
+void qnPredict(const raft::handle_t &cuml_handle, float *X, int N, int D, int C,
                bool fit_intercept, float *params, bool X_col_major,
                int loss_type, float *scores) {
-  qnPredict(cuml_handle.getImpl(), X, N, D, C, fit_intercept, params,
-            X_col_major, loss_type, scores, cuml_handle.getStream());
+  qnPredict(cuml_handle, X, N, D, C, fit_intercept, params, X_col_major,
+            loss_type, scores, cuml_handle.get_stream());
 }
 
-void qnPredict(const cumlHandle &cuml_handle, double *X, int N, int D, int C,
-               bool fit_intercept, double *params, bool X_col_major,
+void qnPredict(const raft::handle_t &cuml_handle, double *X, int N, int D,
+               int C, bool fit_intercept, double *params, bool X_col_major,
                int loss_type, double *preds) {
-  qnPredict(cuml_handle.getImpl(), X, N, D, C, fit_intercept, params,
-            X_col_major, loss_type, preds, cuml_handle.getStream());
+  qnPredict(cuml_handle, X, N, D, C, fit_intercept, params, X_col_major,
+            loss_type, preds, cuml_handle.get_stream());
 }
 
 }  // namespace GLM
diff --git a/cpp/src/glm/glm_api.cpp b/cpp/src/glm/glm_api.cpp
index e6072e4ea8..f99a147d8d 100644
--- a/cpp/src/glm/glm_api.cpp
+++ b/cpp/src/glm/glm_api.cpp
@@ -26,7 +26,7 @@ extern "C" cumlError_t cumlSpQnFit(cumlHandle_t cuml_handle, float *X, float *y,
                                    float *f, int *num_iters, bool X_col_major,
                                    int loss_type) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(cuml_handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -54,7 +54,7 @@ extern "C" cumlError_t cumlDpQnFit(
   int linesearch_max_iter, int lbfgs_memory, int verbosity, double *w0,
   double *f, int *num_iters, bool X_col_major, int loss_type) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(cuml_handle);
   if (status == CUML_SUCCESS) {
     try {
diff --git a/cpp/src/glm/glm_spmg.h b/cpp/src/glm/glm_spmg.h
deleted file mode 100644
index 4d552b1c69..0000000000
--- a/cpp/src/glm/glm_spmg.h
+++ /dev/null
@@ -1,50 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-namespace ML {
-namespace GLM {
-
-void olsFitSPMG(float *h_input, int n_rows, int n_cols, float *h_labels,
-                float *h_coef, float *intercept, bool fit_intercept,
-                bool normalize, int *gpu_ids, int n_gpus);
-void olsFitSPMG(double *h_input, int n_rows, int n_cols, double *h_labels,
-                double *h_coef, double *intercept, bool fit_intercept,
-                bool normalize, int *gpu_ids, int n_gpus);
-void olsPredictSPMG(float *input, int n_rows, int n_cols, float *h_coef,
-                    float intercept, float *preds, int *gpu_ids, int n_gpus);
-void olsPredictSPMG(double *input, int n_rows, int n_cols, double *h_coef,
-                    double intercept, double *preds, int *gpu_ids, int n_gpus);
-
-void spmgOlsFit(float **input, int *input_cols, int n_rows, int n_cols,
-                float **labels, int *label_rows, float **coef, int *coef_cols,
-                float *intercept, bool fit_intercept, bool normalize,
-                int n_gpus);
-
-void spmgOlsFit(double **input, int *input_cols, int n_rows, int n_cols,
-                double **labels, int *label_rows, double **coef, int *coef_cols,
-                double *intercept, bool fit_intercept, bool normalize,
-                int n_gpus);
-
-void spmgOlsPredict(float **input, int *input_cols, int n_rows, int n_cols,
-                    float **coef, int *coef_cols, float intercept,
-                    float **preds, int *pred_cols, int n_gpus);
-
-void spmgOlsPredict(double **input, int *input_cols, int n_rows, int n_cols,
-                    double **coef, int *coef_cols, double intercept,
-                    double **preds, int *pred_cols, int n_gpus);
-
-}  // namespace GLM
-}  // namespace ML
diff --git a/cpp/src/glm/ols.cuh b/cpp/src/glm/ols.cuh
index 3ec6ad54d4..b3812a60a0 100644
--- a/cpp/src/glm/ols.cuh
+++ b/cpp/src/glm/ols.cuh
@@ -16,19 +16,19 @@
 
 #pragma once
 
-#include <linalg/gemv.h>
+#include <raft/linalg/gemv.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
-#include <linalg/add.cuh>
 #include <linalg/lstsq.cuh>
-#include <linalg/norm.cuh>
-#include <linalg/subtract.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
-#include <stats/stddev.cuh>
-#include <stats/sum.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
+#include <raft/stats/stddev.cuh>
+#include <raft/stats/sum.cuh>
 #include "preprocess.cuh"
 
 namespace ML {
@@ -51,13 +51,12 @@ using namespace MLCommon;
  * @param algo          specifies which solver to use (0: SVD, 1: Eigendecomposition, 2: QR-decomposition)
  */
 template <typename math_t>
-void olsFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
-            int n_cols, math_t *labels, math_t *coef, math_t *intercept,
-            bool fit_intercept, bool normalize, cudaStream_t stream,
-            int algo = 0) {
-  auto cublas_handle = handle.getCublasHandle();
-  auto cusolver_handle = handle.getcusolverDnHandle();
-  auto allocator = handle.getDeviceAllocator();
+void olsFit(const raft::handle_t &handle, math_t *input, int n_rows, int n_cols,
+            math_t *labels, math_t *coef, math_t *intercept, bool fit_intercept,
+            bool normalize, cudaStream_t stream, int algo = 0) {
+  auto cublas_handle = handle.get_cublas_handle();
+  auto cusolver_handle = handle.get_cusolver_dn_handle();
+  auto allocator = handle.get_device_allocator();
 
   ASSERT(n_cols > 0, "olsFit: number of columns cannot be less than one");
   ASSERT(n_rows > 1, "olsFit: number of rows cannot be less than two");
@@ -78,11 +77,9 @@ void olsFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
   }
 
   if (algo == 0 || n_cols == 1) {
-    LinAlg::lstsqSVD(input, n_rows, n_cols, labels, coef, cusolver_handle,
-                     cublas_handle, allocator, stream);
+    LinAlg::lstsqSVD(handle, input, n_rows, n_cols, labels, coef, stream);
   } else if (algo == 1) {
-    LinAlg::lstsqEig(input, n_rows, n_cols, labels, coef, cusolver_handle,
-                     cublas_handle, allocator, stream);
+    LinAlg::lstsqEig(handle, input, n_rows, n_cols, labels, coef, stream);
   } else if (algo == 2) {
     LinAlg::lstsqQR(input, n_rows, n_cols, labels, coef, cusolver_handle,
                     cublas_handle, allocator, stream);
@@ -113,20 +110,18 @@ void olsFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
  * @param stream        cuda stream
  */
 template <typename math_t>
-void olsPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
+void olsPredict(const raft::handle_t &handle, const math_t *input, int n_rows,
                 int n_cols, const math_t *coef, math_t intercept, math_t *preds,
                 cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
-
   ASSERT(n_cols > 0, "olsPredict: number of columns cannot be less than one");
   ASSERT(n_rows > 0, "olsPredict: number of rows cannot be less than one");
 
   math_t alpha = math_t(1);
   math_t beta = math_t(0);
-  LinAlg::gemm(input, n_rows, n_cols, coef, preds, n_rows, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, alpha, beta, cublas_handle, stream);
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, preds, n_rows, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta, stream);
 
-  LinAlg::addScalar(preds, preds, intercept, n_rows, stream);
+  raft::linalg::addScalar(preds, preds, intercept, n_rows, stream);
 }
 
 };  // namespace GLM
diff --git a/cpp/src/glm/ols_mg.cu b/cpp/src/glm/ols_mg.cu
index 5b7eac2fb9..ef90e88351 100644
--- a/cpp/src/glm/ols_mg.cu
+++ b/cpp/src/glm/ols_mg.cu
@@ -15,18 +15,18 @@
  */
 
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/linear_model/ols_mg.hpp>
 #include <cuml/linear_model/preprocess_mg.hpp>
-#include <linalg/add.cuh>
-#include <linalg/gemm.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
 #include <opg/linalg/lstsq.hpp>
 #include <opg/stats/mean.hpp>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 
 using namespace MLCommon;
 
@@ -35,16 +35,16 @@ namespace OLS {
 namespace opg {
 
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *coef, T *intercept,
               bool fit_intercept, bool normalize, int algo,
               cudaStream_t *streams, int n_streams, bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  cusolverDnHandle_t cusolver_handle = handle.getImpl().getcusolverDnHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  const auto allocator = handle.get_device_allocator();
 
   device_buffer<T> mu_input(allocator, streams[0]);
   device_buffer<T> norm2_input(allocator, streams[0]);
@@ -66,8 +66,8 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
   if (algo == 0 || input_desc.N == 1) {
     ASSERT(false, "olsFit: no algorithm with this id has been implemented");
   } else if (algo == 1) {
-    LinAlg::opg::lstsqEig(input_data, input_desc, labels, coef, comm, allocator,
-                          streams, n_streams, cublas_handle, cusolver_handle);
+    LinAlg::opg::lstsqEig(handle, input_data, input_desc, labels, coef, streams,
+                          n_streams);
   } else {
     ASSERT(false, "olsFit: no algorithm with this id has been implemented");
   }
@@ -96,13 +96,14 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
  * @input param verbose
  */
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *coef, T *intercept,
               bool fit_intercept, bool normalize, int algo, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
 
   int n_streams = input_desc.blocksOwnedBy(rank).size();
   cudaStream_t streams[n_streams];
@@ -123,7 +124,7 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle,
+void predict_impl(raft::handle_t &handle,
                   std::vector<Matrix::Data<T> *> &input_data,
                   Matrix::PartDescriptor &input_desc, T *coef, T intercept,
                   std::vector<Matrix::Data<T> *> &preds, cudaStream_t *streams,
@@ -134,22 +135,22 @@ void predict_impl(cumlHandle &handle,
 
   for (int i = 0; i < input_data.size(); i++) {
     int si = i % n_streams;
-    LinAlg::gemm(input_data[i]->ptr, local_blocks[i]->size, input_desc.N, coef,
-                 preds[i]->ptr, local_blocks[i]->size, size_t(1), CUBLAS_OP_N,
-                 CUBLAS_OP_N, alpha, beta, handle.getImpl().getCublasHandle(),
-                 streams[si]);
+    raft::linalg::gemm(handle, input_data[i]->ptr, local_blocks[i]->size,
+                       input_desc.N, coef, preds[i]->ptr, local_blocks[i]->size,
+                       size_t(1), CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta,
+                       streams[si]);
 
-    LinAlg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
-                      local_blocks[i]->size, streams[si]);
+    raft::linalg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
+                            local_blocks[i]->size, streams[si]);
   }
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                   size_t n_parts, Matrix::Data<T> **input, size_t n_rows,
                   size_t n_cols, T *coef, T intercept, Matrix::Data<T> **preds,
                   bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -157,7 +158,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   Matrix::PartDescriptor input_desc(n_rows, n_cols, ranksAndSizes, rank);
   std::vector<Matrix::Data<T> *> preds_data(preds, preds + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -176,7 +177,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   }
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
+void fit(raft::handle_t &handle, std::vector<Matrix::Data<float> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<float> *> &labels, float *coef,
          float *intercept, bool fit_intercept, bool normalize, int algo,
@@ -185,7 +186,8 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
            fit_intercept, normalize, algo, verbose);
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
+void fit(raft::handle_t &handle,
+         std::vector<Matrix::Data<double> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<double> *> &labels, double *coef,
          double *intercept, bool fit_intercept, bool normalize, int algo,
@@ -194,7 +196,7 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
            fit_intercept, normalize, algo, verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<float> **input, size_t n_rows,
              size_t n_cols, float *coef, float intercept,
              Matrix::Data<float> **preds, bool verbose) {
@@ -202,7 +204,7 @@ void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                intercept, preds, verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<double> **input, size_t n_rows,
              size_t n_cols, double *coef, double intercept,
              Matrix::Data<double> **preds, bool verbose) {
diff --git a/cpp/src/glm/preprocess.cuh b/cpp/src/glm/preprocess.cuh
index c8883c6339..781f5768f9 100644
--- a/cpp/src/glm/preprocess.cuh
+++ b/cpp/src/glm/preprocess.cuh
@@ -16,15 +16,16 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <linalg/gemm.cuh>
-#include <linalg/norm.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
-#include <stats/stddev.cuh>
+#include <common/device_buffer.hpp>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
+#include <raft/stats/stddev.cuh>
 
 namespace ML {
 namespace GLM {
@@ -32,7 +33,7 @@ namespace GLM {
 using namespace MLCommon;
 
 template <typename math_t>
-void preProcessData(const cumlHandle_impl &handle, math_t *input, int n_rows,
+void preProcessData(const raft::handle_t &handle, math_t *input, int n_rows,
                     int n_cols, math_t *labels, math_t *intercept,
                     math_t *mu_input, math_t *mu_labels, math_t *norm2_input,
                     bool fit_intercept, bool normalize, cudaStream_t stream) {
@@ -42,26 +43,26 @@ void preProcessData(const cumlHandle_impl &handle, math_t *input, int n_rows,
          "Parameter n_rows: number of rows cannot be less than two");
 
   if (fit_intercept) {
-    Stats::mean(mu_input, input, n_cols, n_rows, false, false, stream);
-    Stats::meanCenter(input, input, mu_input, n_cols, n_rows, false, true,
-                      stream);
+    raft::stats::mean(mu_input, input, n_cols, n_rows, false, false, stream);
+    raft::stats::meanCenter(input, input, mu_input, n_cols, n_rows, false, true,
+                            stream);
 
-    Stats::mean(mu_labels, labels, 1, n_rows, false, false, stream);
-    Stats::meanCenter(labels, labels, mu_labels, 1, n_rows, false, true,
-                      stream);
+    raft::stats::mean(mu_labels, labels, 1, n_rows, false, false, stream);
+    raft::stats::meanCenter(labels, labels, mu_labels, 1, n_rows, false, true,
+                            stream);
 
     if (normalize) {
-      LinAlg::colNorm(norm2_input, input, n_cols, n_rows, LinAlg::L2Norm, false,
-                      stream,
-                      [] __device__(math_t v) { return MLCommon::mySqrt(v); });
-      Matrix::matrixVectorBinaryDivSkipZero(input, norm2_input, n_rows, n_cols,
-                                            false, true, stream, true);
+      raft::linalg::colNorm(
+        norm2_input, input, n_cols, n_rows, raft::linalg::L2Norm, false, stream,
+        [] __device__(math_t v) { return raft::mySqrt(v); });
+      raft::matrix::matrixVectorBinaryDivSkipZero(
+        input, norm2_input, n_rows, n_cols, false, true, stream, true);
     }
   }
 }
 
 template <typename math_t>
-void postProcessData(const cumlHandle_impl &handle, math_t *input, int n_rows,
+void postProcessData(const raft::handle_t &handle, math_t *input, int n_rows,
                      int n_cols, math_t *labels, math_t *coef,
                      math_t *intercept, math_t *mu_input, math_t *mu_labels,
                      math_t *norm2_input, bool fit_intercept, bool normalize,
@@ -71,28 +72,30 @@ void postProcessData(const cumlHandle_impl &handle, math_t *input, int n_rows,
   ASSERT(n_rows > 1,
          "Parameter n_rows: number of rows cannot be less than two");
 
-  cublasHandle_t cublas_handle = handle.getCublasHandle();
-  auto allocator = handle.getDeviceAllocator();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  auto allocator = handle.get_device_allocator();
   device_buffer<math_t> d_intercept(allocator, stream, 1);
 
   if (normalize) {
-    Matrix::matrixVectorBinaryMult(input, norm2_input, n_rows, n_cols, false,
-                                   true, stream);
-    Matrix::matrixVectorBinaryDivSkipZero(coef, norm2_input, 1, n_cols, false,
-                                          true, stream, true);
+    raft::matrix::matrixVectorBinaryMult(input, norm2_input, n_rows, n_cols,
+                                         false, true, stream);
+    raft::matrix::matrixVectorBinaryDivSkipZero(coef, norm2_input, 1, n_cols,
+                                                false, true, stream, true);
   }
 
-  LinAlg::gemm(mu_input, 1, n_cols, coef, d_intercept.data(), 1, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, cublas_handle, stream);
+  raft::linalg::gemm(handle, mu_input, 1, n_cols, coef, d_intercept.data(), 1,
+                     1, CUBLAS_OP_N, CUBLAS_OP_N, stream);
 
-  LinAlg::subtract(d_intercept.data(), mu_labels, d_intercept.data(), 1,
-                   stream);
-  updateHost(intercept, d_intercept.data(), 1, stream);
+  raft::linalg::subtract(d_intercept.data(), mu_labels, d_intercept.data(), 1,
+                         stream);
+  raft::update_host(intercept, d_intercept.data(), 1, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
-  Stats::meanAdd(input, input, mu_input, n_cols, n_rows, false, true, stream);
-  Stats::meanAdd(labels, labels, mu_labels, 1, n_rows, false, true, stream);
+  raft::stats::meanAdd(input, input, mu_input, n_cols, n_rows, false, true,
+                       stream);
+  raft::stats::meanAdd(labels, labels, mu_labels, 1, n_rows, false, true,
+                       stream);
 }
 
 };  // namespace GLM
diff --git a/cpp/src/glm/preprocess_mg.cu b/cpp/src/glm/preprocess_mg.cu
index f5b79db66a..b01783b5a1 100644
--- a/cpp/src/glm/preprocess_mg.cu
+++ b/cpp/src/glm/preprocess_mg.cu
@@ -14,19 +14,20 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/linear_model/preprocess_mg.hpp>
-#include <linalg/gemm.cuh>
-#include <linalg/subtract.cuh>
-#include <matrix/math.cuh>
 #include <opg/linalg/norm.hpp>
 #include <opg/matrix/math.hpp>
 #include <opg/stats/mean.hpp>
 #include <opg/stats/mean_center.hpp>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/matrix/math.cuh>
 
 using namespace MLCommon;
 
@@ -35,23 +36,22 @@ namespace GLM {
 namespace opg {
 
 template <typename T>
-void preProcessData_impl(cumlHandle &handle,
+void preProcessData_impl(raft::handle_t &handle,
                          std::vector<Matrix::Data<T> *> &input_data,
                          Matrix::PartDescriptor &input_desc,
                          std::vector<Matrix::Data<T> *> &labels, T *mu_input,
                          T *mu_labels, T *norm2_input, bool fit_intercept,
                          bool normalize, cudaStream_t *streams, int n_streams,
                          bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  cusolverDnHandle_t cusolver_handle = handle.getImpl().getcusolverDnHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  const auto allocator = handle.get_device_allocator();
 
   if (fit_intercept) {
     Matrix::Data<T> mu_input_data{mu_input, size_t(input_desc.N)};
-    Stats::opg::mean(mu_input_data, input_data, input_desc, comm, allocator,
-                     streams, n_streams, cublas_handle);
+    Stats::opg::mean(handle, mu_input_data, input_data, input_desc, streams,
+                     n_streams);
     Stats::opg::mean_center(input_data, input_desc, mu_input_data, comm,
                             streams, n_streams);
 
@@ -59,15 +59,15 @@ void preProcessData_impl(cumlHandle &handle,
     labels_desc.N = size_t(1);
 
     Matrix::Data<T> mu_labels_data{mu_labels, size_t(1)};
-    Stats::opg::mean(mu_labels_data, labels, labels_desc, comm, allocator,
-                     streams, n_streams, cublas_handle);
+    Stats::opg::mean(handle, mu_labels_data, labels, labels_desc, streams,
+                     n_streams);
     Stats::opg::mean_center(labels, labels_desc, mu_labels_data, comm, streams,
                             n_streams);
 
     if (normalize) {
       Matrix::Data<T> norm2_input_data{norm2_input, size_t(input_desc.N)};
-      LinAlg::opg::colNorm2(norm2_input_data, input_data, input_desc, comm,
-                            allocator, streams, n_streams, cublas_handle);
+      LinAlg::opg::colNorm2(handle, norm2_input_data, input_data, input_desc,
+                            streams, n_streams);
 
       Matrix::opg::matrixVectorBinaryDivSkipZero(
         input_data, input_desc, norm2_input_data, false, true, true, comm,
@@ -77,18 +77,17 @@ void preProcessData_impl(cumlHandle &handle,
 }
 
 template <typename T>
-void postProcessData_impl(cumlHandle &handle,
+void postProcessData_impl(raft::handle_t &handle,
                           std::vector<Matrix::Data<T> *> &input_data,
                           Matrix::PartDescriptor &input_desc,
                           std::vector<Matrix::Data<T> *> &labels, T *coef,
                           T *intercept, T *mu_input, T *mu_labels,
                           T *norm2_input, bool fit_intercept, bool normalize,
                           cudaStream_t *streams, int n_streams, bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  cusolverDnHandle_t cusolver_handle = handle.getImpl().getcusolverDnHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  const auto allocator = handle.get_device_allocator();
 
   device_buffer<T> d_intercept(allocator, streams[0], 1);
 
@@ -97,17 +96,18 @@ void postProcessData_impl(cumlHandle &handle,
     Matrix::opg::matrixVectorBinaryMult(input_data, input_desc,
                                         norm2_input_data, false, true, comm,
                                         streams, n_streams);
-    Matrix::matrixVectorBinaryDivSkipZero(coef, norm2_input, size_t(1),
-                                          input_desc.N, false, true, streams[0],
-                                          true);
+    raft::matrix::matrixVectorBinaryDivSkipZero(coef, norm2_input, size_t(1),
+                                                input_desc.N, false, true,
+                                                streams[0], true);
   }
 
-  LinAlg::gemm(mu_input, 1, input_desc.N, coef, d_intercept.data(), 1, 1,
-               CUBLAS_OP_N, CUBLAS_OP_N, cublas_handle, streams[0]);
+  raft::linalg::gemm(handle, mu_input, 1, input_desc.N, coef,
+                     d_intercept.data(), 1, 1, CUBLAS_OP_N, CUBLAS_OP_N,
+                     streams[0]);
 
-  LinAlg::subtract(d_intercept.data(), mu_labels, d_intercept.data(), 1,
-                   streams[0]);
-  updateHost(intercept, d_intercept.data(), 1, streams[0]);
+  raft::linalg::subtract(d_intercept.data(), mu_labels, d_intercept.data(), 1,
+                         streams[0]);
+  raft::update_host(intercept, d_intercept.data(), 1, streams[0]);
 
   Matrix::Data<T> mu_input_data{mu_input, size_t(input_desc.N)};
   Stats::opg::mean_add(input_data, input_desc, mu_input_data, comm, streams,
@@ -120,7 +120,7 @@ void postProcessData_impl(cumlHandle &handle,
                        n_streams);
 }
 
-void preProcessData(cumlHandle &handle,
+void preProcessData(raft::handle_t &handle,
                     std::vector<Matrix::Data<float> *> &input_data,
                     Matrix::PartDescriptor &input_desc,
                     std::vector<Matrix::Data<float> *> &labels, float *mu_input,
@@ -132,7 +132,7 @@ void preProcessData(cumlHandle &handle,
                       n_streams, verbose);
 }
 
-void preProcessData(cumlHandle &handle,
+void preProcessData(raft::handle_t &handle,
                     std::vector<Matrix::Data<double> *> &input_data,
                     Matrix::PartDescriptor &input_desc,
                     std::vector<Matrix::Data<double> *> &labels,
@@ -144,7 +144,7 @@ void preProcessData(cumlHandle &handle,
                       n_streams, verbose);
 }
 
-void postProcessData(cumlHandle &handle,
+void postProcessData(raft::handle_t &handle,
                      std::vector<Matrix::Data<float> *> &input_data,
                      Matrix::PartDescriptor &input_desc,
                      std::vector<Matrix::Data<float> *> &labels, float *coef,
@@ -156,7 +156,7 @@ void postProcessData(cumlHandle &handle,
                        normalize, streams, n_streams, verbose);
 }
 
-void postProcessData(cumlHandle &handle,
+void postProcessData(raft::handle_t &handle,
                      std::vector<Matrix::Data<double> *> &input_data,
                      Matrix::PartDescriptor &input_desc,
                      std::vector<Matrix::Data<double> *> &labels, double *coef,
diff --git a/cpp/src/glm/qn/glm_base.cuh b/cpp/src/glm/qn/glm_base.cuh
index 216e4ca410..1dcfc9b417 100644
--- a/cpp/src/glm/qn/glm_base.cuh
+++ b/cpp/src/glm/qn/glm_base.cuh
@@ -16,14 +16,14 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <stats/mean.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/stats/mean.cuh>
 #include <vector>
 #include "simple_mat.cuh"
 
@@ -31,7 +31,7 @@ namespace ML {
 namespace GLM {
 
 template <typename T>
-inline void linearFwd(const cumlHandle_impl &handle, SimpleMat<T> &Z,
+inline void linearFwd(const raft::handle_t &handle, SimpleMat<T> &Z,
                       const SimpleMat<T> &X, const SimpleMat<T> &W,
                       cudaStream_t stream) {
   // Forward pass:  compute Z <- W * X.T + bias
@@ -46,8 +46,8 @@ inline void linearFwd(const cumlHandle_impl &handle, SimpleMat<T> &Z,
     // - Z <- b (broadcast): TODO reads Z unnecessarily atm
     // - Z <- W * X^T + Z    : TODO can be fused in CUTLASS?
     auto set_bias = [] __device__(const T z, const T b) { return b; };
-    MLCommon::LinAlg::matrixVectorOp(Z.data, Z.data, bias.data, Z.n, Z.m, false,
-                                     false, set_bias, stream);
+    raft::linalg::matrixVectorOp(Z.data, Z.data, bias.data, Z.n, Z.m, false,
+                                 false, set_bias, stream);
 
     Z.assign_gemm(handle, 1, weights, false, X, true, 1, stream);
   } else {
@@ -56,7 +56,7 @@ inline void linearFwd(const cumlHandle_impl &handle, SimpleMat<T> &Z,
 }
 
 template <typename T>
-inline void linearBwd(const cumlHandle_impl &handle, SimpleMat<T> &G,
+inline void linearBwd(const raft::handle_t &handle, SimpleMat<T> &G,
                       const SimpleMat<T> &X, const SimpleMat<T> &dZ,
                       bool setZero, cudaStream_t stream) {
   // Backward pass:
@@ -74,7 +74,7 @@ inline void linearBwd(const cumlHandle_impl &handle, SimpleMat<T> &G,
 
     // TODO can this be fused somehow?
     Gweights.assign_gemm(handle, 1.0 / X.m, dZ, false, X, false, beta, stream);
-    MLCommon::Stats::mean(Gbias.data, dZ.data, dZ.m, dZ.n, false, true, stream);
+    raft::stats::mean(Gbias.data, dZ.data, dZ.m, dZ.n, false, true, stream);
   } else {
     G.assign_gemm(handle, 1.0 / X.m, dZ, false, X, false, beta, stream);
   }
@@ -95,9 +95,9 @@ struct GLMBase : GLMDims {
   typedef SimpleMat<T> Mat;
   typedef SimpleVec<T> Vec;
 
-  const cumlHandle_impl &handle;
+  const raft::handle_t &handle;
 
-  GLMBase(const cumlHandle_impl &handle, int D, int C, bool fit_intercept)
+  GLMBase(const raft::handle_t &handle, int D, int C, bool fit_intercept)
     : GLMDims(C, D, fit_intercept), handle(handle) {}
 
   /*
@@ -120,13 +120,13 @@ struct GLMBase : GLMDims {
     // TODO would be nice to have a kernel that fuses these two steps
     // This would be easy, if mapThenSumReduce allowed outputing the result of
     // map (supporting inplace)
-    MLCommon::LinAlg::mapThenSumReduce(loss_val, y.len, f_l, stream, y.data,
-                                       Z.data);
+    raft::linalg::mapThenSumReduce(loss_val, y.len, f_l, stream, y.data,
+                                   Z.data);
 
     auto f_dl = [=] __device__(const T y, const T z) {
       return loss->dlz(y, z);
     };
-    MLCommon::LinAlg::binaryOp(Z.data, y.data, Z.data, y.len, f_dl, stream);
+    raft::linalg::binaryOp(Z.data, y.data, Z.data, y.len, f_dl, stream);
   }
 
   inline void loss_grad(T *loss_val, Mat &G, const Mat &W,
@@ -168,7 +168,7 @@ struct GLMWithData : GLMDims {
     objective->loss_grad(dev_scalar, G, W, X, y, Z, stream);
     lossVal.reset(dev_scalar, 1);
     T loss_host;
-    MLCommon::updateHost(&loss_host, lossVal.data, 1, stream);
+    raft::update_host(&loss_host, lossVal.data, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     return loss_host;
   }
diff --git a/cpp/src/glm/qn/glm_linear.cuh b/cpp/src/glm/qn/glm_linear.cuh
index a2ba24fe42..c94da29c88 100644
--- a/cpp/src/glm/qn/glm_linear.cuh
+++ b/cpp/src/glm/qn/glm_linear.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
 #include "glm_base.cuh"
 #include "simple_mat.cuh"
 
@@ -28,7 +28,7 @@ template <typename T>
 struct SquaredLoss : GLMBase<T, SquaredLoss<T>> {
   typedef GLMBase<T, SquaredLoss<T>> Super;
 
-  SquaredLoss(const cumlHandle_impl &handle, int D, bool has_bias)
+  SquaredLoss(const raft::handle_t &handle, int D, bool has_bias)
     : Super(handle, D, 1, has_bias) {}
 
   inline __device__ T lz(const T y, const T z) const {
diff --git a/cpp/src/glm/qn/glm_logistic.cuh b/cpp/src/glm/qn/glm_logistic.cuh
index ed1cad925f..3daf7f5693 100644
--- a/cpp/src/glm/qn/glm_logistic.cuh
+++ b/cpp/src/glm/qn/glm_logistic.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
 #include "glm_base.cuh"
 #include "simple_mat.cuh"
 
@@ -28,12 +28,12 @@ template <typename T>
 struct LogisticLoss : GLMBase<T, LogisticLoss<T>> {
   typedef GLMBase<T, LogisticLoss<T>> Super;
 
-  LogisticLoss(const cumlHandle_impl &handle, int D, bool has_bias)
+  LogisticLoss(const raft::handle_t &handle, int D, bool has_bias)
     : Super(handle, D, 1, has_bias) {}
 
   inline __device__ T log_sigmoid(T x) const {
     // To avoid floating point overflow in the exp function
-    T temp = MLCommon::myLog(1 + MLCommon::myExp(x < 0 ? x : -x));
+    T temp = raft::myLog(1 + raft::myExp(x < 0 ? x : -x));
     return x < 0 ? x - temp : -temp;
   }
 
@@ -44,7 +44,7 @@ struct LogisticLoss : GLMBase<T, LogisticLoss<T>> {
 
   inline __device__ T dlz(const T y, const T z) const {
     // To avoid fp overflow with exp(z) when abs(z) is large
-    T ez = MLCommon::myExp(z < 0 ? z : -z);
+    T ez = raft::myExp(z < 0 ? z : -z);
     T numerator = z < 0 ? ez : T(1.0);
     return numerator / (T(1.0) + ez) - y;
   }
diff --git a/cpp/src/glm/qn/glm_regularizer.cuh b/cpp/src/glm/qn/glm_regularizer.cuh
index e479b6f8e3..b3644adbc9 100644
--- a/cpp/src/glm/qn/glm_regularizer.cuh
+++ b/cpp/src/glm/qn/glm_regularizer.cuh
@@ -16,11 +16,11 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <stats/mean.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/stats/mean.cuh>
 #include "simple_mat.cuh"
 
 namespace ML {
@@ -43,8 +43,8 @@ struct Tikhonov {
     col_slice(W, Wweights, 0, G.n - has_bias);
     Gweights.ax(l2_penalty, Wweights, stream);
 
-    MLCommon::LinAlg::mapThenSumReduce(reg_val, Wweights.len, *this, stream,
-                                       Wweights.data);
+    raft::linalg::mapThenSumReduce(reg_val, Wweights.len, *this, stream,
+                                   Wweights.data);
   }
 };
 
@@ -66,10 +66,10 @@ struct RegularizedGLM : GLMDims {
     G.fill(0, stream);
 
     reg->reg_grad(lossVal.data, G, W, loss->fit_intercept, stream);
-    MLCommon::updateHost(&reg_host, lossVal.data, 1, stream);
+    raft::update_host(&reg_host, lossVal.data, 1, stream);
 
     loss->loss_grad(lossVal.data, G, W, Xb, yb, Zb, stream, false);
-    MLCommon::updateHost(&loss_host, lossVal.data, 1, stream);
+    raft::update_host(&loss_host, lossVal.data, 1, stream);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src/glm/qn/glm_softmax.cuh b/cpp/src/glm/qn/glm_softmax.cuh
index 6dc1cca48d..468736a7fe 100644
--- a/cpp/src/glm/qn/glm_softmax.cuh
+++ b/cpp/src/glm/qn/glm_softmax.cuh
@@ -16,17 +16,17 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
 #include "glm_base.cuh"
 #include "simple_mat.cuh"
 
 namespace ML {
 namespace GLM {
-using MLCommon::ceildiv;
-using MLCommon::myExp;
-using MLCommon::myLog;
-using MLCommon::myMax;
+using raft::ceildiv;
+using raft::myExp;
+using raft::myLog;
+using raft::myMax;
 
 // Input: matrix Z (dims: CxN)
 // Computes softmax cross entropy loss across columns, i.e. normalization
@@ -152,7 +152,7 @@ __global__ void logSoftmaxKernel(T *out, T *dZ, const T *in, const T *labels,
    */
   T blockSum = BlockRed(shm.blockStore).Sum(lossVal);
   if (threadIdx.x == 0 && threadIdx.y == 0) {
-    atomicAdd(out, blockSum);
+    raft::myAtomicAdd(out, blockSum);
   }
 }
 
@@ -189,7 +189,7 @@ template <typename T>
 struct Softmax : GLMBase<T, Softmax<T>> {
   typedef GLMBase<T, Softmax<T>> Super;
 
-  Softmax(const cumlHandle_impl &handle, int D, int C, bool has_bias)
+  Softmax(const raft::handle_t &handle, int D, int C, bool has_bias)
     : Super(handle, D, C, has_bias) {}
 
   inline void getLossAndDZ(T *loss_val, SimpleMat<T> &Z, const SimpleVec<T> &y,
diff --git a/cpp/src/glm/qn/qn.cuh b/cpp/src/glm/qn/qn.cuh
index 96ba03acd3..741e3a06d6 100644
--- a/cpp/src/glm/qn/qn.cuh
+++ b/cpp/src/glm/qn/qn.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 #include <common/device_buffer.hpp>
-#include <matrix/math.cuh>
+#include <raft/matrix/math.cuh>
 #include "glm_base.cuh"
 #include "glm_linear.cuh"
 #include "glm_logistic.cuh"
@@ -27,7 +27,7 @@
 namespace ML {
 namespace GLM {
 template <typename T, typename LossFunction>
-int qn_fit(const cumlHandle_impl &handle, LossFunction &loss, T *Xptr, T *yptr,
+int qn_fit(const raft::handle_t &handle, LossFunction &loss, T *Xptr, T *yptr,
            T *zptr, int N, T l1, T l2, int max_iter, T grad_tol,
            int linesearch_max_iter, int lbfgs_memory, int verbosity,
            T *w0,  // initial value and result
@@ -60,7 +60,7 @@ int qn_fit(const cumlHandle_impl &handle, LossFunction &loss, T *Xptr, T *yptr,
 }
 
 template <typename T>
-void qnFit(const cumlHandle_impl &handle, T *X, T *y, int N, int D, int C,
+void qnFit(const raft::handle_t &handle, T *X, T *y, int N, int D, int C,
            bool fit_intercept, T l1, T l2, int max_iter, T grad_tol,
            int linesearch_max_iter, int lbfgs_memory, int verbosity, T *w0,
            T *f, int *num_iters, bool X_col_major, int loss_type,
@@ -68,7 +68,7 @@ void qnFit(const cumlHandle_impl &handle, T *X, T *y, int N, int D, int C,
   STORAGE_ORDER ord = X_col_major ? COL_MAJOR : ROW_MAJOR;
   int C_len = (loss_type == 0) ? (C - 1) : C;
 
-  MLCommon::device_buffer<T> tmp(handle.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<T> tmp(handle.get_device_allocator(), stream,
                                  C_len * N);
   SimpleMat<T> z(tmp.data(), C_len, N);
 
@@ -101,7 +101,7 @@ void qnFit(const cumlHandle_impl &handle, T *X, T *y, int N, int D, int C,
 }
 
 template <typename T>
-void qnDecisionFunction(const cumlHandle_impl &handle, T *Xptr, int N, int D,
+void qnDecisionFunction(const raft::handle_t &handle, T *Xptr, int N, int D,
                         int C, bool fit_intercept, T *params, bool X_col_major,
                         int loss_type, T *scores, cudaStream_t stream) {
   // NOTE: While gtests pass X as row-major, and python API passes X as
@@ -121,11 +121,11 @@ void qnDecisionFunction(const cumlHandle_impl &handle, T *Xptr, int N, int D,
 }
 
 template <typename T>
-void qnPredict(const cumlHandle_impl &handle, T *Xptr, int N, int D, int C,
+void qnPredict(const raft::handle_t &handle, T *Xptr, int N, int D, int C,
                bool fit_intercept, T *params, bool X_col_major, int loss_type,
                T *preds, cudaStream_t stream) {
   int C_len = (loss_type == 0) ? (C - 1) : C;
-  MLCommon::device_buffer<T> scores(handle.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<T> scores(handle.get_device_allocator(), stream,
                                     C_len * N);
   qnDecisionFunction<T>(handle, Xptr, N, D, C, fit_intercept, params,
                         X_col_major, loss_type, scores.data(), stream);
@@ -147,7 +147,7 @@ void qnPredict(const cumlHandle_impl &handle, T *Xptr, int N, int D, int C,
     } break;
     case 2: {
       ASSERT(C > 2, "qn.h: softmax invalid C");
-      MLCommon::Matrix::argmax(Z.data, C, N, preds, stream);
+      raft::matrix::argmax(Z.data, C, N, preds, stream);
     } break;
     default: {
       ASSERT(false, "qn.h: unknown loss function.");
diff --git a/cpp/src/glm/qn/qn_solvers.cuh b/cpp/src/glm/qn/qn_solvers.cuh
index 29a32a58ba..651577077a 100644
--- a/cpp/src/glm/qn/qn_solvers.cuh
+++ b/cpp/src/glm/qn/qn_solvers.cuh
@@ -40,8 +40,8 @@
  *
  */
 
-#include <cuda_utils.cuh>
 #include <cuml/common/logger.hpp>
+#include <raft/cuda_utils.cuh>
 #include "qn_linesearch.cuh"
 #include "qn_util.cuh"
 #include "simple_mat.cuh"
@@ -49,21 +49,19 @@
 namespace ML {
 namespace GLM {
 
-using MLCommon::alignTo;
-
 // TODO better way to deal with alignment? Smaller aligne possible?
 constexpr size_t qn_align = 256;
 
 template <typename T>
 inline size_t lbfgs_workspace_size(const LBFGSParam<T> &param, const int n) {
-  size_t mat_size = alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
-  size_t vec_size = alignTo<size_t>(sizeof(T) * n, qn_align);
+  size_t mat_size = raft::alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
+  size_t vec_size = raft::alignTo<size_t>(sizeof(T) * n, qn_align);
   return 2 * mat_size + 4 * vec_size + qn_align;
 }
 
 template <typename T>
 inline size_t owlqn_workspace_size(const LBFGSParam<T> &param, const int n) {
-  size_t vec_size = alignTo<size_t>(sizeof(T) * n, qn_align);
+  size_t vec_size = raft::alignTo<size_t>(sizeof(T) * n, qn_align);
   return lbfgs_workspace_size(param, n) + vec_size;
 }
 
@@ -80,8 +78,8 @@ inline OPT_RETCODE min_lbfgs(const LBFGSParam<T> &param,
   ASSERT(workspace.len >= workspace_size, "LBFGS: workspace insufficient");
 
   // SETUP WORKSPACE
-  size_t mat_size = alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
-  size_t vec_size = alignTo<size_t>(sizeof(T) * n, qn_align);
+  size_t mat_size = raft::alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
+  size_t vec_size = raft::alignTo<size_t>(sizeof(T) * n, qn_align);
   T *p_ws = workspace.data;
   SimpleMat<T> S(p_ws, n, param.m);
   p_ws += mat_size;
@@ -206,8 +204,8 @@ inline OPT_RETCODE min_owlqn(const LBFGSParam<T> &param, Function &f,
          "OWL-QN: Invalid pseudo grad limit parameter");
 
   // SETUP WORKSPACE
-  size_t mat_size = alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
-  size_t vec_size = alignTo<size_t>(sizeof(T) * n, qn_align);
+  size_t mat_size = raft::alignTo<size_t>(sizeof(T) * param.m * n, qn_align);
+  size_t vec_size = raft::alignTo<size_t>(sizeof(T) * n, qn_align);
   T *p_ws = workspace.data;
   SimpleMat<T> S(p_ws, n, param.m);
   p_ws += mat_size;
@@ -235,9 +233,9 @@ inline OPT_RETCODE min_owlqn(const LBFGSParam<T> &param, Function &f,
 
   op_project<T> project_neg(T(-1.0));
 
-  auto f_wrap = [&f, &l1_penalty, &pg_limit, &stream](
-                  SimpleVec<T> &x, SimpleVec<T> &grad, T *dev_scalar,
-                  cudaStream_t stream) {
+  auto f_wrap = [&f, &l1_penalty, &pg_limit](SimpleVec<T> &x,
+                                             SimpleVec<T> &grad, T *dev_scalar,
+                                             cudaStream_t stream) {
     T tmp = f(x, grad, dev_scalar, stream);
     SimpleVec<T> mask(x.data, pg_limit);
     return tmp + l1_penalty * nrm1(mask, dev_scalar, stream);
@@ -334,14 +332,14 @@ inline OPT_RETCODE min_owlqn(const LBFGSParam<T> &param, Function &f,
  * Chooses the right algorithm, depending on presence of l1 term
  */
 template <typename T, typename LossFunction>
-inline int qn_minimize(const cumlHandle_impl &handle, SimpleVec<T> &x, T *fx,
+inline int qn_minimize(const raft::handle_t &handle, SimpleVec<T> &x, T *fx,
                        int *num_iters, LossFunction &loss, const T l1,
                        const LBFGSParam<T> &opt_param, cudaStream_t stream,
                        const int verbosity = 0) {
   // TODO should the worksapce allocation happen outside?
   OPT_RETCODE ret;
   if (l1 == 0.0) {
-    MLCommon::device_buffer<T> tmp(handle.getDeviceAllocator(), stream,
+    MLCommon::device_buffer<T> tmp(handle.get_device_allocator(), stream,
                                    lbfgs_workspace_size(opt_param, x.len));
     SimpleVec<T> workspace(tmp.data(), tmp.size());
 
@@ -362,7 +360,7 @@ inline int qn_minimize(const cumlHandle_impl &handle, SimpleVec<T> &x, T *fx,
     // handling the term l1norm(x) * l1_pen explicitely, i.e.
     // it needs to evaluate f(x) and its gradient separately
 
-    MLCommon::device_buffer<T> tmp(handle.getDeviceAllocator(), stream,
+    MLCommon::device_buffer<T> tmp(handle.get_device_allocator(), stream,
                                    owlqn_workspace_size(opt_param, x.len));
     SimpleVec<T> workspace(tmp.data(), tmp.size());
 
diff --git a/cpp/src/glm/qn/qn_util.cuh b/cpp/src/glm/qn/qn_util.cuh
index 5cdca7131e..4546ef736e 100644
--- a/cpp/src/glm/qn/qn_util.cuh
+++ b/cpp/src/glm/qn/qn_util.cuh
@@ -15,9 +15,9 @@
  */
 
 #pragma once
-#include <cuda_utils.cuh>
 #include <cuml/common/logger.hpp>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace GLM {
@@ -179,7 +179,7 @@ inline int lbfgs_search_dir(const LBFGSParam<T> &param, int *n_vec,
     // "L-BFGS-B Fortran subroutines for large-scale bound constrained
     // optimization" Ciyou Zhu, Richard H. Byrd, Peihuang Lu and Jorge Nocedal
     // (1994).
-    CUML_LOG_INFO("L-BFGS WARNING: skipping update step ys=%f, yy=%f", ys, yy);
+    CUML_LOG_DEBUG("L-BFGS WARNING: skipping update step ys=%f, yy=%f", ys, yy);
     return end;
   }
   (*n_vec)++;
@@ -214,7 +214,7 @@ inline int lbfgs_search_dir(const LBFGSParam<T> &param, int *n_vec,
 template <typename T>
 HDI T get_pseudo_grad(T x, T dlossx, T C) {
   if (x != 0) {
-    return dlossx + MLCommon::sgn(x) * C;
+    return dlossx + raft::sgn(x) * C;
   }
   T dplus = dlossx + C;
   T dmins = dlossx - C;
diff --git a/cpp/src/glm/qn/simple_mat.cuh b/cpp/src/glm/qn/simple_mat.cuh
index d69f7467c2..3ab51f63f5 100644
--- a/cpp/src/glm/qn/simple_mat.cuh
+++ b/cpp/src/glm/qn/simple_mat.cuh
@@ -18,16 +18,16 @@
 #include <iostream>
 #include <vector>
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/norm.cuh>
 #include <linalg/ternary_op.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace ML {
 
@@ -56,7 +56,7 @@ struct SimpleMat {
 
   void print(std::ostream &oss) const { oss << (*this) << std::endl; }
 
-  inline void assign_gemm(const cumlHandle_impl &handle, const T alpha,
+  inline void assign_gemm(const raft::handle_t &handle, const T alpha,
                           const SimpleMat<T> &A, const bool transA,
                           const SimpleMat<T> &B, const bool transB,
                           const T beta, cudaStream_t stream) {
@@ -79,18 +79,17 @@ struct SimpleMat {
     ASSERT(kA == kB, "GEMM invalid dims: k");
 
     if (ord == COL_MAJOR && A.ord == COL_MAJOR &&
-        B.ord == COL_MAJOR) {  // base case
-      MLCommon::LinAlg::cublasgemm(
-        handle.getCublasHandle(),            // handle
-        transA ? CUBLAS_OP_T : CUBLAS_OP_N,  // transA
-        transB ? CUBLAS_OP_T : CUBLAS_OP_N,  // transB
-        this->m, this->n, kA,                // dimensions m,n,k
-        &alpha, A.data,
-        A.m,          // lda
-        B.data, B.m,  // ldb
-        &beta, this->data,
-        this->m,  // ldc,
-        stream);
+        B.ord == COL_MAJOR) {                                       // base case
+      raft::linalg::cublasgemm(handle.get_cublas_handle(),          // handle
+                               transA ? CUBLAS_OP_T : CUBLAS_OP_N,  // transA
+                               transB ? CUBLAS_OP_T : CUBLAS_OP_N,  // transB
+                               this->m, this->n, kA,  // dimensions m,n,k
+                               &alpha, A.data,
+                               A.m,          // lda
+                               B.data, B.m,  // ldb
+                               &beta, this->data,
+                               this->m,  // ldc,
+                               stream);
       return;
     }
     if (A.ord == ROW_MAJOR) {
@@ -115,7 +114,7 @@ struct SimpleMat {
     ASSERT(ord == x.ord, "SimpleMat::ax: Storage orders must match");
 
     auto scale = [a] __device__(const T x) { return a * x; };
-    MLCommon::LinAlg::unaryOp(data, x.data, len, scale, stream);
+    raft::linalg::unaryOp(data, x.data, len, scale, stream);
   }
 
   // this = a*x + y
@@ -125,7 +124,7 @@ struct SimpleMat {
     ASSERT(ord == y.ord, "SimpleMat::axpy: Storage orders must match");
 
     auto axpy = [a] __device__(const T x, const T y) { return a * x + y; };
-    MLCommon::LinAlg::binaryOp(data, x.data, y.data, len, axpy, stream);
+    raft::linalg::binaryOp(data, x.data, y.data, len, axpy, stream);
   }
 
   template <typename Lambda>
@@ -134,7 +133,7 @@ struct SimpleMat {
     ASSERT(ord == other.ord,
            "SimpleMat::assign_unary: Storage orders must match");
 
-    MLCommon::LinAlg::unaryOp(data, other.data, len, f, stream);
+    raft::linalg::unaryOp(data, other.data, len, f, stream);
   }
 
   template <typename Lambda>
@@ -146,7 +145,7 @@ struct SimpleMat {
     ASSERT(ord == other2.ord,
            "SimpleMat::assign_binary: Storage orders must match");
 
-    MLCommon::LinAlg::binaryOp(data, other1.data, other2.data, len, f, stream);
+    raft::linalg::binaryOp(data, other1.data, other2.data, len, f, stream);
   }
 
   template <typename Lambda>
@@ -168,7 +167,7 @@ struct SimpleMat {
   inline void fill(const T val, cudaStream_t stream) {
     // TODO this reads data unnecessary, though it's mostly used for testing
     auto f = [val] __device__(const T x) { return val; };
-    MLCommon::LinAlg::unaryOp(data, data, len, f, stream);
+    raft::linalg::unaryOp(data, data, len, f, stream);
   }
 
   inline void copy_async(const SimpleMat<T> &other, cudaStream_t stream) {
@@ -188,7 +187,7 @@ struct SimpleVec : SimpleMat<T> {
 
   SimpleVec(T *data, const int n) : Super(data, n, 1, COL_MAJOR) {}
   // this = alpha * A * x + beta * this
-  void assign_gemv(const cumlHandle_impl &handle, const T alpha,
+  void assign_gemv(const raft::handle_t &handle, const T alpha,
                    const SimpleMat<T> &A, bool transA, const SimpleVec<T> &x,
                    const T beta, cudaStream_t stream) {
     Super::assign_gemm(handle, alpha, A, transA, x, false, beta, stream);
@@ -227,9 +226,9 @@ template <typename T>
 inline T dot(const SimpleVec<T> &u, const SimpleVec<T> &v, T *tmp_dev,
              cudaStream_t stream) {
   auto f = [] __device__(const T x, const T y) { return x * y; };
-  MLCommon::LinAlg::mapThenSumReduce(tmp_dev, u.len, f, stream, u.data, v.data);
+  raft::linalg::mapThenSumReduce(tmp_dev, u.len, f, stream, u.data, v.data);
   T tmp_host;
-  MLCommon::updateHost(&tmp_host, tmp_dev, 1, stream);
+  raft::update_host(&tmp_host, tmp_dev, 1, stream);
   cudaStreamSynchronize(stream);
   return tmp_host;
 }
@@ -241,15 +240,15 @@ inline T squaredNorm(const SimpleVec<T> &u, T *tmp_dev, cudaStream_t stream) {
 
 template <typename T>
 inline T nrm2(const SimpleVec<T> &u, T *tmp_dev, cudaStream_t stream) {
-  return MLCommon::mySqrt<T>(squaredNorm(u, tmp_dev, stream));
+  return raft::mySqrt<T>(squaredNorm(u, tmp_dev, stream));
 }
 
 template <typename T>
 inline T nrm1(const SimpleVec<T> &u, T *tmp_dev, cudaStream_t stream) {
-  MLCommon::LinAlg::rowNorm(tmp_dev, u.data, u.len, 1, MLCommon::LinAlg::L1Norm,
-                            true, stream, MLCommon::Nop<T>());
+  raft::linalg::rowNorm(tmp_dev, u.data, u.len, 1, raft::linalg::L1Norm, true,
+                        stream, raft::Nop<T>());
   T tmp_host;
-  MLCommon::updateHost(&tmp_host, tmp_dev, 1, stream);
+  raft::update_host(&tmp_host, tmp_dev, 1, stream);
   cudaStreamSynchronize(stream);
   return tmp_host;
 }
@@ -257,7 +256,7 @@ inline T nrm1(const SimpleVec<T> &u, T *tmp_dev, cudaStream_t stream) {
 template <typename T>
 std::ostream &operator<<(std::ostream &os, const SimpleVec<T> &v) {
   std::vector<T> out(v.len);
-  MLCommon::updateHost(&out[0], v.data, v.len, 0);
+  raft::update_host(&out[0], v.data, v.len, 0);
   CUDA_CHECK(cudaStreamSynchronize(0));
   int it = 0;
   for (; it < v.len - 1;) {
@@ -272,7 +271,7 @@ template <typename T>
 std::ostream &operator<<(std::ostream &os, const SimpleMat<T> &mat) {
   os << "ord=" << (mat.ord == COL_MAJOR ? "CM" : "RM") << "\n";
   std::vector<T> out(mat.len);
-  MLCommon::updateHost(&out[0], mat.data, mat.len, 0);
+  raft::update_host(&out[0], mat.data, mat.len, 0);
   CUDA_CHECK(cudaStreamSynchronize(0));
   if (mat.ord == COL_MAJOR) {
     for (int r = 0; r < mat.m; r++) {
diff --git a/cpp/src/glm/ridge.cuh b/cpp/src/glm/ridge.cuh
index 303d9f6cd9..c44ba6d5b3 100644
--- a/cpp/src/glm/ridge.cuh
+++ b/cpp/src/glm/ridge.cuh
@@ -16,19 +16,19 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <linalg/add.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/norm.cuh>
-#include <linalg/subtract.cuh>
-#include <linalg/svd.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
-#include <stats/stddev.cuh>
-#include <stats/sum.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/linalg/svd.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
+#include <raft/stats/stddev.cuh>
+#include <raft/stats/sum.cuh>
 #include "preprocess.cuh"
 
 namespace ML {
@@ -37,11 +37,11 @@ namespace GLM {
 using namespace MLCommon;
 
 template <typename math_t>
-void ridgeSolve(const cumlHandle_impl &handle, math_t *S, math_t *V, math_t *U,
+void ridgeSolve(const raft::handle_t &handle, math_t *S, math_t *V, math_t *U,
                 int n_rows, int n_cols, math_t *b, math_t *alpha, int n_alpha,
                 math_t *w, cudaStream_t stream) {
-  auto cublasH = handle.getCublasHandle();
-  auto cusolverH = handle.getcusolverDnHandle();
+  auto cublasH = handle.get_cublas_handle();
+  auto cusolverH = handle.get_cusolver_dn_handle();
 
   // Implements this: w = V * inv(S^2 + λ*I) * S * U^T * b
   math_t *S_nnz;
@@ -49,31 +49,32 @@ void ridgeSolve(const cumlHandle_impl &handle, math_t *S, math_t *V, math_t *U,
   math_t beta = math_t(0);
   math_t thres = math_t(1e-10);
 
-  Matrix::setSmallValuesZero(S, n_cols, stream, thres);
-  allocate(S_nnz, n_cols, true);
-  copy(S_nnz, S, n_cols, stream);
-  Matrix::power(S_nnz, n_cols, stream);
-  LinAlg::addScalar(S_nnz, S_nnz, alpha[0], n_cols, stream);
-  Matrix::matrixVectorBinaryDivSkipZero(S, S_nnz, 1, n_cols, false, true,
-                                        stream, true);
+  raft::matrix::setSmallValuesZero(S, n_cols, stream, thres);
+  raft::allocate(S_nnz, n_cols, true);
+  raft::copy(S_nnz, S, n_cols, stream);
+  raft::matrix::power(S_nnz, n_cols, stream);
+  raft::linalg::addScalar(S_nnz, S_nnz, alpha[0], n_cols, stream);
+  raft::matrix::matrixVectorBinaryDivSkipZero(S, S_nnz, 1, n_cols, false, true,
+                                              stream, true);
 
-  Matrix::matrixVectorBinaryMult(V, S, n_cols, n_cols, false, true, stream);
-  LinAlg::gemm(U, n_rows, n_cols, b, S_nnz, n_cols, 1, CUBLAS_OP_T, CUBLAS_OP_N,
-               alp, beta, cublasH, stream);
+  raft::matrix::matrixVectorBinaryMult(V, S, n_cols, n_cols, false, true,
+                                       stream);
+  raft::linalg::gemm(handle, U, n_rows, n_cols, b, S_nnz, n_cols, 1,
+                     CUBLAS_OP_T, CUBLAS_OP_N, alp, beta, stream);
 
-  LinAlg::gemm(V, n_cols, n_cols, S_nnz, w, n_cols, 1, CUBLAS_OP_N, CUBLAS_OP_N,
-               alp, beta, cublasH, stream);
+  raft::linalg::gemm(handle, V, n_cols, n_cols, S_nnz, w, n_cols, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, alp, beta, stream);
 
   CUDA_CHECK(cudaFree(S_nnz));
 }
 
 template <typename math_t>
-void ridgeSVD(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
+void ridgeSVD(const raft::handle_t &handle, math_t *A, int n_rows, int n_cols,
               math_t *b, math_t *alpha, int n_alpha, math_t *w,
               cudaStream_t stream) {
-  auto cublasH = handle.getCublasHandle();
-  auto cusolverH = handle.getcusolverDnHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cublasH = handle.get_cublas_handle();
+  auto cusolverH = handle.get_cusolver_dn_handle();
+  auto allocator = handle.get_device_allocator();
 
   ASSERT(n_cols > 0, "ridgeSVD: number of columns cannot be less than one");
   ASSERT(n_rows > 1, "ridgeSVD: number of rows cannot be less than two");
@@ -83,12 +84,12 @@ void ridgeSVD(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
   int U_len = n_rows * n_cols;
   int V_len = n_cols * n_cols;
 
-  allocate(U, U_len);
-  allocate(V, V_len);
-  allocate(S, n_cols);
+  raft::allocate(U, U_len);
+  raft::allocate(V, V_len);
+  raft::allocate(S, n_cols);
 
-  LinAlg::svdQR(A, n_rows, n_cols, S, U, V, true, true, true, cusolverH,
-                cublasH, allocator, stream);
+  raft::linalg::svdQR(handle, A, n_rows, n_cols, S, U, V, true, true, true,
+                      stream);
   ridgeSolve(handle, S, V, U, n_rows, n_cols, b, alpha, n_alpha, w, stream);
 
   CUDA_CHECK(cudaFree(U));
@@ -97,12 +98,12 @@ void ridgeSVD(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
 }
 
 template <typename math_t>
-void ridgeEig(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
+void ridgeEig(const raft::handle_t &handle, math_t *A, int n_rows, int n_cols,
               math_t *b, math_t *alpha, int n_alpha, math_t *w,
               cudaStream_t stream) {
-  auto cublasH = handle.getCublasHandle();
-  auto cusolverH = handle.getcusolverDnHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cublasH = handle.get_cublas_handle();
+  auto cusolverH = handle.get_cusolver_dn_handle();
+  auto allocator = handle.get_device_allocator();
 
   ASSERT(n_cols > 1, "ridgeEig: number of columns cannot be less than two");
   ASSERT(n_rows > 1, "ridgeEig: number of rows cannot be less than two");
@@ -112,12 +113,12 @@ void ridgeEig(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
   int U_len = n_rows * n_cols;
   int V_len = n_cols * n_cols;
 
-  allocate(U, U_len);
-  allocate(V, V_len);
-  allocate(S, n_cols);
+  raft::allocate(U, U_len);
+  raft::allocate(V, V_len);
+  raft::allocate(S, n_cols);
+
+  raft::linalg::svdEig(handle, A, n_rows, n_cols, S, U, V, true, stream);
 
-  LinAlg::svdEig(A, n_rows, n_cols, S, U, V, true, cublasH, cusolverH, stream,
-                 allocator);
   ridgeSolve(handle, S, V, U, n_rows, n_cols, b, alpha, n_alpha, w, stream);
 
   CUDA_CHECK(cudaFree(U));
@@ -142,13 +143,13 @@ void ridgeEig(const cumlHandle_impl &handle, math_t *A, int n_rows, int n_cols,
  * @param algo          specifies which solver to use (0: SVD, 1: Eigendecomposition)
  */
 template <typename math_t>
-void ridgeFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
+void ridgeFit(const raft::handle_t &handle, math_t *input, int n_rows,
               int n_cols, math_t *labels, math_t *alpha, int n_alpha,
               math_t *coef, math_t *intercept, bool fit_intercept,
               bool normalize, cudaStream_t stream, int algo = 0) {
-  auto cublas_handle = handle.getCublasHandle();
-  auto cusolver_handle = handle.getcusolverDnHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cublas_handle = handle.get_cublas_handle();
+  auto cusolver_handle = handle.get_cusolver_dn_handle();
+  auto allocator = handle.get_device_allocator();
 
   ASSERT(n_cols > 0, "ridgeFit: number of columns cannot be less than one");
   ASSERT(n_rows > 1, "ridgeFit: number of rows cannot be less than two");
@@ -156,10 +157,10 @@ void ridgeFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
   math_t *mu_input, *norm2_input, *mu_labels;
 
   if (fit_intercept) {
-    allocate(mu_input, n_cols);
-    allocate(mu_labels, 1);
+    raft::allocate(mu_input, n_cols);
+    raft::allocate(mu_labels, 1);
     if (normalize) {
-      allocate(norm2_input, n_cols);
+      raft::allocate(norm2_input, n_cols);
     }
     preProcessData(handle, input, n_rows, n_cols, labels, intercept, mu_input,
                    mu_labels, norm2_input, fit_intercept, normalize, stream);
@@ -205,11 +206,9 @@ void ridgeFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
  * @param stream        cuda stream
  */
 template <typename math_t>
-void ridgePredict(const cumlHandle_impl &handle, const math_t *input,
-                  int n_rows, int n_cols, const math_t *coef, math_t intercept,
+void ridgePredict(const raft::handle_t &handle, const math_t *input, int n_rows,
+                  int n_cols, const math_t *coef, math_t intercept,
                   math_t *preds, cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
-
   ASSERT(n_cols > 0,
          "Parameter n_cols: number of columns cannot be less than one");
   ASSERT(n_rows > 1,
@@ -217,10 +216,10 @@ void ridgePredict(const cumlHandle_impl &handle, const math_t *input,
 
   math_t alpha = math_t(1);
   math_t beta = math_t(0);
-  LinAlg::gemm(input, n_rows, n_cols, coef, preds, n_rows, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, alpha, beta, cublas_handle, stream);
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, preds, n_rows, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta, stream);
 
-  LinAlg::addScalar(preds, preds, intercept, n_rows, stream);
+  raft::linalg::addScalar(preds, preds, intercept, n_rows, stream);
 }
 
 };  // namespace GLM
diff --git a/cpp/src/glm/ridge_mg.cu b/cpp/src/glm/ridge_mg.cu
index 615e6d60d0..f8a30b7359 100644
--- a/cpp/src/glm/ridge_mg.cu
+++ b/cpp/src/glm/ridge_mg.cu
@@ -14,20 +14,21 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/linear_model/preprocess_mg.hpp>
 #include <cuml/linear_model/ridge_mg.hpp>
-#include <linalg/add.cuh>
-#include <linalg/gemm.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
 #include <opg/linalg/mv_aTb.hpp>
 #include <opg/linalg/svd.hpp>
 #include <opg/stats/mean.hpp>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 
 using namespace MLCommon;
 
@@ -36,17 +37,16 @@ namespace Ridge {
 namespace opg {
 
 template <typename T>
-void ridgeSolve(const cumlHandle &handle, T *S, T *V,
+void ridgeSolve(const raft::handle_t &handle, T *S, T *V,
                 std::vector<Matrix::Data<T> *> &U,
                 const Matrix::PartDescriptor &UDesc,
                 const std::vector<Matrix::Data<T> *> &b, const T *alpha,
                 const int n_alpha, T *w, cudaStream_t *streams, int n_streams,
                 bool verbose) {
-  auto cublasH = handle.getImpl().getCublasHandle();
-  auto cusolverH = handle.getImpl().getcusolverDnHandle();
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  auto cublasH = handle.get_cublas_handle();
+  auto cusolverH = handle.get_cusolver_dn_handle();
+  const auto &comm = handle.get_comms();
+  const auto allocator = handle.get_device_allocator();
 
   // Implements this: w = V * inv(S^2 + λ*I) * S * U^T * b
   T *S_nnz;
@@ -54,46 +54,43 @@ void ridgeSolve(const cumlHandle &handle, T *S, T *V,
   T beta = T(0);
   T thres = T(1e-10);
 
-  Matrix::setSmallValuesZero(S, UDesc.N, streams[0], thres);
+  raft::matrix::setSmallValuesZero(S, UDesc.N, streams[0], thres);
 
   // TO-DO: Update to use `device_buffer` here
   // Tracking issue: https://github.com/rapidsai/cuml/issues/2524
-  allocate(S_nnz, UDesc.N, true);
-  copy(S_nnz, S, UDesc.N, streams[0]);
-  Matrix::power(S_nnz, UDesc.N, streams[0]);
-  LinAlg::addScalar(S_nnz, S_nnz, alpha[0], UDesc.N, streams[0]);
-  Matrix::matrixVectorBinaryDivSkipZero(S, S_nnz, size_t(1), UDesc.N, false,
-                                        true, streams[0], true);
+  raft::allocate(S_nnz, UDesc.N, true);
+  raft::copy(S_nnz, S, UDesc.N, streams[0]);
+  raft::matrix::power(S_nnz, UDesc.N, streams[0]);
+  raft::linalg::addScalar(S_nnz, S_nnz, alpha[0], UDesc.N, streams[0]);
+  raft::matrix::matrixVectorBinaryDivSkipZero(S, S_nnz, size_t(1), UDesc.N,
+                                              false, true, streams[0], true);
 
-  Matrix::matrixVectorBinaryMult(V, S, UDesc.N, UDesc.N, false, true,
-                                 streams[0]);
+  raft::matrix::matrixVectorBinaryMult(V, S, UDesc.N, UDesc.N, false, true,
+                                       streams[0]);
 
   Matrix::Data<T> S_nnz_data;
   S_nnz_data.totalSize = UDesc.N;
   S_nnz_data.ptr = S_nnz;
-  LinAlg::opg::mv_aTb(S_nnz_data, U, UDesc, b, comm, allocator, streams,
-                      n_streams, cublasH);
+  LinAlg::opg::mv_aTb(handle, S_nnz_data, U, UDesc, b, streams, n_streams);
 
-  LinAlg::gemm(V, UDesc.N, UDesc.N, S_nnz, w, UDesc.N, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, alp, beta, cublasH, streams[0]);
+  raft::linalg::gemm(handle, V, UDesc.N, UDesc.N, S_nnz, w, UDesc.N, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, alp, beta, streams[0]);
 
   CUDA_CHECK(cudaFree(S_nnz));
 }
 
 template <typename T>
-void ridgeEig(cumlHandle &handle, const std::vector<Matrix::Data<T> *> &A,
+void ridgeEig(raft::handle_t &handle, const std::vector<Matrix::Data<T> *> &A,
               const Matrix::PartDescriptor &ADesc,
               const std::vector<Matrix::Data<T> *> &b, const T *alpha,
               const int n_alpha, T *coef, cudaStream_t *streams, int n_streams,
               bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  const cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  const cusolverDnHandle_t cusolver_handle =
-    handle.getImpl().getcusolverDnHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  const cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  const cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  const auto allocator = handle.get_device_allocator();
 
-  int rank = comm.getRank();
+  int rank = comm.get_rank();
 
   device_buffer<T> S(allocator, streams[0], ADesc.N);
   device_buffer<T> V(allocator, streams[0], ADesc.N * ADesc.N);
@@ -123,21 +120,21 @@ void ridgeEig(cumlHandle &handle, const std::vector<Matrix::Data<T> *> &A,
     U.push_back(&(U_temp[i]));
   }
 
-  LinAlg::opg::svdEig(A, ADesc, U, S.data(), V.data(), comm, allocator, streams,
-                      n_streams, cublas_handle, cusolver_handle);
+  LinAlg::opg::svdEig(handle, A, ADesc, U, S.data(), V.data(), streams,
+                      n_streams);
 
   ridgeSolve(handle, S.data(), V.data(), U, ADesc, b, alpha, n_alpha, coef,
              streams, n_streams, verbose);
 }
 
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *alpha, int n_alpha,
               T *coef, T *intercept, bool fit_intercept, bool normalize,
               int algo, cudaStream_t *streams, int n_streams, bool verbose) {
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto allocator = handle.get_device_allocator();
 
   device_buffer<T> mu_input(allocator, streams[0]);
   device_buffer<T> norm2_input(allocator, streams[0]);
@@ -193,14 +190,15 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
  * @input param verbose
  */
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *alpha, int n_alpha,
               T *coef, T *intercept, bool fit_intercept, bool normalize,
               int algo, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   // Tracking issue: https://github.com/rapidsai/cuml/issues/2470
 
   int n_streams = input_desc.blocksOwnedBy(rank).size();
@@ -223,7 +221,7 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle,
+void predict_impl(raft::handle_t &handle,
                   std::vector<Matrix::Data<T> *> &input_data,
                   Matrix::PartDescriptor &input_desc, T *coef, T intercept,
                   std::vector<Matrix::Data<T> *> &preds, cudaStream_t *streams,
@@ -234,22 +232,22 @@ void predict_impl(cumlHandle &handle,
 
   for (int i = 0; i < input_data.size(); i++) {
     int si = i % n_streams;
-    LinAlg::gemm(input_data[i]->ptr, local_blocks[i]->size, input_desc.N, coef,
-                 preds[i]->ptr, local_blocks[i]->size, size_t(1), CUBLAS_OP_N,
-                 CUBLAS_OP_N, alpha, beta, handle.getImpl().getCublasHandle(),
-                 streams[si]);
+    raft::linalg::gemm(handle, input_data[i]->ptr, local_blocks[i]->size,
+                       input_desc.N, coef, preds[i]->ptr, local_blocks[i]->size,
+                       size_t(1), CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta,
+                       streams[si]);
 
-    LinAlg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
-                      local_blocks[i]->size, streams[si]);
+    raft::linalg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
+                            local_blocks[i]->size, streams[si]);
   }
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                   size_t n_parts, Matrix::Data<T> **input, size_t n_rows,
                   size_t n_cols, T *coef, T intercept, Matrix::Data<T> **preds,
                   bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -257,7 +255,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   Matrix::PartDescriptor input_desc(n_rows, n_cols, ranksAndSizes, rank);
   std::vector<Matrix::Data<T> *> preds_data(preds, preds + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -276,7 +274,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   }
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
+void fit(raft::handle_t &handle, std::vector<Matrix::Data<float> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<float> *> &labels, float *alpha, int n_alpha,
          float *coef, float *intercept, bool fit_intercept, bool normalize,
@@ -285,7 +283,8 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
            intercept, fit_intercept, normalize, algo, verbose);
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
+void fit(raft::handle_t &handle,
+         std::vector<Matrix::Data<double> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<double> *> &labels, double *alpha,
          int n_alpha, double *coef, double *intercept, bool fit_intercept,
@@ -294,7 +293,7 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
            intercept, fit_intercept, normalize, algo, verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<float> **input, size_t n_rows,
              size_t n_cols, float *coef, float intercept,
              Matrix::Data<float> **preds, bool verbose) {
@@ -302,7 +301,7 @@ void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                intercept, preds, verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<double> **input, size_t n_rows,
              size_t n_cols, double *coef, double intercept,
              Matrix::Data<double> **preds, bool verbose) {
diff --git a/cpp/src/holtwinters/holtwinters.cu b/cpp/src/holtwinters/holtwinters.cu
index 4f983018cd..a2378c9a80 100644
--- a/cpp/src/holtwinters/holtwinters.cu
+++ b/cpp/src/holtwinters/holtwinters.cu
@@ -31,7 +31,7 @@ void buffer_size(int n, int batch_size, int frequency,
                             season_coef_shift);
 }
 
-void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
+void fit(const raft::handle_t &handle, int n, int batch_size, int frequency,
          int start_periods, ML::SeasonalType seasonal, float epsilon,
          float *data, float *level_d, float *trend_d, float *season_d,
          float *error_d) {
@@ -40,7 +40,7 @@ void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
                                   level_d, trend_d, season_d, error_d);
 }
 
-void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
+void fit(const raft::handle_t &handle, int n, int batch_size, int frequency,
          int start_periods, ML::SeasonalType seasonal, double epsilon,
          double *data, double *level_d, double *trend_d, double *season_d,
          double *error_d) {
@@ -49,7 +49,7 @@ void fit(const ML::cumlHandle &handle, int n, int batch_size, int frequency,
                                    level_d, trend_d, season_d, error_d);
 }
 
-void forecast(const ML::cumlHandle &handle, int n, int batch_size,
+void forecast(const raft::handle_t &handle, int n, int batch_size,
               int frequency, int h, ML::SeasonalType seasonal, float *level_d,
               float *trend_d, float *season_d, float *forecast_d) {
   ML::HoltWintersForecastHelper<float>(handle, n, batch_size, frequency, h,
@@ -57,7 +57,7 @@ void forecast(const ML::cumlHandle &handle, int n, int batch_size,
                                        forecast_d);
 }
 
-void forecast(const ML::cumlHandle &handle, int n, int batch_size,
+void forecast(const raft::handle_t &handle, int n, int batch_size,
               int frequency, int h, ML::SeasonalType seasonal, double *level_d,
               double *trend_d, double *season_d, double *forecast_d) {
   ML::HoltWintersForecastHelper<double>(handle, n, batch_size, frequency, h,
diff --git a/cpp/src/holtwinters/holtwinters_api.cpp b/cpp/src/holtwinters/holtwinters_api.cpp
index 35f83ecfd6..801d69d4cf 100644
--- a/cpp/src/holtwinters/holtwinters_api.cpp
+++ b/cpp/src/holtwinters/holtwinters_api.cpp
@@ -44,7 +44,7 @@ cumlError_t cumlHoltWintersSp_fit(cumlHandle_t handle, int n, int batch_size,
                                   float *trend_d, float *season_d,
                                   float *error_d) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -66,7 +66,7 @@ cumlError_t cumlHoltWintersDp_fit(cumlHandle_t handle, int n, int batch_size,
                                   double *trend_d, double *season_d,
                                   double *error_d) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -87,7 +87,7 @@ cumlError_t cumlHoltWintersSp_forecast(cumlHandle_t handle, int n,
                                        float *level_d, float *trend_d,
                                        float *season_d, float *forecast_d) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -108,7 +108,7 @@ cumlError_t cumlHoltWintersDp_forecast(cumlHandle_t handle, int n,
                                        double *level_d, double *trend_d,
                                        double *season_d, double *forecast_d) {
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
diff --git a/cpp/src/holtwinters/internal/hw_decompose.cuh b/cpp/src/holtwinters/internal/hw_decompose.cuh
index db37afec4e..2cbb54baac 100644
--- a/cpp/src/holtwinters/internal/hw_decompose.cuh
+++ b/cpp/src/holtwinters/internal/hw_decompose.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "hw_utils.cuh"
 
 // optimize, maybe im2col ?
@@ -36,14 +36,14 @@ __global__ void conv1d_kernel(const Dtype *input, int batch_size,
 }
 
 template <typename Dtype>
-void conv1d(const ML::cumlHandle_impl &handle, const Dtype *input,
-            int batch_size, const Dtype *filter, int filter_size, Dtype *output,
+void conv1d(const raft::handle_t &handle, const Dtype *input, int batch_size,
+            const Dtype *filter, int filter_size, Dtype *output,
             int output_size) {
   int total_threads = batch_size;
   conv1d_kernel<Dtype>
     <<<GET_NUM_BLOCKS(total_threads), GET_THREADS_PER_BLOCK(total_threads), 0,
-       handle.getStream()>>>(input, batch_size, filter, filter_size, output,
-                             output_size);
+       handle.get_stream()>>>(input, batch_size, filter, filter_size, output,
+                              output_size);
 }
 
 //https://github.com/rapidsai/cuml/issues/891
@@ -78,10 +78,10 @@ __global__ void season_mean_kernel(const Dtype *season, int len, int batch_size,
 }
 
 template <typename Dtype>
-void season_mean(const ML::cumlHandle_impl &handle, const Dtype *season,
-                 int len, int batch_size, Dtype *start_season, int frequency,
+void season_mean(const raft::handle_t &handle, const Dtype *season, int len,
+                 int batch_size, Dtype *start_season, int frequency,
                  int half_filter_size, ML::SeasonalType seasonal) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   bool is_additive = seasonal == ML::SeasonalType::ADDITIVE;
   season_mean_kernel<Dtype>
     <<<GET_NUM_BLOCKS(batch_size), GET_THREADS_PER_BLOCK(batch_size), 0,
@@ -121,13 +121,12 @@ __global__ void batched_ls_solver_kernel(const Dtype *B, const Dtype *rq,
 }
 
 template <typename Dtype>
-void batched_ls(const ML::cumlHandle_impl &handle, const Dtype *data,
-                int trend_len, int batch_size, Dtype *level, Dtype *trend) {
-  cudaStream_t stream = handle.getStream();
-  cublasHandle_t cublas_h = handle.getCublasHandle();
-  cusolverDnHandle_t cusolver_h = handle.getcusolverDnHandle();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle.getDeviceAllocator();
+void batched_ls(const raft::handle_t &handle, const Dtype *data, int trend_len,
+                int batch_size, Dtype *level, Dtype *trend) {
+  cudaStream_t stream = handle.get_stream();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  cusolverDnHandle_t cusolver_h = handle.get_cusolver_dn_handle();
+  auto dev_allocator = handle.get_device_allocator();
 
   const Dtype one = (Dtype)1.;
   const Dtype zero = (Dtype)0.;
@@ -149,19 +148,19 @@ void batched_ls(const ML::cumlHandle_impl &handle, const Dtype *data,
     A_h[i] = (Dtype)1.;
     A_h[trend_len + i] = (Dtype)(i + 1);
   }
-  MLCommon::updateDevice(A_d.data(), A_h.data(), 2 * trend_len, stream);
+  raft::update_device(A_d.data(), A_h.data(), 2 * trend_len, stream);
 
-  CUSOLVER_CHECK(MLCommon::LinAlg::cusolverDngeqrf_bufferSize<Dtype>(
+  CUSOLVER_CHECK(raft::linalg::cusolverDngeqrf_bufferSize<Dtype>(
     cusolver_h, trend_len, 2, A_d.data(), 2, &geqrf_buffer));
 
-  CUSOLVER_CHECK(MLCommon::LinAlg::cusolverDnorgqr_bufferSize<Dtype>(
+  CUSOLVER_CHECK(raft::linalg::cusolverDnorgqr_bufferSize<Dtype>(
     cusolver_h, trend_len, 2, 2, A_d.data(), 2, tau_d.data(), &orgqr_buffer));
 
   lwork_size = geqrf_buffer > orgqr_buffer ? geqrf_buffer : orgqr_buffer;
   MLCommon::device_buffer<Dtype> lwork_d(dev_allocator, stream, lwork_size);
 
   // QR decomposition of A
-  CUSOLVER_CHECK(MLCommon::LinAlg::cusolverDngeqrf<Dtype>(
+  CUSOLVER_CHECK(raft::linalg::cusolverDngeqrf<Dtype>(
     cusolver_h, trend_len, 2, A_d.data(), trend_len, tau_d.data(),
     lwork_d.data(), lwork_size, dev_info_d.data(), stream));
 
@@ -169,11 +168,11 @@ void batched_ls(const ML::cumlHandle_impl &handle, const Dtype *data,
   RinvKernel<Dtype><<<1, 1, 0, stream>>>(A_d.data(), Rinv_d.data(), trend_len);
 
   // R1QT = inv(R)*transpose(Q)
-  CUSOLVER_CHECK(MLCommon::LinAlg::cusolverDnorgqr<Dtype>(
+  CUSOLVER_CHECK(raft::linalg::cusolverDnorgqr<Dtype>(
     cusolver_h, trend_len, 2, 2, A_d.data(), trend_len, tau_d.data(),
     lwork_d.data(), lwork_size, dev_info_d.data(), stream));
 
-  CUBLAS_CHECK(MLCommon::LinAlg::cublasgemm<Dtype>(
+  CUBLAS_CHECK(raft::linalg::cublasgemm<Dtype>(
     cublas_h, CUBLAS_OP_N, CUBLAS_OP_T, 2, trend_len, 2, &one, Rinv_d.data(), 2,
     A_d.data(), trend_len, &zero, R1Qt_d.data(), 2, stream));
 
@@ -183,15 +182,13 @@ void batched_ls(const ML::cumlHandle_impl &handle, const Dtype *data,
 }
 
 template <typename Dtype>
-void stl_decomposition_gpu(const ML::cumlHandle_impl &handle, const Dtype *ts,
-                           int n, int batch_size, int frequency,
-                           int start_periods, Dtype *start_level,
-                           Dtype *start_trend, Dtype *start_season,
-                           ML::SeasonalType seasonal) {
-  cudaStream_t stream = handle.getStream();
-  cublasHandle_t cublas_h = handle.getCublasHandle();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle.getDeviceAllocator();
+void stl_decomposition_gpu(const raft::handle_t &handle, const Dtype *ts, int n,
+                           int batch_size, int frequency, int start_periods,
+                           Dtype *start_level, Dtype *start_trend,
+                           Dtype *start_season, ML::SeasonalType seasonal) {
+  cudaStream_t stream = handle.get_stream();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  auto dev_allocator = handle.get_device_allocator();
 
   const int end = start_periods * frequency;
   const int filter_size = (frequency / 2) * 2 + 1;
@@ -205,7 +202,7 @@ void stl_decomposition_gpu(const ML::cumlHandle_impl &handle, const Dtype *ts,
   }
 
   MLCommon::device_buffer<Dtype> filter_d(dev_allocator, stream, filter_size);
-  MLCommon::updateDevice(filter_d.data(), filter_h.data(), filter_size, stream);
+  raft::update_device(filter_d.data(), filter_h.data(), filter_size, stream);
 
   // Set Trend
   MLCommon::device_buffer<Dtype> trend_d(dev_allocator, stream,
@@ -220,18 +217,18 @@ void stl_decomposition_gpu(const ML::cumlHandle_impl &handle, const Dtype *ts,
   if (seasonal == ML::SeasonalType::ADDITIVE) {
     const Dtype one = 1.;
     const Dtype minus_one = -1.;
-    CUBLAS_CHECK(MLCommon::LinAlg::cublasgeam<Dtype>(
+    CUBLAS_CHECK(raft::linalg::cublasgeam<Dtype>(
       cublas_h, CUBLAS_OP_N, CUBLAS_OP_N, trend_len, batch_size, &one,
       ts + ts_offset, trend_len, &minus_one, trend_d.data(), trend_len,
       season_d.data(), trend_len, stream));
   } else {
     MLCommon::device_buffer<Dtype> aligned_ts(dev_allocator, stream,
                                               batch_size * trend_len);
-    MLCommon::copy(aligned_ts.data(), ts + ts_offset, batch_size * trend_len,
-                   stream);
-    MLCommon::LinAlg::eltwiseDivide<Dtype>(season_d.data(), aligned_ts.data(),
-                                           trend_d.data(),
-                                           trend_len * batch_size, stream);
+    raft::copy(aligned_ts.data(), ts + ts_offset, batch_size * trend_len,
+               stream);
+    raft::linalg::eltwiseDivide<Dtype>(season_d.data(), aligned_ts.data(),
+                                       trend_d.data(), trend_len * batch_size,
+                                       stream);
   }
 
   season_mean(handle, season_d.data(), trend_len, batch_size, start_season,
diff --git a/cpp/src/holtwinters/internal/hw_eval.cuh b/cpp/src/holtwinters/internal/hw_eval.cuh
index 8f38718bb2..dbdac20cb9 100644
--- a/cpp/src/holtwinters/internal/hw_eval.cuh
+++ b/cpp/src/holtwinters/internal/hw_eval.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "hw_utils.cuh"
 
 template <typename Dtype>
@@ -155,16 +155,15 @@ __global__ void holtwinters_eval_gpu_global_kernel(
 // Test global and shared kernels
 // https://github.com/rapidsai/cuml/issues/890
 template <typename Dtype>
-void holtwinters_eval_gpu(const ML::cumlHandle_impl &handle, const Dtype *ts,
-                          int n, int batch_size, int frequency,
+void holtwinters_eval_gpu(const raft::handle_t &handle, const Dtype *ts, int n,
+                          int batch_size, int frequency,
                           const Dtype *start_level, const Dtype *start_trend,
                           const Dtype *start_season, const Dtype *alpha,
                           const Dtype *beta, const Dtype *gamma, Dtype *level,
                           Dtype *trend, Dtype *season, Dtype *xhat,
                           Dtype *error, ML::SeasonalType seasonal) {
-  cudaStream_t stream = handle.getStream();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto dev_allocator = handle.get_device_allocator();
 
   int total_blocks = GET_NUM_BLOCKS(batch_size);
   int threads_per_block = GET_THREADS_PER_BLOCK(batch_size);
@@ -173,7 +172,7 @@ void holtwinters_eval_gpu(const ML::cumlHandle_impl &handle, const Dtype *ts,
   size_t sm_needed = sizeof(Dtype) * threads_per_block * frequency;
   bool is_additive = seasonal == ML::SeasonalType::ADDITIVE;
 
-  if (sm_needed > MLCommon::getSharedMemPerBlock()) {
+  if (sm_needed > raft::getSharedMemPerBlock()) {
     MLCommon::device_buffer<Dtype> pseason(dev_allocator, stream,
                                            batch_size * frequency);
     holtwinters_eval_gpu_global_kernel<Dtype>
diff --git a/cpp/src/holtwinters/internal/hw_forecast.cuh b/cpp/src/holtwinters/internal/hw_forecast.cuh
index df5404cc80..6fcd1ff072 100644
--- a/cpp/src/holtwinters/internal/hw_forecast.cuh
+++ b/cpp/src/holtwinters/internal/hw_forecast.cuh
@@ -61,12 +61,12 @@ __global__ void holtwinters_level_forecast_kernel(Dtype *forecast, int h,
 }
 
 template <typename Dtype>
-void holtwinters_forecast_gpu(const ML::cumlHandle_impl &handle,
-                              Dtype *forecast, int h, int batch_size,
-                              int frequency, const Dtype *level_coef,
-                              const Dtype *trend_coef, const Dtype *season_coef,
+void holtwinters_forecast_gpu(const raft::handle_t &handle, Dtype *forecast,
+                              int h, int batch_size, int frequency,
+                              const Dtype *level_coef, const Dtype *trend_coef,
+                              const Dtype *season_coef,
                               ML::SeasonalType seasonal) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
 
   int total_blocks = GET_NUM_BLOCKS(batch_size);
   int threads_per_block = GET_THREADS_PER_BLOCK(batch_size);
diff --git a/cpp/src/holtwinters/internal/hw_optim.cuh b/cpp/src/holtwinters/internal/hw_optim.cuh
index 8a39129a73..733a64b26e 100644
--- a/cpp/src/holtwinters/internal/hw_optim.cuh
+++ b/cpp/src/holtwinters/internal/hw_optim.cuh
@@ -15,7 +15,7 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "hw_eval.cuh"
 #include "hw_utils.cuh"
 
@@ -414,15 +414,14 @@ __global__ void holtwinters_optim_gpu_global_kernel(
 // https://github.com/rapidsai/cuml/issues/890
 template <typename Dtype>
 void holtwinters_optim_gpu(
-  const ML::cumlHandle_impl &handle, const Dtype *ts, int n, int batch_size,
+  const raft::handle_t &handle, const Dtype *ts, int n, int batch_size,
   int frequency, const Dtype *start_level, const Dtype *start_trend,
   const Dtype *start_season, Dtype *alpha, bool optim_alpha, Dtype *beta,
   bool optim_beta, Dtype *gamma, bool optim_gamma, Dtype *level, Dtype *trend,
   Dtype *season, Dtype *xhat, Dtype *error, ML::OptimCriterion *optim_result,
   ML::SeasonalType seasonal, const ML::OptimParams<Dtype> optim_params) {
-  cudaStream_t stream = handle.getStream();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto dev_allocator = handle.get_device_allocator();
 
   //int total_blocks = GET_NUM_BLOCKS(batch_size);
   //int threads_per_block = GET_THREADS_PER_BLOCK(batch_size);
@@ -435,7 +434,7 @@ void holtwinters_optim_gpu(
   bool single_param =
     (optim_alpha + optim_beta + optim_gamma > 1) ? false : true;
 
-  if (sm_needed > MLCommon::getSharedMemPerBlock()) {  // Global memory //
+  if (sm_needed > raft::getSharedMemPerBlock()) {  // Global memory //
     MLCommon::device_buffer<Dtype> pseason(dev_allocator, stream,
                                            batch_size * frequency);
     holtwinters_optim_gpu_global_kernel<Dtype>
diff --git a/cpp/src/holtwinters/internal/hw_utils.cuh b/cpp/src/holtwinters/internal/hw_utils.cuh
index 13278ca756..a163cfb7a6 100644
--- a/cpp/src/holtwinters/internal/hw_utils.cuh
+++ b/cpp/src/holtwinters/internal/hw_utils.cuh
@@ -15,16 +15,16 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
 #include <cuda_runtime.h>
 #include <cuml/tsa/holtwinters_params.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/cusolver_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
-#include <linalg/eltwise.cuh>
+#include <raft/linalg/eltwise.cuh>
 #include <vector>
 
 #define IDX(n, m, N) (n + (m) * (N))
diff --git a/cpp/src/holtwinters/runner.cuh b/cpp/src/holtwinters/runner.cuh
index dc13aff867..bf825a3f16 100644
--- a/cpp/src/holtwinters/runner.cuh
+++ b/cpp/src/holtwinters/runner.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/tsa/holtwinters_params.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include "internal/hw_decompose.cuh"
 #include "internal/hw_eval.cuh"
 #include "internal/hw_forecast.cuh"
@@ -27,16 +27,16 @@
 namespace ML {
 
 template <typename Dtype>
-void HWTranspose(const ML::cumlHandle &handle, Dtype *data_in, int m, int n,
+void HWTranspose(const raft::handle_t &handle, Dtype *data_in, int m, int n,
                  Dtype *data_out) {
   ASSERT(!(!data_in || !data_out || n < 1 || m < 1), "HW error in in line %d",
          __LINE__);
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
-  cublasHandle_t cublas_h = handle_impl.getCublasHandle();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
+  cublasHandle_t cublas_h = handle_impl.get_cublas_handle();
 
-  MLCommon::LinAlg::transpose<Dtype>(data_in, data_out, n, m, cublas_h, stream);
+  raft::linalg::transpose<Dtype>(handle, data_in, data_out, n, m, stream);
 }
 
 void HoltWintersBufferSize(int n, int batch_size, int frequency, bool use_beta,
@@ -60,25 +60,25 @@ void HoltWintersBufferSize(int n, int batch_size, int frequency, bool use_beta,
 }
 
 template <typename Dtype>
-void HoltWintersDecompose(const ML::cumlHandle &handle, const Dtype *ts, int n,
+void HoltWintersDecompose(const raft::handle_t &handle, const Dtype *ts, int n,
                           int batch_size, int frequency, Dtype *start_level,
                           Dtype *start_trend, Dtype *start_season,
                           int start_periods, ML::SeasonalType seasonal) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
-  cublasHandle_t cublas_h = handle_impl.getCublasHandle();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
+  cublasHandle_t cublas_h = handle_impl.get_cublas_handle();
 
   if (start_level != nullptr && start_trend == nullptr &&
       start_season == nullptr) {  // level decomposition
-    MLCommon::copy(start_level, ts, batch_size, stream);
+    raft::copy(start_level, ts, batch_size, stream);
   } else if (start_level != nullptr && start_trend != nullptr &&
              start_season == nullptr) {  // trend decomposition
-    MLCommon::copy(start_level, ts + batch_size, batch_size, stream);
-    MLCommon::copy(start_trend, ts + batch_size, batch_size, stream);
+    raft::copy(start_level, ts + batch_size, batch_size, stream);
+    raft::copy(start_trend, ts + batch_size, batch_size, stream);
     const Dtype alpha = -1.;
-    CUBLAS_CHECK(MLCommon::LinAlg::cublasaxpy(cublas_h, batch_size, &alpha, ts,
-                                              1, start_trend, 1, stream));
+    CUBLAS_CHECK(raft::linalg::cublasaxpy(cublas_h, batch_size, &alpha, ts, 1,
+                                          start_trend, 1, stream));
     // cublas::axpy(batch_size, (Dtype)-1., ts, start_trend);
   } else if (start_level != nullptr && start_trend != nullptr &&
              start_season != nullptr) {
@@ -89,15 +89,15 @@ void HoltWintersDecompose(const ML::cumlHandle &handle, const Dtype *ts, int n,
 }
 
 template <typename Dtype>
-void HoltWintersEval(const ML::cumlHandle &handle, const Dtype *ts, int n,
+void HoltWintersEval(const raft::handle_t &handle, const Dtype *ts, int n,
                      int batch_size, int frequency, const Dtype *start_level,
                      const Dtype *start_trend, const Dtype *start_season,
                      const Dtype *alpha, const Dtype *beta, const Dtype *gamma,
                      Dtype *level, Dtype *trend, Dtype *season, Dtype *xhat,
                      Dtype *error, ML::SeasonalType seasonal) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
 
   ASSERT(!((!start_trend) != (!beta) || (!start_season) != (!gamma)),
          "HW error in in line %d", __LINE__);
@@ -116,7 +116,7 @@ void HoltWintersEval(const ML::cumlHandle &handle, const Dtype *ts, int n,
 // and epsilon majorly influences the fitting based on precision. For a summary,
 // https://github.com/rapidsai/cuml/issues/888
 template <typename Dtype>
-void HoltWintersOptim(const ML::cumlHandle &handle, const Dtype *ts, int n,
+void HoltWintersOptim(const raft::handle_t &handle, const Dtype *ts, int n,
                       int batch_size, int frequency, const Dtype *start_level,
                       const Dtype *start_trend, const Dtype *start_season,
                       Dtype *alpha, bool optim_alpha, Dtype *beta,
@@ -125,9 +125,9 @@ void HoltWintersOptim(const ML::cumlHandle &handle, const Dtype *ts, int n,
                       Dtype *xhat, Dtype *error, OptimCriterion *optim_result,
                       OptimParams<Dtype> *optim_params,
                       ML::SeasonalType seasonal) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
 
   // default values
   OptimParams<Dtype> optim_params_;
@@ -179,13 +179,13 @@ void HoltWintersOptim(const ML::cumlHandle &handle, const Dtype *ts, int n,
 }
 
 template <typename Dtype>
-void HoltWintersForecast(const ML::cumlHandle &handle, Dtype *forecast, int h,
+void HoltWintersForecast(const raft::handle_t &handle, Dtype *forecast, int h,
                          int batch_size, int frequency, const Dtype *level_coef,
                          const Dtype *trend_coef, const Dtype *season_coef,
                          ML::SeasonalType seasonal) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
 
   ASSERT(!(!level_coef && !trend_coef && !season_coef),
          "HW error in in line %d", __LINE__);
@@ -197,16 +197,15 @@ void HoltWintersForecast(const ML::cumlHandle &handle, Dtype *forecast, int h,
 // change optim_gamma to false here to test bug in Double Exponential Smoothing
 // https://github.com/rapidsai/cuml/issues/889
 template <typename Dtype>
-void HoltWintersFitHelper(const ML::cumlHandle &handle, int n, int batch_size,
+void HoltWintersFitHelper(const raft::handle_t &handle, int n, int batch_size,
                           int frequency, int start_periods,
                           ML::SeasonalType seasonal, Dtype epsilon, Dtype *data,
                           Dtype *level_d, Dtype *trend_d, Dtype *season_d,
                           Dtype *error_d) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle_impl.getDeviceAllocator();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
+  auto dev_allocator = handle_impl.get_device_allocator();
 
   bool optim_alpha = true, optim_beta = true, optim_gamma = true;
   // initial values for alpha, beta and gamma
@@ -233,14 +232,14 @@ void HoltWintersFitHelper(const ML::cumlHandle &handle, int n, int batch_size,
   MLCommon::device_buffer<Dtype> dataset_d(dev_allocator, stream,
                                            batch_size * n);
   MLCommon::device_buffer<Dtype> alpha_d(dev_allocator, stream, batch_size);
-  MLCommon::updateDevice(alpha_d.data(), alpha_h.data(), batch_size, stream);
+  raft::update_device(alpha_d.data(), alpha_h.data(), batch_size, stream);
   MLCommon::device_buffer<Dtype> level_seed_d(dev_allocator, stream,
                                               leveltrend_seed_len);
 
   if (optim_beta) {
     beta_d =
       (Dtype *)dev_allocator->allocate(sizeof(Dtype) * batch_size, stream);
-    MLCommon::updateDevice(beta_d, beta_h.data(), batch_size, stream);
+    raft::update_device(beta_d, beta_h.data(), batch_size, stream);
     trend_seed_d = (Dtype *)dev_allocator->allocate(
       sizeof(Dtype) * leveltrend_seed_len, stream);
   }
@@ -248,7 +247,7 @@ void HoltWintersFitHelper(const ML::cumlHandle &handle, int n, int batch_size,
   if (optim_gamma) {
     gamma_d =
       (Dtype *)dev_allocator->allocate(sizeof(Dtype) * batch_size, stream);
-    MLCommon::updateDevice(gamma_d, gamma_h.data(), batch_size, stream);
+    raft::update_device(gamma_d, gamma_h.data(), batch_size, stream);
     start_season_d =
       (Dtype *)dev_allocator->allocate(sizeof(Dtype) * season_seed_len, stream);
   }
@@ -279,16 +278,15 @@ void HoltWintersFitHelper(const ML::cumlHandle &handle, int n, int batch_size,
 }
 
 template <typename Dtype>
-void HoltWintersForecastHelper(const ML::cumlHandle &handle, int n,
+void HoltWintersForecastHelper(const raft::handle_t &handle, int n,
                                int batch_size, int frequency, int h,
                                ML::SeasonalType seasonal, Dtype *level_d,
                                Dtype *trend_d, Dtype *season_d,
                                Dtype *forecast_d) {
-  const ML::cumlHandle_impl &handle_impl = handle.getImpl();
-  ML::detail::streamSyncer _(handle_impl);
-  cudaStream_t stream = handle_impl.getStream();
-  std::shared_ptr<MLCommon::deviceAllocator> dev_allocator =
-    handle_impl.getDeviceAllocator();
+  const raft::handle_t &handle_impl = handle;
+  raft::stream_syncer _(handle_impl);
+  cudaStream_t stream = handle_impl.get_stream();
+  auto dev_allocator = handle_impl.get_device_allocator();
 
   bool optim_beta = true, optim_gamma = true;
 
diff --git a/cpp/src/kmeans/common.cuh b/cpp/src/kmeans/common.cuh
index e898a05d85..8eb8279c7b 100644
--- a/cpp/src/kmeans/common.cuh
+++ b/cpp/src/kmeans/common.cuh
@@ -17,15 +17,15 @@
 
 #include <distance/distance.cuh>
 #include <distance/fused_l2_nn.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/mean_squared_error.cuh>
-#include <linalg/reduce.cuh>
 #include <linalg/reduce_cols_by_key.cuh>
 #include <linalg/reduce_rows_by_key.cuh>
 #include <matrix/gather.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/mean_squared_error.cuh>
+#include <raft/linalg/reduce.cuh>
+#include <raft/random/rng.cuh>
 #include <random/permute.cuh>
-#include <random/rng.cuh>
 #include <random>
 
 #include <thrust/equal.h>
@@ -39,30 +39,31 @@
 
 #include <common/allocatorAdapter.hpp>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <common/host_buffer.hpp>
 #include <common/tensor.hpp>
+#include <raft/comms/comms.hpp>
 
 #include <cuml/common/logger.hpp>
 
 #include <cuml/cluster/kmeans_mg.hpp>
 
+#include <raft/cudart_utils.h>
 #include <fstream>
 
 namespace ML {
 
-#define LOG(handle, fmt, ...)                                            \
-  do {                                                                   \
-    bool isRoot = true;                                                  \
-    if (handle.commsInitialized()) {                                     \
-      const MLCommon::cumlCommunicator &comm = handle.getCommunicator(); \
-      const int my_rank = comm.getRank();                                \
-      isRoot = my_rank == 0;                                             \
-    }                                                                    \
-    if (isRoot) {                                                        \
-      CUML_LOG_DEBUG(fmt, ##__VA_ARGS__);                                \
-    }                                                                    \
+#define LOG(handle, fmt, ...)                \
+  do {                                       \
+    bool isRoot = true;                      \
+    if (handle.comms_initialized()) {        \
+      const auto &comm = handle.get_comms(); \
+      const int my_rank = comm.get_rank();   \
+      isRoot = my_rank == 0;                 \
+    }                                        \
+    if (isRoot) {                            \
+      CUML_LOG_DEBUG(fmt, ##__VA_ARGS__);    \
+    }                                        \
   } while (0)
 
 namespace kmeans {
@@ -145,7 +146,7 @@ CountT getCentroidsBatchSize(const KMeansParams &params,
 
 // Computes the intensity histogram from a sequence of labels
 template <typename SampleIteratorT, typename CounterT>
-void countLabels(const cumlHandle_impl &handle, SampleIteratorT labels,
+void countLabels(const raft::handle_t &handle, SampleIteratorT labels,
                  CounterT *count, int n_samples, int n_clusters,
                  MLCommon::device_buffer<char> &workspace,
                  cudaStream_t stream) {
@@ -167,7 +168,7 @@ void countLabels(const cumlHandle_impl &handle, SampleIteratorT labels,
 
 template <typename DataT, typename IndexT>
 Tensor<DataT, 2, IndexT> sampleCentroids(
-  const cumlHandle_impl &handle, Tensor<DataT, 2, IndexT> &X,
+  const raft::handle_t &handle, Tensor<DataT, 2, IndexT> &X,
   Tensor<DataT, 1, IndexT> &minClusterDistance,
   Tensor<int, 1, IndexT> &isSampleCentroid,
   typename kmeans::detail::SamplingOp<DataT> &select_op,
@@ -175,11 +176,11 @@ Tensor<DataT, 2, IndexT> sampleCentroids(
   int n_local_samples = X.getSize(0);
   int n_features = X.getSize(1);
 
-  Tensor<int, 1> nSelected({1}, handle.getDeviceAllocator(), stream);
+  Tensor<int, 1> nSelected({1}, handle.get_device_allocator(), stream);
 
   cub::ArgIndexInputIterator<DataT *> ip_itr(minClusterDistance.data());
   Tensor<cub::KeyValuePair<ptrdiff_t, DataT>, 1> sampledMinClusterDistance(
-    {n_local_samples}, handle.getDeviceAllocator(), stream);
+    {n_local_samples}, handle.get_device_allocator(), stream);
   size_t temp_storage_bytes = 0;
   CUDA_CHECK(cub::DeviceSelect::If(
     nullptr, temp_storage_bytes, ip_itr, sampledMinClusterDistance.data(),
@@ -193,12 +194,12 @@ Tensor<DataT, 2, IndexT> sampleCentroids(
                                    stream));
 
   int nPtsSampledInRank = 0;
-  MLCommon::copy(&nPtsSampledInRank, nSelected.data(), nSelected.numElements(),
-                 stream);
+  raft::copy(&nPtsSampledInRank, nSelected.data(), nSelected.numElements(),
+             stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   int *rawPtr_isSampleCentroid = isSampleCentroid.data();
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto execution_policy = thrust::cuda::par(alloc).on(stream);
   thrust::for_each_n(execution_policy, sampledMinClusterDistance.begin(),
                      nPtsSampledInRank,
@@ -207,7 +208,7 @@ Tensor<DataT, 2, IndexT> sampleCentroids(
                      });
 
   Tensor<DataT, 2, IndexT> inRankCp({nPtsSampledInRank, n_features},
-                                    handle.getDeviceAllocator(), stream);
+                                    handle.get_device_allocator(), stream);
 
   MLCommon::Matrix::gather(
     X.data(), X.getSize(1), X.getSize(0), sampledMinClusterDistance.data(),
@@ -221,7 +222,7 @@ Tensor<DataT, 2, IndexT> sampleCentroids(
 }
 
 template <typename DataT, typename IndexT, typename ReductionOpT>
-void computeClusterCost(const cumlHandle_impl &handle,
+void computeClusterCost(const raft::handle_t &handle,
                         Tensor<DataT, 1, IndexT> &minClusterDistance,
                         MLCommon::device_buffer<char> &workspace,
                         DataT *clusterCost, ReductionOpT reduction_op,
@@ -242,13 +243,11 @@ void computeClusterCost(const cumlHandle_impl &handle,
 // calculate pairwise distance between 'dataset[n x d]' and 'centroids[k x d]',
 // result will be stored in 'pairwiseDistance[n x k]'
 template <typename DataT, typename IndexT>
-void pairwiseDistance(const cumlHandle_impl &handle,
-                      Tensor<DataT, 2, IndexT> &X,
+void pairwiseDistance(const raft::handle_t &handle, Tensor<DataT, 2, IndexT> &X,
                       Tensor<DataT, 2, IndexT> &centroids,
                       Tensor<DataT, 2, IndexT> &pairwiseDistance,
                       MLCommon::device_buffer<char> &workspace,
-                      MLCommon::Distance::DistanceType metric,
-                      cudaStream_t stream) {
+                      ML::Distance::DistanceType metric, cudaStream_t stream) {
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = centroids.getSize(0);
@@ -265,13 +264,13 @@ void pairwiseDistance(const cumlHandle_impl &handle,
 // is the distance between the sample and the 'centroid[key]'
 template <typename DataT, typename IndexT>
 void minClusterAndDistance(
-  const cumlHandle_impl &handle, const KMeansParams &params,
+  const raft::handle_t &handle, const KMeansParams &params,
   Tensor<DataT, 2, IndexT> &X, Tensor<DataT, 2, IndexT> &centroids,
   Tensor<cub::KeyValuePair<IndexT, DataT>, 1, IndexT> &minClusterAndDistance,
   Tensor<DataT, 1, IndexT> &L2NormX,
   MLCommon::device_buffer<DataT> &L2NormBuf_OR_DistBuf,
-  MLCommon::device_buffer<char> &workspace,
-  MLCommon::Distance::DistanceType metric, cudaStream_t stream) {
+  MLCommon::device_buffer<char> &workspace, ML::Distance::DistanceType metric,
+  cudaStream_t stream) {
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = centroids.getSize(0);
@@ -279,12 +278,12 @@ void minClusterAndDistance(
   auto centroidsBatchSize =
     kmeans::detail::getCentroidsBatchSize(params, n_clusters);
 
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
     L2NormBuf_OR_DistBuf.resize(n_clusters, stream);
-    MLCommon::LinAlg::rowNorm(L2NormBuf_OR_DistBuf.data(), centroids.data(),
-                              centroids.getSize(1), centroids.getSize(0),
-                              MLCommon::LinAlg::L2Norm, true, stream);
+    raft::linalg::rowNorm(L2NormBuf_OR_DistBuf.data(), centroids.data(),
+                          centroids.getSize(1), centroids.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   } else {
     L2NormBuf_OR_DistBuf.resize(dataBatchSize * centroidsBatchSize, stream);
   }
@@ -299,7 +298,7 @@ void minClusterAndDistance(
   cub::KeyValuePair<IndexT, DataT> initial_value(
     0, std::numeric_limits<DataT>::max());
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
   thrust::fill(thrust_exec_policy, minClusterAndDistance.begin(),
                minClusterAndDistance.end(), initial_value);
@@ -329,8 +328,8 @@ void minClusterAndDistance(
       auto centroidsView =
         centroids.template view<2>({nc, n_features}, {cIdx, 0});
 
-      if (metric == MLCommon::Distance::EucExpandedL2 ||
-          metric == MLCommon::Distance::EucExpandedL2Sqrt) {
+      if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+          metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
         auto centroidsNormView = centroidsNorm.template view<1>({nc}, {cIdx});
         workspace.resize((sizeof(int)) * ns, stream);
 
@@ -341,8 +340,8 @@ void minClusterAndDistance(
           minClusterAndDistanceView.data(), datasetView.data(),
           centroidsView.data(), L2NormXView.data(), centroidsNormView.data(),
           ns, nc, n_features, (void *)workspace.data(), redOp,
-          (metric == MLCommon::Distance::EucExpandedL2) ? false : true, false,
-          stream);
+          (metric == ML::Distance::DistanceType::EucExpandedL2) ? false : true,
+          false, stream);
       } else {
         // pairwiseDistanceView [ns x nc] - view representing the pairwise
         // distance for current batch
@@ -358,7 +357,7 @@ void minClusterAndDistance(
         // argmin reduction returning <index, value> pair
         // calculates the closest centroid and the distance to the closest
         // centroid
-        MLCommon::LinAlg::coalescedReduction(
+        raft::linalg::coalescedReduction(
           minClusterAndDistanceView.data(), pairwiseDistanceView.data(),
           pairwiseDistanceView.getSize(1), pairwiseDistanceView.getSize(0),
           initial_value, stream, true,
@@ -381,14 +380,14 @@ void minClusterAndDistance(
 }
 
 template <typename DataT, typename IndexT>
-void minClusterDistance(const cumlHandle_impl &handle,
+void minClusterDistance(const raft::handle_t &handle,
                         const KMeansParams &params, Tensor<DataT, 2, IndexT> &X,
                         Tensor<DataT, 2, IndexT> &centroids,
                         Tensor<DataT, 1, IndexT> &minClusterDistance,
                         Tensor<DataT, 1, IndexT> &L2NormX,
                         MLCommon::device_buffer<DataT> &L2NormBuf_OR_DistBuf,
                         MLCommon::device_buffer<char> &workspace,
-                        MLCommon::Distance::DistanceType metric,
+                        ML::Distance::DistanceType metric,
                         cudaStream_t stream) {
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
@@ -398,12 +397,12 @@ void minClusterDistance(const cumlHandle_impl &handle,
   auto centroidsBatchSize =
     kmeans::detail::getCentroidsBatchSize(params, n_clusters);
 
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
     L2NormBuf_OR_DistBuf.resize(n_clusters, stream);
-    MLCommon::LinAlg::rowNorm(L2NormBuf_OR_DistBuf.data(), centroids.data(),
-                              centroids.getSize(1), centroids.getSize(0),
-                              MLCommon::LinAlg::L2Norm, true, stream);
+    raft::linalg::rowNorm(L2NormBuf_OR_DistBuf.data(), centroids.data(),
+                          centroids.getSize(1), centroids.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   } else {
     L2NormBuf_OR_DistBuf.resize(dataBatchSize * centroidsBatchSize, stream);
   }
@@ -415,7 +414,7 @@ void minClusterDistance(const cumlHandle_impl &handle,
   Tensor<DataT, 2, IndexT> pairwiseDistance(
     L2NormBuf_OR_DistBuf.data(), {dataBatchSize, centroidsBatchSize});
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
   thrust::fill(thrust_exec_policy, minClusterDistance.begin(),
                minClusterDistance.end(), std::numeric_limits<DataT>::max());
@@ -446,8 +445,8 @@ void minClusterDistance(const cumlHandle_impl &handle,
       auto centroidsView =
         centroids.template view<2>({nc, n_features}, {cIdx, 0});
 
-      if (metric == MLCommon::Distance::EucExpandedL2 ||
-          metric == MLCommon::Distance::EucExpandedL2Sqrt) {
+      if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+          metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
         auto centroidsNormView = centroidsNorm.template view<1>({nc}, {cIdx});
         workspace.resize((sizeof(int)) * ns, stream);
 
@@ -456,8 +455,8 @@ void minClusterDistance(const cumlHandle_impl &handle,
           minClusterDistanceView.data(), datasetView.data(),
           centroidsView.data(), L2NormXView.data(), centroidsNormView.data(),
           ns, nc, n_features, (void *)workspace.data(), redOp,
-          (metric == MLCommon::Distance::EucExpandedL2) ? false : true, false,
-          stream);
+          (metric == ML::Distance::DistanceType::EucExpandedL2) ? false : true,
+          false, stream);
       } else {
         // pairwiseDistanceView [ns x nc] - view representing the pairwise
         // distance for current batch
@@ -470,7 +469,7 @@ void minClusterDistance(const cumlHandle_impl &handle,
                                          pairwiseDistanceView, workspace,
                                          metric, stream);
 
-        MLCommon::LinAlg::coalescedReduction(
+        raft::linalg::coalescedReduction(
           minClusterDistanceView.data(), pairwiseDistanceView.data(),
           pairwiseDistanceView.getSize(1), pairwiseDistanceView.getSize(0),
           std::numeric_limits<DataT>::max(), stream, true,
@@ -491,7 +490,7 @@ void minClusterDistance(const cumlHandle_impl &handle,
 // shuffle and randomly select 'n_samples_to_gather' from input 'in' and stores
 // in 'out' does not modify the input
 template <typename DataT, typename IndexT>
-void shuffleAndGather(const cumlHandle_impl &handle,
+void shuffleAndGather(const raft::handle_t &handle,
                       const Tensor<DataT, 2, IndexT> &in,
                       Tensor<DataT, 2, IndexT> &out, size_t n_samples_to_gather,
                       int seed, cudaStream_t stream,
@@ -499,7 +498,7 @@ void shuffleAndGather(const cumlHandle_impl &handle,
   auto n_samples = in.getSize(0);
   auto n_features = in.getSize(1);
 
-  Tensor<IndexT, 1> indices({n_samples}, handle.getDeviceAllocator(), stream);
+  Tensor<IndexT, 1> indices({n_samples}, handle.get_device_allocator(), stream);
 
   if (workspace) {
     // shuffle indices on device using ml-prims
@@ -508,16 +507,16 @@ void shuffleAndGather(const cumlHandle_impl &handle,
                                      stream);
   } else {
     // shuffle indices on host and copy to device...
-    MLCommon::host_buffer<IndexT> ht_indices(handle.getHostAllocator(), stream,
-                                             n_samples);
+    MLCommon::host_buffer<IndexT> ht_indices(handle.get_host_allocator(),
+                                             stream, n_samples);
 
     std::iota(ht_indices.begin(), ht_indices.end(), 0);
 
     std::mt19937 gen(seed);
     std::shuffle(ht_indices.begin(), ht_indices.end(), gen);
 
-    MLCommon::copy(indices.data(), ht_indices.data(), indices.numElements(),
-                   stream);
+    raft::copy(indices.data(), ht_indices.data(), indices.numElements(),
+               stream);
   }
 
   MLCommon::Matrix::gather(in.data(), in.getSize(1), in.getSize(0),
@@ -527,10 +526,10 @@ void shuffleAndGather(const cumlHandle_impl &handle,
 
 template <typename DataT, typename IndexT>
 void countSamplesInCluster(
-  const cumlHandle_impl &handle, const KMeansParams &params,
+  const raft::handle_t &handle, const KMeansParams &params,
   Tensor<DataT, 2, IndexT> &X, Tensor<DataT, 1, IndexT> &L2NormX,
   Tensor<DataT, 2, IndexT> &centroids, MLCommon::device_buffer<char> &workspace,
-  MLCommon::Distance::DistanceType metric,
+  ML::Distance::DistanceType metric,
   Tensor<DataT, 1, IndexT> &sampleCountInCluster, cudaStream_t stream) {
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
@@ -540,11 +539,11 @@ void countSamplesInCluster(
   //   - key is the index of nearest cluster
   //   - value is the distance to the nearest cluster
   Tensor<cub::KeyValuePair<IndexT, DataT>, 1, IndexT> minClusterAndDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
 
   // temporary buffer to store distance matrix, destructor releases the resource
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // computes minClusterAndDistance[0:n_samples) where  minClusterAndDistance[i]
   // is a <key, value> pair where
@@ -584,9 +583,9 @@ void countSamplesInCluster(
  * 5: end for
  */
 template <typename DataT, typename IndexT>
-void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
+void kmeansPlusPlus(const raft::handle_t &handle, const KMeansParams &params,
                     Tensor<DataT, 2, IndexT> &X,
-                    MLCommon::Distance::DistanceType metric,
+                    ML::Distance::DistanceType metric,
                     MLCommon::device_buffer<char> &workspace,
                     MLCommon::device_buffer<DataT> &centroidsRawData,
                     cudaStream_t stream) {
@@ -605,44 +604,43 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
   auto dataBatchSize = kmeans::detail::getDataBatchSize(params, n_samples);
 
   // temporary buffers
-  MLCommon::host_buffer<DataT> h_wt(handle.getHostAllocator(), stream,
+  MLCommon::host_buffer<DataT> h_wt(handle.get_host_allocator(), stream,
                                     n_samples);
 
-  MLCommon::device_buffer<DataT> distBuffer(handle.getDeviceAllocator(), stream,
-                                            n_trials * n_samples);
+  MLCommon::device_buffer<DataT> distBuffer(handle.get_device_allocator(),
+                                            stream, n_trials * n_samples);
 
   Tensor<DataT, 2, IndexT> centroidCandidates(
-    {n_trials, n_features}, handle.getDeviceAllocator(), stream);
+    {n_trials, n_features}, handle.get_device_allocator(), stream);
 
   Tensor<DataT, 1, IndexT> costPerCandidate(
-    {n_trials}, handle.getDeviceAllocator(), stream);
+    {n_trials}, handle.get_device_allocator(), stream);
 
   Tensor<DataT, 1, IndexT> minClusterDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
 
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
-  MLCommon::device_buffer<DataT> clusterCost(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> clusterCost(handle.get_device_allocator(),
                                              stream, 1);
 
   MLCommon::device_buffer<cub::KeyValuePair<int, DataT>>
-    minClusterIndexAndDistance(handle.getDeviceAllocator(), stream, 1);
+    minClusterIndexAndDistance(handle.get_device_allocator(), stream, 1);
 
   // L2 norm of X: ||c||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
 
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
   std::mt19937 gen(params.seed);
   std::uniform_int_distribution<> dis(0, n_samples - 1);
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
 
   // <<< Step-1 >>>: C <-- sample a point uniformly at random from X
@@ -652,8 +650,8 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
   // reset buffer to store the chosen centroid
   centroidsRawData.reserve(n_clusters * n_features, stream);
   centroidsRawData.resize(initialCentroid.numElements(), stream);
-  MLCommon::copy(centroidsRawData.begin(), initialCentroid.data(),
-                 initialCentroid.numElements(), stream);
+  raft::copy(centroidsRawData.begin(), initialCentroid.data(),
+             initialCentroid.numElements(), stream);
 
   //  C = initial set of centroids
   auto centroids = std::move(Tensor<DataT, 2, IndexT>(
@@ -673,8 +671,8 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
   while (n_clusters_picked < n_clusters) {
     // <<< Step-3 >>> : Sample x in X with probability p_x = d^2(x, C) / phi_X (C)
     // Choose 'n_trials' centroid candidates from X with probability proportional to the squared distance to the nearest existing cluster
-    MLCommon::copy(h_wt.data(), minClusterDistance.data(),
-                   minClusterDistance.numElements(), stream);
+    raft::copy(h_wt.data(), minClusterDistance.data(),
+               minClusterDistance.numElements(), stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // Note - n_trials is relative small here, we don't need MLCommon::gather call
@@ -682,8 +680,8 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
     for (int cIdx = 0; cIdx < n_trials; ++cIdx) {
       auto rand_idx = d(gen);
       auto randCentroid = X.template view<2>({1, n_features}, {rand_idx, 0});
-      MLCommon::copy(centroidCandidates.data() + cIdx * n_features,
-                     randCentroid.data(), randCentroid.numElements(), stream);
+      raft::copy(centroidCandidates.data() + cIdx * n_features,
+                 randCentroid.data(), randCentroid.numElements(), stream);
     }
 
     // Calculate pairwise distance between X and the centroid candidates
@@ -698,16 +696,16 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
     // Outputs minDistanceBuf[m_trails x n_samples] where minDistance[i, :] contains updated minClusterDistance that includes candidate-i
     auto minDistBuf = std::move(
       Tensor<DataT, 2, IndexT>(distBuffer.data(), {n_trials, n_samples}));
-    MLCommon::LinAlg::matrixVectorOp(
+    raft::linalg::matrixVectorOp(
       minDistBuf.data(), pwd.data(), minClusterDistance.data(), pwd.getSize(1),
       pwd.getSize(0), true, true,
       [=] __device__(DataT mat, DataT vec) { return vec <= mat ? vec : mat; },
       stream);
 
     // Calculate costPerCandidate[n_trials] where costPerCandidate[i] is the cluster cost when using centroid candidate-i
-    MLCommon::LinAlg::reduce(costPerCandidate.data(), minDistBuf.data(),
-                             minDistBuf.getSize(1), minDistBuf.getSize(0),
-                             static_cast<DataT>(0), true, true, stream);
+    raft::linalg::reduce(costPerCandidate.data(), minDistBuf.data(),
+                         minDistBuf.getSize(1), minDistBuf.getSize(0),
+                         static_cast<DataT>(0), true, true, stream);
 
     // Greedy Choice - Choose the candidate that has minimum cluster cost
     // ArgMin operation below identifies the index of minimum cost in costPerCandidate
@@ -727,19 +725,19 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
         minClusterIndexAndDistance.data(), costPerCandidate.getSize(0));
 
       int bestCandidateIdx = -1;
-      MLCommon::copy(&bestCandidateIdx, &minClusterIndexAndDistance.data()->key,
-                     1, stream);
+      raft::copy(&bestCandidateIdx, &minClusterIndexAndDistance.data()->key, 1,
+                 stream);
       /// <<< End of Step-3 >>>
 
       /// <<< Step-4 >>>: C = C U {x}
       // Update minimum cluster distance corresponding to the chosen centroid candidate
-      MLCommon::copy(minClusterDistance.data(),
-                     minDistBuf.data() + bestCandidateIdx * n_samples,
-                     n_samples, stream);
+      raft::copy(minClusterDistance.data(),
+                 minDistBuf.data() + bestCandidateIdx * n_samples, n_samples,
+                 stream);
 
-      MLCommon::copy(centroidsRawData.data() + n_clusters_picked * n_features,
-                     centroidCandidates.data() + bestCandidateIdx * n_features,
-                     n_features, stream);
+      raft::copy(centroidsRawData.data() + n_clusters_picked * n_features,
+                 centroidCandidates.data() + bestCandidateIdx * n_features,
+                 n_features, stream);
 
       ++n_clusters_picked;
       /// <<< End of Step-4 >>>
@@ -751,10 +749,10 @@ void kmeansPlusPlus(const cumlHandle_impl &handle, const KMeansParams &params,
 }
 
 template <typename DataT, typename IndexT>
-void checkWeights(const cumlHandle_impl &handle,
+void checkWeights(const raft::handle_t &handle,
                   MLCommon::device_buffer<char> &workspace,
                   Tensor<DataT, 1, IndexT> &weight, cudaStream_t stream) {
-  MLCommon::device_buffer<DataT> wt_aggr(handle.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<DataT> wt_aggr(handle.get_device_allocator(), stream,
                                          1);
 
   int n_samples = weight.getSize(0);
@@ -769,7 +767,7 @@ void checkWeights(const cumlHandle_impl &handle,
                                     stream));
 
   DataT wt_sum = 0;
-  MLCommon::copy(&wt_sum, wt_aggr.data(), 1, stream);
+  raft::copy(&wt_sum, wt_aggr.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   if (wt_sum != n_samples) {
@@ -779,7 +777,7 @@ void checkWeights(const cumlHandle_impl &handle,
         n_samples);
 
     DataT scale = n_samples / wt_sum;
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       weight.data(), weight.data(), weight.numElements(),
       [=] __device__(const DataT &wt) { return wt * scale; }, stream);
   }
diff --git a/cpp/src/kmeans/kmeans.cu b/cpp/src/kmeans/kmeans.cu
index de81cd05e9..cc1d50ea15 100644
--- a/cpp/src/kmeans/kmeans.cu
+++ b/cpp/src/kmeans/kmeans.cu
@@ -21,97 +21,75 @@ namespace ML {
 namespace kmeans {
 
 // -------------------------- fit_predict --------------------------------//
-void fit_predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit_predict(const raft::handle_t &handle, const KMeansParams &params,
                  const float *X, int n_samples, int n_features,
                  const float *sample_weight, float *centroids, int *labels,
                  float &inertia, int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  fit(h, params, X, n_samples, n_features, sample_weight, centroids, inertia,
-      n_iter);
-  predict(h, params, centroids, X, n_samples, n_features, sample_weight, labels,
-          inertia);
+  impl::fit(handle, params, X, n_samples, n_features, sample_weight, centroids,
+            inertia, n_iter);
+  impl::predict(handle, params, centroids, X, n_samples, n_features,
+                sample_weight, labels, inertia);
 }
 
-void fit_predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit_predict(const raft::handle_t &handle, const KMeansParams &params,
                  const double *X, int n_samples, int n_features,
                  const double *sample_weight, double *centroids, int *labels,
                  double &inertia, int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  fit(h, params, X, n_samples, n_features, sample_weight, centroids, inertia,
-      n_iter);
-  predict(h, params, centroids, X, n_samples, n_features, sample_weight, labels,
-          inertia);
+  impl::fit(handle, params, X, n_samples, n_features, sample_weight, centroids,
+            inertia, n_iter);
+  impl::predict(handle, params, centroids, X, n_samples, n_features,
+                sample_weight, labels, inertia);
 }
 
 // ----------------------------- fit ---------------------------------//
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const float *X, int n_samples, int n_features,
          const float *sample_weight, float *centroids, float &inertia,
          int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  fit(h, params, X, n_samples, n_features, sample_weight, centroids, inertia,
-      n_iter);
+  impl::fit(handle, params, X, n_samples, n_features, sample_weight, centroids,
+            inertia, n_iter);
 }
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const double *X, int n_samples, int n_features,
          const double *sample_weight, double *centroids, double &inertia,
          int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  fit(h, params, X, n_samples, n_features, sample_weight, centroids, inertia,
-      n_iter);
+  impl::fit(handle, params, X, n_samples, n_features, sample_weight, centroids,
+            inertia, n_iter);
 }
 
 // ----------------------------- predict ---------------------------------//
 
-void predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void predict(const raft::handle_t &handle, const KMeansParams &params,
              const float *centroids, const float *X, int n_samples,
              int n_features, const float *sample_weight, int *labels,
              float &inertia) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  predict(h, params, centroids, X, n_samples, n_features, sample_weight, labels,
-          inertia);
+  impl::predict(handle, params, centroids, X, n_samples, n_features,
+                sample_weight, labels, inertia);
 }
 
-void predict(const ML::cumlHandle &handle, const KMeansParams &params,
+void predict(const raft::handle_t &handle, const KMeansParams &params,
              const double *centroids, const double *X, int n_samples,
              int n_features, const double *sample_weight, int *labels,
              double &inertia) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  predict(h, params, centroids, X, n_samples, n_features, sample_weight, labels,
-          inertia);
+  impl::predict(handle, params, centroids, X, n_samples, n_features,
+                sample_weight, labels, inertia);
 }
 
 // ----------------------------- transform ---------------------------------//
-void transform(const ML::cumlHandle &handle, const KMeansParams &params,
+void transform(const raft::handle_t &handle, const KMeansParams &params,
                const float *centroids, const float *X, int n_samples,
                int n_features, int metric, float *X_new) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  transform(h, params, centroids, X, n_samples, n_features, metric, X_new);
+  impl::transform(handle, params, centroids, X, n_samples, n_features, metric,
+                  X_new);
 }
 
-void transform(const ML::cumlHandle &handle, const KMeansParams &params,
+void transform(const raft::handle_t &handle, const KMeansParams &params,
                const double *centroids, const double *X, int n_samples,
                int n_features, int metric, double *X_new) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-
-  transform(h, params, centroids, X, n_samples, n_features, metric, X_new);
+  impl::transform(handle, params, centroids, X, n_samples, n_features, metric,
+                  X_new);
 }
 
 };  // end namespace kmeans
diff --git a/cpp/src/kmeans/kmeans_mg.cu b/cpp/src/kmeans/kmeans_mg.cu
index 121b70730d..8cfc6560f3 100644
--- a/cpp/src/kmeans/kmeans_mg.cu
+++ b/cpp/src/kmeans/kmeans_mg.cu
@@ -23,23 +23,21 @@ namespace opg {
 
 // ----------------------------- fit ---------------------------------//
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const float *X, int n_samples, int n_features, float *centroids,
          float &inertia, int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
+  const raft::handle_t &h = handle;
 
-  ML::detail::streamSyncer _(h);
-  ML::kmeans::opg::fit(h, params, X, n_samples, n_features, centroids, inertia,
-                       n_iter);
+  raft::stream_syncer _(h);
+  impl::fit(h, params, X, n_samples, n_features, centroids, inertia, n_iter);
 }
 
-void fit(const ML::cumlHandle &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const double *X, int n_samples, int n_features, double *centroids,
          double &inertia, int &n_iter) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  ML::detail::streamSyncer _(h);
-  ML::kmeans::opg::fit(h, params, X, n_samples, n_features, centroids, inertia,
-                       n_iter);
+  const raft::handle_t &h = handle;
+  raft::stream_syncer _(h);
+  impl::fit(h, params, X, n_samples, n_features, centroids, inertia, n_iter);
 }
 
 };  // end namespace opg
diff --git a/cpp/src/kmeans/kmeans_mg_impl.cuh b/cpp/src/kmeans/kmeans_mg_impl.cuh
index 6d7a7ca6c4..b45a7f14ad 100644
--- a/cpp/src/kmeans/kmeans_mg_impl.cuh
+++ b/cpp/src/kmeans/kmeans_mg_impl.cuh
@@ -14,28 +14,30 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
 #include "common.cuh"
 #include "sg_impl.cuh"
 
 namespace ML {
 namespace kmeans {
 namespace opg {
+namespace impl {
 
 #define KMEANS_COMM_ROOT 0
 
 // Selects 'n_clusters' samples randomly from X
 template <typename DataT, typename IndexT>
-void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void initRandom(const raft::handle_t &handle, const KMeansParams &params,
                 Tensor<DataT, 2, IndexT> &X,
                 MLCommon::device_buffer<DataT> &centroidsRawData) {
-  const MLCommon::cumlCommunicator &comm = handle.getCommunicator();
-  cudaStream_t stream = handle.getStream();
+  const auto &comm = handle.get_comms();
+  cudaStream_t stream = handle.get_stream();
   auto n_local_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
 
-  const int my_rank = comm.getRank();
-  const int n_ranks = comm.getSize();
+  const int my_rank = comm.get_rank();
+  const int n_ranks = comm.get_size();
 
   // allocate centroids buffer
   centroidsRawData.resize(n_clusters * n_features, stream);
@@ -43,7 +45,7 @@ void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     centroidsRawData.data(), {n_clusters, n_features}));
 
   std::vector<int> nCentroidsSampledByRank(n_ranks, 0);
-  std::vector<int> nCentroidsElementsToReceiveFromRank(n_ranks, 0);
+  std::vector<size_t> nCentroidsElementsToReceiveFromRank(n_ranks, 0);
 
   const int nranks_reqd = std::min(n_ranks, n_clusters);
   ASSERT(KMEANS_COMM_ROOT < nranks_reqd,
@@ -67,13 +69,14 @@ void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
          my_rank, nCentroidsSampledInRank, n_local_samples);
 
   Tensor<DataT, 2, IndexT> centroidsSampledInRank(
-    {nCentroidsSampledInRank, n_features}, handle.getDeviceAllocator(), stream);
+    {nCentroidsSampledInRank, n_features}, handle.get_device_allocator(),
+    stream);
 
   kmeans::detail::shuffleAndGather(handle, X, centroidsSampledInRank,
                                    nCentroidsSampledInRank, params.seed,
                                    stream);
 
-  std::vector<int> displs(n_ranks);
+  std::vector<size_t> displs(n_ranks);
   thrust::exclusive_scan(
     thrust::host, nCentroidsElementsToReceiveFromRank.begin(),
     nCentroidsElementsToReceiveFromRank.end(), displs.begin());
@@ -101,23 +104,22 @@ void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 * 8: Recluster the weighted points in C into k clusters
 */
 template <typename DataT, typename IndexT>
-void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
+void initKMeansPlusPlus(const raft::handle_t &handle,
                         const KMeansParams &params, Tensor<DataT, 2, IndexT> &X,
                         MLCommon::device_buffer<DataT> &centroidsRawData,
                         MLCommon::device_buffer<char> &workspace) {
-  const MLCommon::cumlCommunicator &comm = handle.getCommunicator();
-  cudaStream_t stream = handle.getStream();
-  const int my_rank = comm.getRank();
-  const int n_rank = comm.getSize();
+  const auto &comm = handle.get_comms();
+  cudaStream_t stream = handle.get_stream();
+  const int my_rank = comm.get_rank();
+  const int n_rank = comm.get_size();
 
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
 
-  MLCommon::Random::Rng rng(params.seed,
-                            MLCommon::Random::GeneratorType::GenPhilox);
+  raft::random::Rng rng(params.seed, raft::random::GeneratorType::GenPhilox);
 
   // <<<< Step-1 >>> : C <- sample a point uniformly at random from X
   //    1.1 - Select a rank r' at random from the available n_rank ranks with a
@@ -132,15 +134,15 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
   int rp = dis(gen);
 
   // buffer to flag the sample that is chosen as initial centroids
-  MLCommon::host_buffer<int> h_isSampleCentroid(handle.getHostAllocator(),
+  MLCommon::host_buffer<int> h_isSampleCentroid(handle.get_host_allocator(),
                                                 stream, n_samples);
   std::fill(h_isSampleCentroid.begin(), h_isSampleCentroid.end(), 0);
 
-  MLCommon::host_buffer<int> nPtsSampledByRank(handle.getHostAllocator(),
+  MLCommon::host_buffer<int> nPtsSampledByRank(handle.get_host_allocator(),
                                                stream, n_rank);
 
-  Tensor<DataT, 2, IndexT> initialCentroid({1, n_features},
-                                           handle.getDeviceAllocator(), stream);
+  Tensor<DataT, 2, IndexT> initialCentroid(
+    {1, n_features}, handle.get_device_allocator(), stream);
   LOG(handle, "@Rank-%d : KMeans|| : initial centroid is sampled at rank-%d\n",
       my_rank, rp);
 
@@ -153,8 +155,8 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     int cIdx = dis(gen);
     auto centroidsView = X.template view<2>({1, n_features}, {cIdx, 0});
 
-    MLCommon::copy(initialCentroid.data(), centroidsView.data(),
-                   centroidsView.numElements(), stream);
+    raft::copy(initialCentroid.data(), centroidsView.data(),
+               centroidsView.numElements(), stream);
 
     h_isSampleCentroid[cIdx] = 1;
   }
@@ -164,20 +166,20 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
                     stream);
 
   // device buffer to flag the sample that is chosen as initial centroid
-  Tensor<int, 1> isSampleCentroid({n_samples}, handle.getDeviceAllocator(),
+  Tensor<int, 1> isSampleCentroid({n_samples}, handle.get_device_allocator(),
                                   stream);
 
-  MLCommon::copy(isSampleCentroid.data(), h_isSampleCentroid.data(),
-                 isSampleCentroid.numElements(), stream);
+  raft::copy(isSampleCentroid.data(), h_isSampleCentroid.data(),
+             isSampleCentroid.numElements(), stream);
 
-  MLCommon::device_buffer<DataT> centroidsBuf(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> centroidsBuf(handle.get_device_allocator(),
                                               stream);
 
   // reset buffer to store the chosen centroid
   centroidsBuf.reserve(n_clusters * n_features, stream);
   centroidsBuf.resize(initialCentroid.numElements(), stream);
-  MLCommon::copy(centroidsBuf.begin(), initialCentroid.data(),
-                 initialCentroid.numElements(), stream);
+  raft::copy(centroidsBuf.begin(), initialCentroid.data(),
+             initialCentroid.numElements(), stream);
 
   auto potentialCentroids = std::move(Tensor<DataT, 2, IndexT>(
     centroidsBuf.data(),
@@ -185,24 +187,23 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
   // <<< End of Step-1 >>>
 
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // L2 norm of X: ||x||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
   Tensor<DataT, 1, IndexT> minClusterDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
   Tensor<DataT, 1, IndexT> uniformRands({n_samples},
-                                        handle.getDeviceAllocator(), stream);
+                                        handle.get_device_allocator(), stream);
 
   // <<< Step-2 >>>: psi <- phi_X (C)
-  MLCommon::device_buffer<DataT> clusterCost(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> clusterCost(handle.get_device_allocator(),
                                              stream, 1);
 
   kmeans::detail::minClusterDistance(
@@ -217,15 +218,14 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
   // compute total cluster cost by accumulating the partial cost from all the
   // ranks
   comm.allreduce(clusterCost.data(), clusterCost.data(), clusterCost.size(),
-                 MLCommon::cumlCommunicator::SUM, stream);
+                 raft::comms::op_t::SUM, stream);
 
   DataT psi = 0;
-  MLCommon::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
+  raft::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
 
   // <<< End of Step-2 >>>
 
-  ASSERT(comm.syncStream(stream) ==
-           MLCommon::cumlCommunicator::status_t::commStatusSuccess,
+  ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
          "An error occurred in the distributed operation. This can result from "
          "a failed rank");
 
@@ -251,10 +251,9 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
       handle, minClusterDistance, workspace, clusterCost.data(),
       [] __device__(const DataT &a, const DataT &b) { return a + b; }, stream);
     comm.allreduce(clusterCost.data(), clusterCost.data(), clusterCost.size(),
-                   MLCommon::cumlCommunicator::SUM, stream);
-    MLCommon::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
-    ASSERT(comm.syncStream(stream) ==
-             MLCommon::cumlCommunicator::status_t::commStatusSuccess,
+                   raft::comms::op_t::SUM, stream);
+    raft::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
+    ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
            "An error occurred in the distributed operation. This can result "
            "from a failed rank");
 
@@ -279,8 +278,7 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     comm.allgather(&nPtsSampledByRank[my_rank], nPtsSampledByRank.data(), 1,
                    stream);
 
-    ASSERT(comm.syncStream(stream) ==
-             MLCommon::cumlCommunicator::status_t::commStatusSuccess,
+    ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
            "An error occurred in the distributed operation. This can result "
            "from a failed rank");
 
@@ -288,12 +286,12 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
                                      nPtsSampledByRank.end(), 0);
 
     // gather centroids from all ranks
-    std::vector<int> sizes(n_rank);
+    std::vector<size_t> sizes(n_rank);
     thrust::transform(thrust::host, nPtsSampledByRank.begin(),
                       nPtsSampledByRank.end(), sizes.begin(),
                       [&](int val) { return val * n_features; });
 
-    std::vector<int> displs(n_rank);
+    std::vector<size_t> displs(n_rank);
     thrust::exclusive_scan(thrust::host, sizes.begin(), sizes.end(),
                            displs.begin());
 
@@ -317,7 +315,7 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     // releases the resource
 
     Tensor<DataT, 1, IndexT> weight({potentialCentroids.getSize(0)},
-                                    handle.getDeviceAllocator(), stream);
+                                    handle.get_device_allocator(), stream);
 
     kmeans::detail::countSamplesInCluster(handle, params, X, L2NormX,
                                           potentialCentroids, workspace, metric,
@@ -327,7 +325,7 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     comm.allreduce<DataT>(weight.data(),         // sendbuff
                           weight.data(),         // recvbuff
                           weight.numElements(),  // count
-                          MLCommon::cumlCommunicator::SUM, stream);
+                          raft::comms::op_t::SUM, stream);
 
     // <<< end of Step-7 >>>
 
@@ -343,8 +341,8 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     KMeansParams default_params;
     default_params.n_clusters = params.n_clusters;
 
-    ML::kmeans::fit(handle, default_params, potentialCentroids, weight,
-                    centroidsRawData, inertia, n_iter, workspace);
+    ML::kmeans::impl::fit(handle, default_params, potentialCentroids, weight,
+                          centroidsRawData, inertia, n_iter, workspace);
 
   } else if (potentialCentroids.getSize(0) < n_clusters) {
     // supplement with random
@@ -362,63 +360,62 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
     KMeansParams rand_params;
     rand_params.init = KMeansParams::InitMethod::Random;
     rand_params.n_clusters = n_random_clusters;
-    ML::kmeans::opg::initRandom(handle, rand_params, X, centroidsRawData);
+    initRandom(handle, rand_params, X, centroidsRawData);
 
     // copy centroids generated during kmeans|| iteration to the buffer
-    MLCommon::copy(centroidsRawData.data() + n_random_clusters * n_features,
-                   potentialCentroids.data(), potentialCentroids.numElements(),
-                   stream);
+    raft::copy(centroidsRawData.data() + n_random_clusters * n_features,
+               potentialCentroids.data(), potentialCentroids.numElements(),
+               stream);
 
   } else {
     // found the required n_clusters
     centroidsRawData.resize(n_clusters * n_features, stream);
-    MLCommon::copy(centroidsRawData.data(), potentialCentroids.data(),
-                   potentialCentroids.numElements(), stream);
+    raft::copy(centroidsRawData.data(), potentialCentroids.data(),
+               potentialCentroids.numElements(), stream);
   }
 }
 
 template <typename DataT, typename IndexT>
-void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          Tensor<DataT, 2, IndexT> &X,
          MLCommon::device_buffer<DataT> &centroidsRawData, DataT &inertia,
          int &n_iter, MLCommon::device_buffer<char> &workspace) {
-  const MLCommon::cumlCommunicator &comm = handle.getCommunicator();
-  cudaStream_t stream = handle.getStream();
+  const auto &comm = handle.get_comms();
+  cudaStream_t stream = handle.get_stream();
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
 
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
 
   // stores (key, value) pair corresponding to each sample where
   //   - key is the index of nearest cluster
   //   - value is the distance to the nearest cluster
   Tensor<cub::KeyValuePair<IndexT, DataT>, 1, IndexT> minClusterAndDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
 
   // temporary buffer to store L2 norm of centroids or distance matrix,
   // destructor releases the resource
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // temporary buffer to store intermediate centroids, destructor releases the
   // resource
   Tensor<DataT, 2, IndexT> newCentroids({n_clusters, n_features},
-                                        handle.getDeviceAllocator(), stream);
+                                        handle.get_device_allocator(), stream);
 
   // temporary buffer to store the sample count per cluster, destructor releases
   // the resource
   Tensor<int, 1, IndexT> sampleCountInCluster(
-    {n_clusters}, handle.getDeviceAllocator(), stream);
+    {n_clusters}, handle.get_device_allocator(), stream);
 
   // L2 norm of X: ||x||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
   DataT priorClusteringCost = 0;
@@ -465,13 +462,13 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     comm.allreduce<int>(sampleCountInCluster.data(),         // sendbuff
                         sampleCountInCluster.data(),         // recvbuff
                         sampleCountInCluster.numElements(),  // count
-                        MLCommon::cumlCommunicator::SUM, stream);
+                        raft::comms::op_t::SUM, stream);
 
     // reduces newCentroids from all ranks
     comm.allreduce<DataT>(newCentroids.data(),         // sendbuff
                           newCentroids.data(),         // recvbuff
                           newCentroids.numElements(),  // count
-                          MLCommon::cumlCommunicator::SUM, stream);
+                          raft::comms::op_t::SUM, stream);
 
     // Computes newCentroids[i] = newCentroids[i]/sampleCountInCluster[i] where
     //   newCentroids[n_samples x n_features] - 2D array, newCentroids[i] has
@@ -488,7 +485,7 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     auto sampleCountInClusterInverse = std::move(
       Tensor<DataT, 1, IndexT>((DataT *)workspace.data(), {n_clusters}));
 
-    ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+    ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
     auto execution_policy = thrust::cuda::par(alloc).on(stream);
     thrust::transform(
       execution_policy, sampleCountInCluster.begin(),
@@ -500,7 +497,7 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
           return static_cast<DataT>(1.0) / static_cast<DataT>(count);
       });
 
-    MLCommon::LinAlg::matrixVectorOp(
+    raft::linalg::matrixVectorOp(
       newCentroids.data(), newCentroids.data(),
       sampleCountInClusterInverse.data(), newCentroids.getSize(1),
       newCentroids.getSize(0), true, false,
@@ -526,8 +523,8 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 
     // compute the squared norm between the newCentroids and the original
     // centroids, destructor releases the resource
-    Tensor<DataT, 1> sqrdNorm({1}, handle.getDeviceAllocator(), stream);
-    MLCommon::LinAlg::mapThenSumReduce(
+    Tensor<DataT, 1> sqrdNorm({1}, handle.get_device_allocator(), stream);
+    raft::linalg::mapThenSumReduce(
       sqrdNorm.data(), newCentroids.numElements(),
       [=] __device__(const DataT a, const DataT b) {
         DataT diff = a - b;
@@ -536,16 +533,15 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
       stream, centroids.data(), newCentroids.data());
 
     DataT sqrdNormError = 0;
-    MLCommon::copy(&sqrdNormError, sqrdNorm.data(), sqrdNorm.numElements(),
-                   stream);
+    raft::copy(&sqrdNormError, sqrdNorm.data(), sqrdNorm.numElements(), stream);
 
-    MLCommon::copy(centroidsRawData.data(), newCentroids.data(),
-                   newCentroids.numElements(), stream);
+    raft::copy(centroidsRawData.data(), newCentroids.data(),
+               newCentroids.numElements(), stream);
 
     bool done = false;
     if (params.inertia_check) {
       cub::KeyValuePair<IndexT, DataT> *clusterCostD =
-        (cub::KeyValuePair<IndexT, DataT> *)handle.getDeviceAllocator()
+        (cub::KeyValuePair<IndexT, DataT> *)handle.get_device_allocator()
           ->allocate(sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
 
       // calculate cluster cost phi_x(C)
@@ -562,13 +558,12 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 
       // Cluster cost phi_x(C) from all ranks
       comm.allreduce(&clusterCostD->value, &clusterCostD->value, 1,
-                     MLCommon::cumlCommunicator::SUM, stream);
+                     raft::comms::op_t::SUM, stream);
 
       DataT curClusteringCost = 0;
-      MLCommon::copy(&curClusteringCost, &clusterCostD->value, 1, stream);
+      raft::copy(&curClusteringCost, &clusterCostD->value, 1, stream);
 
-      ASSERT(comm.syncStream(stream) ==
-               MLCommon::cumlCommunicator::status_t::commStatusSuccess,
+      ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
              "An error occurred in the distributed operation. This can result "
              "from a failed rank");
       ASSERT(curClusteringCost != (DataT)0.0,
@@ -581,7 +576,7 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
       }
       priorClusteringCost = curClusteringCost;
 
-      handle.getDeviceAllocator()->deallocate(
+      handle.get_device_allocator()->deallocate(
         clusterCostD, sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
     }
 
@@ -598,10 +593,10 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 }
 
 template <typename DataT, typename IndexT = int>
-void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          const DataT *X, const int n_local_samples, const int n_features,
          DataT *centroids, DataT &inertia, int &n_iter) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
 
   ASSERT(n_local_samples > 0, "# of samples must be > 0");
 
@@ -614,24 +609,24 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
   Tensor<DataT, 2, IndexT> data((DataT *)X, {n_local_samples, n_features});
 
   // underlying expandable storage that holds centroids data
-  MLCommon::device_buffer<DataT> centroidsRawData(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> centroidsRawData(handle.get_device_allocator(),
                                                   stream);
 
   // Device-accessible allocation of expandable storage used as temorary buffers
-  MLCommon::device_buffer<char> workspace(handle.getDeviceAllocator(), stream);
+  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(),
+                                          stream);
 
   if (params.init == KMeansParams::InitMethod::Random) {
     // initializing with random samples from input dataset
     LOG(handle,
         "KMeans.fit: initialize cluster centers by randomly choosing from the "
         "input data.\n");
-    ML::kmeans::opg::initRandom(handle, params, data, centroidsRawData);
+    initRandom(handle, params, data, centroidsRawData);
   } else if (params.init == KMeansParams::InitMethod::KMeansPlusPlus) {
     // default method to initialize is kmeans++
     LOG(handle,
         "KMeans.fit: initialize cluster centers using k-means++ algorithm.\n");
-    ML::kmeans::opg::initKMeansPlusPlus(handle, params, data, centroidsRawData,
-                                        workspace);
+    initKMeansPlusPlus(handle, params, data, centroidsRawData, workspace);
   } else if (params.init == KMeansParams::InitMethod::Array) {
     LOG(handle,
         "KMeans.fit: initialize cluster centers from the ndarray array input "
@@ -642,24 +637,24 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
            "the requested initialization method)");
 
     centroidsRawData.resize(params.n_clusters * n_features, stream);
-    MLCommon::copy(centroidsRawData.begin(), centroids,
-                   params.n_clusters * n_features, stream);
+    raft::copy(centroidsRawData.begin(), centroids,
+               params.n_clusters * n_features, stream);
 
   } else {
     THROW("unknown initialization method to select initial centers");
   }
 
-  ML::kmeans::opg::fit(handle, params, data, centroidsRawData, inertia, n_iter,
-                       workspace);
+  fit(handle, params, data, centroidsRawData, inertia, n_iter, workspace);
 
-  MLCommon::copy(centroids, centroidsRawData.data(),
-                 params.n_clusters * n_features, stream);
+  raft::copy(centroids, centroidsRawData.data(), params.n_clusters * n_features,
+             stream);
 
   LOG(handle,
       "KMeans.fit: async call returned (fit could still be running on the "
       "device)\n");
 }
 
+};  // end namespace impl
 };  // end namespace opg
 };  // end namespace kmeans
 };  // end namespace ML
diff --git a/cpp/src/kmeans/sg_impl.cuh b/cpp/src/kmeans/sg_impl.cuh
index 3158a035b8..89d79e8413 100644
--- a/cpp/src/kmeans/sg_impl.cuh
+++ b/cpp/src/kmeans/sg_impl.cuh
@@ -16,19 +16,21 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include "common.cuh"
 
 namespace ML {
 
 namespace kmeans {
 
+namespace impl {
+
 // Selects 'n_clusters' samples randomly from X
 template <typename DataT, typename IndexT>
-void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void initRandom(const raft::handle_t &handle, const KMeansParams &params,
                 Tensor<DataT, 2, IndexT> &X,
                 MLCommon::device_buffer<DataT> &centroidsRawData) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
   // allocate centroids buffer
@@ -41,54 +43,53 @@ void initRandom(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 }
 
 template <typename DataT, typename IndexT>
-void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void fit(const raft::handle_t &handle, const KMeansParams &params,
          Tensor<DataT, 2, IndexT> &X, Tensor<DataT, 1, IndexT> &weight,
          MLCommon::device_buffer<DataT> &centroidsRawData, DataT &inertia,
          int &n_iter, MLCommon::device_buffer<char> &workspace) {
   ML::Logger::get().setLevel(params.verbosity);
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
 
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
 
   // stores (key, value) pair corresponding to each sample where
   //   - key is the index of nearest cluster
   //   - value is the distance to the nearest cluster
   Tensor<cub::KeyValuePair<IndexT, DataT>, 1, IndexT> minClusterAndDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
 
   // temporary buffer to store L2 norm of centroids or distance matrix,
   // destructor releases the resource
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // temporary buffer to store intermediate centroids, destructor releases the
   // resource
   Tensor<DataT, 2, IndexT> newCentroids({n_clusters, n_features},
-                                        handle.getDeviceAllocator(), stream);
+                                        handle.get_device_allocator(), stream);
 
   // temporary buffer to store weights per cluster, destructor releases the
   // resource
   Tensor<DataT, 1, IndexT> wtInCluster({n_clusters},
-                                       handle.getDeviceAllocator(), stream);
+                                       handle.get_device_allocator(), stream);
 
   cub::KeyValuePair<IndexT, DataT> *clusterCostD =
-    (cub::KeyValuePair<IndexT, DataT> *)handle.getDeviceAllocator()->allocate(
+    (cub::KeyValuePair<IndexT, DataT> *)handle.get_device_allocator()->allocate(
       sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
 
   // L2 norm of X: ||x||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
 
   LOG(handle,
@@ -141,7 +142,7 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     //   newCentroids[n_clusters x n_features] - 2D array, newCentroids[i] has sum of all the samples assigned to cluster-i
     //   wtInCluster[n_clusters] - 1D array, wtInCluster[i] contains # of samples in cluster-i.
     // Note - when wtInCluster[i] is 0, newCentroid[i] is reset to 0
-    MLCommon::LinAlg::matrixVectorOp(
+    raft::linalg::matrixVectorOp(
       newCentroids.data(), newCentroids.data(), wtInCluster.data(),
       newCentroids.getSize(1), newCentroids.getSize(0), true, false,
       [=] __device__(DataT mat, DataT vec) {
@@ -171,8 +172,8 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 
     // compute the squared norm between the newCentroids and the original
     // centroids, destructor releases the resource
-    Tensor<DataT, 1> sqrdNorm({1}, handle.getDeviceAllocator(), stream);
-    MLCommon::LinAlg::mapThenSumReduce(
+    Tensor<DataT, 1> sqrdNorm({1}, handle.get_device_allocator(), stream);
+    raft::linalg::mapThenSumReduce(
       sqrdNorm.data(), newCentroids.numElements(),
       [=] __device__(const DataT a, const DataT b) {
         DataT diff = a - b;
@@ -181,11 +182,10 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
       stream, centroids.data(), newCentroids.data());
 
     DataT sqrdNormError = 0;
-    MLCommon::copy(&sqrdNormError, sqrdNorm.data(), sqrdNorm.numElements(),
-                   stream);
+    raft::copy(&sqrdNormError, sqrdNorm.data(), sqrdNorm.numElements(), stream);
 
-    MLCommon::copy(centroidsRawData.data(), newCentroids.data(),
-                   newCentroids.numElements(), stream);
+    raft::copy(centroidsRawData.data(), newCentroids.data(),
+               newCentroids.numElements(), stream);
 
     bool done = false;
     if (params.inertia_check) {
@@ -202,7 +202,7 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
         stream);
 
       DataT curClusteringCost = 0;
-      MLCommon::copy(&curClusteringCost, &clusterCostD->value, 1, stream);
+      raft::copy(&curClusteringCost, &clusterCostD->value, 1, stream);
 
       CUDA_CHECK(cudaStreamSynchronize(stream));
       ASSERT(curClusteringCost != (DataT)0.0,
@@ -254,26 +254,26 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     },
     stream);
 
-  MLCommon::copy(&inertia, &clusterCostD->value, 1, stream);
+  raft::copy(&inertia, &clusterCostD->value, 1, stream);
 
   LOG(handle, "KMeans.fit: completed after %d iterations with %f inertia ",
       n_iter > params.max_iter ? n_iter - 1 : n_iter, inertia);
 
-  handle.getDeviceAllocator()->deallocate(
+  handle.get_device_allocator()->deallocate(
     clusterCostD, sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
 }
 
 template <typename DataT, typename IndexT>
-void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
+void initKMeansPlusPlus(const raft::handle_t &handle,
                         const KMeansParams &params, Tensor<DataT, 2, IndexT> &X,
                         MLCommon::device_buffer<DataT> &centroidsRawData,
                         MLCommon::device_buffer<char> &workspace) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
   centroidsRawData.resize(n_clusters * n_features, stream);
   kmeans::detail::kmeansPlusPlus(handle, params, X, metric, workspace,
                                  centroidsRawData, stream);
@@ -302,18 +302,17 @@ void initKMeansPlusPlus(const ML::cumlHandle_impl &handle,
  */
 template <typename DataT, typename IndexT>
 void initScalableKMeansPlusPlus(
-  const ML::cumlHandle_impl &handle, const KMeansParams &params,
+  const raft::handle_t &handle, const KMeansParams &params,
   Tensor<DataT, 2, IndexT> &X, MLCommon::device_buffer<DataT> &centroidsRawData,
   MLCommon::device_buffer<char> &workspace) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_samples = X.getSize(0);
   auto n_features = X.getSize(1);
   auto n_clusters = params.n_clusters;
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
 
-  MLCommon::Random::Rng rng(params.seed,
-                            MLCommon::Random::GeneratorType::GenPhilox);
+  raft::random::Rng rng(params.seed, raft::random::GeneratorType::GenPhilox);
 
   // <<<< Step-1 >>> : C <- sample a point uniformly at random from X
   std::mt19937 gen(params.seed);
@@ -323,26 +322,26 @@ void initScalableKMeansPlusPlus(
   auto initialCentroid = X.template view<2>({1, n_features}, {cIdx, 0});
 
   // flag the sample that is chosen as initial centroid
-  MLCommon::host_buffer<int> h_isSampleCentroid(handle.getHostAllocator(),
+  MLCommon::host_buffer<int> h_isSampleCentroid(handle.get_host_allocator(),
                                                 stream, n_samples);
   std::fill(h_isSampleCentroid.begin(), h_isSampleCentroid.end(), 0);
   h_isSampleCentroid[cIdx] = 1;
 
   // device buffer to flag the sample that is chosen as initial centroid
-  Tensor<int, 1> isSampleCentroid({n_samples}, handle.getDeviceAllocator(),
+  Tensor<int, 1> isSampleCentroid({n_samples}, handle.get_device_allocator(),
                                   stream);
 
-  MLCommon::copy(isSampleCentroid.data(), h_isSampleCentroid.data(),
-                 isSampleCentroid.numElements(), stream);
+  raft::copy(isSampleCentroid.data(), h_isSampleCentroid.data(),
+             isSampleCentroid.numElements(), stream);
 
-  MLCommon::device_buffer<DataT> centroidsBuf(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> centroidsBuf(handle.get_device_allocator(),
                                               stream);
 
   // reset buffer to store the chosen centroid
   centroidsBuf.reserve(n_clusters * n_features, stream);
   centroidsBuf.resize(initialCentroid.numElements(), stream);
-  MLCommon::copy(centroidsBuf.begin(), initialCentroid.data(),
-                 initialCentroid.numElements(), stream);
+  raft::copy(centroidsBuf.begin(), initialCentroid.data(),
+             initialCentroid.numElements(), stream);
 
   auto potentialCentroids = std::move(Tensor<DataT, 2, IndexT>(
     centroidsBuf.data(),
@@ -352,22 +351,21 @@ void initScalableKMeansPlusPlus(
   // temporary buffer to store L2 norm of centroids or distance matrix,
   // destructor releases the resource
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // L2 norm of X: ||x||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
   Tensor<DataT, 1, IndexT> minClusterDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
   Tensor<DataT, 1, IndexT> uniformRands({n_samples},
-                                        handle.getDeviceAllocator(), stream);
-  MLCommon::device_buffer<DataT> clusterCost(handle.getDeviceAllocator(),
+                                        handle.get_device_allocator(), stream);
+  MLCommon::device_buffer<DataT> clusterCost(handle.get_device_allocator(),
                                              stream, 1);
 
   // <<< Step-2 >>>: psi <- phi_X (C)
@@ -381,7 +379,7 @@ void initScalableKMeansPlusPlus(
     [] __device__(const DataT &a, const DataT &b) { return a + b; }, stream);
 
   DataT psi = 0;
-  MLCommon::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
+  raft::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
 
   // <<< End of Step-2 >>>
 
@@ -404,7 +402,7 @@ void initScalableKMeansPlusPlus(
       handle, minClusterDistance, workspace, clusterCost.data(),
       [] __device__(const DataT &a, const DataT &b) { return a + b; }, stream);
 
-    MLCommon::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
+    raft::copy(&psi, clusterCost.data(), clusterCost.size(), stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // <<<< Step-4 >>> : Sample each point x in X independently and identify new
@@ -424,8 +422,8 @@ void initScalableKMeansPlusPlus(
     /// <<<< Step-5 >>> : C = C U C'
     // append the data in Cp to the buffer holding the potentialCentroids
     centroidsBuf.resize(centroidsBuf.size() + Cp.numElements(), stream);
-    MLCommon::copy(centroidsBuf.end() - Cp.numElements(), Cp.data(),
-                   Cp.numElements(), stream);
+    raft::copy(centroidsBuf.end() - Cp.numElements(), Cp.data(),
+               Cp.numElements(), stream);
 
     int tot_centroids = potentialCentroids.getSize(0) + Cp.getSize(0);
     potentialCentroids = std::move(Tensor<DataT, 2, IndexT>(
@@ -441,7 +439,7 @@ void initScalableKMeansPlusPlus(
     // temporary buffer to store the sample count per cluster, destructor
     // releases the resource
     Tensor<DataT, 1, IndexT> weight({potentialCentroids.getSize(0)},
-                                    handle.getDeviceAllocator(), stream);
+                                    handle.get_device_allocator(), stream);
 
     kmeans::detail::countSamplesInCluster(handle, params, X, L2NormX,
                                           potentialCentroids, workspace, metric,
@@ -459,8 +457,8 @@ void initScalableKMeansPlusPlus(
     KMeansParams default_params;
     default_params.n_clusters = params.n_clusters;
 
-    ML::kmeans::fit(handle, default_params, potentialCentroids, weight,
-                    centroidsRawData, inertia, n_iter, workspace);
+    ML::kmeans::impl::fit(handle, default_params, potentialCentroids, weight,
+                          centroidsRawData, inertia, n_iter, workspace);
 
   } else if (potentialCentroids.getSize(0) < n_clusters) {
     // supplement with random
@@ -482,24 +480,24 @@ void initScalableKMeansPlusPlus(
     initRandom(handle, rand_params, X, centroidsRawData);
 
     // copy centroids generated during kmeans|| iteration to the buffer
-    MLCommon::copy(centroidsRawData.data() + n_random_clusters * n_features,
-                   potentialCentroids.data(), potentialCentroids.numElements(),
-                   stream);
+    raft::copy(centroidsRawData.data() + n_random_clusters * n_features,
+               potentialCentroids.data(), potentialCentroids.numElements(),
+               stream);
   } else {
     // found the required n_clusters
     centroidsRawData.resize(n_clusters * n_features, stream);
-    MLCommon::copy(centroidsRawData.data(), potentialCentroids.data(),
-                   potentialCentroids.numElements(), stream);
+    raft::copy(centroidsRawData.data(), potentialCentroids.data(),
+               potentialCentroids.numElements(), stream);
   }
 }
 
 template <typename DataT, typename IndexT = int>
-void fit(const ML::cumlHandle_impl &handle, const KMeansParams &km_params,
+void fit(const raft::handle_t &handle, const KMeansParams &km_params,
          const DataT *X, const int n_samples, const int n_features,
          const DataT *sample_weight, DataT *centroids, DataT &inertia,
          int &n_iter) {
   ML::Logger::get().setLevel(km_params.verbosity);
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
 
   ASSERT(n_samples > 0, "# of samples must be > 0");
 
@@ -511,22 +509,23 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &km_params,
 
   Tensor<DataT, 2, IndexT> data((DataT *)X, {n_samples, n_features});
 
-  Tensor<DataT, 1, IndexT> weight({n_samples}, handle.getDeviceAllocator(),
+  Tensor<DataT, 1, IndexT> weight({n_samples}, handle.get_device_allocator(),
                                   stream);
   if (sample_weight != nullptr) {
-    MLCommon::copy(weight.data(), sample_weight, n_samples, stream);
+    raft::copy(weight.data(), sample_weight, n_samples, stream);
   } else {
-    ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+    ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
     auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
     thrust::fill(thrust_exec_policy, weight.begin(), weight.end(), 1);
   }
 
   // underlying expandable storage that holds centroids data
-  MLCommon::device_buffer<DataT> centroidsRawData(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<DataT> centroidsRawData(handle.get_device_allocator(),
                                                   stream);
 
   // Device-accessible allocation of expandable storage used as temorary buffers
-  MLCommon::device_buffer<char> workspace(handle.getDeviceAllocator(), stream);
+  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(),
+                                          stream);
 
   // check if weights sum up to n_samples
   kmeans::detail::checkWeights(handle, workspace, weight, stream);
@@ -583,8 +582,8 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &km_params,
              "the requested initialization method)");
 
       centroidsRawData.resize(params.n_clusters * n_features, stream);
-      MLCommon::copy(centroidsRawData.begin(), centroids,
-                     params.n_clusters * n_features, stream);
+      raft::copy(centroidsRawData.begin(), centroids,
+                 params.n_clusters * n_features, stream);
 
     } else {
       THROW("unknown initialization method to select initial centers");
@@ -596,8 +595,8 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &km_params,
     if (_inertia < inertia) {
       inertia = _inertia;
       n_iter = _n_iter;
-      MLCommon::copy(centroids, centroidsRawData.data(),
-                     params.n_clusters * n_features, stream);
+      raft::copy(centroids, centroidsRawData.data(),
+                 params.n_clusters * n_features, stream);
     }
 
     LOG(handle, "KMeans.fit after iteration-%d/%d: inertia - %f, n_iter - %d",
@@ -613,12 +612,12 @@ void fit(const ML::cumlHandle_impl &handle, const KMeansParams &km_params,
 }
 
 template <typename DataT, typename IndexT = int>
-void predict(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void predict(const raft::handle_t &handle, const KMeansParams &params,
              const DataT *cptr, const DataT *Xptr, const int n_samples,
              const int n_features, const DataT *sample_weight,
              IndexT *labelsRawPtr, DataT &inertia) {
   ML::Logger::get().setLevel(params.verbosity);
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_clusters = params.n_clusters;
 
   ASSERT(n_clusters > 0 && cptr != nullptr, "no clusters exist");
@@ -629,47 +628,47 @@ void predict(const ML::cumlHandle_impl &handle, const KMeansParams &params,
   ASSERT(is_device_or_managed_type(cptr),
          "centroid data must be device accessible");
 
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(params.metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(params.metric);
 
   Tensor<DataT, 2, IndexT> X((DataT *)Xptr, {n_samples, n_features});
   Tensor<DataT, 2, IndexT> centroids((DataT *)cptr, {n_clusters, n_features});
 
-  Tensor<DataT, 1, IndexT> weight({n_samples}, handle.getDeviceAllocator(),
+  Tensor<DataT, 1, IndexT> weight({n_samples}, handle.get_device_allocator(),
                                   stream);
   if (sample_weight != nullptr) {
-    MLCommon::copy(weight.data(), sample_weight, n_samples, stream);
+    raft::copy(weight.data(), sample_weight, n_samples, stream);
   } else {
-    ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+    ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
     auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
     thrust::fill(thrust_exec_policy, weight.begin(), weight.end(), 1);
   }
 
   // underlying expandable storage that holds labels
-  MLCommon::device_buffer<IndexT> labelsRawData(handle.getDeviceAllocator(),
+  MLCommon::device_buffer<IndexT> labelsRawData(handle.get_device_allocator(),
                                                 stream);
 
   // Device-accessible allocation of expandable storage used as temorary buffers
-  MLCommon::device_buffer<char> workspace(handle.getDeviceAllocator(), stream);
+  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(),
+                                          stream);
 
   // check if weights sum up to n_samples
   kmeans::detail::checkWeights(handle, workspace, weight, stream);
 
   Tensor<cub::KeyValuePair<IndexT, DataT>, 1> minClusterAndDistance(
-    {n_samples}, handle.getDeviceAllocator(), stream);
+    {n_samples}, handle.get_device_allocator(), stream);
 
   // temporary buffer to store L2 norm of centroids or distance matrix,
   // destructor releases the resource
   MLCommon::device_buffer<DataT> L2NormBuf_OR_DistBuf(
-    handle.getDeviceAllocator(), stream);
+    handle.get_device_allocator(), stream);
 
   // L2 norm of X: ||x||^2
-  Tensor<DataT, 1> L2NormX({n_samples}, handle.getDeviceAllocator(), stream);
-  if (metric == MLCommon::Distance::EucExpandedL2 ||
-      metric == MLCommon::Distance::EucExpandedL2Sqrt) {
-    MLCommon::LinAlg::rowNorm(L2NormX.data(), X.data(), X.getSize(1),
-                              X.getSize(0), MLCommon::LinAlg::L2Norm, true,
-                              stream);
+  Tensor<DataT, 1> L2NormX({n_samples}, handle.get_device_allocator(), stream);
+  if (metric == ML::Distance::DistanceType::EucExpandedL2 ||
+      metric == ML::Distance::DistanceType::EucExpandedL2Sqrt) {
+    raft::linalg::rowNorm(L2NormX.data(), X.data(), X.getSize(1), X.getSize(0),
+                          raft::linalg::L2Norm, true, stream);
   }
 
   // computes minClusterAndDistance[0:n_samples) where  minClusterAndDistance[i]
@@ -683,10 +682,10 @@ void predict(const ML::cumlHandle_impl &handle, const KMeansParams &params,
 
   // calculate cluster cost phi_x(C)
   cub::KeyValuePair<IndexT, DataT> *clusterCostD =
-    (cub::KeyValuePair<IndexT, DataT> *)handle.getDeviceAllocator()->allocate(
+    (cub::KeyValuePair<IndexT, DataT> *)handle.get_device_allocator()->allocate(
       sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
 
-  ML::thrustAllocatorAdapter alloc(handle.getDeviceAllocator(), stream);
+  ML::thrustAllocatorAdapter alloc(handle.get_device_allocator(), stream);
   auto thrust_exec_policy = thrust::cuda::par(alloc).on(stream);
   thrust::transform(
     thrust_exec_policy, minClusterAndDistance.begin(),
@@ -708,7 +707,7 @@ void predict(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     },
     stream);
 
-  MLCommon::copy(&inertia, &clusterCostD->value, 1, stream);
+  raft::copy(&inertia, &clusterCostD->value, 1, stream);
 
   labelsRawData.resize(n_samples, stream);
 
@@ -718,21 +717,21 @@ void predict(const ML::cumlHandle_impl &handle, const KMeansParams &params,
     minClusterAndDistance.end(), labels.begin(),
     [=] __device__(cub::KeyValuePair<IndexT, DataT> pair) { return pair.key; });
 
-  handle.getDeviceAllocator()->deallocate(
+  handle.get_device_allocator()->deallocate(
     clusterCostD, sizeof(cub::KeyValuePair<IndexT, DataT>), stream);
 
-  MLCommon::copy(labelsRawPtr, labelsRawData.data(), n_samples, stream);
+  raft::copy(labelsRawPtr, labelsRawData.data(), n_samples, stream);
 }
 
 template <typename DataT, typename IndexT = int>
-void transform(const ML::cumlHandle_impl &handle, const KMeansParams &params,
+void transform(const raft::handle_t &handle, const KMeansParams &params,
                const DataT *cptr, const DataT *Xptr, int n_samples,
                int n_features, int transform_metric, DataT *X_new) {
   ML::Logger::get().setLevel(params.verbosity);
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   auto n_clusters = params.n_clusters;
-  MLCommon::Distance::DistanceType metric =
-    static_cast<MLCommon::Distance::DistanceType>(transform_metric);
+  ML::Distance::DistanceType metric =
+    static_cast<ML::Distance::DistanceType>(transform_metric);
 
   ASSERT(n_clusters > 0 && cptr != nullptr, "no clusters exist");
 
@@ -751,7 +750,8 @@ void transform(const ML::cumlHandle_impl &handle, const KMeansParams &params,
                                             {n_samples, n_clusters});
 
   // Device-accessible allocation of expandable storage used as temorary buffers
-  MLCommon::device_buffer<char> workspace(handle.getDeviceAllocator(), stream);
+  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(),
+                                          stream);
 
   auto dataBatchSize = kmeans::detail::getDataBatchSize(params, n_samples);
 
@@ -777,5 +777,6 @@ void transform(const ML::cumlHandle_impl &handle, const KMeansParams &params,
   }
 }
 
+};  // namespace impl
 };  // namespace kmeans
 };  // end namespace ML
diff --git a/cpp/src/knn/knn.cu b/cpp/src/knn/knn.cu
index ba60fa018b..16b081cbc9 100644
--- a/cpp/src/knn/knn.cu
+++ b/cpp/src/knn/knn.cu
@@ -25,14 +25,14 @@
 #include <selection/knn.cuh>
 
 #include <cuda_runtime.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 #include <sstream>
 #include <vector>
 
 namespace ML {
 
-void brute_force_knn(cumlHandle &handle, std::vector<float *> &input,
+void brute_force_knn(raft::handle_t &handle, std::vector<float *> &input,
                      std::vector<int> &sizes, int D, float *search_items, int n,
                      int64_t *res_I, float *res_D, int k, bool rowMajorIndex,
                      bool rowMajorQuery, MetricType metric, float metric_arg,
@@ -40,57 +40,58 @@ void brute_force_knn(cumlHandle &handle, std::vector<float *> &input,
   ASSERT(input.size() == sizes.size(),
          "input and sizes vectors must be the same size");
 
-  std::vector<cudaStream_t> int_streams = handle.getImpl().getInternalStreams();
+  std::vector<cudaStream_t> int_streams = handle.get_internal_streams();
 
   MLCommon::Selection::brute_force_knn(
     input, sizes, D, search_items, n, res_I, res_D, k,
-    handle.getImpl().getDeviceAllocator(), handle.getImpl().getStream(),
-    int_streams.data(), handle.getImpl().getNumInternalStreams(), rowMajorIndex,
-    rowMajorQuery, nullptr, metric, metric_arg, expanded);
+    handle.get_device_allocator(), handle.get_stream(), int_streams.data(),
+    handle.get_num_internal_streams(), rowMajorIndex, rowMajorQuery, nullptr,
+    metric, metric_arg, expanded);
 }
 
-void knn_classify(cumlHandle &handle, int *out, int64_t *knn_indices,
-                  std::vector<int *> &y, size_t n_query_rows, size_t n_samples,
-                  int k) {
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+void knn_classify(raft::handle_t &handle, int *out, int64_t *knn_indices,
+                  std::vector<int *> &y, size_t n_index_rows,
+                  size_t n_query_rows, int k) {
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   std::vector<int *> uniq_labels(y.size());
   std::vector<int> n_unique(y.size());
 
   for (int i = 0; i < y.size(); i++) {
-    MLCommon::Label::getUniqueLabels(y[i], n_samples, &(uniq_labels[i]),
+    MLCommon::Label::getUniqueLabels(y[i], n_index_rows, &(uniq_labels[i]),
                                      &(n_unique[i]), stream, d_alloc);
   }
 
-  MLCommon::Selection::knn_classify(out, knn_indices, y, n_query_rows,
-                                    n_samples, k, uniq_labels, n_unique,
+  MLCommon::Selection::knn_classify(out, knn_indices, y, n_index_rows,
+                                    n_query_rows, k, uniq_labels, n_unique,
                                     d_alloc, stream);
 }
 
-void knn_regress(cumlHandle &handle, float *out, int64_t *knn_indices,
-                 std::vector<float *> &y, size_t n_query_rows, size_t n_samples,
-                 int k) {
-  MLCommon::Selection::knn_regress(out, knn_indices, y, n_query_rows, n_samples,
-                                   k, handle.getStream());
+void knn_regress(raft::handle_t &handle, float *out, int64_t *knn_indices,
+                 std::vector<float *> &y, size_t n_index_rows,
+                 size_t n_query_rows, int k) {
+  MLCommon::Selection::knn_regress(out, knn_indices, y, n_index_rows,
+                                   n_query_rows, k, handle.get_stream());
 }
 
-void knn_class_proba(cumlHandle &handle, std::vector<float *> &out,
+void knn_class_proba(raft::handle_t &handle, std::vector<float *> &out,
                      int64_t *knn_indices, std::vector<int *> &y,
-                     size_t n_index_rows, size_t n_samples, int k) {
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+                     size_t n_index_rows, size_t n_query_rows, int k) {
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   std::vector<int *> uniq_labels(y.size());
   std::vector<int> n_unique(y.size());
 
   for (int i = 0; i < y.size(); i++) {
-    MLCommon::Label::getUniqueLabels(y[i], n_samples, &(uniq_labels[i]),
+    MLCommon::Label::getUniqueLabels(y[i], n_index_rows, &(uniq_labels[i]),
                                      &(n_unique[i]), stream, d_alloc);
   }
 
-  MLCommon::Selection::class_probs(out, knn_indices, y, n_index_rows, n_samples,
-                                   k, uniq_labels, n_unique, d_alloc, stream);
+  MLCommon::Selection::class_probs(out, knn_indices, y, n_index_rows,
+                                   n_query_rows, k, uniq_labels, n_unique,
+                                   d_alloc, stream);
 }
 
 /**
@@ -98,18 +99,18 @@ void knn_class_proba(cumlHandle &handle, std::vector<float *> &out,
  * a series of input arrays and combine the results into a single
  * output array for indexes and distances.
  *
- * @param handle the cuml handle to use
- * @param input an array of pointers to the input arrays
- * @param sizes an array of sizes of input arrays
- * @param n_params array size of input and sizes
- * @param D the dimensionality of the arrays
- * @param search_items array of items to search of dimensionality D
- * @param n number of rows in search_items
- * @param res_I the resulting index array of size n * k
- * @param res_D the resulting distance array of size n * k
- * @param k the number of nearest neighbors to return
- * @param rowMajorIndex is the index array in row major layout?
- * @param rowMajorQuery is the query array in row major layout?
+ * @param[in] handle the cuml handle to use
+ * @param[in] input an array of pointers to the input arrays
+ * @param[in] sizes an array of sizes of input arrays
+ * @param[in] n_params array size of input and sizes
+ * @param[in] D the dimensionality of the arrays
+ * @param[in] search_items array of items to search of dimensionality D
+ * @param[in] n number of rows in search_items
+ * @param[out] res_I the resulting index array of size n * k
+ * @param[out] res_D the resulting distance array of size n * k
+ * @param[in] k the number of nearest neighbors to return
+ * @param[in] rowMajorIndex is the index array in row major layout?
+ * @param[in] rowMajorQuery is the query array in row major layout?
  */
 extern "C" cumlError_t knn_search(const cumlHandle_t handle, float **input,
                                   int *sizes, int n_params, int D,
@@ -119,11 +120,10 @@ extern "C" cumlError_t knn_search(const cumlHandle_t handle, float **input,
                                   float metric_arg, bool expanded) {
   cumlError_t status;
 
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
 
-  std::vector<cudaStream_t> int_streams =
-    handle_ptr->getImpl().getInternalStreams();
+  std::vector<cudaStream_t> int_streams = handle_ptr->get_internal_streams();
 
   std::vector<float *> input_vec(n_params);
   std::vector<int> sizes_vec(n_params);
@@ -136,11 +136,10 @@ extern "C" cumlError_t knn_search(const cumlHandle_t handle, float **input,
     try {
       MLCommon::Selection::brute_force_knn(
         input_vec, sizes_vec, D, search_items, n, res_I, res_D, k,
-        handle_ptr->getImpl().getDeviceAllocator(),
-        handle_ptr->getImpl().getStream(), int_streams.data(),
-        handle_ptr->getImpl().getNumInternalStreams(), rowMajorIndex,
-        rowMajorQuery, nullptr, (ML::MetricType)metric_type, metric_arg,
-        expanded);
+        handle_ptr->get_device_allocator(), handle_ptr->get_stream(),
+        int_streams.data(), handle_ptr->get_num_internal_streams(),
+        rowMajorIndex, rowMajorQuery, nullptr, (ML::MetricType)metric_type,
+        metric_arg, expanded);
     } catch (...) {
       status = CUML_ERROR_UNKNOWN;
     }
diff --git a/cpp/src/knn/knn_classify_mg.cu b/cpp/src/knn/knn_classify_mg.cu
index 92f0ca0525..bf4ce30bd7 100644
--- a/cpp/src/knn/knn_classify_mg.cu
+++ b/cpp/src/knn/knn_classify_mg.cu
@@ -55,7 +55,7 @@ namespace opg {
 
 using namespace knn_common;
 
-void knn_classify(ML::cumlHandle &handle, std::vector<Matrix::Data<int> *> *out,
+void knn_classify(raft::handle_t &handle, std::vector<Matrix::Data<int> *> *out,
                   std::vector<Matrix::Data<int64_t> *> *out_I,
                   std::vector<Matrix::floatData_t *> *out_D,
                   std::vector<std::vector<float *>> *probas,
diff --git a/cpp/src/knn/knn_mg.cu b/cpp/src/knn/knn_mg.cu
index b958c36bf1..fa106605ee 100644
--- a/cpp/src/knn/knn_mg.cu
+++ b/cpp/src/knn/knn_mg.cu
@@ -19,13 +19,14 @@
 
 #include <common/cumlHandle.hpp>
 
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/comms/comms.hpp>
 
 #include <set>
 
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace KNN {
@@ -33,7 +34,8 @@ namespace opg {
 
 void reduce(Matrix::Data<int64_t> *&out_I, Matrix::floatData_t *&out_D,
             device_buffer<int64_t> &res_I, device_buffer<float> &res_D,
-            Matrix::PartDescriptor &index_desc, const cumlCommunicator &comm,
+            Matrix::PartDescriptor &index_desc,
+            const raft::comms::comms_t &comm,
             std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream,
             size_t cur_batch_size, int k, int local_parts_completed,
             int cur_batch, size_t total_n_processed, std::set<int> idxRanks) {
@@ -83,12 +85,12 @@ void perform_local_knn(int64_t *res_I, float *res_D,
 }
 
 void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
-                     std::set<int> idxRanks, const cumlCommunicator &comm,
+                     std::set<int> idxRanks, const raft::comms::comms_t &comm,
                      cudaStream_t stream) {
-  int my_rank = comm.getRank();
+  int my_rank = comm.get_rank();
 
   int request_idx = 0;
-  std::vector<MLCommon::cumlCommunicator::request_t> requests;
+  std::vector<raft::comms::request_t> requests;
   if (part_rank == my_rank) {
     int idx_rank_size = idxRanks.size();
     if (idxRanks.find(my_rank) != idxRanks.end()) {
@@ -114,7 +116,7 @@ void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
 
   try {
     comm.waitall(requests.size(), requests.data());
-  } catch (Exception &e) {
+  } catch (raft::exception &e) {
     std::cout << "FAILRE!" << std::endl;
   }
 }
@@ -124,16 +126,16 @@ void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
    * query batch to the root rank for the batch.
    */
 void exchange_results(device_buffer<int64_t> &res_I,
-                      device_buffer<float> &res_D, const cumlCommunicator &comm,
-                      int part_rank, std::set<int> idxRanks,
-                      cudaStream_t stream, size_t cur_batch_size, int k,
-                      int local_parts_completed) {
-  int my_rank = comm.getRank();
+                      device_buffer<float> &res_D,
+                      const raft::comms::comms_t &comm, int part_rank,
+                      std::set<int> idxRanks, cudaStream_t stream,
+                      size_t cur_batch_size, int k, int local_parts_completed) {
+  int my_rank = comm.get_rank();
 
   size_t batch_elms = cur_batch_size * k;
 
   int request_idx = 0;
-  std::vector<MLCommon::cumlCommunicator::request_t> requests;
+  std::vector<raft::comms::request_t> requests;
   if (part_rank != my_rank) {
     requests.resize(2);
     comm.isend(res_I.data(), batch_elms, part_rank, 0,
@@ -180,12 +182,12 @@ void exchange_results(device_buffer<int64_t> &res_I,
 
   try {
     comm.waitall(requests.size(), requests.data());
-  } catch (Exception &e) {
+  } catch (raft::exception &e) {
     std::cout << "FAILURE!" << std::endl;
   }
 }
 
-void brute_force_knn(ML::cumlHandle &handle,
+void brute_force_knn(raft::handle_t &handle,
                      std::vector<Matrix::Data<int64_t> *> &out_I,
                      std::vector<Matrix::floatData_t *> &out_D,
                      std::vector<Matrix::floatData_t *> &idx_data,
@@ -202,18 +204,18 @@ void brute_force_knn(ML::cumlHandle &handle,
            "k must be <= the number of rows in the smallest index partition.");
   }
 
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  const cumlCommunicator &comm = h.getCommunicator();
-  cudaStream_t stream = h.getStream();
+  const raft::handle_t &h = handle;
+  const auto &comm = h.get_comms();
+  cudaStream_t stream = h.get_stream();
 
-  const std::shared_ptr<deviceAllocator> allocator = h.getDeviceAllocator();
+  const auto allocator = h.get_device_allocator();
 
-  int my_rank = comm.getRank();
+  int my_rank = comm.get_rank();
 
   std::set<int> idxRanks = idx_desc.uniqueRanks();
 
   std::vector<Matrix::RankSizePair *> local_idx_parts =
-    idx_desc.blocksOwnedBy(comm.getRank());
+    idx_desc.blocksOwnedBy(comm.get_rank());
 
   int local_parts_completed = 0;
 
@@ -223,7 +225,7 @@ void brute_force_knn(ML::cumlHandle &handle,
     int part_rank = partition->rank;
     size_t part_n_rows = partition->size;
 
-    size_t total_batches = ceildiv(part_n_rows, batch_size);
+    size_t total_batches = raft::ceildiv(part_n_rows, batch_size);
     size_t total_n_processed = 0;
 
     // Loop through batches for each query part
@@ -245,7 +247,7 @@ void brute_force_knn(ML::cumlHandle &handle,
                   << std::endl;
       }
 
-      int my_rank = comm.getRank();
+      int my_rank = comm.get_rank();
       device_buffer<float> part_data(allocator, stream, 0);
 
       size_t batch_input_elms = cur_batch_size * query_desc.N;
@@ -263,9 +265,10 @@ void brute_force_knn(ML::cumlHandle &handle,
         if (!rowMajorQuery && total_batches > 1) {
           tmp_batch_buf.resize(batch_input_elms, stream);
           for (int col_data = 0; col_data < query_desc.N; col_data++) {
-            copy(tmp_batch_buf.data() + (col_data * cur_batch_size),
-                 data->ptr + ((col_data * part_n_rows) + total_n_processed),
-                 cur_batch_size, stream);
+            raft::copy(
+              tmp_batch_buf.data() + (col_data * cur_batch_size),
+              data->ptr + ((col_data * part_n_rows) + total_n_processed),
+              cur_batch_size, stream);
           }
           cur_query_ptr = tmp_batch_buf.data();
 
@@ -306,15 +309,15 @@ void brute_force_knn(ML::cumlHandle &handle,
         // Offset nearest neighbor index matrix by partition indices
         std::vector<size_t> start_indices = idx_desc.startIndices(my_rank);
 
-        cudaStream_t int_streams[handle.getImpl().getNumInternalStreams()];
-        for (int i = 0; i < handle.getImpl().getNumInternalStreams(); i++) {
-          int_streams[i] = handle.getImpl().getInternalStream(i);
+        cudaStream_t int_streams[handle.get_num_internal_streams()];
+        for (int i = 0; i < handle.get_num_internal_streams(); i++) {
+          int_streams[i] = handle.get_internal_stream(i);
         }
 
         perform_local_knn(res_I.data(), res_D.data(), idx_data, idx_desc,
                           local_idx_parts, start_indices, stream, &*int_streams,
-                          handle.getNumInternalStreams(),
-                          handle.getDeviceAllocator(), cur_batch_size, k,
+                          handle.get_num_internal_streams(),
+                          handle.get_device_allocator(), cur_batch_size, k,
                           cur_query_ptr, rowMajorIndex, rowMajorQuery);
 
         // Synchronize before sending
diff --git a/cpp/src/knn/knn_opg_common.cu b/cpp/src/knn/knn_opg_common.cu
index d3ee3af175..e3726aa981 100644
--- a/cpp/src/knn/knn_opg_common.cu
+++ b/cpp/src/knn/knn_opg_common.cu
@@ -52,13 +52,15 @@
 
 #include <common/cumlHandle.hpp>
 
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
+#include <cuml/common/logger.hpp>
+#include <raft/comms/comms.hpp>
 
 #include <set>
 
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace KNN {
@@ -66,10 +68,22 @@ namespace opg {
 
 namespace knn_common {
 
+/**
+ * This function copies the labels associated to the locally merged indices
+ * from the index partitions to a merged array of labels
+ * @param[out] out merged labels
+ * @param[in] knn_indices merged indices
+ * @param[in] parts unmerged labels in partitions
+ * @param[in] offsets array splitting the partitions making it possible
+ * to identify the origin partition of an nearest neighbor index
+ * @param[in] cur_batch_size current batch size
+ * @param[in] n_parts number of partitions
+ * @param[in] n_labels number of labels to write (batch_size * n_outputs)
+ */
 template <typename T, int TPB_X>
-__global__ void copy_outputs_kernel(T *out, int64_t *knn_indices, T **parts,
-                                    int64_t *offsets, size_t cur_batch_size,
-                                    int n_parts, int n_labels) {
+__global__ void copy_label_outputs_from_index_parts_kernel(
+  T *out, int64_t *knn_indices, T **parts, int64_t *offsets,
+  size_t cur_batch_size, int n_parts, int n_labels) {
   int64_t i = (blockIdx.x * TPB_X) + threadIdx.x;
   if (i >= n_labels) return;
   int64_t nn_idx = knn_indices[i];
@@ -82,17 +96,20 @@ __global__ void copy_outputs_kernel(T *out, int64_t *knn_indices, T **parts,
 }
 
 template <typename T>
-void copy_outputs(T *out, int64_t *knn_indices,
-                  std::vector<std::vector<T *>> &y, size_t cur_batch_size,
-                  int k, int n_outputs, int n_features, int my_rank,
-                  std::vector<Matrix::RankSizePair *> &idxPartsToRanks,
-                  std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream) {
+void copy_label_outputs_from_index_parts(T *out, int64_t *knn_indices,
+                                         std::vector<std::vector<T *>> &y,
+                                         size_t cur_batch_size, int k,
+                                         int n_outputs, int my_rank,
+                                         Matrix::PartDescriptor &index_desc,
+                                         std::shared_ptr<deviceAllocator> alloc,
+                                         cudaStream_t stream) {
   const int TPB_X = 256;
-
   int n_labels = cur_batch_size * k;
-  dim3 grid(MLCommon::ceildiv(n_labels, TPB_X));
+  dim3 grid(raft::ceildiv(n_labels, TPB_X));
   dim3 blk(TPB_X);
 
+  std::vector<Matrix::RankSizePair *> &idxPartsToRanks =
+    index_desc.partsToRanks;
   int64_t offset = 0;
   std::vector<int64_t> offsets_h;
   for (auto &rsp : idxPartsToRanks) {
@@ -103,7 +120,7 @@ void copy_outputs(T *out, int64_t *knn_indices,
   }
   size_t n_parts = offsets_h.size();
   device_buffer<int64_t> offsets_d(alloc, stream, n_parts);
-  updateDevice(offsets_d.data(), offsets_h.data(), n_parts, stream);
+  raft::update_device(offsets_d.data(), offsets_h.data(), n_parts, stream);
 
   std::vector<T *> parts_h(n_parts);
   device_buffer<T *> parts_d(alloc, stream, n_parts);
@@ -111,14 +128,106 @@ void copy_outputs(T *out, int64_t *knn_indices,
     for (int p = 0; p < n_parts; p++) {
       parts_h[p] = y[p][o];
     }
-    updateDevice(parts_d.data(), parts_h.data(), n_parts, stream);
+    raft::update_device(parts_d.data(), parts_h.data(), n_parts, stream);
 
-    copy_outputs_kernel<T, TPB_X><<<grid, blk, 0, stream>>>(
-      out + (o * n_labels), knn_indices, parts_d.data(), offsets_d.data(),
-      cur_batch_size, n_parts, n_labels);
+    copy_label_outputs_from_index_parts_kernel<T, TPB_X>
+      <<<grid, blk, 0, stream>>>(out + (o * n_labels), knn_indices,
+                                 parts_d.data(), offsets_d.data(),
+                                 cur_batch_size, n_parts, n_labels);
+  }
+}
+
+/**
+ * This function copies the labels associated to the merged indices
+ * from the unmerged to a merged (n_ranks times smaller) array of labels
+ * @param[out] outputs merged labels
+ * @param[in] knn_indices merged indices
+ * @param[in] unmerged_outputs unmerged labels
+ * @param[in] unmerged_knn_indices unmerged indices
+ * @param[in] offsets array splitting the partitions making it possible
+ * to identify the origin partition of an nearest neighbor index
+ * @param[in] parts_to_ranks get rank index from index partition index,
+ * informative to find positions as the unmerged arrays are built
+ * so that ranks are in order (unlike partitions)
+ * @param[in] nearest_neighbors number of nearest neighbors to look for in query
+ * @param[in] n_outputs number of targets
+ * @param[in] n_labels number of labels to write (batch_size * n_outputs)
+ * @param[in] n_parts number of index partitions
+ * @param[in] n_ranks number of index ranks
+ */
+template <typename T, int TPB_X>
+__global__ void merge_labels_kernel(T *outputs, int64_t *knn_indices,
+                                    T *unmerged_outputs,
+                                    int64_t *unmerged_knn_indices,
+                                    int64_t *offsets, int *parts_to_ranks,
+                                    int nearest_neighbors, int n_outputs,
+                                    int n_labels, int n_parts, int n_ranks) {
+  int64_t i = (blockIdx.x * TPB_X) + threadIdx.x;
+  if (i >= n_labels) return;
+  int64_t nn_idx = knn_indices[i];
+  int part_idx = 0;
+  for (; part_idx < n_parts && nn_idx >= offsets[part_idx]; part_idx++)
+    ;
+  part_idx = min(max((int)0, part_idx - 1), n_parts - 1);
+  int rank_idx = parts_to_ranks[part_idx];
+  int inbatch_idx = i / nearest_neighbors;
+  int64_t elm_idx = (rank_idx * n_labels) + inbatch_idx * nearest_neighbors;
+  for (int k = 0; k < nearest_neighbors; k++) {
+    if (nn_idx == unmerged_knn_indices[elm_idx + k]) {
+      for (int o = 0; o < n_outputs; o++) {
+        outputs[(o * n_labels) + i] =
+          unmerged_outputs[(o * n_ranks * n_labels) + elm_idx + k];
+      }
+      return;
+    }
   }
 }
 
+template <typename T>
+void merge_labels(T *output, int64_t *knn_indices, T *unmerged_outputs,
+                  int64_t *unmerged_knn_indices, int cur_batch_size,
+                  int nearest_neighbors, int n_outputs,
+                  Matrix::PartDescriptor &index_desc,
+                  std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream) {
+  const int TPB_X = 256;
+  int n_labels = cur_batch_size * nearest_neighbors;
+  dim3 grid(raft::ceildiv(n_labels, TPB_X));
+  dim3 blk(TPB_X);
+
+  std::set<int> idxRanks = index_desc.uniqueRanks();
+  std::vector<Matrix::RankSizePair *> &idxPartsToRanks =
+    index_desc.partsToRanks;
+
+  int offset = 0;
+  std::vector<int64_t> offsets_h;
+  for (auto &rsp : idxPartsToRanks) {
+    offsets_h.push_back(offset);
+    offset += rsp->size;
+  }
+  device_buffer<int64_t> offsets_d(alloc, stream, offsets_h.size());
+  raft::update_device(offsets_d.data(), offsets_h.data(), offsets_h.size(),
+                      stream);
+
+  std::vector<int> parts_to_ranks_h;
+  for (auto &rsp : idxPartsToRanks) {
+    int i = 0;
+    for (int rank : idxRanks) {
+      if (rank == rsp->rank) {
+        parts_to_ranks_h.push_back(i);
+      }
+      ++i;
+    }
+  }
+  device_buffer<int> parts_to_ranks_d(alloc, stream, parts_to_ranks_h.size());
+  raft::update_device(parts_to_ranks_d.data(), parts_to_ranks_h.data(),
+                      parts_to_ranks_h.size(), stream);
+
+  merge_labels_kernel<T, TPB_X><<<grid, blk, 0, stream>>>(
+    output, knn_indices, unmerged_outputs, unmerged_knn_indices,
+    offsets_d.data(), parts_to_ranks_d.data(), nearest_neighbors, n_outputs,
+    n_labels, idxPartsToRanks.size(), idxRanks.size());
+}
+
 template <typename T>
 void launch_local_operation(T *out, int64_t *knn_indices, std::vector<T *> y,
                             size_t total_labels, size_t cur_batch_size, int k,
@@ -131,86 +240,85 @@ void launch_local_operation(T *out, int64_t *knn_indices, std::vector<T *> y,
 
 template <>
 void launch_local_operation<int>(
-  int *out, int64_t *knn_indices, std::vector<int *> y, size_t total_labels,
-  size_t cur_batch_size, int k, const std::shared_ptr<deviceAllocator> alloc,
+  int *out, int64_t *knn_indices, std::vector<int *> y, size_t n_index_rows,
+  size_t n_query_rows, int k, const std::shared_ptr<deviceAllocator> alloc,
   cudaStream_t stream, cudaStream_t *int_streams, int n_int_streams,
   bool probas_only, std::vector<float *> *probas,
   std::vector<int *> *uniq_labels, std::vector<int> *n_unique) {
   if (probas_only) {
     MLCommon::Selection::class_probs<32, true>(
-      *probas, nullptr, y, total_labels, cur_batch_size, k, *uniq_labels,
+      *probas, nullptr, y, n_index_rows, n_query_rows, k, *uniq_labels,
       *n_unique, alloc, stream, &int_streams[0], n_int_streams);
   } else {
     MLCommon::Selection::knn_classify<32, true>(
-      out, nullptr, y, total_labels, cur_batch_size, k, *uniq_labels, *n_unique,
+      out, nullptr, y, n_index_rows, n_query_rows, k, *uniq_labels, *n_unique,
       alloc, stream, &int_streams[0], n_int_streams);
   }
 }
 
 template <>
 void launch_local_operation<float>(
-  float *out, int64_t *knn_indices, std::vector<float *> y, size_t total_labels,
-  size_t cur_batch_size, int k, const std::shared_ptr<deviceAllocator> alloc,
+  float *out, int64_t *knn_indices, std::vector<float *> y, size_t n_index_rows,
+  size_t n_query_rows, int k, const std::shared_ptr<deviceAllocator> alloc,
   cudaStream_t stream, cudaStream_t *int_streams, int n_int_streams,
   bool probas_only, std::vector<float *> *probas,
   std::vector<int *> *uniq_labels, std::vector<int> *n_unique) {
   MLCommon::Selection::knn_regress<float, 32, true>(
-    out, nullptr, y, total_labels, cur_batch_size, k, stream, &int_streams[0],
+    out, nullptr, y, n_index_rows, n_query_rows, k, stream, &int_streams[0],
     n_int_streams);
 }
 
 template <typename T>
 void perform_local_operation(T *out, int64_t *knn_indices, T *labels,
                              size_t cur_batch_size, int k, int n_outputs,
-                             ML::cumlHandle &h, bool probas_only = false,
+                             raft::handle_t &h, bool probas_only = false,
                              std::vector<float *> *probas = nullptr,
                              std::vector<int *> *uniq_labels = nullptr,
                              std::vector<int> *n_unique = nullptr) {
   size_t n_labels = cur_batch_size * k;
-  size_t total_labels = n_outputs * n_labels;
 
   std::vector<T *> y(n_outputs);
   for (int o = 0; o < n_outputs; o++) {
     y[o] = labels + (o * n_labels);
   }
 
-  cudaStream_t stream = h.getStream();
-  const std::shared_ptr<deviceAllocator> alloc = h.getDeviceAllocator();
+  cudaStream_t stream = h.get_stream();
+  const auto alloc = h.get_device_allocator();
 
-  int n_int_streams = h.getImpl().getNumInternalStreams();
+  int n_int_streams = h.get_num_internal_streams();
   cudaStream_t int_streams[n_int_streams];
   for (int i = 0; i < n_int_streams; i++) {
-    int_streams[i] = h.getImpl().getInternalStream(i);
+    int_streams[i] = h.get_internal_stream(i);
   }
 
-  launch_local_operation<T>(out, knn_indices, y, total_labels, cur_batch_size,
-                            k, alloc, stream, int_streams, n_int_streams,
-                            probas_only, probas, uniq_labels, n_unique);
+  launch_local_operation(out, knn_indices, y, n_labels, cur_batch_size, k,
+                         alloc, stream, int_streams, n_int_streams, probas_only,
+                         probas, uniq_labels, n_unique);
 }
 
 template <typename T>
-void reduce(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
+void reduce(raft::handle_t &handle, std::vector<Matrix::Data<T> *> *out,
             std::vector<Matrix::Data<int64_t> *> *out_I,
             std::vector<Matrix::floatData_t *> *out_D, device_buffer<T> &res,
             device_buffer<int64_t> &res_I, device_buffer<float> &res_D,
             Matrix::PartDescriptor &index_desc, size_t cur_batch_size, int k,
-            int n_outputs, int local_parts_completed, int cur_batch,
-            size_t total_n_processed, std::set<int> idxRanks, int my_rank,
+            int n_outputs, int local_parts_completed, size_t total_n_processed,
             bool probas_only = false,
             std::vector<std::vector<float *>> *probas = nullptr,
             std::vector<int *> *uniq_labels = nullptr,
             std::vector<int> *n_unique = nullptr) {
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  cudaStream_t stream = h.getStream();
-  const std::shared_ptr<deviceAllocator> alloc = h.getDeviceAllocator();
+  const raft::handle_t &h = handle;
+  cudaStream_t stream = h.get_stream();
+  const auto alloc = h.get_device_allocator();
 
+  std::set<int> idxRanks = index_desc.uniqueRanks();
   device_buffer<int64_t> trans(alloc, stream, idxRanks.size());
   CUDA_CHECK(cudaMemsetAsync(trans.data(), 0, idxRanks.size() * sizeof(int64_t),
                              stream));
 
   size_t batch_offset = total_n_processed * k;
 
-  T *output = nullptr;
+  T *outputs = nullptr;
   int64_t *indices = nullptr;
   float *distances = nullptr;
 
@@ -224,12 +332,16 @@ void reduce(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
     indices = indices_b->data();
     distances = distances_b->data();
 
-    auto &probas_part = probas->at(local_parts_completed);
-    for (float *ptr : probas_part) {
-      probas_with_offsets.push_back(ptr + batch_offset);
+    std::vector<float *> &probas_part = probas->at(local_parts_completed);
+    for (int i = 0; i < n_outputs; i++) {
+      float *ptr = probas_part[i];
+      int n_unique_classes = n_unique->at(i);
+      probas_with_offsets.push_back(ptr +
+                                    (total_n_processed * n_unique_classes));
     }
   } else {
-    output = out->at(local_parts_completed)->ptr + batch_offset;
+    outputs =
+      out->at(local_parts_completed)->ptr + (n_outputs * total_n_processed);
     indices = out_I->at(local_parts_completed)->ptr + batch_offset;
     distances = out_D->at(local_parts_completed)->ptr + batch_offset;
   }
@@ -238,9 +350,15 @@ void reduce(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
                                        indices, cur_batch_size, idxRanks.size(),
                                        k, stream, trans.data());
 
-  perform_local_operation(output, indices, res.data(), cur_batch_size, k,
-                          n_outputs, handle, probas_only, &probas_with_offsets,
-                          uniq_labels, n_unique);
+  device_buffer<T> merged_outputs_b(alloc, stream,
+                                    n_outputs * cur_batch_size * k);
+  T *merged_outputs = merged_outputs_b.data();
+  merge_labels(merged_outputs, indices, res.data(), res_I.data(),
+               cur_batch_size, k, n_outputs, index_desc, alloc, stream);
+
+  perform_local_operation<T>(outputs, indices, merged_outputs, cur_batch_size,
+                             k, n_outputs, handle, probas_only,
+                             &probas_with_offsets, uniq_labels, n_unique);
 
   if (probas_only) {
     delete indices_b;
@@ -280,12 +398,12 @@ void perform_local_knn(int64_t *res_I, float *res_D,
 }
 
 void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
-                     std::set<int> idxRanks, const cumlCommunicator &comm,
+                     std::set<int> idxRanks, const raft::comms::comms_t &comm,
                      cudaStream_t stream) {
-  int my_rank = comm.getRank();
+  int my_rank = comm.get_rank();
 
   int request_idx = 0;
-  std::vector<MLCommon::cumlCommunicator::request_t> requests;
+  std::vector<raft::comms::request_t> requests;
   if (part_rank == my_rank) {
     int idx_rank_size = idxRanks.size();
     if (idxRanks.find(my_rank) != idxRanks.end()) {
@@ -311,27 +429,29 @@ void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
 
   try {
     comm.waitall(requests.size(), requests.data());
-  } catch (Exception &e) {
-    std::cout << "FAILURE!" << std::endl;
+  } catch (raft::exception &e) {
+    CUML_LOG_DEBUG("FAILURE!");
   }
 }
 
 /**
-   * All non-root index ranks send the results for the current
-   * query batch to the root rank for the batch.
-   */
+ * All non-root index ranks send the results for the current
+ * query batch to the root rank for the batch.
+ */
 template <typename T>
 void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
-                      device_buffer<float> &res_D, const cumlCommunicator &comm,
-                      int part_rank, std::set<int> idxRanks,
-                      cudaStream_t stream, size_t cur_batch_size, int k,
-                      int n_outputs, int local_parts_completed) {
-  int my_rank = comm.getRank();
+                      device_buffer<float> &res_D,
+                      const raft::comms::comms_t &comm, int part_rank,
+                      std::set<int> idxRanks, cudaStream_t stream,
+                      std::shared_ptr<deviceAllocator> alloc,
+                      size_t cur_batch_size, int k, int n_outputs,
+                      int local_parts_completed) {
+  int my_rank = comm.get_rank();
 
   size_t batch_elms = cur_batch_size * k;
 
   int request_idx = 0;
-  std::vector<MLCommon::cumlCommunicator::request_t> requests;
+  std::vector<raft::comms::request_t> requests;
   if (part_rank != my_rank) {
     requests.resize(2 + n_outputs);
     comm.isend(res_I.data(), batch_elms, part_rank, 0,
@@ -345,13 +465,11 @@ void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
     for (size_t o = 0; o < n_outputs; o++) {
       comm.isend(res.data() + (o * batch_elms), batch_elms, part_rank, 0,
                  requests.data() + request_idx);
-      request_idx++;
+      ++request_idx;
     }
   } else {
     bool part_rank_is_idx = idxRanks.find(part_rank) != idxRanks.end();
-    int idx_rank_size = idxRanks.size();
-
-    int num_received = 0;
+    size_t idx_rank_size = idxRanks.size();
 
     // if root rank is an index, it will already have
     // query data, so no need to receive from it.
@@ -359,15 +477,41 @@ void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
     res_I.resize(batch_elms * idx_rank_size, stream);
     res_D.resize(batch_elms * idx_rank_size, stream);
     if (part_rank_is_idx) {
-      num_received = 1;  // root rank will take the zeroth slot
       --idx_rank_size;
+      int i = 0;
+      for (int rank : idxRanks) {
+        if (rank == my_rank) {
+          size_t batch_offset = batch_elms * i;
+
+          // Indices and distances are stored in rank order
+          raft::copy_async(res_I.data() + batch_offset, res_I.data(),
+                           batch_elms, stream);
+          raft::copy_async(res_D.data() + batch_offset, res_D.data(),
+                           batch_elms, stream);
+
+          device_buffer<T> tmp_res(alloc, stream, n_outputs * batch_elms);
+          raft::copy_async(tmp_res.data(), res.data(), tmp_res.size(), stream);
+
+          for (int o = 0; o < n_outputs; ++o) {
+            // Outputs are stored in target order and then in rank order
+            raft::copy_async(
+              res.data() + (o * idxRanks.size() * batch_elms) + batch_offset,
+              tmp_res.data() + (o * batch_elms), batch_elms, stream);
+          }
+          CUDA_CHECK(cudaStreamSynchronize(stream));
+          break;
+        }
+        i++;
+      }
     }
 
+    int num_received = 0;
     requests.resize((2 + n_outputs) * idx_rank_size);
     for (int rank : idxRanks) {
       if (rank != my_rank) {
         size_t batch_offset = batch_elms * num_received;
 
+        // Indices and distances are stored in rank order
         comm.irecv(res_I.data() + batch_offset, batch_elms, rank, 0,
                    requests.data() + request_idx);
         ++request_idx;
@@ -376,11 +520,17 @@ void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
         ++request_idx;
 
         for (size_t o = 0; o < n_outputs; o++) {
+          // Outputs are stored in target order and then in rank order
           T *r = res.data() + (o * idxRanks.size() * batch_elms) + batch_offset;
           comm.irecv(r, batch_elms, rank, 0, requests.data() + request_idx);
           ++request_idx;
         }
-
+        ++num_received;
+      } else if (part_rank_is_idx) {
+        /**
+         * Prevents overwriting data when the owner of currently
+         * processed query partition has itself some index partition(s)
+         */
         ++num_received;
       }
     }
@@ -388,13 +538,13 @@ void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
 
   try {
     comm.waitall(requests.size(), requests.data());
-  } catch (Exception &e) {
-    std::cout << "FAILURE!" << std::endl;
+  } catch (raft::exception &e) {
+    CUML_LOG_DEBUG("FAILURE!");
   }
 }
 
 template <typename T>
-void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
+void opg_knn(raft::handle_t &handle, std::vector<Matrix::Data<T> *> *out,
              std::vector<Matrix::Data<int64_t> *> *out_I,
              std::vector<Matrix::floatData_t *> *out_D,
              std::vector<Matrix::floatData_t *> &idx_data,
@@ -414,18 +564,18 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
            "k must be <= the number of rows in the smallest index partition.");
   }
 
-  const ML::cumlHandle_impl &h = handle.getImpl();
-  const cumlCommunicator &comm = h.getCommunicator();
-  cudaStream_t stream = h.getStream();
+  const raft::handle_t &h = handle;
+  const auto &comm = h.get_comms();
+  cudaStream_t stream = h.get_stream();
 
-  const std::shared_ptr<deviceAllocator> allocator = h.getDeviceAllocator();
+  const auto allocator = h.get_device_allocator();
 
-  int my_rank = comm.getRank();
+  int my_rank = comm.get_rank();
 
   std::set<int> idxRanks = idx_desc.uniqueRanks();
 
   std::vector<Matrix::RankSizePair *> local_idx_parts =
-    idx_desc.blocksOwnedBy(comm.getRank());
+    idx_desc.blocksOwnedBy(comm.get_rank());
 
   int local_parts_completed = 0;
 
@@ -435,7 +585,7 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
     int part_rank = partition->rank;
     size_t part_n_rows = partition->size;
 
-    size_t total_batches = ceildiv(part_n_rows, batch_size);
+    size_t total_batches = raft::ceildiv(part_n_rows, batch_size);
     size_t total_n_processed = 0;
 
     // Loop through batches for each query part
@@ -445,19 +595,14 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
       if (cur_batch == total_batches - 1)
         cur_batch_size = part_n_rows - (cur_batch * batch_size);
 
-      if (my_rank == part_rank && verbose) {
-        std::cout << "Root Rank is " << my_rank << std::endl;
-      }
+      if (my_rank == part_rank) CUML_LOG_DEBUG("Root Rank is %d", my_rank);
 
       /**
-           * Root broadcasts batch to all other ranks
-           */
-      if (verbose) {
-        std::cout << "Rank " << my_rank << ": Performing Broadcast"
-                  << std::endl;
-      }
+       * Root broadcasts batch to all other ranks
+       */
+      CUML_LOG_DEBUG("Rank %d: Performing Broadcast", my_rank);
 
-      int my_rank = comm.getRank();
+      int my_rank = comm.get_rank();
       device_buffer<float> part_data(allocator, stream, 0);
 
       size_t batch_input_elms = cur_batch_size * query_desc.N;
@@ -475,9 +620,10 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
         if (!rowMajorQuery && total_batches > 1) {
           tmp_batch_buf.resize(batch_input_elms, stream);
           for (int col_data = 0; col_data < query_desc.N; col_data++) {
-            copy(tmp_batch_buf.data() + (col_data * cur_batch_size),
-                 data->ptr + ((col_data * part_n_rows) + total_n_processed),
-                 cur_batch_size, stream);
+            raft::copy(
+              tmp_batch_buf.data() + (col_data * cur_batch_size),
+              data->ptr + ((col_data * part_n_rows) + total_n_processed),
+              cur_batch_size, stream);
           }
           cur_query_ptr = tmp_batch_buf.data();
 
@@ -494,8 +640,8 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
       bool my_rank_is_idx = idxRanks.find(my_rank) != idxRanks.end();
 
       /**
-           * Send query to index partitions
-           */
+       * Send query to index partitions
+       */
       if (my_rank == part_rank || my_rank_is_idx)
         broadcast_query(cur_query_ptr, batch_input_elms, part_rank, idxRanks,
                         comm, stream);
@@ -505,11 +651,9 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
       device_buffer<float> res_D(allocator, stream);
       if (my_rank_is_idx) {
         /**
-             * All index ranks perform local KNN
-             */
-        if (verbose)
-          std::cout << "Rank " << my_rank << ": Performing Local KNN"
-                    << std::endl;
+         * All index ranks perform local KNN
+         */
+        CUML_LOG_DEBUG("Rank %d: Performing Local KNN", my_rank);
 
         size_t batch_knn_elms = k * cur_batch_size;
 
@@ -520,55 +664,56 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
         // Offset nearest neighbor index matrix by partition indices
         std::vector<size_t> start_indices = idx_desc.startIndices(my_rank);
 
-        cudaStream_t int_streams[handle.getImpl().getNumInternalStreams()];
-        for (int i = 0; i < handle.getImpl().getNumInternalStreams(); i++) {
-          int_streams[i] = handle.getImpl().getInternalStream(i);
+        cudaStream_t int_streams[handle.get_num_internal_streams()];
+        for (int i = 0; i < handle.get_num_internal_streams(); i++) {
+          int_streams[i] = handle.get_internal_stream(i);
         }
 
         perform_local_knn(res_I.data(), res_D.data(), idx_data, idx_desc,
                           local_idx_parts, start_indices, stream, int_streams,
-                          handle.getNumInternalStreams(),
-                          handle.getDeviceAllocator(), cur_batch_size, k,
+                          handle.get_num_internal_streams(),
+                          handle.get_device_allocator(), cur_batch_size, k,
                           cur_query_ptr, rowMajorIndex, rowMajorQuery);
 
-        // Synchronize before running labels copy
-        CUDA_CHECK(cudaStreamSynchronize(stream));
-
-        copy_outputs(res.data(), res_I.data(), y, (size_t)cur_batch_size,
-                     (int)k, (int)n_outputs, (int)idx_desc.N, my_rank,
-                     idx_desc.partsToRanks, handle.getDeviceAllocator(),
-                     stream);
+        copy_label_outputs_from_index_parts(
+          res.data(), res_I.data(), y, (size_t)cur_batch_size, (int)k,
+          (int)n_outputs, my_rank, idx_desc, handle.get_device_allocator(),
+          stream);
 
         // Synchronize before sending
         CUDA_CHECK(cudaStreamSynchronize(stream));
+        CUDA_CHECK(cudaPeekAtLastError());
       }
 
-      /**
-             * Ranks exchange results.
-             * Partition owner receives. All other ranks send.
-             */
-      if (verbose)
-        std::cout << "Rank " << my_rank << ": Exchanging results" << std::endl;
-      exchange_results(res, res_I, res_D, comm, part_rank, idxRanks, stream,
-                       cur_batch_size, k, n_outputs, local_parts_completed);
+      if (part_rank == my_rank || my_rank_is_idx) {
+        /**
+         * Ranks exchange results.
+         * Each rank having index partition(s) sends
+         * its local results (my_rank_is_idx)
+         * Additionally the owner of currently processed query partition
+         * receives and performs a reduce even if it has
+         * no index partition (part_rank == my_rank)
+         */
+        CUML_LOG_DEBUG("Rank %d: Exchanging results", my_rank);
+        exchange_results(res, res_I, res_D, comm, part_rank, idxRanks, stream,
+                         handle.get_device_allocator(), cur_batch_size, k,
+                         n_outputs, local_parts_completed);
+      }
 
       /**
-           * Root rank performs local reduce
-           */
+       * Root rank performs local reduce
+       */
       if (part_rank == my_rank) {
-        if (verbose)
-          std::cout << "Rank " << my_rank << ": Performing Reduce" << std::endl;
+        CUML_LOG_DEBUG("Rank %d: Performing Reduce", my_rank);
 
         reduce(handle, out, out_I, out_D, res, res_I, res_D, idx_desc,
-               cur_batch_size, k, n_outputs, local_parts_completed, cur_batch,
-               total_n_processed, idxRanks, my_rank, probas_only, probas,
-               uniq_labels, n_unique);
+               cur_batch_size, k, n_outputs, local_parts_completed,
+               total_n_processed, probas_only, probas, uniq_labels, n_unique);
 
         CUDA_CHECK(cudaStreamSynchronize(stream));
         CUDA_CHECK(cudaPeekAtLastError());
 
-        if (verbose)
-          std::cout << "Rank " << my_rank << ": Finished Reduce" << std::endl;
+        CUML_LOG_DEBUG("Rank %d: Finished Reduce", my_rank);
       }
 
       total_n_processed += cur_batch_size;
@@ -578,7 +723,7 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
   }
 };
 
-template void opg_knn<int>(ML::cumlHandle &handle,
+template void opg_knn<int>(raft::handle_t &handle,
                            std::vector<Matrix::Data<int> *> *out,
                            std::vector<Matrix::Data<int64_t> *> *out_I,
                            std::vector<Matrix::floatData_t *> *out_D,
@@ -593,7 +738,7 @@ template void opg_knn<int>(ML::cumlHandle &handle,
                            std::vector<int *> *uniq_labels,
                            std::vector<int> *n_unique, bool probas_only);
 
-template void opg_knn<float>(ML::cumlHandle &handle,
+template void opg_knn<float>(raft::handle_t &handle,
                              std::vector<Matrix::Data<float> *> *out,
                              std::vector<Matrix::Data<int64_t> *> *out_I,
                              std::vector<Matrix::floatData_t *> *out_D,
@@ -608,54 +753,6 @@ template void opg_knn<float>(ML::cumlHandle &handle,
                              std::vector<int *> *uniq_labels,
                              std::vector<int> *n_unique, bool probas_only);
 
-template void reduce<int>(
-  ML::cumlHandle &handle, std::vector<Matrix::Data<int> *> *out,
-  std::vector<Matrix::Data<int64_t> *> *out_I,
-  std::vector<Matrix::floatData_t *> *out_D, device_buffer<int> &res,
-  device_buffer<int64_t> &res_I, device_buffer<float> &res_D,
-  Matrix::PartDescriptor &index_desc, size_t cur_batch_size, int k,
-  int n_outputs, int local_parts_completed, int cur_batch,
-  size_t total_n_processed, std::set<int> idxRanks, int my_rank,
-  bool probas_only, std::vector<std::vector<float *>> *probas,
-  std::vector<int *> *uniq_labels, std::vector<int> *n_unique);
-
-template void reduce<float>(
-  ML::cumlHandle &handle, std::vector<Matrix::Data<float> *> *out,
-  std::vector<Matrix::Data<int64_t> *> *out_I,
-  std::vector<Matrix::floatData_t *> *out_D, device_buffer<float> &res,
-  device_buffer<int64_t> &res_I, device_buffer<float> &res_D,
-  Matrix::PartDescriptor &index_desc, size_t cur_batch_size, int k,
-  int n_outputs, int local_parts_completed, int cur_batch,
-  size_t total_n_processed, std::set<int> idxRanks, int my_rank,
-  bool probas_only, std::vector<std::vector<float *>> *probas,
-  std::vector<int *> *uniq_labels, std::vector<int> *n_unique);
-
-template void exchange_results<int>(device_buffer<int> &res,
-                                    device_buffer<int64_t> &res_I,
-                                    device_buffer<float> &res_D,
-                                    const cumlCommunicator &comm, int part_rank,
-                                    std::set<int> idxRanks, cudaStream_t stream,
-                                    size_t cur_batch_size, int k, int n_outputs,
-                                    int local_parts_completed);
-
-template void exchange_results<float>(
-  device_buffer<float> &res, device_buffer<int64_t> &res_I,
-  device_buffer<float> &res_D, const cumlCommunicator &comm, int part_rank,
-  std::set<int> idxRanks, cudaStream_t stream, size_t cur_batch_size, int k,
-  int n_outputs, int local_parts_completed);
-
-template void copy_outputs<int>(
-  int *out, int64_t *knn_indices, std::vector<std::vector<int *>> &y,
-  size_t cur_batch_size, int k, int n_outputs, int n_features, int my_rank,
-  std::vector<Matrix::RankSizePair *> &idxPartsToRanks,
-  std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream);
-
-template void copy_outputs<float>(
-  float *out, int64_t *knn_indices, std::vector<std::vector<float *>> &y,
-  size_t cur_batch_size, int k, int n_outputs, int n_features, int my_rank,
-  std::vector<Matrix::RankSizePair *> &idxPartsToRanks,
-  std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream);
-
 };  // namespace knn_common
 };  // namespace opg
 };  // namespace KNN
diff --git a/cpp/src/knn/knn_opg_common.cuh b/cpp/src/knn/knn_opg_common.cuh
index 16575a64be..c3356f14cb 100644
--- a/cpp/src/knn/knn_opg_common.cuh
+++ b/cpp/src/knn/knn_opg_common.cuh
@@ -54,13 +54,13 @@
 
 #include <common/cumlHandle.hpp>
 
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/comms/comms.hpp>
 
 #include <set>
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace KNN {
@@ -69,7 +69,7 @@ namespace opg {
 namespace knn_common {
 
 template <typename T>
-void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
+void opg_knn(raft::handle_t &handle, std::vector<Matrix::Data<T> *> *out,
              std::vector<Matrix::Data<int64_t> *> *out_I,
              std::vector<Matrix::floatData_t *> *out_D,
              std::vector<Matrix::floatData_t *> &idx_data,
@@ -83,7 +83,7 @@ void opg_knn(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
              std::vector<int> *n_unique = nullptr, bool probas_only = false);
 
 template <typename T>
-void reduce(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
+void reduce(raft::handle_t &handle, std::vector<Matrix::Data<T> *> *out,
             std::vector<Matrix::Data<int64_t> *> *out_I,
             std::vector<Matrix::floatData_t *> *out_D, device_buffer<T> &res,
             device_buffer<int64_t> &res_I, device_buffer<float> &res_D,
@@ -96,15 +96,16 @@ void reduce(ML::cumlHandle &handle, std::vector<Matrix::Data<T> *> *out,
             std::vector<int> *n_unique = nullptr);
 
 void broadcast_query(float *query, size_t batch_input_elms, int part_rank,
-                     std::set<int> idxRanks, const cumlCommunicator &comm,
+                     std::set<int> idxRanks, const raft::comms::comms_t &comm,
                      cudaStream_t stream);
 
 template <typename T>
 void exchange_results(device_buffer<T> &res, device_buffer<int64_t> &res_I,
-                      device_buffer<float> &res_D, const cumlCommunicator &comm,
-                      int part_rank, std::set<int> idxRanks,
-                      cudaStream_t stream, size_t cur_batch_size, int k,
-                      int n_outputs, int local_parts_completed);
+                      device_buffer<float> &res_D,
+                      const raft::comms::comms_t &comm, int part_rank,
+                      std::set<int> idxRanks, cudaStream_t stream,
+                      size_t cur_batch_size, int k, int n_outputs,
+                      int local_parts_completed);
 
 void perform_local_knn(int64_t *res_I, float *res_D,
                        std::vector<Matrix::floatData_t *> &idx_data,
@@ -119,7 +120,7 @@ void perform_local_knn(int64_t *res_I, float *res_D,
 template <typename T>
 void perform_local_operation(T *out, int64_t *knn_indices, T *labels,
                              size_t cur_batch_size, int k, int n_outputs,
-                             ML::cumlHandle &h, bool probas_only = false,
+                             raft::handle_t &h, bool probas_only = false,
                              std::vector<float *> *probas = nullptr,
                              std::vector<int *> *uniq_labels = nullptr,
                              std::vector<int> *n_unique = nullptr);
diff --git a/cpp/src/knn/knn_regress_mg.cu b/cpp/src/knn/knn_regress_mg.cu
index 5b5b68ad78..9f0749e73e 100644
--- a/cpp/src/knn/knn_regress_mg.cu
+++ b/cpp/src/knn/knn_regress_mg.cu
@@ -55,7 +55,7 @@ namespace opg {
 
 using namespace knn_common;
 
-void knn_regress(ML::cumlHandle &handle,
+void knn_regress(raft::handle_t &handle,
                  std::vector<Matrix::Data<float> *> *out,
                  std::vector<Matrix::Data<int64_t> *> *out_I,
                  std::vector<Matrix::floatData_t *> *out_D,
diff --git a/cpp/src/metrics/metrics.cu b/cpp/src/metrics/metrics.cu
index 3f33242b19..4c80b6d5d3 100644
--- a/cpp/src/metrics/metrics.cu
+++ b/cpp/src/metrics/metrics.cu
@@ -14,112 +14,130 @@
  * limitations under the License.
  */
 
-#include <cuda_utils.cuh>
 #include <cuml/metrics/metrics.hpp>
 #include <metrics/adjustedRandIndex.cuh>
 #include <metrics/klDivergence.cuh>
+#include <metrics/pairwiseDistance.cuh>
 #include <metrics/randIndex.cuh>
+#include <metrics/scores.cuh>
 #include <metrics/silhouetteScore.cuh>
 #include <metrics/vMeasure.cuh>
-#include <score/scores.cuh>
 
 namespace ML {
 
 namespace Metrics {
 
-float r2_score_py(const cumlHandle &handle, float *y, float *y_hat, int n) {
-  return MLCommon::Score::r2_score(y, y_hat, n, handle.getStream());
+float r2_score_py(const raft::handle_t &handle, float *y, float *y_hat, int n) {
+  return MLCommon::Score::r2_score(y, y_hat, n, handle.get_stream());
 }
 
-double r2_score_py(const cumlHandle &handle, double *y, double *y_hat, int n) {
-  return MLCommon::Score::r2_score(y, y_hat, n, handle.getStream());
+double r2_score_py(const raft::handle_t &handle, double *y, double *y_hat,
+                   int n) {
+  return MLCommon::Score::r2_score(y, y_hat, n, handle.get_stream());
 }
 
-double randIndex(const cumlHandle &handle, const double *y, const double *y_hat,
-                 int n) {
+double randIndex(const raft::handle_t &handle, const double *y,
+                 const double *y_hat, int n) {
   return MLCommon::Metrics::computeRandIndex(
-    y, y_hat, (uint64_t)n, handle.getDeviceAllocator(), handle.getStream());
+    y, y_hat, (uint64_t)n, handle.get_device_allocator(), handle.get_stream());
 }
 
-double silhouetteScore(const cumlHandle &handle, double *y, int nRows,
+double silhouetteScore(const raft::handle_t &handle, double *y, int nRows,
                        int nCols, int *labels, int nLabels, double *silScores,
                        int metric) {
   return MLCommon::Metrics::silhouetteScore<double, int>(
-    y, nRows, nCols, labels, nLabels, silScores, handle.getDeviceAllocator(),
-    handle.getStream(), metric);
+    y, nRows, nCols, labels, nLabels, silScores, handle.get_device_allocator(),
+    handle.get_stream(), metric);
 }
 
-double adjustedRandIndex(const cumlHandle &handle, const int64_t *y,
+double adjustedRandIndex(const raft::handle_t &handle, const int64_t *y,
                          const int64_t *y_hat, const int64_t n) {
   return MLCommon::Metrics::computeAdjustedRandIndex<int64_t,
                                                      unsigned long long>(
-    y, y_hat, n, handle.getDeviceAllocator(), handle.getStream());
+    y, y_hat, n, handle.get_device_allocator(), handle.get_stream());
 }
 
-double adjustedRandIndex(const cumlHandle &handle, const int *y,
+double adjustedRandIndex(const raft::handle_t &handle, const int *y,
                          const int *y_hat, const int n) {
   return MLCommon::Metrics::computeAdjustedRandIndex<int, unsigned long long>(
-    y, y_hat, n, handle.getDeviceAllocator(), handle.getStream());
+    y, y_hat, n, handle.get_device_allocator(), handle.get_stream());
 }
 
-double klDivergence(const cumlHandle &handle, const double *y,
+double klDivergence(const raft::handle_t &handle, const double *y,
                     const double *y_hat, int n) {
   return MLCommon::Metrics::klDivergence(
-    y, y_hat, n, handle.getDeviceAllocator(), handle.getStream());
+    y, y_hat, n, handle.get_device_allocator(), handle.get_stream());
 }
 
-float klDivergence(const cumlHandle &handle, const float *y, const float *y_hat,
-                   int n) {
+float klDivergence(const raft::handle_t &handle, const float *y,
+                   const float *y_hat, int n) {
   return MLCommon::Metrics::klDivergence(
-    y, y_hat, n, handle.getDeviceAllocator(), handle.getStream());
+    y, y_hat, n, handle.get_device_allocator(), handle.get_stream());
 }
 
-double entropy(const cumlHandle &handle, const int *y, const int n,
+double entropy(const raft::handle_t &handle, const int *y, const int n,
                const int lower_class_range, const int upper_class_range) {
   return MLCommon::Metrics::entropy(y, n, lower_class_range, upper_class_range,
-                                    handle.getDeviceAllocator(),
-                                    handle.getStream());
+                                    handle.get_device_allocator(),
+                                    handle.get_stream());
 }
 
-double mutualInfoScore(const cumlHandle &handle, const int *y, const int *y_hat,
-                       const int n, const int lower_class_range,
+double mutualInfoScore(const raft::handle_t &handle, const int *y,
+                       const int *y_hat, const int n,
+                       const int lower_class_range,
                        const int upper_class_range) {
   return MLCommon::Metrics::mutualInfoScore(
     y, y_hat, n, lower_class_range, upper_class_range,
-    handle.getDeviceAllocator(), handle.getStream());
+    handle.get_device_allocator(), handle.get_stream());
 }
 
-double homogeneityScore(const cumlHandle &handle, const int *y,
+double homogeneityScore(const raft::handle_t &handle, const int *y,
                         const int *y_hat, const int n,
                         const int lower_class_range,
                         const int upper_class_range) {
   return MLCommon::Metrics::homogeneityScore(
     y, y_hat, n, lower_class_range, upper_class_range,
-    handle.getDeviceAllocator(), handle.getStream());
+    handle.get_device_allocator(), handle.get_stream());
 }
 
-double completenessScore(const cumlHandle &handle, const int *y,
+double completenessScore(const raft::handle_t &handle, const int *y,
                          const int *y_hat, const int n,
                          const int lower_class_range,
                          const int upper_class_range) {
   return MLCommon::Metrics::homogeneityScore(
     y_hat, y, n, lower_class_range, upper_class_range,
-    handle.getDeviceAllocator(), handle.getStream());
+    handle.get_device_allocator(), handle.get_stream());
 }
 
-double vMeasure(const cumlHandle &handle, const int *y, const int *y_hat,
+double vMeasure(const raft::handle_t &handle, const int *y, const int *y_hat,
                 const int n, const int lower_class_range,
                 const int upper_class_range) {
   return MLCommon::Metrics::vMeasure(
     y, y_hat, n, lower_class_range, upper_class_range,
-    handle.getDeviceAllocator(), handle.getStream());
+    handle.get_device_allocator(), handle.get_stream());
 }
 
-float accuracy_score_py(const cumlHandle &handle, const int *predictions,
+float accuracy_score_py(const raft::handle_t &handle, const int *predictions,
                         const int *ref_predictions, int n) {
   return MLCommon::Score::accuracy_score(predictions, ref_predictions, n,
-                                         handle.getDeviceAllocator(),
-                                         handle.getStream());
+                                         handle.get_device_allocator(),
+                                         handle.get_stream());
+}
+
+void pairwiseDistance(const raft::handle_t &handle, const double *x,
+                      const double *y, double *dist, int m, int n, int k,
+                      ML::Distance::DistanceType metric, bool isRowMajor) {
+  MLCommon::Metrics::pairwiseDistance(x, y, dist, m, n, k, metric,
+                                      handle.get_device_allocator(),
+                                      handle.get_stream(), isRowMajor);
+}
+
+void pairwiseDistance(const raft::handle_t &handle, const float *x,
+                      const float *y, float *dist, int m, int n, int k,
+                      ML::Distance::DistanceType metric, bool isRowMajor) {
+  MLCommon::Metrics::pairwiseDistance(x, y, dist, m, n, k, metric,
+                                      handle.get_device_allocator(),
+                                      handle.get_stream(), isRowMajor);
 }
 
 }  // namespace Metrics
diff --git a/cpp/src/metrics/trustworthiness.cu b/cpp/src/metrics/trustworthiness.cu
index 97d3b09aa7..8449adf824 100644
--- a/cpp/src/metrics/trustworthiness.cu
+++ b/cpp/src/metrics/trustworthiness.cu
@@ -14,10 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
 #include <distance/distance.cuh>
-#include <score/scores.cuh>
+#include <metrics/scores.cuh>
+#include <raft/handle.hpp>
 
 namespace ML {
 namespace Metrics {
@@ -33,20 +32,20 @@ namespace Metrics {
         * @input tparam distance_type: Distance type to consider
         * @return Trustworthiness score
         */
-template <typename math_t, MLCommon::Distance::DistanceType distance_type>
-double trustworthiness_score(const cumlHandle& h, math_t* X, math_t* X_embedded,
-                             int n, int m, int d, int n_neighbors,
-                             int batchSize) {
-  cudaStream_t stream = h.getStream();
-  auto d_alloc = h.getDeviceAllocator();
+template <typename math_t, ML::Distance::DistanceType distance_type>
+double trustworthiness_score(const raft::handle_t& h, math_t* X,
+                             math_t* X_embedded, int n, int m, int d,
+                             int n_neighbors, int batchSize) {
+  cudaStream_t stream = h.get_stream();
+  auto d_alloc = h.get_device_allocator();
 
   return MLCommon::Score::trustworthiness_score<math_t, distance_type>(
     X, X_embedded, n, m, d, n_neighbors, d_alloc, stream, batchSize);
 }
 
 template double
-trustworthiness_score<float, MLCommon::Distance::EucUnexpandedL2Sqrt>(
-  const cumlHandle& h, float* X, float* X_embedded, int n, int m, int d,
+trustworthiness_score<float, ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
+  const raft::handle_t& h, float* X, float* X_embedded, int n, int m, int d,
   int n_neighbors, int batchSize);
 
 };  //end namespace Metrics
diff --git a/cpp/src/metrics/trustworthiness.cuh b/cpp/src/metrics/trustworthiness.cuh
index c303257319..4e86bda890 100644
--- a/cpp/src/metrics/trustworthiness.cuh
+++ b/cpp/src/metrics/trustworthiness.cuh
@@ -22,9 +22,9 @@
 namespace ML {
 namespace Metrics {
 
-template <typename math_t, MLCommon::Distance::DistanceType distance_type>
-double trustworthiness_score(const cumlHandle& h, math_t* X, math_t* X_embedded,
-                             int n, int m, int d, int n_neighbors,
-                             int batchSize = 512);
+template <typename math_t, ML::Distance::DistanceType distance_type>
+double trustworthiness_score(const raft::handle_t& h, math_t* X,
+                             math_t* X_embedded, int n, int m, int d,
+                             int n_neighbors, int batchSize = 512);
 }
 }  // namespace ML
diff --git a/cpp/src/metrics/trustworthiness_c.h b/cpp/src/metrics/trustworthiness_c.h
index 5e7b7ebd7b..3078bf8a82 100644
--- a/cpp/src/metrics/trustworthiness_c.h
+++ b/cpp/src/metrics/trustworthiness_c.h
@@ -14,34 +14,17 @@
  * limitations under the License.
  */
 
-namespace MLCommon {
-namespace Distance {
-enum DistanceType {
-  /** evaluate as dist_ij = sum(x_ik^2) + sum(y_ij)^2 - 2*sum(x_ik * y_jk) */
-  EucExpandedL2 = 0,
-  /** same as above, but inside the epilogue, perform square root operation */
-  EucExpandedL2Sqrt,
-  /** cosine distance */
-  EucExpandedCosine,
-  /** L1 distance */
-  EucUnexpandedL1,
-  /** evaluate as dist_ij += (x_ik - y-jk)^2 */
-  EucUnexpandedL2,
-  /** same as above, but inside the epilogue, perform square root operation */
-  EucUnexpandedL2Sqrt,
-};
-}
-};  // namespace MLCommon
+#pragma once
 
-using namespace MLCommon::Distance;
+#include <cuml/distance/distance_type.h>
 
 namespace ML {
 namespace Metrics {
 
-template <typename math_t, DistanceType distance_type>
-double trustworthiness_score(const cumlHandle& h, math_t* X, math_t* X_embedded,
-                             int n, int m, int d, int n_neighbors,
-                             int batchSize);
+template <typename math_t, ML::Distance::DistanceType distance_type>
+double trustworthiness_score(const raft::handle_t& h, math_t* X,
+                             math_t* X_embedded, int n, int m, int d,
+                             int n_neighbors, int batchSize);
 
 }
 }  // namespace ML
diff --git a/cpp/src/ml_cuda_utils.h b/cpp/src/ml_cuda_utils.h
index b890188cec..39db0346d0 100644
--- a/cpp/src/ml_cuda_utils.h
+++ b/cpp/src/ml_cuda_utils.h
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuda_runtime.h>
+#include <raft/cudart_utils.h>
 
 namespace ML {
 
@@ -38,11 +38,7 @@ inline cudaMemoryType memory_type(const void *p) {
     err = cudaGetLastError();
     ASSERT(err == cudaErrorInvalidValue, "%s", cudaGetErrorString(err));
   }
-#if CUDA_VERSION >= 10000
   return att.type;
-#else
-  return att.memoryType;
-#endif
 }
 
 inline bool is_device_or_managed_type(const void *p) {
@@ -50,4 +46,5 @@ inline bool is_device_or_managed_type(const void *p) {
   return p_memory_type == cudaMemoryTypeDevice ||
          p_memory_type == cudaMemoryTypeManaged;
 }
+
 }  // namespace ML
diff --git a/cpp/src/ml_mg_utils.cuh b/cpp/src/ml_mg_utils.cuh
index 7ac79777f5..5fb1ace2d7 100644
--- a/cpp/src/ml_mg_utils.cuh
+++ b/cpp/src/ml_mg_utils.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 #include <cuda_runtime.h>
 
@@ -40,7 +40,7 @@ template <typename OutType, typename T = size_t>
 void chunk_to_device(const OutType *ptr, T n, int D, int *devices,
                      OutType **output, T *sizes, int n_chunks,
                      cudaStream_t stream) {
-  size_t chunk_size = MLCommon::ceildiv<size_t>((size_t)n, (size_t)n_chunks);
+  size_t chunk_size = raft::ceildiv<size_t>((size_t)n, (size_t)n_chunks);
 
 #pragma omp parallel for
   for (int i = 0; i < n_chunks; i++) {
@@ -51,8 +51,8 @@ void chunk_to_device(const OutType *ptr, T n, int D, int *devices,
     if (length * (i + 1) > n) length = length - ((chunk_size * (i + 1)) - n);
 
     float *ptr_d;
-    MLCommon::allocate(ptr_d, length * D);
-    MLCommon::updateDevice(ptr_d, ptr + (chunk_size * i), length * D, stream);
+    raft::allocate(ptr_d, length * D);
+    raft::update_device(ptr_d, ptr + (chunk_size * i), length * D, stream);
 
     output[i] = ptr_d;
     sizes[i] = length;
diff --git a/cpp/src/pca/pca.cu b/cpp/src/pca/pca.cu
index f5146d8834..e2b76134c4 100644
--- a/cpp/src/pca/pca.cu
+++ b/cpp/src/pca/pca.cu
@@ -21,68 +21,66 @@ namespace ML {
 
 using namespace MLCommon;
 
-void pcaFit(cumlHandle &handle, float *input, float *components,
+void pcaFit(raft::handle_t &handle, float *input, float *components,
             float *explained_var, float *explained_var_ratio,
             float *singular_vals, float *mu, float *noise_vars,
             const paramsPCA &prms) {
-  pcaFit(handle.getImpl(), input, components, explained_var,
-         explained_var_ratio, singular_vals, mu, noise_vars, prms,
-         handle.getStream());
+  pcaFit(handle, input, components, explained_var, explained_var_ratio,
+         singular_vals, mu, noise_vars, prms, handle.get_stream());
 }
 
-void pcaFit(cumlHandle &handle, double *input, double *components,
+void pcaFit(raft::handle_t &handle, double *input, double *components,
             double *explained_var, double *explained_var_ratio,
             double *singular_vals, double *mu, double *noise_vars,
             const paramsPCA &prms) {
-  pcaFit(handle.getImpl(), input, components, explained_var,
-         explained_var_ratio, singular_vals, mu, noise_vars, prms,
-         handle.getStream());
+  pcaFit(handle, input, components, explained_var, explained_var_ratio,
+         singular_vals, mu, noise_vars, prms, handle.get_stream());
 }
 
-void pcaFitTransform(cumlHandle &handle, float *input, float *trans_input,
+void pcaFitTransform(raft::handle_t &handle, float *input, float *trans_input,
                      float *components, float *explained_var,
                      float *explained_var_ratio, float *singular_vals,
                      float *mu, float *noise_vars, const paramsPCA &prms) {
-  pcaFitTransform(handle.getImpl(), input, trans_input, components,
-                  explained_var, explained_var_ratio, singular_vals, mu,
-                  noise_vars, prms, handle.getStream());
+  pcaFitTransform(handle, input, trans_input, components, explained_var,
+                  explained_var_ratio, singular_vals, mu, noise_vars, prms,
+                  handle.get_stream());
 }
 
-void pcaFitTransform(cumlHandle &handle, double *input, double *trans_input,
+void pcaFitTransform(raft::handle_t &handle, double *input, double *trans_input,
                      double *components, double *explained_var,
                      double *explained_var_ratio, double *singular_vals,
                      double *mu, double *noise_vars, const paramsPCA &prms) {
-  pcaFitTransform(handle.getImpl(), input, trans_input, components,
-                  explained_var, explained_var_ratio, singular_vals, mu,
-                  noise_vars, prms, handle.getStream());
+  pcaFitTransform(handle, input, trans_input, components, explained_var,
+                  explained_var_ratio, singular_vals, mu, noise_vars, prms,
+                  handle.get_stream());
 }
 
-void pcaInverseTransform(cumlHandle &handle, float *trans_input,
+void pcaInverseTransform(raft::handle_t &handle, float *trans_input,
                          float *components, float *singular_vals, float *mu,
                          float *input, const paramsPCA &prms) {
-  pcaInverseTransform(handle.getImpl(), trans_input, components, singular_vals,
-                      mu, input, prms, handle.getStream());
+  pcaInverseTransform(handle, trans_input, components, singular_vals, mu, input,
+                      prms, handle.get_stream());
 }
 
-void pcaInverseTransform(cumlHandle &handle, double *trans_input,
+void pcaInverseTransform(raft::handle_t &handle, double *trans_input,
                          double *components, double *singular_vals, double *mu,
                          double *input, const paramsPCA &prms) {
-  pcaInverseTransform(handle.getImpl(), trans_input, components, singular_vals,
-                      mu, input, prms, handle.getStream());
+  pcaInverseTransform(handle, trans_input, components, singular_vals, mu, input,
+                      prms, handle.get_stream());
 }
 
-void pcaTransform(cumlHandle &handle, float *input, float *components,
+void pcaTransform(raft::handle_t &handle, float *input, float *components,
                   float *trans_input, float *singular_vals, float *mu,
                   const paramsPCA &prms) {
-  pcaTransform(handle.getImpl(), input, components, trans_input, singular_vals,
-               mu, prms, handle.getStream());
+  pcaTransform(handle, input, components, trans_input, singular_vals, mu, prms,
+               handle.get_stream());
 }
 
-void pcaTransform(cumlHandle &handle, double *input, double *components,
+void pcaTransform(raft::handle_t &handle, double *input, double *components,
                   double *trans_input, double *singular_vals, double *mu,
                   const paramsPCA &prms) {
-  pcaTransform(handle.getImpl(), input, components, trans_input, singular_vals,
-               mu, prms, handle.getStream());
+  pcaTransform(handle, input, components, trans_input, singular_vals, mu, prms,
+               handle.get_stream());
 }
 
 };  // end namespace ML
diff --git a/cpp/src/pca/pca.cuh b/cpp/src/pca/pca.cuh
index f655caca66..4d78272fdc 100644
--- a/cpp/src/pca/pca.cuh
+++ b/cpp/src/pca/pca.cuh
@@ -16,20 +16,20 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/cuml.hpp>
 #include <cuml/decomposition/params.hpp>
-#include <linalg/eig.cuh>
-#include <linalg/eltwise.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/eig.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
 #include <stats/cov.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
 #include <tsvd/tsvd.cuh>
 
 namespace ML {
@@ -37,27 +37,28 @@ namespace ML {
 using namespace MLCommon;
 
 template <typename math_t, typename enum_solver = solver>
-void truncCompExpVars(const cumlHandle_impl &handle, math_t *in,
+void truncCompExpVars(const raft::handle_t &handle, math_t *in,
                       math_t *components, math_t *explained_var,
                       math_t *explained_var_ratio,
                       const paramsTSVDTemplate<enum_solver> prms,
                       cudaStream_t stream) {
   int len = prms.n_cols * prms.n_cols;
-  auto allocator = handle.getDeviceAllocator();
+  auto allocator = handle.get_device_allocator();
   device_buffer<math_t> components_all(allocator, stream, len);
   device_buffer<math_t> explained_var_all(allocator, stream, prms.n_cols);
   device_buffer<math_t> explained_var_ratio_all(allocator, stream, prms.n_cols);
 
   calEig<math_t, enum_solver>(handle, in, components_all.data(),
                               explained_var_all.data(), prms, stream);
-  Matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
-                          prms.n_components, prms.n_cols, stream);
-  Matrix::ratio(explained_var_all.data(), explained_var_ratio_all.data(),
-                prms.n_cols, allocator, stream);
-  Matrix::truncZeroOrigin(explained_var_all.data(), prms.n_cols, explained_var,
-                          prms.n_components, 1, stream);
-  Matrix::truncZeroOrigin(explained_var_ratio_all.data(), prms.n_cols,
-                          explained_var_ratio, prms.n_components, 1, stream);
+  raft::matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
+                                prms.n_components, prms.n_cols, stream);
+  raft::matrix::ratio(handle, explained_var_all.data(),
+                      explained_var_ratio_all.data(), prms.n_cols, stream);
+  raft::matrix::truncZeroOrigin(explained_var_all.data(), prms.n_cols,
+                                explained_var, prms.n_components, 1, stream);
+  raft::matrix::truncZeroOrigin(explained_var_ratio_all.data(), prms.n_cols,
+                                explained_var_ratio, prms.n_components, 1,
+                                stream);
 }
 
 /**
@@ -74,11 +75,11 @@ void truncCompExpVars(const cumlHandle_impl &handle, math_t *in,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void pcaFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
+void pcaFit(const raft::handle_t &handle, math_t *input, math_t *components,
             math_t *explained_var, math_t *explained_var_ratio,
             math_t *singular_vals, math_t *mu, math_t *noise_vars,
             const paramsPCA &prms, cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
+  auto cublas_handle = handle.get_cublas_handle();
 
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than two");
@@ -91,22 +92,22 @@ void pcaFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
   int n_components = prms.n_components;
   if (n_components > prms.n_cols) n_components = prms.n_cols;
 
-  Stats::mean(mu, input, prms.n_cols, prms.n_rows, true, false, stream);
+  raft::stats::mean(mu, input, prms.n_cols, prms.n_rows, true, false, stream);
 
   int len = prms.n_cols * prms.n_cols;
-  device_buffer<math_t> cov(handle.getDeviceAllocator(), stream, len);
+  device_buffer<math_t> cov(handle.get_device_allocator(), stream, len);
 
-  Stats::cov(cov.data(), input, mu, prms.n_cols, prms.n_rows, true, false, true,
-             cublas_handle, stream);
+  Stats::cov(handle, cov.data(), input, mu, prms.n_cols, prms.n_rows, true,
+             false, true, stream);
   truncCompExpVars(handle, cov.data(), components, explained_var,
                    explained_var_ratio, prms, stream);
 
   math_t scalar = (prms.n_rows - 1);
-  Matrix::seqRoot(explained_var, singular_vals, scalar, n_components, stream,
-                  true);
+  raft::matrix::seqRoot(explained_var, singular_vals, scalar, n_components,
+                        stream, true);
 
-  Stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
-                 stream);
+  raft::stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
+                       stream);
 }
 
 /**
@@ -124,7 +125,7 @@ void pcaFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void pcaFitTransform(const cumlHandle_impl &handle, math_t *input,
+void pcaFitTransform(const raft::handle_t &handle, math_t *input,
                      math_t *trans_input, math_t *components,
                      math_t *explained_var, math_t *explained_var_ratio,
                      math_t *singular_vals, math_t *mu, math_t *noise_vars,
@@ -134,7 +135,7 @@ void pcaFitTransform(const cumlHandle_impl &handle, math_t *input,
   pcaTransform(handle, input, components, trans_input, singular_vals, mu, prms,
                stream);
   signFlip(trans_input, prms.n_rows, prms.n_components, components, prms.n_cols,
-           handle.getDeviceAllocator(), stream);
+           handle.get_device_allocator(), stream);
 }
 
 // TODO: implement pcaGetCovariance function
@@ -161,38 +162,40 @@ void pcaGetPrecision() {
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void pcaInverseTransform(const cumlHandle_impl &handle, math_t *trans_input,
+void pcaInverseTransform(const raft::handle_t &handle, math_t *trans_input,
                          math_t *components, math_t *singular_vals, math_t *mu,
                          math_t *input, const paramsPCA &prms,
                          cudaStream_t stream) {
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than two");
-  ASSERT(prms.n_rows > 1,
-         "Parameter n_rows: number of rows cannot be less than two");
+  ASSERT(prms.n_rows > 0,
+         "Parameter n_rows: number of rows cannot be less than one");
   ASSERT(
     prms.n_components > 0,
     "Parameter n_components: number of components cannot be less than one");
 
   if (prms.whiten) {
-    math_t scalar = math_t(1 / sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, stream);
-    Matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
-                                           prms.n_rows, prms.n_components, true,
-                                           true, stream);
+    math_t sqrt_n_samples = sqrt(prms.n_rows - 1);
+    math_t scalar = prms.n_rows - 1 > 0 ? math_t(1 / sqrt_n_samples) : 0;
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, stream);
+    raft::matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
+                                                 prms.n_rows, prms.n_components,
+                                                 true, true, stream);
   }
 
   tsvdInverseTransform(handle, trans_input, components, input, prms, stream);
-  Stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
-                 stream);
+  raft::stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
+                       stream);
 
   if (prms.whiten) {
-    Matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
-                                          prms.n_rows, prms.n_components, true,
-                                          true, stream);
-    math_t scalar = math_t(sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, stream);
+    raft::matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
+                                                prms.n_rows, prms.n_components,
+                                                true, true, stream);
+    math_t sqrt_n_samples = sqrt(prms.n_rows - 1);
+    math_t scalar = prms.n_rows - 1 > 0 ? math_t(1 / sqrt_n_samples) : 0;
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, stream);
   }
 }
 
@@ -220,40 +223,41 @@ void pcaScoreSamples() {
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void pcaTransform(const cumlHandle_impl &handle, math_t *input,
+void pcaTransform(const raft::handle_t &handle, math_t *input,
                   math_t *components, math_t *trans_input,
                   math_t *singular_vals, math_t *mu, const paramsPCA &prms,
                   cudaStream_t stream) {
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than two");
-  ASSERT(prms.n_rows > 1,
-         "Parameter n_rows: number of rows cannot be less than two");
+  ASSERT(prms.n_rows > 0,
+         "Parameter n_rows: number of rows cannot be less than one");
   ASSERT(
     prms.n_components > 0,
     "Parameter n_components: number of components cannot be less than one");
 
   if (prms.whiten) {
     math_t scalar = math_t(sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, stream);
-    Matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
-                                          prms.n_rows, prms.n_components, true,
-                                          true, stream);
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, stream);
+    raft::matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
+                                                prms.n_rows, prms.n_components,
+                                                true, true, stream);
   }
 
-  Stats::meanCenter(input, input, mu, prms.n_cols, prms.n_rows, false, true,
-                    stream);
+  raft::stats::meanCenter(input, input, mu, prms.n_cols, prms.n_rows, false,
+                          true, stream);
   tsvdTransform(handle, input, components, trans_input, prms, stream);
-  Stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
-                 stream);
+  raft::stats::meanAdd(input, input, mu, prms.n_cols, prms.n_rows, false, true,
+                       stream);
 
   if (prms.whiten) {
-    Matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
-                                           prms.n_rows, prms.n_components, true,
-                                           true, stream);
-    math_t scalar = math_t(1 / sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, stream);
+    raft::matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
+                                                 prms.n_rows, prms.n_components,
+                                                 true, true, stream);
+    math_t sqrt_n_samples = sqrt(prms.n_rows - 1);
+    math_t scalar = prms.n_rows - 1 > 0 ? math_t(1 / sqrt_n_samples) : 0;
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, stream);
   }
 }
 
diff --git a/cpp/src/pca/pca_mg.cu b/cpp/src/pca/pca_mg.cu
index 9cd2a42b7b..db311dd409 100644
--- a/cpp/src/pca/pca_mg.cu
+++ b/cpp/src/pca/pca_mg.cu
@@ -14,22 +14,23 @@
  * limitations under the License.
  */
 
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/decomposition/pca.hpp>
 #include <cuml/decomposition/pca_mg.hpp>
 #include <cuml/decomposition/sign_flip_mg.hpp>
-#include <matrix/math.cuh>
 #include <opg/linalg/qr_based_svd.hpp>
 #include <opg/matrix/matrix_utils.hpp>
 #include <opg/stats/cov.hpp>
 #include <opg/stats/mean.hpp>
 #include <opg/stats/mean_center.hpp>
-#include <stats/mean_center.cuh>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/stats/mean_center.cuh>
 #include "pca.cuh"
 
 using namespace MLCommon;
@@ -39,35 +40,33 @@ namespace PCA {
 namespace opg {
 
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc, T *components,
               T *explained_var, T *explained_var_ratio, T *singular_vals, T *mu,
               T *noise_vars, paramsPCAMG prms, cudaStream_t *streams,
               int n_streams, bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
 
   Matrix::Data<T> mu_data{mu, size_t(prms.n_cols)};
 
-  Stats::opg::mean(mu_data, input_data, input_desc, comm, allocator, streams,
-                   n_streams, handle.getImpl().getCublasHandle());
+  Stats::opg::mean(handle, mu_data, input_data, input_desc, streams, n_streams);
 
   device_buffer<T> cov_data(allocator, streams[0], prms.n_cols * prms.n_cols);
   size_t cov_data_size = cov_data.size();
   Matrix::Data<T> cov{cov_data.data(), cov_data_size};
 
-  Stats::opg::cov(cov, input_data, input_desc, mu_data, true, comm, allocator,
-                  streams, n_streams, cublas_handle);
+  Stats::opg::cov(handle, cov, input_data, input_desc, mu_data, true, streams,
+                  n_streams);
 
-  ML::truncCompExpVars<T, mg_solver>(handle.getImpl(), cov.ptr, components,
-                                     explained_var, explained_var_ratio, prms,
-                                     streams[0]);
+  ML::truncCompExpVars<T, mg_solver>(handle, cov.ptr, components, explained_var,
+                                     explained_var_ratio, prms, streams[0]);
 
   T scalar = (prms.n_rows - 1);
-  Matrix::seqRoot(explained_var, singular_vals, scalar, prms.n_components,
-                  streams[0], true);
+  raft::matrix::seqRoot(explained_var, singular_vals, scalar, prms.n_components,
+                        streams[0], true);
 
   Stats::opg::mean_add(input_data, input_desc, mu_data, comm, streams,
                        n_streams);
@@ -88,13 +87,14 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
  * @input param verbose
  */
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc, T *components,
               T *explained_var, T *explained_var_ratio, T *singular_vals, T *mu,
               T *noise_vars, paramsPCAMG prms, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   // Reference issue https://github.com/rapidsai/cuml/issues/2470
   int n_streams = input_desc.blocksOwnedBy(rank).size();
   cudaStream_t streams[n_streams];
@@ -111,15 +111,15 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
       CUDA_CHECK(cudaStreamSynchronize(streams[i]));
     }
   } else if (prms.algorithm == mg_solver::QR) {
-    const ML::cumlHandle_impl &h = handle.getImpl();
-    cudaStream_t stream = h.getStream();
-    const std::shared_ptr<deviceAllocator> allocator = h.getDeviceAllocator();
-    const cumlCommunicator &comm = h.getCommunicator();
+    const raft::handle_t &h = handle;
+    cudaStream_t stream = h.get_stream();
+    const auto allocator = h.get_device_allocator();
+    const auto &comm = h.get_comms();
 
     // Center the data
     Matrix::Data<T> mu_data{mu, size_t(prms.n_cols)};
-    Stats::opg::mean(mu_data, input_data, input_desc, comm, allocator, streams,
-                     n_streams, handle.getImpl().getCublasHandle());
+    Stats::opg::mean(handle, mu_data, input_data, input_desc, streams,
+                     n_streams);
     Stats::opg::mean_center(input_data, input_desc, mu_data, comm, streams,
                             n_streams);
     for (int i = 0; i < n_streams; i++) {
@@ -150,22 +150,23 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
     device_buffer<T> explained_var_ratio_all(allocator, stream, prms.n_cols);
 
     T scalar = 1.0 / (prms.n_rows - 1);
-    Matrix::power(sVector.data(), explained_var_all.data(), scalar, prms.n_cols,
-                  stream);
-    Matrix::ratio(explained_var_all.data(), explained_var_ratio_all.data(),
-                  prms.n_cols, allocator, stream);
+    raft::matrix::power(sVector.data(), explained_var_all.data(), scalar,
+                        prms.n_cols, stream);
+    raft::matrix::ratio(handle, explained_var_all.data(),
+                        explained_var_ratio_all.data(), prms.n_cols, stream);
 
-    Matrix::truncZeroOrigin(sVector.data(), prms.n_cols, singular_vals,
-                            prms.n_components, 1, stream);
+    raft::matrix::truncZeroOrigin(sVector.data(), prms.n_cols, singular_vals,
+                                  prms.n_components, 1, stream);
 
-    Matrix::truncZeroOrigin(explained_var_all.data(), prms.n_cols,
-                            explained_var, prms.n_components, 1, stream);
-    Matrix::truncZeroOrigin(explained_var_ratio_all.data(), prms.n_cols,
-                            explained_var_ratio, prms.n_components, 1, stream);
+    raft::matrix::truncZeroOrigin(explained_var_all.data(), prms.n_cols,
+                                  explained_var, prms.n_components, 1, stream);
+    raft::matrix::truncZeroOrigin(explained_var_ratio_all.data(), prms.n_cols,
+                                  explained_var_ratio, prms.n_components, 1,
+                                  stream);
 
-    MLCommon::LinAlg::transpose(vMatrix.data(), prms.n_cols, stream);
-    Matrix::truncZeroOrigin(vMatrix.data(), prms.n_cols, components,
-                            prms.n_components, prms.n_cols, stream);
+    raft::linalg::transpose(vMatrix.data(), prms.n_cols, stream);
+    raft::matrix::truncZeroOrigin(vMatrix.data(), prms.n_cols, components,
+                                  prms.n_components, prms.n_cols, stream);
 
     Matrix::opg::deallocate(h, uMatrixParts, input_desc, rank, stream);
 
@@ -184,49 +185,50 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 }
 
 template <typename T>
-void transform_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
+void transform_impl(raft::handle_t &handle,
+                    std::vector<Matrix::Data<T> *> &input,
                     const Matrix::PartDescriptor input_desc, T *components,
                     std::vector<Matrix::Data<T> *> &trans_input,
                     T *singular_vals, T *mu, const paramsPCAMG prms,
                     cudaStream_t *streams, int n_streams, bool verbose) {
-  cublasHandle_t cublas_h = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
   std::vector<Matrix::RankSizePair *> local_blocks = input_desc.partsToRanks;
 
   if (prms.whiten) {
     T scalar = T(sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_cols * prms.n_components, streams[0]);
-    Matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
-                                          prms.n_cols, prms.n_components, true,
-                                          true, streams[0]);
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_cols * prms.n_components, streams[0]);
+    raft::matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
+                                                prms.n_cols, prms.n_components,
+                                                true, true, streams[0]);
   }
 
   for (int i = 0; i < input.size(); i++) {
     int si = i % n_streams;
 
-    Stats::meanCenter(input[i]->ptr, input[i]->ptr, mu, size_t(prms.n_cols),
-                      local_blocks[i]->size, false, true, streams[si]);
+    raft::stats::meanCenter(input[i]->ptr, input[i]->ptr, mu,
+                            size_t(prms.n_cols), local_blocks[i]->size, false,
+                            true, streams[si]);
 
     T alpha = T(1);
     T beta = T(0);
-    LinAlg::gemm(input[i]->ptr, local_blocks[i]->size, size_t(prms.n_cols),
-                 components, trans_input[i]->ptr, local_blocks[i]->size,
-                 int(prms.n_components), CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta,
-                 cublas_h, streams[si]);
+    raft::linalg::gemm(handle, input[i]->ptr, local_blocks[i]->size,
+                       size_t(prms.n_cols), components, trans_input[i]->ptr,
+                       local_blocks[i]->size, int(prms.n_components),
+                       CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta, streams[si]);
 
-    Stats::meanAdd(input[i]->ptr, input[i]->ptr, mu, size_t(prms.n_cols),
-                   local_blocks[i]->size, false, true, streams[si]);
+    raft::stats::meanAdd(input[i]->ptr, input[i]->ptr, mu, size_t(prms.n_cols),
+                         local_blocks[i]->size, false, true, streams[si]);
   }
 
   if (prms.whiten) {
-    Matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
-                                           prms.n_cols, prms.n_components, true,
-                                           true, streams[0]);
+    raft::matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
+                                                 prms.n_cols, prms.n_components,
+                                                 true, true, streams[0]);
     T scalar = T(1 / sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_cols * prms.n_components, streams[0]);
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_cols * prms.n_components, streams[0]);
   }
 
   for (int i = 0; i < n_streams; i++) {
@@ -248,14 +250,14 @@ void transform_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
  * @input param verbose
  */
 template <typename T>
-void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                     size_t n_parts, Matrix::Data<T> **input, T *components,
                     Matrix::Data<T> **trans_input, T *singular_vals, T *mu,
                     paramsPCAMG prms, bool verbose) {
   // We want to update the API of this function, and other functions with
   // regards to https://github.com/rapidsai/cuml/issues/2471
 
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -264,7 +266,7 @@ void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                                     rank);
   std::vector<Matrix::Data<T> *> trans_data(trans_input, trans_input + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -285,23 +287,22 @@ void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
 
 template <typename T>
 void inverse_transform_impl(
-  cumlHandle &handle, std::vector<Matrix::Data<T> *> &trans_input,
+  raft::handle_t &handle, std::vector<Matrix::Data<T> *> &trans_input,
   Matrix::PartDescriptor trans_input_desc, T *components,
   std::vector<Matrix::Data<T> *> &input, T *singular_vals, T *mu,
   paramsPCAMG prms, cudaStream_t *streams, int n_streams, bool verbose) {
-  cublasHandle_t cublas_h = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
   std::vector<Matrix::RankSizePair *> local_blocks =
     trans_input_desc.partsToRanks;
 
   if (prms.whiten) {
     T scalar = T(1 / sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, streams[0]);
-    Matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
-                                           prms.n_rows, prms.n_components, true,
-                                           true, streams[0]);
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, streams[0]);
+    raft::matrix::matrixVectorBinaryMultSkipZero(components, singular_vals,
+                                                 prms.n_rows, prms.n_components,
+                                                 true, true, streams[0]);
   }
 
   for (int i = 0; i < local_blocks.size(); i++) {
@@ -309,22 +310,22 @@ void inverse_transform_impl(
     T alpha = T(1);
     T beta = T(0);
 
-    LinAlg::gemm(trans_input[i]->ptr, local_blocks[i]->size,
-                 size_t(prms.n_components), components, input[i]->ptr,
-                 local_blocks[i]->size, prms.n_cols, CUBLAS_OP_N, CUBLAS_OP_N,
-                 alpha, beta, cublas_h, streams[si]);
+    raft::linalg::gemm(handle, trans_input[i]->ptr, local_blocks[i]->size,
+                       size_t(prms.n_components), components, input[i]->ptr,
+                       local_blocks[i]->size, prms.n_cols, CUBLAS_OP_N,
+                       CUBLAS_OP_N, alpha, beta, streams[si]);
 
-    Stats::meanAdd(input[i]->ptr, input[i]->ptr, mu, size_t(prms.n_cols),
-                   local_blocks[i]->size, false, true, streams[si]);
+    raft::stats::meanAdd(input[i]->ptr, input[i]->ptr, mu, size_t(prms.n_cols),
+                         local_blocks[i]->size, false, true, streams[si]);
   }
 
   if (prms.whiten) {
-    Matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
-                                          prms.n_rows, prms.n_components, true,
-                                          true, streams[0]);
+    raft::matrix::matrixVectorBinaryDivSkipZero(components, singular_vals,
+                                                prms.n_rows, prms.n_components,
+                                                true, true, streams[0]);
     T scalar = T(sqrt(prms.n_rows - 1));
-    LinAlg::scalarMultiply(components, components, scalar,
-                           prms.n_rows * prms.n_components, streams[0]);
+    raft::linalg::scalarMultiply(components, components, scalar,
+                                 prms.n_rows * prms.n_components, streams[0]);
   }
 
   for (int i = 0; i < n_streams; i++) {
@@ -346,12 +347,12 @@ void inverse_transform_impl(
  * @input param verbose
  */
 template <typename T>
-void inverse_transform_impl(cumlHandle &handle,
+void inverse_transform_impl(raft::handle_t &handle,
                             Matrix::RankSizePair **rank_sizes, size_t n_parts,
                             Matrix::Data<T> **trans_input, T *components,
                             Matrix::Data<T> **input, T *singular_vals, T *mu,
                             paramsPCAMG prms, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -361,7 +362,7 @@ void inverse_transform_impl(cumlHandle &handle,
 
   std::vector<Matrix::Data<T> *> input_data(input, input + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -397,13 +398,13 @@ void inverse_transform_impl(cumlHandle &handle,
  * @input param verbose
  */
 template <typename T>
-void fit_transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
-                        size_t n_parts, Matrix::Data<T> **input,
-                        Matrix::Data<T> **trans_input, T *components,
-                        T *explained_var, T *explained_var_ratio,
+void fit_transform_impl(raft::handle_t &handle,
+                        Matrix::RankSizePair **rank_sizes, size_t n_parts,
+                        Matrix::Data<T> **input, Matrix::Data<T> **trans_input,
+                        T *components, T *explained_var, T *explained_var_ratio,
                         T *singular_vals, T *mu, T *noise_vars,
                         paramsPCAMG prms, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -412,7 +413,7 @@ void fit_transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                                     rank);
   std::vector<Matrix::Data<T> *> trans_data(trans_input, trans_input + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -438,7 +439,7 @@ void fit_transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   }
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
+void fit(raft::handle_t &handle, std::vector<Matrix::Data<float> *> &input_data,
          Matrix::PartDescriptor &input_desc, float *components,
          float *explained_var, float *explained_var_ratio, float *singular_vals,
          float *mu, float *noise_vars, paramsPCAMG prms, bool verbose) {
@@ -446,7 +447,8 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
            explained_var_ratio, singular_vals, mu, noise_vars, prms, verbose);
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
+void fit(raft::handle_t &handle,
+         std::vector<Matrix::Data<double> *> &input_data,
          Matrix::PartDescriptor &input_desc, double *components,
          double *explained_var, double *explained_var_ratio,
          double *singular_vals, double *mu, double *noise_vars,
@@ -455,7 +457,7 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
            explained_var_ratio, singular_vals, mu, noise_vars, prms, verbose);
 }
 
-void fit_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void fit_transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                    size_t n_parts, Matrix::floatData_t **input,
                    Matrix::floatData_t **trans_input, float *components,
                    float *explained_var, float *explained_var_ratio,
@@ -466,7 +468,7 @@ void fit_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                      singular_vals, mu, noise_vars, prms, verbose);
 }
 
-void fit_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void fit_transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                    size_t n_parts, Matrix::doubleData_t **input,
                    Matrix::doubleData_t **trans_input, double *components,
                    double *explained_var, double *explained_var_ratio,
@@ -477,7 +479,7 @@ void fit_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                      singular_vals, mu, noise_vars, prms, verbose);
 }
 
-void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                size_t n_parts, Matrix::Data<float> **input, float *components,
                Matrix::Data<float> **trans_input, float *singular_vals,
                float *mu, paramsPCAMG prms, bool verbose) {
@@ -485,7 +487,7 @@ void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                  singular_vals, mu, prms, verbose);
 }
 
-void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                size_t n_parts, Matrix::Data<double> **input, double *components,
                Matrix::Data<double> **trans_input, double *singular_vals,
                double *mu, paramsPCAMG prms, bool verbose) {
@@ -493,20 +495,20 @@ void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                  singular_vals, mu, prms, verbose);
 }
 
-void inverse_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
-                       size_t n_parts, Matrix::Data<float> **trans_input,
-                       float *components, Matrix::Data<float> **input,
-                       float *singular_vals, float *mu, paramsPCAMG prms,
-                       bool verbose) {
+void inverse_transform(raft::handle_t &handle,
+                       Matrix::RankSizePair **rank_sizes, size_t n_parts,
+                       Matrix::Data<float> **trans_input, float *components,
+                       Matrix::Data<float> **input, float *singular_vals,
+                       float *mu, paramsPCAMG prms, bool verbose) {
   inverse_transform_impl(handle, rank_sizes, n_parts, trans_input, components,
                          input, singular_vals, mu, prms, verbose);
 }
 
-void inverse_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
-                       size_t n_parts, Matrix::Data<double> **trans_input,
-                       double *components, Matrix::Data<double> **input,
-                       double *singular_vals, double *mu, paramsPCAMG prms,
-                       bool verbose) {
+void inverse_transform(raft::handle_t &handle,
+                       Matrix::RankSizePair **rank_sizes, size_t n_parts,
+                       Matrix::Data<double> **trans_input, double *components,
+                       Matrix::Data<double> **input, double *singular_vals,
+                       double *mu, paramsPCAMG prms, bool verbose) {
   inverse_transform_impl(handle, rank_sizes, n_parts, trans_input, components,
                          input, singular_vals, mu, prms, verbose);
 }
diff --git a/cpp/src/pca/sign_flip_mg.cu b/cpp/src/pca/sign_flip_mg.cu
index c8835a287d..9050975cf2 100644
--- a/cpp/src/pca/sign_flip_mg.cu
+++ b/cpp/src/pca/sign_flip_mg.cu
@@ -18,13 +18,13 @@
 #include <thrust/execution_policy.h>
 #include <common/allocatorAdapter.hpp>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/decomposition/sign_flip_mg.hpp>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 
 using namespace MLCommon;
 
@@ -121,20 +121,20 @@ void flip(T *input, int n_rows, int n_cols, T *max_vals,
  * @{
  */
 template <typename T>
-void sign_flip_imp(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
+void sign_flip_imp(raft::handle_t &handle,
+                   std::vector<Matrix::Data<T> *> &input,
                    Matrix::PartDescriptor &input_desc, T *components,
                    int n_components, cudaStream_t *streams, int n_stream) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  const auto allocator = handle.get_device_allocator();
 
   std::vector<Matrix::RankSizePair *> local_blocks =
     input_desc.blocksOwnedBy(rank);
   device_buffer<T> max_vals(
     allocator, streams[0],
-    std::max(size_t(comm.getSize()), local_blocks.size()) * n_components);
+    std::max(size_t(comm.get_size()), local_blocks.size()) * n_components);
 
   for (int i = 0; i < input.size(); i++) {
     T *mv_loc = max_vals.data() + (i * n_components);
@@ -150,9 +150,9 @@ void sign_flip_imp(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
                       max_vals.data(), allocator, streams[0], true);
 
   comm.allgather(max_vals.data(), max_vals.data(), n_components, streams[0]);
-  comm.syncStream(streams[0]);
+  comm.sync_stream(streams[0]);
 
-  findMaxAbsOfColumns(max_vals.data(), n_components, comm.getSize(),
+  findMaxAbsOfColumns(max_vals.data(), n_components, comm.get_size(),
                       max_vals.data(), allocator, streams[0], true);
 
   for (int i = 0; i < local_blocks.size(); i++) {
@@ -168,7 +168,7 @@ void sign_flip_imp(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
        streams[0]);
 }
 
-void sign_flip(cumlHandle &handle,
+void sign_flip(raft::handle_t &handle,
                std::vector<Matrix::Data<float> *> &input_data,
                Matrix::PartDescriptor &input_desc, float *components,
                int n_components, cudaStream_t *streams, int n_stream) {
@@ -176,7 +176,7 @@ void sign_flip(cumlHandle &handle,
                 streams, n_stream);
 }
 
-void sign_flip(cumlHandle &handle,
+void sign_flip(raft::handle_t &handle,
                std::vector<Matrix::Data<double> *> &input_data,
                Matrix::PartDescriptor &input_desc, double *components,
                int n_components, cudaStream_t *streams, int n_stream) {
diff --git a/cpp/src/random_projection/rproj.cu b/cpp/src/random_projection/rproj.cu
index c1b0c54f64..5286fbbc28 100644
--- a/cpp/src/random_projection/rproj.cu
+++ b/cpp/src/random_projection/rproj.cu
@@ -21,14 +21,14 @@ namespace ML {
 
 using namespace MLCommon;
 
-template void RPROJfit(const cumlHandle& handle, rand_mat<float>* random_matrix,
-                       paramsRPROJ* params);
-template void RPROJfit(const cumlHandle& handle,
+template void RPROJfit(const raft::handle_t& handle,
+                       rand_mat<float>* random_matrix, paramsRPROJ* params);
+template void RPROJfit(const raft::handle_t& handle,
                        rand_mat<double>* random_matrix, paramsRPROJ* params);
-template void RPROJtransform(const cumlHandle& handle, float* input,
+template void RPROJtransform(const raft::handle_t& handle, float* input,
                              rand_mat<float>* random_matrix, float* output,
                              paramsRPROJ* params);
-template void RPROJtransform(const cumlHandle& handle, double* input,
+template void RPROJtransform(const raft::handle_t& handle, double* input,
                              rand_mat<double>* random_matrix, double* output,
                              paramsRPROJ* params);
 
diff --git a/cpp/src/random_projection/rproj.cuh b/cpp/src/random_projection/rproj.cuh
index 1b7cafda83..37190cf7ba 100644
--- a/cpp/src/random_projection/rproj.cuh
+++ b/cpp/src/random_projection/rproj.cuh
@@ -20,19 +20,17 @@
 #include <unordered_set>
 #include <vector>
 
-#include <common/cudart_utils.h>
 #include <cuml/random_projection/rproj_c.h>
-#include <linalg/cublas_wrappers.h>
-#include <sparse/cusparse_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/sparse/cusparse_wrappers.h>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "rproj_utils.cuh"
 
 namespace ML {
 
 using namespace MLCommon;
-using namespace MLCommon::LinAlg;
-using namespace MLCommon::Sparse;
 
 /**
 	 * @brief generates a gaussian random matrix
@@ -41,14 +39,14 @@ using namespace MLCommon::Sparse;
 	 * @param[in] params: data structure that includes all the parameters of the model
 	 */
 template <typename math_t>
-void gaussian_random_matrix(const cumlHandle& h,
+void gaussian_random_matrix(const raft::handle_t& h,
                             rand_mat<math_t>* random_matrix,
                             paramsRPROJ& params) {
-  cudaStream_t stream = h.getStream();
-  auto d_alloc = h.getDeviceAllocator();
+  cudaStream_t stream = h.get_stream();
+  auto d_alloc = h.get_device_allocator();
   int len = params.n_components * params.n_features;
   random_matrix->dense_data.resize(len, stream);
-  auto rng = Random::Rng(params.random_state);
+  auto rng = raft::random::Rng(params.random_state);
   math_t scale = 1.0 / sqrt(double(params.n_components));
   rng.normal(random_matrix->dense_data.data(), len, math_t(0), scale, stream);
 }
@@ -60,20 +58,21 @@ void gaussian_random_matrix(const cumlHandle& h,
  * @param[in] params: data structure that includes all the parameters of the model
  */
 template <typename math_t>
-void sparse_random_matrix(const cumlHandle& h, rand_mat<math_t>* random_matrix,
+void sparse_random_matrix(const raft::handle_t& h,
+                          rand_mat<math_t>* random_matrix,
                           paramsRPROJ& params) {
-  cudaStream_t stream = h.getStream();
-  auto d_alloc = h.getDeviceAllocator();
+  cudaStream_t stream = h.get_stream();
+  auto d_alloc = h.get_device_allocator();
 
   if (params.density == 1.0f) {
     int len = params.n_components * params.n_features;
     random_matrix->dense_data.resize(len, stream);
-    auto rng = Random::Rng(params.random_state);
+    auto rng = raft::random::Rng(params.random_state);
     math_t scale = 1.0 / sqrt(math_t(params.n_components));
     rng.scaled_bernoulli(random_matrix->dense_data.data(), len, math_t(0.5),
                          scale, stream);
   } else {
-    auto alloc = h.getHostAllocator();
+    auto alloc = h.get_host_allocator();
 
     double max_total_density = params.density * 1.2;
     size_t indices_alloc =
@@ -101,17 +100,17 @@ void sparse_random_matrix(const cumlHandle& h, rand_mat<math_t>* random_matrix,
 
     size_t len = offset;
     random_matrix->indices.resize(len, stream);
-    updateDevice(random_matrix->indices.data(), indices, len, stream);
+    raft::update_device(random_matrix->indices.data(), indices, len, stream);
     alloc->deallocate(indices, indices_alloc, stream);
 
     len = indptr_idx + 1;
     random_matrix->indptr.resize(len, stream);
-    updateDevice(random_matrix->indptr.data(), indptr, len, stream);
+    raft::update_device(random_matrix->indptr.data(), indptr, len, stream);
     alloc->deallocate(indptr, indptr_alloc, stream);
 
     len = offset;
     random_matrix->sparse_data.resize(len, stream);
-    auto rng = Random::Rng(params.random_state);
+    auto rng = raft::random::Rng(params.random_state);
     math_t scale = sqrt(1.0 / params.density) / sqrt(params.n_components);
     rng.scaled_bernoulli(random_matrix->sparse_data.data(), len, math_t(0.5),
                          scale, stream);
@@ -125,7 +124,7 @@ void sparse_random_matrix(const cumlHandle& h, rand_mat<math_t>* random_matrix,
 	 * @param[in] params: data structure that includes all the parameters of the model
 	 */
 template <typename math_t>
-void RPROJfit(const cumlHandle& handle, rand_mat<math_t>* random_matrix,
+void RPROJfit(const raft::handle_t& handle, rand_mat<math_t>* random_matrix,
               paramsRPROJ* params) {
   random_matrix->reset();
 
@@ -150,15 +149,15 @@ void RPROJfit(const cumlHandle& handle, rand_mat<math_t>* random_matrix,
 	 * @param[in] params: data structure that includes all the parameters of the model
 	 */
 template <typename math_t>
-void RPROJtransform(const cumlHandle& handle, math_t* input,
+void RPROJtransform(const raft::handle_t& handle, math_t* input,
                     rand_mat<math_t>* random_matrix, math_t* output,
                     paramsRPROJ* params) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
 
   check_parameters(*params);
 
   if (random_matrix->type == dense) {
-    cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
+    cublasHandle_t cublas_handle = handle.get_cublas_handle();
 
     const math_t alfa = 1;
     const math_t beta = 0;
@@ -171,13 +170,12 @@ void RPROJtransform(const cumlHandle& handle, math_t* input,
     int& ldb = k;
     int& ldc = m;
 
-    CUBLAS_CHECK(cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k,
-                            &alfa, input, lda, random_matrix->dense_data.data(),
-                            ldb, &beta, output, ldc, stream));
+    CUBLAS_CHECK(raft::linalg::cublasgemm(
+      cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alfa, input, lda,
+      random_matrix->dense_data.data(), ldb, &beta, output, ldc, stream));
 
   } else if (random_matrix->type == sparse) {
-    cusparseHandle_t cusparse_handle = handle.getImpl().getcusparseHandle();
-    CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, stream));
+    cusparseHandle_t cusparse_handle = handle.get_cusparse_handle();
 
     const math_t alfa = 1;
     const math_t beta = 0;
@@ -190,10 +188,10 @@ void RPROJtransform(const cumlHandle& handle, math_t* input,
     int& lda = m;
     int& ldc = m;
 
-    CUSPARSE_CHECK(cusparsegemmi(
+    CUSPARSE_CHECK(raft::sparse::cusparsegemmi(
       cusparse_handle, m, n, k, nnz, &alfa, input, lda,
       random_matrix->sparse_data.data(), random_matrix->indptr.data(),
-      random_matrix->indices.data(), &beta, output, ldc));
+      random_matrix->indices.data(), &beta, output, ldc, stream));
   } else {
     ASSERT(false,
            "Could not find a random matrix. Please perform a fit operation "
diff --git a/cpp/src/random_projection/rproj_utils.cuh b/cpp/src/random_projection/rproj_utils.cuh
index 2abc00cb1f..38649acc39 100644
--- a/cpp/src/random_projection/rproj_utils.cuh
+++ b/cpp/src/random_projection/rproj_utils.cuh
@@ -16,12 +16,12 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/random_projection/rproj_c.h>
+#include <raft/cudart_utils.h>
 #include <sys/time.h>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 
 const int TPB_X = 256;
 
@@ -50,40 +50,40 @@ __global__ void sum_bools(bool* in_bools, int n, int* out_val) {
   int row = (blockIdx.x * TPB_X) + threadIdx.x;
   if (row < n) {
     bool v = in_bools[row];
-    if (v) atomicAdd(out_val, (int)in_bools[row]);
+    if (v) raft::myAtomicAdd(out_val, (int)in_bools[row]);
   }
 }
 
-inline size_t binomial(const ML::cumlHandle& h, size_t n, double p,
+inline size_t binomial(const raft::handle_t& h, size_t n, double p,
                        int random_state) {
-  auto alloc = h.getDeviceAllocator();
+  auto alloc = h.get_device_allocator();
 
   struct timeval tp;
   gettimeofday(&tp, NULL);
   long long seed = tp.tv_sec * 1000 + tp.tv_usec;
 
-  auto rng = MLCommon::Random::Rng(random_state + seed);
+  auto rng = raft::random::Rng(random_state + seed);
 
-  bool* rand_array = (bool*)alloc->allocate(n * sizeof(bool), h.getStream());
-  int* successes = (int*)alloc->allocate(sizeof(int), h.getStream());
+  bool* rand_array = (bool*)alloc->allocate(n * sizeof(bool), h.get_stream());
+  int* successes = (int*)alloc->allocate(sizeof(int), h.get_stream());
 
-  rng.bernoulli(rand_array, n, p, h.getStream());
+  rng.bernoulli(rand_array, n, p, h.get_stream());
 
-  cudaMemsetAsync(successes, 0, sizeof(int), h.getStream());
+  cudaMemsetAsync(successes, 0, sizeof(int), h.get_stream());
 
-  dim3 grid_n(MLCommon::ceildiv(n, (size_t)TPB_X), 1, 1);
+  dim3 grid_n(raft::ceildiv(n, (size_t)TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
-  sum_bools<<<grid_n, blk, 0, h.getStream()>>>(rand_array, n, successes);
+  sum_bools<<<grid_n, blk, 0, h.get_stream()>>>(rand_array, n, successes);
   CUDA_CHECK(cudaPeekAtLastError());
 
   int ret = 0;
-  MLCommon::updateHost(&ret, successes, 1, h.getStream());
-  cudaStreamSynchronize(h.getStream());
+  raft::update_host(&ret, successes, 1, h.get_stream());
+  cudaStreamSynchronize(h.get_stream());
   CUDA_CHECK(cudaPeekAtLastError());
 
-  alloc->deallocate(rand_array, n * sizeof(bool), h.getStream());
-  alloc->deallocate(successes, sizeof(int), h.getStream());
+  alloc->deallocate(rand_array, n * sizeof(bool), h.get_stream());
+  alloc->deallocate(successes, sizeof(int), h.get_stream());
 
   return n - ret;
 }
diff --git a/cpp/src/randomforest/randomforest.cu b/cpp/src/randomforest/randomforest.cu
index 2cd67bae31..fab11c5aa8 100644
--- a/cpp/src/randomforest/randomforest.cu
+++ b/cpp/src/randomforest/randomforest.cu
@@ -259,6 +259,23 @@ void _print_rf(const RandomForestMetaData<T, L>* forest, bool summary) {
   }
 }
 
+template <class T, class L>
+std::string _dump_rf_as_json(const RandomForestMetaData<T, L>* forest) {
+  if (!forest || !forest->trees) {
+    return "[]";
+  }
+  std::ostringstream oss;
+  oss << "[\n";
+  for (int i = 0; i < forest->rf_params.n_trees; i++) {
+    oss << DecisionTree::dump_tree_as_json<T, L>(&(forest->trees[i]));
+    if (i < forest->rf_params.n_trees - 1) {
+      oss << ",\n";
+    }
+  }
+  oss << "\n]";
+  return oss.str();
+}
+
 /**
  * @brief Print summary for all trees in the random forest.
  * @tparam T: data type for input data (float or double).
@@ -281,6 +298,11 @@ void print_rf_detailed(const RandomForestMetaData<T, L>* forest) {
   _print_rf(forest, false);
 }
 
+template <class T, class L>
+std::string dump_rf_as_json(const RandomForestMetaData<T, L>* forest) {
+  return _dump_rf_as_json(forest);
+}
+
 template <class T, class L>
 void build_treelite_forest(ModelHandle* model,
                            const RandomForestMetaData<T, L>* forest,
@@ -442,7 +464,7 @@ ModelHandle concatenate_trees(std::vector<ModelHandle> treelite_handles) {
 /**
  * @defgroup Random Forest Classification - Fit function
  * @brief Build (i.e., fit, train) random forest classifier for input data.
- * @param[in] user_handle: cumlHandle
+ * @param[in] user_handle: raft::handle_t
  * @param[in,out] forest: CPU pointer to RandomForestMetaData object. User allocated.
  * @param[in] input: train data (n_rows samples, n_cols features) in column major format,
  *   excluding labels. Device pointer.
@@ -457,7 +479,7 @@ ModelHandle concatenate_trees(std::vector<ModelHandle> treelite_handles) {
  * @param[in] verbosity: verbosity level for logging messages during execution
  * @{
  */
-void fit(const cumlHandle& user_handle, RandomForestClassifierF*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestClassifierF*& forest,
          float* input, int n_rows, int n_cols, int* labels, int n_unique_labels,
          RF_params rf_params, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
@@ -472,7 +494,7 @@ void fit(const cumlHandle& user_handle, RandomForestClassifierF*& forest,
                      n_unique_labels, forest);
 }
 
-void fit(const cumlHandle& user_handle, RandomForestClassifierD*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestClassifierD*& forest,
          double* input, int n_rows, int n_cols, int* labels,
          int n_unique_labels, RF_params rf_params, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
@@ -492,7 +514,7 @@ void fit(const cumlHandle& user_handle, RandomForestClassifierD*& forest,
  * @defgroup Random Forest Classification - Predict function
  * @brief Predict target feature for input data; n-ary classification for
      single feature supported.
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] forest: CPU pointer to RandomForestMetaData object.
  *   The user should have previously called fit to build the random forest.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
@@ -502,7 +524,7 @@ void fit(const cumlHandle& user_handle, RandomForestClassifierD*& forest,
  * @param[in] verbosity: verbosity level for logging messages during execution
  * @{
  */
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestClassifierF* forest, const float* input,
              int n_rows, int n_cols, int* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -512,7 +534,7 @@ void predict(const cumlHandle& user_handle,
                          forest, verbosity);
 }
 
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestClassifierD* forest, const double* input,
              int n_rows, int n_cols, int* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -527,7 +549,7 @@ void predict(const cumlHandle& user_handle,
  * @defgroup Random Forest Classification - Predict function
  * @brief Predict target feature for input data; n-ary classification for
      single feature supported.
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] forest: CPU pointer to RandomForestMetaData object.
  *   The user should have previously called fit to build the random forest.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
@@ -537,7 +559,7 @@ void predict(const cumlHandle& user_handle,
  * @param[in] verbosity: verbosity level for logging messages during execution
  * @{
  */
-void predictGetAll(const cumlHandle& user_handle,
+void predictGetAll(const raft::handle_t& user_handle,
                    const RandomForestClassifierF* forest, const float* input,
                    int n_rows, int n_cols, int* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -547,7 +569,7 @@ void predictGetAll(const cumlHandle& user_handle,
                                forest, verbosity);
 }
 
-void predictGetAll(const cumlHandle& user_handle,
+void predictGetAll(const raft::handle_t& user_handle,
                    const RandomForestClassifierD* forest, const double* input,
                    int n_rows, int n_cols, int* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -561,7 +583,7 @@ void predictGetAll(const cumlHandle& user_handle,
 /**
  * @defgroup Random Forest Classification - Score function
  * @brief Compare predicted features validate against ref_labels.
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] forest: CPU pointer to RandomForestMetaData object.
  *   The user should have previously called fit to build the random forest.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
@@ -573,7 +595,7 @@ void predictGetAll(const cumlHandle& user_handle,
  * @return RF_metrics struct with classification score (i.e., accuracy)
  * @{
  */
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestClassifierF* forest, const int* ref_labels,
                  int n_rows, const int* predictions, int verbosity) {
   RF_metrics classification_score = rfClassifier<float>::score(
@@ -581,7 +603,7 @@ RF_metrics score(const cumlHandle& user_handle,
   return classification_score;
 }
 
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestClassifierD* forest, const int* ref_labels,
                  int n_rows, const int* predictions, int verbosity) {
   RF_metrics classification_score = rfClassifier<double>::score(
@@ -594,12 +616,14 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
                            float min_impurity_decrease, bool bootstrap_features,
                            bool bootstrap, int n_trees, float rows_sample,
                            int seed, CRITERION split_criterion,
-                           bool quantile_per_tree, int cfg_n_streams) {
+                           bool quantile_per_tree, int cfg_n_streams,
+                           bool use_experimental_backend, int max_batch_size) {
   DecisionTree::DecisionTreeParams tree_params;
   DecisionTree::set_tree_params(
     tree_params, max_depth, max_leaves, max_features, n_bins, split_algo,
     min_rows_per_node, min_impurity_decrease, bootstrap_features,
-    split_criterion, quantile_per_tree);
+    split_criterion, quantile_per_tree, use_experimental_backend,
+    max_batch_size);
   RF_params rf_params;
   set_all_rf_params(rf_params, n_trees, bootstrap, rows_sample, seed,
                     cfg_n_streams, tree_params);
@@ -611,7 +635,7 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
 /**
  * @defgroup Random Forest Regression - Fit function
  * @brief Build (i.e., fit, train) random forest regressor for input data.
- * @param[in] user_handle: cumlHandle
+ * @param[in] user_handle: raft::handle_t
  * @param[in,out] forest: CPU pointer to RandomForestMetaData object. User allocated.
  * @param[in] input: train data (n_rows samples, n_cols features) in column major format,
  *   excluding labels. Device pointer.
@@ -623,7 +647,7 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
  * @param[in] verbosity: verbosity level for logging messages during execution
  * @{
  */
-void fit(const cumlHandle& user_handle, RandomForestRegressorF*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestRegressorF*& forest,
          float* input, int n_rows, int n_cols, float* labels,
          RF_params rf_params, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
@@ -637,7 +661,7 @@ void fit(const cumlHandle& user_handle, RandomForestRegressorF*& forest,
   rf_regressor->fit(user_handle, input, n_rows, n_cols, labels, forest);
 }
 
-void fit(const cumlHandle& user_handle, RandomForestRegressorD*& forest,
+void fit(const raft::handle_t& user_handle, RandomForestRegressorD*& forest,
          double* input, int n_rows, int n_cols, double* labels,
          RF_params rf_params, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
@@ -655,7 +679,7 @@ void fit(const cumlHandle& user_handle, RandomForestRegressorD*& forest,
 /**
  * @defgroup Random Forest Regression - Predict function
  * @brief Predict target feature for input data; regression for single feature supported.
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] forest: CPU pointer to RandomForestMetaData object.
  *   The user should have previously called fit to build the random forest.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
@@ -665,7 +689,7 @@ void fit(const cumlHandle& user_handle, RandomForestRegressorD*& forest,
  * @param[in] verbosity: verbosity level for logging messages during execution
  * @{
  */
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestRegressorF* forest, const float* input,
              int n_rows, int n_cols, float* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -675,7 +699,7 @@ void predict(const cumlHandle& user_handle,
                         verbosity);
 }
 
-void predict(const cumlHandle& user_handle,
+void predict(const raft::handle_t& user_handle,
              const RandomForestRegressorD* forest, const double* input,
              int n_rows, int n_cols, double* predictions, int verbosity) {
   ASSERT(forest->trees, "Cannot predict! No trees in the forest.");
@@ -689,7 +713,7 @@ void predict(const cumlHandle& user_handle,
 /**
  * @defgroup Random Forest Regression - Score function
  * @brief Predict target feature for input data and validate against ref_labels.
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] forest: CPU pointer to RandomForestMetaData object.
  *   The user should have previously called fit to build the random forest.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
@@ -702,7 +726,7 @@ void predict(const cumlHandle& user_handle,
  *   mean squared error, median absolute error)
  * @{
  */
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestRegressorF* forest, const float* ref_labels,
                  int n_rows, const float* predictions, int verbosity) {
   RF_metrics regression_score = rfRegressor<float>::score(
@@ -711,7 +735,7 @@ RF_metrics score(const cumlHandle& user_handle,
   return regression_score;
 }
 
-RF_metrics score(const cumlHandle& user_handle,
+RF_metrics score(const raft::handle_t& user_handle,
                  const RandomForestRegressorD* forest, const double* ref_labels,
                  int n_rows, const double* predictions, int verbosity) {
   RF_metrics regression_score = rfRegressor<double>::score(
@@ -739,6 +763,15 @@ template void print_rf_detailed<float, float>(
 template void print_rf_detailed<double, double>(
   const RandomForestRegressorD* forest);
 
+template std::string dump_rf_as_json<float, int>(
+  const RandomForestClassifierF* forest);
+template std::string dump_rf_as_json<double, int>(
+  const RandomForestClassifierD* forest);
+template std::string dump_rf_as_json<float, float>(
+  const RandomForestRegressorF* forest);
+template std::string dump_rf_as_json<double, double>(
+  const RandomForestRegressorD* forest);
+
 template void null_trees_ptr<float, int>(RandomForestClassifierF*& forest);
 template void null_trees_ptr<double, int>(RandomForestClassifierD*& forest);
 template void null_trees_ptr<float, float>(RandomForestRegressorF*& forest);
diff --git a/cpp/src/randomforest/randomforest_impl.cuh b/cpp/src/randomforest/randomforest_impl.cuh
index 41bfe0c12c..ebd5e55864 100644
--- a/cpp/src/randomforest/randomforest_impl.cuh
+++ b/cpp/src/randomforest/randomforest_impl.cuh
@@ -17,13 +17,13 @@
 #ifndef _OPENMP
 #define omp_get_thread_num() 0
 #endif
-#include <common/cudart_utils.h>
 #include <decisiontree/memory.h>
 #include <decisiontree/quantile/quantile.h>
+#include <raft/cudart_utils.h>
 #include <cuml/common/logger.hpp>
+#include <metrics/scores.cuh>
+#include <raft/random/rng.cuh>
 #include <random/permute.cuh>
-#include <random/rng.cuh>
-#include <score/scores.cuh>
 #include "randomforest_impl.h"
 
 namespace ML {
@@ -70,32 +70,16 @@ void rf<T, L>::prepare_fit_per_tree(
   int rs = tree_id;
   if (rf_params.seed > -1) rs = rf_params.seed + tree_id;
 
-  MLCommon::Random::Rng rng(rs * 1000 | 0xFF00AA,
-                            MLCommon::Random::GeneratorType::GenKiss99);
+  raft::random::Rng rng(rs * 1000 | 0xFF00AA,
+                        raft::random::GeneratorType::GenKiss99);
   if (rf_params.bootstrap) {
+    // Use bootstrapped sample set
     rng.uniformInt<unsigned>(selected_rows, n_sampled_rows, 0, n_rows, stream);
 
-  } else {  // Sampling w/o replacement
-    MLCommon::device_buffer<unsigned int>* inkeys =
-      new MLCommon::device_buffer<unsigned int>(device_allocator, stream,
-                                                n_rows);
-    MLCommon::device_buffer<unsigned int>* outkeys =
-      new MLCommon::device_buffer<unsigned int>(device_allocator, stream,
-                                                n_rows);
-    thrust::sequence(thrust::cuda::par.on(stream), inkeys->data(),
-                     inkeys->data() + n_rows);
-    int* perms = nullptr;
-    MLCommon::Random::permute(perms, outkeys->data(), inkeys->data(), 1, n_rows,
-                              false, stream);
-    // outkeys has more rows than selected_rows; doing the shuffling before the
-    // resize to differentiate the per-tree rows sample.
-    CUDA_CHECK(cudaMemcpyAsync(selected_rows, outkeys->data(),
-                               n_sampled_rows * sizeof(unsigned int),
-                               cudaMemcpyDeviceToDevice, stream));
-    inkeys->release(stream);
-    outkeys->release(stream);
-    delete inkeys;
-    delete outkeys;
+  } else {
+    // Use all the samples from the dataset
+    thrust::sequence(thrust::cuda::par.on(stream), selected_rows,
+                     selected_rows + n_sampled_rows);
   }
 }
 
@@ -152,7 +136,7 @@ const DecisionTree::DecisionTreeClassifier<T>* rfClassifier<T>::get_trees_ptr()
 /**
  * @brief Build (i.e., fit, train) random forest classifier for input data.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle
+ * @param[in] user_handle: raft::handle_t
  * @param[in] input: train data (n_rows samples, n_cols features) in column major format,
  *   excluding labels. Device pointer.
  * @param[in] n_rows: number of training data samples.
@@ -164,33 +148,45 @@ const DecisionTree::DecisionTreeClassifier<T>* rfClassifier<T>::get_trees_ptr()
  * @param[in] forest: CPU point to RandomForestMetaData struct.
  */
 template <typename T>
-void rfClassifier<T>::fit(const cumlHandle& user_handle, const T* input,
+void rfClassifier<T>::fit(const raft::handle_t& user_handle, const T* input,
                           int n_rows, int n_cols, int* labels,
                           int n_unique_labels,
                           RandomForestMetaData<T, int>*& forest) {
   this->error_checking(input, labels, n_rows, n_cols, false);
 
-  const cumlHandle_impl& handle = user_handle.getImpl();
-  int n_sampled_rows = this->rf_params.rows_sample * n_rows;
+  const raft::handle_t& handle = user_handle;
+  int n_sampled_rows = 0;
+  if (this->rf_params.bootstrap) {
+    n_sampled_rows = std::round(this->rf_params.rows_sample * n_rows);
+  } else {
+    if (this->rf_params.rows_sample != 1.0) {
+      CUML_LOG_WARN(
+        "If bootstrap sampling is disabled, rows_sample value is ignored and "
+        "whole dataset is used for building each tree");
+      this->rf_params.rows_sample = 1.0;
+    }
+    n_sampled_rows = n_rows;
+  }
   int n_streams = this->rf_params.n_streams;
-  ASSERT(n_streams <= handle.getNumInternalStreams(),
-         "rf_params.n_streams (=%d) should be <= cumlHandle.n_streams (=%d)",
-         n_streams, handle.getNumInternalStreams());
+  ASSERT(
+    n_streams <= handle.get_num_internal_streams(),
+    "rf_params.n_streams (=%d) should be <= raft::handle_t.n_streams (=%d)",
+    n_streams, handle.get_num_internal_streams());
 
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   // Select n_sampled_rows (with replacement) numbers from [0, n_rows) per tree.
   // selected_rows: randomly generated IDs for bootstrapped samples (w/ replacement); a device ptr.
   MLCommon::device_buffer<unsigned int>* selected_rows[n_streams];
   for (int i = 0; i < n_streams; i++) {
-    auto s = handle.getInternalStream(i);
+    auto s = handle.get_internal_stream(i);
     selected_rows[i] = new MLCommon::device_buffer<unsigned int>(
-      handle.getDeviceAllocator(), s, n_sampled_rows);
+      handle.get_device_allocator(), s, n_sampled_rows);
   }
 
   std::shared_ptr<TemporaryMemory<T, int>> tempmem[n_streams];
   for (int i = 0; i < n_streams; i++) {
     tempmem[i] = std::make_shared<TemporaryMemory<T, int>>(
-      handle, handle.getInternalStream(i), n_rows, n_cols, n_unique_labels,
+      handle, handle.get_internal_stream(i), n_rows, n_cols, n_unique_labels,
       this->rf_params.tree_params);
   }
   //Preprocess once only per forest
@@ -218,7 +214,7 @@ void rfClassifier<T>::fit(const cumlHandle& user_handle, const T* input,
 
     this->prepare_fit_per_tree(
       i, n_rows, n_sampled_rows, rowids, tempmem[stream_id]->num_sms,
-      tempmem[stream_id]->stream, handle.getDeviceAllocator());
+      tempmem[stream_id]->stream, handle.get_device_allocator());
 
     /* Build individual tree in the forest.
        - input is a pointer to orig data that have n_cols features and n_rows rows.
@@ -230,7 +226,7 @@ void rfClassifier<T>::fit(const cumlHandle& user_handle, const T* input,
     */
     DecisionTree::TreeMetaDataNode<T, int>* tree_ptr = &(forest->trees[i]);
     tree_ptr->treeid = i;
-    trees[i].fit(handle.getDeviceAllocator(), handle.getHostAllocator(),
+    trees[i].fit(handle.get_device_allocator(), handle.get_host_allocator(),
                  tempmem[stream_id]->stream, input, n_cols, n_rows, labels,
                  rowids, n_sampled_rows, n_unique_labels, tree_ptr,
                  this->rf_params.tree_params, tempmem[stream_id]);
@@ -244,13 +240,13 @@ void rfClassifier<T>::fit(const cumlHandle& user_handle, const T* input,
     delete selected_rows[i];
   }
 
-  CUDA_CHECK(cudaStreamSynchronize(user_handle.getStream()));
+  CUDA_CHECK(cudaStreamSynchronize(user_handle.get_stream()));
 }
 
 /**
  * @brief Predict target feature for input data; n-ary classification for single feature supported.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
  * @param[in] n_rows: number of  data samples.
  * @param[in] n_cols: number of features (excluding target feature).
@@ -258,18 +254,17 @@ void rfClassifier<T>::fit(const cumlHandle& user_handle, const T* input,
  * @param[in] verbosity: verbosity level for logging messages during execution
  */
 template <typename T>
-void rfClassifier<T>::predict(const cumlHandle& user_handle, const T* input,
+void rfClassifier<T>::predict(const raft::handle_t& user_handle, const T* input,
                               int n_rows, int n_cols, int* predictions,
                               const RandomForestMetaData<T, int>* forest,
                               int verbosity) const {
   ML::Logger::get().setLevel(verbosity);
   this->error_checking(input, predictions, n_rows, n_cols, true);
   std::vector<int> h_predictions(n_rows);
-  const cumlHandle_impl& handle = user_handle.getImpl();
-  cudaStream_t stream = user_handle.getStream();
+  cudaStream_t stream = user_handle.get_stream();
 
   std::vector<T> h_input(n_rows * n_cols);
-  MLCommon::updateHost(h_input.data(), input, n_rows * n_cols, stream);
+  raft::update_host(h_input.data(), input, n_rows * n_cols, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   int row_size = n_cols;
@@ -307,14 +302,14 @@ void rfClassifier<T>::predict(const cumlHandle& user_handle, const T* input,
     h_predictions[row_id] = majority_prediction;
   }
 
-  MLCommon::updateDevice(predictions, h_predictions.data(), n_rows, stream);
+  raft::update_device(predictions, h_predictions.data(), n_rows, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
 /**
  * @brief Predict target feature for input data; n-ary classification for single feature supported.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
  * @param[in] n_rows: number of  data samples.
  * @param[in] n_cols: number of features (excluding target feature).
@@ -322,7 +317,7 @@ void rfClassifier<T>::predict(const cumlHandle& user_handle, const T* input,
  * @param[in] verbosity: verbosity level for logging messages during execution
  */
 template <typename T>
-void rfClassifier<T>::predictGetAll(const cumlHandle& user_handle,
+void rfClassifier<T>::predictGetAll(const raft::handle_t& user_handle,
                                     const T* input, int n_rows, int n_cols,
                                     int* predictions,
                                     const RandomForestMetaData<T, int>* forest,
@@ -332,9 +327,8 @@ void rfClassifier<T>::predictGetAll(const cumlHandle& user_handle,
   std::vector<int> h_predictions(n_rows * num_trees);
 
   std::vector<T> h_input(n_rows * n_cols);
-  const cumlHandle_impl& handle = user_handle.getImpl();
-  cudaStream_t stream = user_handle.getStream();
-  MLCommon::updateHost(h_input.data(), input, n_rows * n_cols, stream);
+  cudaStream_t stream = user_handle.get_stream();
+  raft::update_host(h_input.data(), input, n_rows * n_cols, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   int row_size = n_cols;
@@ -359,15 +353,15 @@ void rfClassifier<T>::predictGetAll(const cumlHandle& user_handle,
     }
   }
 
-  MLCommon::updateDevice(predictions, h_predictions.data(), n_rows * num_trees,
-                         stream);
+  raft::update_device(predictions, h_predictions.data(), n_rows * num_trees,
+                      stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
 /**
  * @brief Predict target feature for input data and validate against ref_labels.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
  * @param[in] ref_labels: label values for cross validation (n_rows elements); GPU pointer.
  * @param[in] n_rows: number of  data samples.
@@ -376,12 +370,12 @@ void rfClassifier<T>::predictGetAll(const cumlHandle& user_handle,
  * @param[in] verbosity: verbosity level for logging messages during execution
  */
 template <typename T>
-RF_metrics rfClassifier<T>::score(const cumlHandle& user_handle,
+RF_metrics rfClassifier<T>::score(const raft::handle_t& user_handle,
                                   const int* ref_labels, int n_rows,
                                   const int* predictions, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
-  cudaStream_t stream = user_handle.getImpl().getStream();
-  auto d_alloc = user_handle.getDeviceAllocator();
+  cudaStream_t stream = user_handle.get_stream();
+  auto d_alloc = user_handle.get_device_allocator();
   float accuracy = MLCommon::Score::accuracy_score(predictions, ref_labels,
                                                    n_rows, d_alloc, stream);
   RF_metrics stats = set_rf_metrics_classification(accuracy);
@@ -426,7 +420,7 @@ const DecisionTree::DecisionTreeRegressor<T>* rfRegressor<T>::get_trees_ptr()
 /**
  * @brief Build (i.e., fit, train) random forest regressor for input data.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle
+ * @param[in] user_handle: raft::handle_t
  * @param[in] input: train data (n_rows samples, n_cols features) in column major format, excluding labels. Device pointer.
  * @param[in] n_rows: number of training data samples.
  * @param[in] n_cols: number of features (i.e., columns) excluding target feature.
@@ -434,32 +428,44 @@ const DecisionTree::DecisionTreeRegressor<T>* rfRegressor<T>::get_trees_ptr()
  * @param[in, out] forest: CPU pointer to RandomForestMetaData struct
  */
 template <typename T>
-void rfRegressor<T>::fit(const cumlHandle& user_handle, const T* input,
+void rfRegressor<T>::fit(const raft::handle_t& user_handle, const T* input,
                          int n_rows, int n_cols, T* labels,
                          RandomForestMetaData<T, T>*& forest) {
   this->error_checking(input, labels, n_rows, n_cols, false);
 
-  const cumlHandle_impl& handle = user_handle.getImpl();
-  int n_sampled_rows = this->rf_params.rows_sample * n_rows;
+  const raft::handle_t& handle = user_handle;
+  int n_sampled_rows = 0;
+  if (this->rf_params.bootstrap) {
+    n_sampled_rows = this->rf_params.rows_sample * n_rows;
+  } else {
+    if (this->rf_params.rows_sample != 1.0) {
+      CUML_LOG_WARN(
+        "If bootstrap sampling is disabled, rows_sample value is ignored and "
+        "whole dataset is used for building each tree");
+    }
+    n_sampled_rows = n_rows;
+  }
+
   int n_streams = this->rf_params.n_streams;
-  ASSERT(n_streams <= handle.getNumInternalStreams(),
-         "rf_params.n_streams (=%d) should be <= cumlHandle.n_streams (=%d)",
-         n_streams, handle.getNumInternalStreams());
+  ASSERT(
+    n_streams <= handle.get_num_internal_streams(),
+    "rf_params.n_streams (=%d) should be <= raft::handle_t.n_streams (=%d)",
+    n_streams, handle.get_num_internal_streams());
 
-  cudaStream_t stream = user_handle.getStream();
+  cudaStream_t stream = user_handle.get_stream();
   // Select n_sampled_rows (with replacement) numbers from [0, n_rows) per tree.
   // selected_rows: randomly generated IDs for bootstrapped samples (w/ replacement); a device ptr.
   MLCommon::device_buffer<unsigned int>* selected_rows[n_streams];
   for (int i = 0; i < n_streams; i++) {
-    auto s = handle.getInternalStream(i);
+    auto s = handle.get_internal_stream(i);
     selected_rows[i] = new MLCommon::device_buffer<unsigned int>(
-      handle.getDeviceAllocator(), s, n_sampled_rows);
+      handle.get_device_allocator(), s, n_sampled_rows);
   }
 
   std::shared_ptr<TemporaryMemory<T, T>> tempmem[n_streams];
   for (int i = 0; i < n_streams; i++) {
     tempmem[i] = std::make_shared<TemporaryMemory<T, T>>(
-      handle, handle.getInternalStream(i), n_rows, n_cols, 1,
+      handle, handle.get_internal_stream(i), n_rows, n_cols, 1,
       this->rf_params.tree_params);
   }
   //Preprocess once only per forest
@@ -485,7 +491,7 @@ void rfRegressor<T>::fit(const cumlHandle& user_handle, const T* input,
     unsigned int* rowids = selected_rows[stream_id]->data();
     this->prepare_fit_per_tree(
       i, n_rows, n_sampled_rows, rowids, tempmem[stream_id]->num_sms,
-      tempmem[stream_id]->stream, handle.getDeviceAllocator());
+      tempmem[stream_id]->stream, handle.get_device_allocator());
 
     /* Build individual tree in the forest.
        - input is a pointer to orig data that have n_cols features and n_rows rows.
@@ -496,7 +502,7 @@ void rfRegressor<T>::fit(const cumlHandle& user_handle, const T* input,
     */
     DecisionTree::TreeMetaDataNode<T, T>* tree_ptr = &(forest->trees[i]);
     tree_ptr->treeid = i;
-    trees[i].fit(handle.getDeviceAllocator(), handle.getHostAllocator(),
+    trees[i].fit(handle.get_device_allocator(), handle.get_host_allocator(),
                  tempmem[stream_id]->stream, input, n_cols, n_rows, labels,
                  rowids, n_sampled_rows, tree_ptr, this->rf_params.tree_params,
                  tempmem[stream_id]);
@@ -510,13 +516,13 @@ void rfRegressor<T>::fit(const cumlHandle& user_handle, const T* input,
     delete selected_rows[i];
   }
 
-  CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+  CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 }
 
 /**
  * @brief Predict target feature for input data; regression for single feature supported.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
  * @param[in] n_rows: number of  data samples.
  * @param[in] n_cols: number of features (excluding target feature).
@@ -525,18 +531,17 @@ void rfRegressor<T>::fit(const cumlHandle& user_handle, const T* input,
  * @param[in] verbosity: verbosity level for logging messages during execution
  */
 template <typename T>
-void rfRegressor<T>::predict(const cumlHandle& user_handle, const T* input,
+void rfRegressor<T>::predict(const raft::handle_t& user_handle, const T* input,
                              int n_rows, int n_cols, T* predictions,
                              const RandomForestMetaData<T, T>* forest,
                              int verbosity) const {
   this->error_checking(input, predictions, n_rows, n_cols, true);
 
   std::vector<T> h_predictions(n_rows);
-  const cumlHandle_impl& handle = user_handle.getImpl();
-  cudaStream_t stream = user_handle.getStream();
+  cudaStream_t stream = user_handle.get_stream();
 
   std::vector<T> h_input(n_rows * n_cols);
-  MLCommon::updateHost(h_input.data(), input, n_rows * n_cols, stream);
+  raft::update_host(h_input.data(), input, n_rows * n_cols, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   int row_size = n_cols;
@@ -563,14 +568,14 @@ void rfRegressor<T>::predict(const cumlHandle& user_handle, const T* input,
     h_predictions[row_id] = sum_predictions / this->rf_params.n_trees;
   }
 
-  MLCommon::updateDevice(predictions, h_predictions.data(), n_rows, stream);
+  raft::update_device(predictions, h_predictions.data(), n_rows, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
 /**
  * @brief Predict target feature for input data and validate against ref_labels.
  * @tparam T: data type for input data (float or double).
- * @param[in] user_handle: cumlHandle.
+ * @param[in] user_handle: raft::handle_t.
  * @param[in] input: test data (n_rows samples, n_cols features) in row major format. GPU pointer.
  * @param[in] ref_labels: label values for cross validation (n_rows elements); GPU pointer.
  * @param[in] n_rows: number of  data samples.
@@ -580,12 +585,12 @@ void rfRegressor<T>::predict(const cumlHandle& user_handle, const T* input,
  * @param[in] verbosity: verbosity level for logging messages during execution
  */
 template <typename T>
-RF_metrics rfRegressor<T>::score(const cumlHandle& user_handle,
+RF_metrics rfRegressor<T>::score(const raft::handle_t& user_handle,
                                  const T* ref_labels, int n_rows,
                                  const T* predictions, int verbosity) {
   ML::Logger::get().setLevel(verbosity);
-  cudaStream_t stream = user_handle.getImpl().getStream();
-  auto d_alloc = user_handle.getDeviceAllocator();
+  cudaStream_t stream = user_handle.get_stream();
+  auto d_alloc = user_handle.get_device_allocator();
 
   double mean_abs_error, mean_squared_error, median_abs_error;
   MLCommon::Score::regression_metrics(predictions, ref_labels, n_rows, d_alloc,
diff --git a/cpp/src/randomforest/randomforest_impl.h b/cpp/src/randomforest/randomforest_impl.h
index 056270eeff..73c7129466 100644
--- a/cpp/src/randomforest/randomforest_impl.h
+++ b/cpp/src/randomforest/randomforest_impl.h
@@ -52,17 +52,18 @@ class rfClassifier : public rf<T, int> {
   rfClassifier(RF_params cfg_rf_params);
   ~rfClassifier();
 
-  void fit(const cumlHandle& user_handle, const T* input, int n_rows,
+  void fit(const raft::handle_t& user_handle, const T* input, int n_rows,
            int n_cols, int* labels, int n_unique_labels,
            RandomForestMetaData<T, int>*& forest);
-  void predict(const cumlHandle& user_handle, const T* input, int n_rows,
+  void predict(const raft::handle_t& user_handle, const T* input, int n_rows,
                int n_cols, int* predictions,
                const RandomForestMetaData<T, int>* forest, int verbosity) const;
-  void predictGetAll(const cumlHandle& user_handle, const T* input, int n_rows,
-                     int n_cols, int* predictions,
+  void predictGetAll(const raft::handle_t& user_handle, const T* input,
+                     int n_rows, int n_cols, int* predictions,
                      const RandomForestMetaData<T, int>* forest, int verbosity);
-  static RF_metrics score(const cumlHandle& user_handle, const int* ref_labels,
-                          int n_rows, const int* predictions, int verbosity);
+  static RF_metrics score(const raft::handle_t& user_handle,
+                          const int* ref_labels, int n_rows,
+                          const int* predictions, int verbosity);
 };
 
 template <class T>
@@ -75,12 +76,13 @@ class rfRegressor : public rf<T, T> {
   rfRegressor(RF_params cfg_rf_params);
   ~rfRegressor();
 
-  void fit(const cumlHandle& user_handle, const T* input, int n_rows,
+  void fit(const raft::handle_t& user_handle, const T* input, int n_rows,
            int n_cols, T* labels, RandomForestMetaData<T, T>*& forest);
-  void predict(const cumlHandle& user_handle, const T* input, int n_rows,
+  void predict(const raft::handle_t& user_handle, const T* input, int n_rows,
                int n_cols, T* predictions,
                const RandomForestMetaData<T, T>* forest, int verbosity) const;
-  static RF_metrics score(const cumlHandle& user_handle, const T* ref_labels,
-                          int n_rows, const T* predictions, int verbosity);
+  static RF_metrics score(const raft::handle_t& user_handle,
+                          const T* ref_labels, int n_rows, const T* predictions,
+                          int verbosity);
 };
 }  //End namespace ML
diff --git a/cpp/src/solver/cd.cuh b/cpp/src/solver/cd.cuh
index c05a28a999..978052fff7 100644
--- a/cpp/src/solver/cd.cuh
+++ b/cpp/src/solver/cd.cuh
@@ -16,23 +16,24 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
+#include <common/device_buffer.hpp>
 #include <cuml/solvers/params.hpp>
 #include <functions/linearReg.cuh>
 #include <functions/penalty.cuh>
 #include <functions/softThres.cuh>
 #include <glm/preprocess.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/multiply.cuh>
-#include <linalg/subtract.cuh>
-#include <linalg/unary_op.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/multiply.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "shuffle.h"
 
 namespace ML {
@@ -43,7 +44,7 @@ using namespace MLCommon;
 /**
  * Fits a linear, lasso, and elastic-net regression model using Coordinate Descent solver
  * @param handle
- *        Reference of cumlHandle
+ *        Reference of raft::handle_t
  * @param input
  *        pointer to an array in column-major format (size of n_rows, n_cols)
  * @param n_rows
@@ -78,7 +79,7 @@ using namespace MLCommon;
  *        cuda stream
  */
 template <typename math_t>
-void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
+void cdFit(const raft::handle_t &handle, math_t *input, int n_rows, int n_cols,
            math_t *labels, math_t *coef, math_t *intercept, bool fit_intercept,
            bool normalize, int epochs, ML::loss_funct loss, math_t alpha,
            math_t l1_ratio, bool shuffle, math_t tol, cudaStream_t stream) {
@@ -89,9 +90,9 @@ void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
   ASSERT(loss == ML::loss_funct::SQRD_LOSS,
          "Parameter loss: Only SQRT_LOSS function is supported for now");
 
-  cublasHandle_t cublas_handle = handle.getCublasHandle();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
 
-  auto allocator = handle.getDeviceAllocator();
+  auto allocator = handle.get_device_allocator();
   device_buffer<math_t> pred(allocator, stream, n_rows);
   device_buffer<math_t> residual(allocator, stream, n_rows);
   device_buffer<math_t> squared(allocator, stream, n_cols);
@@ -122,14 +123,16 @@ void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
 
   if (normalize) {
     math_t scalar = math_t(1.0) + l2_alpha;
-    Matrix::setValue(squared.data(), squared.data(), scalar, n_cols, stream);
+    raft::matrix::setValue(squared.data(), squared.data(), scalar, n_cols,
+                           stream);
   } else {
-    LinAlg::colNorm(squared.data(), input, n_cols, n_rows, LinAlg::L2Norm,
-                    false, stream);
-    LinAlg::addScalar(squared.data(), squared.data(), l2_alpha, n_cols, stream);
+    raft::linalg::colNorm(squared.data(), input, n_cols, n_rows,
+                          raft::linalg::L2Norm, false, stream);
+    raft::linalg::addScalar(squared.data(), squared.data(), l2_alpha, n_cols,
+                            stream);
   }
 
-  copy(residual.data(), labels, n_rows, stream);
+  raft::copy(residual.data(), labels, n_rows, stream);
 
   for (int i = 0; i < epochs; i++) {
     if (i > 0 && shuffle) {
@@ -146,21 +149,21 @@ void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
       math_t *squared_loc = squared.data() + ci;
       math_t *input_col_loc = input + (ci * n_rows);
 
-      LinAlg::multiplyScalar(pred.data(), input_col_loc, h_coef[ci], n_rows,
-                             stream);
-      LinAlg::add(residual.data(), residual.data(), pred.data(), n_rows,
-                  stream);
-      LinAlg::gemm(input_col_loc, n_rows, 1, residual.data(), coef_loc, 1, 1,
-                   CUBLAS_OP_T, CUBLAS_OP_N, cublas_handle, stream);
+      raft::linalg::multiplyScalar(pred.data(), input_col_loc, h_coef[ci],
+                                   n_rows, stream);
+      raft::linalg::add(residual.data(), residual.data(), pred.data(), n_rows,
+                        stream);
+      raft::linalg::gemm(handle, input_col_loc, n_rows, 1, residual.data(),
+                         coef_loc, 1, 1, CUBLAS_OP_T, CUBLAS_OP_N, stream);
 
       if (l1_ratio > math_t(0.0))
         Functions::softThres(coef_loc, coef_loc, alpha, 1, stream);
 
-      LinAlg::eltwiseDivideCheckZero(coef_loc, coef_loc, squared_loc, 1,
-                                     stream);
+      raft::linalg::eltwiseDivideCheckZero(coef_loc, coef_loc, squared_loc, 1,
+                                           stream);
 
       coef_prev = h_coef[ci];
-      updateHost(&(h_coef[ci]), coef_loc, 1, stream);
+      raft::update_host(&(h_coef[ci]), coef_loc, 1, stream);
       CUDA_CHECK(cudaStreamSynchronize(stream));
 
       math_t diff = abs(coef_prev - h_coef[ci]);
@@ -169,10 +172,10 @@ void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
 
       if (abs(h_coef[ci]) > coef_max) coef_max = abs(h_coef[ci]);
 
-      LinAlg::multiplyScalar(pred.data(), input_col_loc, h_coef[ci], n_rows,
-                             stream);
-      LinAlg::subtract(residual.data(), residual.data(), pred.data(), n_rows,
-                       stream);
+      raft::linalg::multiplyScalar(pred.data(), input_col_loc, h_coef[ci],
+                                   n_rows, stream);
+      raft::linalg::subtract(residual.data(), residual.data(), pred.data(),
+                             n_rows, stream);
     }
 
     bool flag_continue = true;
@@ -221,7 +224,7 @@ void cdFit(const cumlHandle_impl &handle, math_t *input, int n_rows, int n_cols,
  *        cuda stream
  */
 template <typename math_t>
-void cdPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
+void cdPredict(const raft::handle_t &handle, const math_t *input, int n_rows,
                int n_cols, const math_t *coef, math_t intercept, math_t *preds,
                ML::loss_funct loss, cudaStream_t stream) {
   ASSERT(n_cols > 0,
@@ -231,9 +234,8 @@ void cdPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
   ASSERT(loss == ML::loss_funct::SQRD_LOSS,
          "Parameter loss: Only SQRT_LOSS function is supported for now");
 
-  cublasHandle_t cublas_handle = handle.getCublasHandle();
-  Functions::linearRegH(input, n_rows, n_cols, coef, preds, intercept,
-                        cublas_handle, stream);
+  Functions::linearRegH(handle, input, n_rows, n_cols, coef, preds, intercept,
+                        stream);
 }
 
 };  // namespace Solver
diff --git a/cpp/src/solver/cd_mg.cu b/cpp/src/solver/cd_mg.cu
index 301f55de3e..1b48b958ca 100644
--- a/cpp/src/solver/cd_mg.cu
+++ b/cpp/src/solver/cd_mg.cu
@@ -14,23 +14,24 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/linear_model/preprocess_mg.hpp>
 #include <cuml/solvers/cd_mg.hpp>
 #include <functions/softThres.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/multiply.cuh>
-#include <linalg/subtract.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
 #include <opg/linalg/mv_aTb.hpp>
 #include <opg/linalg/norm.hpp>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/multiply.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "shuffle.h"
 
 using namespace MLCommon;
@@ -40,19 +41,19 @@ namespace CD {
 namespace opg {
 
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *coef, T *intercept,
               bool fit_intercept, bool normalize, int epochs, T alpha,
               T l1_ratio, bool shuffle, T tol, cudaStream_t *streams,
               int n_streams, bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
 
   std::vector<Matrix::RankSizePair *> partsToRanks =
-    input_desc.blocksOwnedBy(comm.getRank());
+    input_desc.blocksOwnedBy(comm.get_rank());
 
   size_t total_M = 0.0;
   for (int i = 0; i < partsToRanks.size(); i++) {
@@ -88,30 +89,29 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
   int *ri_h = (int *)malloc(memsize);
   CUDA_CHECK(cudaHostRegister(ri_h, memsize, cudaHostRegisterDefault));
 
-  if (comm.getRank() == 0) {
+  if (comm.get_rank() == 0) {
     ML::Solver::initShuffle(ri, g);
     for (int i = 0; i < input_desc.N; i++) {
       ri_h[i] = ri[i];
     }
   }
 
-  comm.bcast(ri_h, input_desc.N, MLCommon::cumlCommunicator::INT, 0,
-             streams[0]);
-  comm.syncStream(streams[0]);
+  comm.bcast(ri_h, input_desc.N, 0, streams[0]);
+  comm.sync_stream(streams[0]);
 
   T l2_alpha = (1 - l1_ratio) * alpha * input_desc.M;
   alpha = l1_ratio * alpha * input_desc.M;
 
   if (normalize) {
     T scalar = T(1.0) + l2_alpha;
-    Matrix::setValue(squared.data(), squared.data(), scalar, input_desc.N,
-                     streams[0]);
+    raft::matrix::setValue(squared.data(), squared.data(), scalar, input_desc.N,
+                           streams[0]);
   } else {
     Matrix::Data<T> squared_data{squared.data(), size_t(input_desc.N)};
-    LinAlg::opg::colNorm2NoSeq(squared_data, input_data, input_desc, comm,
-                               allocator, streams, n_streams, cublas_handle);
-    LinAlg::addScalar(squared.data(), squared.data(), l2_alpha, input_desc.N,
-                      streams[0]);
+    LinAlg::opg::colNorm2NoSeq(handle, squared_data, input_data, input_desc,
+                               streams, n_streams);
+    raft::linalg::addScalar(squared.data(), squared.data(), l2_alpha,
+                            input_desc.N, streams[0]);
   }
 
   std::vector<Matrix::Data<T> *> input_data_temp;
@@ -122,7 +122,7 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 
   T *rs = residual.data();
   for (int i = 0; i < partsToRanks.size(); i++) {
-    copy(rs, labels[i]->ptr, partsToRanks[i]->size, streams[0]);
+    raft::copy(rs, labels[i]->ptr, partsToRanks[i]->size, streams[0]);
 
     Matrix::Data<T> *rs_data = new Matrix::Data<T>();
     rs_data->ptr = rs;
@@ -138,16 +138,15 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 
   for (int i = 0; i < epochs; i++) {
     if (i > 0 && shuffle) {
-      if (comm.getRank() == 0) {
+      if (comm.get_rank() == 0) {
         Solver::shuffle(ri, g);
         for (int k = 0; k < input_desc.N; k++) {
           ri_h[k] = ri[k];
         }
       }
 
-      comm.bcast(ri_h, input_desc.N, MLCommon::cumlCommunicator::INT, 0,
-                 streams[0]);
-      comm.syncStream(streams[0]);
+      comm.bcast(ri_h, input_desc.N, 0, streams[0]);
+      comm.sync_stream(streams[0]);
     }
 
     T coef_max = 0.0;
@@ -168,11 +167,12 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
         input_data_temp[k]->ptr = input_col_loc;
         input_data_temp[k]->totalSize = partsToRanks[k]->size;
 
-        LinAlg::multiplyScalar(pred_loc, input_col_loc, h_coef[ci],
-                               partsToRanks[k]->size, streams[k % n_streams]);
+        raft::linalg::multiplyScalar(pred_loc, input_col_loc, h_coef[ci],
+                                     partsToRanks[k]->size,
+                                     streams[k % n_streams]);
 
-        LinAlg::add(residual_loc, residual_loc, pred_loc, partsToRanks[k]->size,
-                    streams[k % n_streams]);
+        raft::linalg::add(residual_loc, residual_loc, pred_loc,
+                          partsToRanks[k]->size, streams[k % n_streams]);
 
         pred_loc = pred_loc + partsToRanks[k]->size;
         residual_loc = residual_loc + partsToRanks[k]->size;
@@ -184,18 +184,17 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 
       coef_loc_data.ptr = coef_loc;
       coef_loc_data.totalSize = size_t(1);
-      LinAlg::opg::mv_aTb(coef_loc_data, input_data_temp, input_desc_temp,
-                          residual_temp, comm, allocator, streams, n_streams,
-                          cublas_handle);
+      LinAlg::opg::mv_aTb(handle, coef_loc_data, input_data_temp,
+                          input_desc_temp, residual_temp, streams, n_streams);
 
       if (l1_ratio > T(0.0))
         Functions::softThres(coef_loc, coef_loc, alpha, 1, streams[0]);
 
-      LinAlg::eltwiseDivideCheckZero(coef_loc, coef_loc, squared_loc, 1,
-                                     streams[0]);
+      raft::linalg::eltwiseDivideCheckZero(coef_loc, coef_loc, squared_loc, 1,
+                                           streams[0]);
 
       coef_prev = h_coef[ci];
-      updateHost(&(h_coef[ci]), coef_loc, 1, streams[0]);
+      raft::update_host(&(h_coef[ci]), coef_loc, 1, streams[0]);
       CUDA_CHECK(cudaStreamSynchronize(streams[0]));
 
       T diff = abs(coef_prev - h_coef[ci]);
@@ -210,11 +209,12 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
       for (int k = 0; k < input_data.size(); k++) {
         input_col_loc = input_data[k]->ptr + (ci * partsToRanks[k]->size);
 
-        LinAlg::multiplyScalar(pred_loc, input_col_loc, h_coef[ci],
-                               partsToRanks[k]->size, streams[k % n_streams]);
+        raft::linalg::multiplyScalar(pred_loc, input_col_loc, h_coef[ci],
+                                     partsToRanks[k]->size,
+                                     streams[k % n_streams]);
 
-        LinAlg::subtract(residual_loc, residual_loc, pred_loc,
-                         partsToRanks[k]->size, streams[k % n_streams]);
+        raft::linalg::subtract(residual_loc, residual_loc, pred_loc,
+                               partsToRanks[k]->size, streams[k % n_streams]);
 
         pred_loc = pred_loc + partsToRanks[k]->size;
         residual_loc = residual_loc + partsToRanks[k]->size;
@@ -271,14 +271,15 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
  * @input param verbose
  */
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc,
               std::vector<Matrix::Data<T> *> &labels, T *coef, T *intercept,
               bool fit_intercept, bool normalize, int epochs, T alpha,
               T l1_ratio, bool shuffle, T tol, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   // Tracking issue: https://github.com/rapidsai/cuml/issues/2470
 
   int n_streams = input_desc.blocksOwnedBy(rank).size();
@@ -302,7 +303,7 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle,
+void predict_impl(raft::handle_t &handle,
                   std::vector<Matrix::Data<T> *> &input_data,
                   Matrix::PartDescriptor &input_desc, T *coef, T intercept,
                   std::vector<Matrix::Data<T> *> &preds, cudaStream_t *streams,
@@ -313,22 +314,22 @@ void predict_impl(cumlHandle &handle,
 
   for (int i = 0; i < input_data.size(); i++) {
     int si = i % n_streams;
-    LinAlg::gemm(input_data[i]->ptr, local_blocks[i]->size, input_desc.N, coef,
-                 preds[i]->ptr, local_blocks[i]->size, size_t(1), CUBLAS_OP_N,
-                 CUBLAS_OP_N, alpha, beta, handle.getImpl().getCublasHandle(),
-                 streams[si]);
+    raft::linalg::gemm(handle, input_data[i]->ptr, local_blocks[i]->size,
+                       input_desc.N, coef, preds[i]->ptr, local_blocks[i]->size,
+                       size_t(1), CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta,
+                       streams[si]);
 
-    LinAlg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
-                      local_blocks[i]->size, streams[si]);
+    raft::linalg::addScalar(preds[i]->ptr, preds[i]->ptr, intercept,
+                            local_blocks[i]->size, streams[si]);
   }
 }
 
 template <typename T>
-void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                   size_t n_parts, Matrix::Data<T> **input, size_t n_rows,
                   size_t n_cols, T *coef, T intercept, Matrix::Data<T> **preds,
                   bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -336,7 +337,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   Matrix::PartDescriptor input_desc(n_rows, n_cols, ranksAndSizes, rank);
   std::vector<Matrix::Data<T> *> preds_data(preds, preds + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   // Tracking issue: https://github.com/rapidsai/cuml/issues/2470
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
@@ -356,7 +357,7 @@ void predict_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   }
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
+void fit(raft::handle_t &handle, std::vector<Matrix::Data<float> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<float> *> &labels, float *coef,
          float *intercept, bool fit_intercept, bool normalize, int epochs,
@@ -366,7 +367,8 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<float> *> &input_data,
            verbose);
 }
 
-void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
+void fit(raft::handle_t &handle,
+         std::vector<Matrix::Data<double> *> &input_data,
          Matrix::PartDescriptor &input_desc,
          std::vector<Matrix::Data<double> *> &labels, double *coef,
          double *intercept, bool fit_intercept, bool normalize, int epochs,
@@ -377,7 +379,7 @@ void fit(cumlHandle &handle, std::vector<Matrix::Data<double> *> &input_data,
            verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<float> **input, size_t n_rows,
              size_t n_cols, float *coef, float intercept,
              Matrix::Data<float> **preds, bool verbose) {
@@ -385,7 +387,7 @@ void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                intercept, preds, verbose);
 }
 
-void predict(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void predict(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
              size_t n_parts, Matrix::Data<double> **input, size_t n_rows,
              size_t n_cols, double *coef, double intercept,
              Matrix::Data<double> **preds, bool verbose) {
diff --git a/cpp/src/solver/sgd.cuh b/cpp/src/solver/sgd.cuh
index ba9501715e..530897f3b5 100644
--- a/cpp/src/solver/sgd.cuh
+++ b/cpp/src/solver/sgd.cuh
@@ -16,25 +16,25 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/gemv.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/gemv.h>
 #include <common/cumlHandle.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/solvers/params.hpp>
 #include <functions/hinge.cuh>
 #include <functions/linearReg.cuh>
 #include <functions/logisticReg.cuh>
 #include <glm/preprocess.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/norm.cuh>
-#include <linalg/subtract.cuh>
-#include <linalg/unary_op.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
 #include "learning_rate.h"
 #include "shuffle.h"
 
@@ -46,7 +46,7 @@ using namespace MLCommon;
 /**
  * Fits a linear, lasso, and elastic-net regression model using Coordinate Descent solver
  * @param handle
- *        Reference of cumlHandle
+ *        Reference of raft::handle_t
  * @param input
  *        pointer to an array in column-major format (size of n_rows, n_cols)
  * @param n_rows
@@ -91,21 +91,20 @@ using namespace MLCommon;
  *        cuda stream
  */
 template <typename math_t>
-void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
-            int n_cols, math_t *labels, math_t *coef, math_t *intercept,
-            bool fit_intercept, int batch_size, int epochs, ML::lr_type lr_type,
-            math_t eta0, math_t power_t, ML::loss_funct loss,
-            Functions::penalty penalty, math_t alpha, math_t l1_ratio,
-            bool shuffle, math_t tol, int n_iter_no_change,
-            cudaStream_t stream) {
+void sgdFit(const raft::handle_t &handle, math_t *input, int n_rows, int n_cols,
+            math_t *labels, math_t *coef, math_t *intercept, bool fit_intercept,
+            int batch_size, int epochs, ML::lr_type lr_type, math_t eta0,
+            math_t power_t, ML::loss_funct loss, Functions::penalty penalty,
+            math_t alpha, math_t l1_ratio, bool shuffle, math_t tol,
+            int n_iter_no_change, cudaStream_t stream) {
   ASSERT(n_cols > 0,
          "Parameter n_cols: number of columns cannot be less than one");
   ASSERT(n_rows > 1,
          "Parameter n_rows: number of rows cannot be less than two");
 
-  cublasHandle_t cublas_handle = handle.getCublasHandle();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
 
-  auto allocator = handle.getDeviceAllocator();
+  auto allocator = handle.get_device_allocator();
   device_buffer<math_t> mu_input(allocator, stream, 0);
   device_buffer<math_t> mu_labels(allocator, stream, 0);
   device_buffer<math_t> norm2_input(allocator, stream, 0);
@@ -159,27 +158,24 @@ void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
 
       if (cbs == 0) break;
 
-      updateDevice(indices.data(), &rand_indices[j], cbs, stream);
-      Matrix::copyRows(input, n_rows, n_cols, input_batch.data(),
-                       indices.data(), cbs, stream);
-      Matrix::copyRows(labels, n_rows, 1, labels_batch.data(), indices.data(),
-                       cbs, stream);
+      raft::update_device(indices.data(), &rand_indices[j], cbs, stream);
+      raft::matrix::copyRows(input, n_rows, n_cols, input_batch.data(),
+                             indices.data(), cbs, stream);
+      raft::matrix::copyRows(labels, n_rows, 1, labels_batch.data(),
+                             indices.data(), cbs, stream);
 
       if (loss == ML::loss_funct::SQRD_LOSS) {
-        Functions::linearRegLossGrads(input_batch.data(), cbs, n_cols,
+        Functions::linearRegLossGrads(handle, input_batch.data(), cbs, n_cols,
                                       labels_batch.data(), coef, grads.data(),
-                                      penalty, alpha, l1_ratio, cublas_handle,
-                                      allocator, stream);
+                                      penalty, alpha, l1_ratio, stream);
       } else if (loss == ML::loss_funct::LOG) {
-        Functions::logisticRegLossGrads(input_batch.data(), cbs, n_cols,
+        Functions::logisticRegLossGrads(handle, input_batch.data(), cbs, n_cols,
                                         labels_batch.data(), coef, grads.data(),
-                                        penalty, alpha, l1_ratio, cublas_handle,
-                                        allocator, stream);
+                                        penalty, alpha, l1_ratio, stream);
       } else if (loss == ML::loss_funct::HINGE) {
-        Functions::hingeLossGrads(input_batch.data(), cbs, n_cols,
+        Functions::hingeLossGrads(handle, input_batch.data(), cbs, n_cols,
                                   labels_batch.data(), coef, grads.data(),
-                                  penalty, alpha, l1_ratio, cublas_handle,
-                                  allocator, stream);
+                                  penalty, alpha, l1_ratio, stream);
       } else {
         ASSERT(false,
                "sgd.cuh: Other loss functions have not been implemented yet!");
@@ -188,9 +184,9 @@ void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
       if (lr_type != ML::lr_type::ADAPTIVE)
         learning_rate = calLearningRate(lr_type, eta0, power_t, alpha, t);
 
-      LinAlg::scalarMultiply(grads.data(), grads.data(), learning_rate, n_cols,
-                             stream);
-      LinAlg::subtract(coef, coef, grads.data(), n_cols, stream);
+      raft::linalg::scalarMultiply(grads.data(), grads.data(), learning_rate,
+                                   n_cols, stream);
+      raft::linalg::subtract(coef, coef, grads.data(), n_cols, stream);
 
       j = j + cbs;
       t = t + 1;
@@ -198,20 +194,20 @@ void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
 
     if (tol > math_t(0)) {
       if (loss == ML::loss_funct::SQRD_LOSS) {
-        Functions::linearRegLoss(input, n_rows, n_cols, labels, coef,
+        Functions::linearRegLoss(handle, input, n_rows, n_cols, labels, coef,
                                  loss_value.data(), penalty, alpha, l1_ratio,
-                                 cublas_handle, allocator, stream);
+                                 stream);
       } else if (loss == ML::loss_funct::LOG) {
-        Functions::logisticRegLoss(input, n_rows, n_cols, labels, coef,
+        Functions::logisticRegLoss(handle, input, n_rows, n_cols, labels, coef,
                                    loss_value.data(), penalty, alpha, l1_ratio,
-                                   cublas_handle, allocator, stream);
+                                   stream);
       } else if (loss == ML::loss_funct::HINGE) {
-        Functions::hingeLoss(input, n_rows, n_cols, labels, coef,
+        Functions::hingeLoss(handle, input, n_rows, n_cols, labels, coef,
                              loss_value.data(), penalty, alpha, l1_ratio,
-                             cublas_handle, allocator, stream);
+                             stream);
       }
 
-      updateHost(&curr_loss_value, loss_value.data(), 1, stream);
+      raft::update_host(&curr_loss_value, loss_value.data(), 1, stream);
       CUDA_CHECK(cudaStreamSynchronize(stream));
 
       if (i > 0) {
@@ -247,7 +243,7 @@ void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
 /**
  * Make predictions
  * @param handle
- *        Reference of cumlHandle
+ *        Reference of raft::handle_t
  * @param input
  *        pointer to an array in column-major format (size of n_rows, n_cols)
  * @param n_rows
@@ -266,7 +262,7 @@ void sgdFit(const cumlHandle_impl &handle, math_t *input, int n_rows,
  *        cuda stream
  */
 template <typename math_t>
-void sgdPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
+void sgdPredict(const raft::handle_t &handle, const math_t *input, int n_rows,
                 int n_cols, const math_t *coef, math_t intercept, math_t *preds,
                 ML::loss_funct loss, cudaStream_t stream) {
   ASSERT(n_cols > 0,
@@ -274,24 +270,22 @@ void sgdPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
   ASSERT(n_rows > 1,
          "Parameter n_rows: number of rows cannot be less than two");
 
-  cublasHandle_t cublas_handle = handle.getCublasHandle();
-
   if (loss == ML::loss_funct::SQRD_LOSS) {
-    Functions::linearRegH(input, n_rows, n_cols, coef, preds, intercept,
-                          cublas_handle, stream);
+    Functions::linearRegH(handle, input, n_rows, n_cols, coef, preds, intercept,
+                          stream);
   } else if (loss == ML::loss_funct::LOG) {
-    Functions::logisticRegH(input, n_rows, n_cols, coef, preds, intercept,
-                            cublas_handle, stream);
+    Functions::logisticRegH(handle, input, n_rows, n_cols, coef, preds,
+                            intercept, stream);
   } else if (loss == ML::loss_funct::HINGE) {
-    Functions::hingeH(input, n_rows, n_cols, coef, preds, intercept,
-                      cublas_handle, stream);
+    Functions::hingeH(handle, input, n_rows, n_cols, coef, preds, intercept,
+                      stream);
   }
 }
 
 /**
  * Make binary classifications
  * @param handle
- *        Reference of cumlHandle
+ *        Reference of raft::handle_t
  * @param input
  *        pointer to an array in column-major format (size of n_rows, n_cols)
  * @param n_rows
@@ -310,7 +304,7 @@ void sgdPredict(const cumlHandle_impl &handle, const math_t *input, int n_rows,
  *        cuda stream
  */
 template <typename math_t>
-void sgdPredictBinaryClass(const cumlHandle_impl &handle, const math_t *input,
+void sgdPredictBinaryClass(const raft::handle_t &handle, const math_t *input,
                            int n_rows, int n_cols, const math_t *coef,
                            math_t intercept, math_t *preds, ML::loss_funct loss,
                            cudaStream_t stream) {
@@ -319,7 +313,7 @@ void sgdPredictBinaryClass(const cumlHandle_impl &handle, const math_t *input,
 
   math_t scalar = math_t(1);
   if (loss == ML::loss_funct::SQRD_LOSS || loss == ML::loss_funct::LOG) {
-    LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       preds, preds, n_rows,
       [scalar] __device__(math_t in) {
         if (in >= math_t(0.5))
@@ -329,7 +323,7 @@ void sgdPredictBinaryClass(const cumlHandle_impl &handle, const math_t *input,
       },
       stream);
   } else if (loss == ML::loss_funct::HINGE) {
-    LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       preds, preds, n_rows,
       [scalar] __device__(math_t in) {
         if (in >= math_t(0.0))
diff --git a/cpp/src/solver/solver.cu b/cpp/src/solver/solver.cu
index 39edde23e2..dd2981886d 100644
--- a/cpp/src/solver/solver.cu
+++ b/cpp/src/solver/solver.cu
@@ -24,7 +24,7 @@ namespace Solver {
 
 using namespace ML;
 
-void sgdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
+void sgdFit(raft::handle_t &handle, float *input, int n_rows, int n_cols,
             float *labels, float *coef, float *intercept, bool fit_intercept,
             int batch_size, int epochs, int lr_type, float eta0, float power_t,
             int loss, int penalty, float alpha, float l1_ratio, bool shuffle,
@@ -67,13 +67,12 @@ void sgdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: this learning rate type is not supported.");
   }
 
-  sgdFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-         fit_intercept, batch_size, epochs, learning_rate_type, eta0, power_t,
-         loss_funct, pen, alpha, l1_ratio, shuffle, tol, n_iter_no_change,
-         handle.getStream());
+  sgdFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+         batch_size, epochs, learning_rate_type, eta0, power_t, loss_funct, pen,
+         alpha, l1_ratio, shuffle, tol, n_iter_no_change, handle.get_stream());
 }
 
-void sgdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
+void sgdFit(raft::handle_t &handle, double *input, int n_rows, int n_cols,
             double *labels, double *coef, double *intercept, bool fit_intercept,
             int batch_size, int epochs, int lr_type, double eta0,
             double power_t, int loss, int penalty, double alpha,
@@ -115,14 +114,14 @@ void sgdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: this learning rate type is not supported.");
   }
 
-  sgdFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-         fit_intercept, batch_size, epochs, learning_rate_type, eta0, power_t,
-         loss_funct, pen, alpha, l1_ratio, shuffle, tol, n_iter_no_change,
-         handle.getStream());
+  sgdFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+         batch_size, epochs, learning_rate_type, eta0, power_t, loss_funct, pen,
+         alpha, l1_ratio, shuffle, tol, n_iter_no_change, handle.get_stream());
 }
 
-void sgdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
-                const float *coef, float intercept, float *preds, int loss) {
+void sgdPredict(raft::handle_t &handle, const float *input, int n_rows,
+                int n_cols, const float *coef, float intercept, float *preds,
+                int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -134,12 +133,13 @@ void sgdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  sgdPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-             loss_funct, handle.getStream());
+  sgdPredict(handle, input, n_rows, n_cols, coef, intercept, preds, loss_funct,
+             handle.get_stream());
 }
 
-void sgdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
-                const double *coef, double intercept, double *preds, int loss) {
+void sgdPredict(raft::handle_t &handle, const double *input, int n_rows,
+                int n_cols, const double *coef, double intercept, double *preds,
+                int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -151,13 +151,13 @@ void sgdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  sgdPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-             loss_funct, handle.getStream());
+  sgdPredict(handle, input, n_rows, n_cols, coef, intercept, preds, loss_funct,
+             handle.get_stream());
 }
 
-void sgdPredictBinaryClass(cumlHandle &handle, const float *input, int n_rows,
-                           int n_cols, const float *coef, float intercept,
-                           float *preds, int loss) {
+void sgdPredictBinaryClass(raft::handle_t &handle, const float *input,
+                           int n_rows, int n_cols, const float *coef,
+                           float intercept, float *preds, int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -169,13 +169,13 @@ void sgdPredictBinaryClass(cumlHandle &handle, const float *input, int n_rows,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  sgdPredictBinaryClass(handle.getImpl(), input, n_rows, n_cols, coef,
-                        intercept, preds, loss_funct, handle.getStream());
+  sgdPredictBinaryClass(handle, input, n_rows, n_cols, coef, intercept, preds,
+                        loss_funct, handle.get_stream());
 }
 
-void sgdPredictBinaryClass(cumlHandle &handle, const double *input, int n_rows,
-                           int n_cols, const double *coef, double intercept,
-                           double *preds, int loss) {
+void sgdPredictBinaryClass(raft::handle_t &handle, const double *input,
+                           int n_rows, int n_cols, const double *coef,
+                           double intercept, double *preds, int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -187,11 +187,11 @@ void sgdPredictBinaryClass(cumlHandle &handle, const double *input, int n_rows,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  sgdPredictBinaryClass(handle.getImpl(), input, n_rows, n_cols, coef,
-                        intercept, preds, loss_funct, handle.getStream());
+  sgdPredictBinaryClass(handle, input, n_rows, n_cols, coef, intercept, preds,
+                        loss_funct, handle.get_stream());
 }
 
-void cdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
+void cdFit(raft::handle_t &handle, float *input, int n_rows, int n_cols,
            float *labels, float *coef, float *intercept, bool fit_intercept,
            bool normalize, int epochs, int loss, float alpha, float l1_ratio,
            bool shuffle, float tol) {
@@ -200,12 +200,12 @@ void cdFit(cumlHandle &handle, float *input, int n_rows, int n_cols,
 
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
 
-  cdFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-        fit_intercept, normalize, epochs, loss_funct, alpha, l1_ratio, shuffle,
-        tol, handle.getStream());
+  cdFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+        normalize, epochs, loss_funct, alpha, l1_ratio, shuffle, tol,
+        handle.get_stream());
 }
 
-void cdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
+void cdFit(raft::handle_t &handle, double *input, int n_rows, int n_cols,
            double *labels, double *coef, double *intercept, bool fit_intercept,
            bool normalize, int epochs, int loss, double alpha, double l1_ratio,
            bool shuffle, double tol) {
@@ -214,13 +214,14 @@ void cdFit(cumlHandle &handle, double *input, int n_rows, int n_cols,
 
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
 
-  cdFit(handle.getImpl(), input, n_rows, n_cols, labels, coef, intercept,
-        fit_intercept, normalize, epochs, loss_funct, alpha, l1_ratio, shuffle,
-        tol, handle.getStream());
+  cdFit(handle, input, n_rows, n_cols, labels, coef, intercept, fit_intercept,
+        normalize, epochs, loss_funct, alpha, l1_ratio, shuffle, tol,
+        handle.get_stream());
 }
 
-void cdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
-               const float *coef, float intercept, float *preds, int loss) {
+void cdPredict(raft::handle_t &handle, const float *input, int n_rows,
+               int n_cols, const float *coef, float intercept, float *preds,
+               int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -228,12 +229,13 @@ void cdPredict(cumlHandle &handle, const float *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  cdPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-            loss_funct, handle.getStream());
+  cdPredict(handle, input, n_rows, n_cols, coef, intercept, preds, loss_funct,
+            handle.get_stream());
 }
 
-void cdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
-               const double *coef, double intercept, double *preds, int loss) {
+void cdPredict(raft::handle_t &handle, const double *input, int n_rows,
+               int n_cols, const double *coef, double intercept, double *preds,
+               int loss) {
   ML::loss_funct loss_funct = ML::loss_funct::SQRD_LOSS;
   if (loss == 0) {
     loss_funct = ML::loss_funct::SQRD_LOSS;
@@ -241,8 +243,8 @@ void cdPredict(cumlHandle &handle, const double *input, int n_rows, int n_cols,
     ASSERT(false, "glm.cu: other functions are not supported yet.");
   }
 
-  cdPredict(handle.getImpl(), input, n_rows, n_cols, coef, intercept, preds,
-            loss_funct, handle.getStream());
+  cdPredict(handle, input, n_rows, n_cols, coef, intercept, preds, loss_funct,
+            handle.get_stream());
 }
 
 }  // namespace Solver
diff --git a/cpp/src/spectral/spectral.cu b/cpp/src/spectral/spectral.cu
index c4f338ae8d..ea7f43a075 100644
--- a/cpp/src/spectral/spectral.cu
+++ b/cpp/src/spectral/spectral.cu
@@ -36,12 +36,12 @@ namespace Spectral {
    * @param n_components the number of components to project the X into
    * @param out output array for embedding (size n*n_comonents)
    */
-void fit_embedding(const cumlHandle &handle, int *rows, int *cols, float *vals,
-                   int nnz, int n, int n_components, float *out) {
-  const auto &impl = handle.getImpl();
+void fit_embedding(const raft::handle_t &handle, int *rows, int *cols,
+                   float *vals, int nnz, int n, int n_components, float *out) {
+  const auto &impl = handle;
   MLCommon::Spectral::fit_embedding(
-    impl.getcusparseHandle(), rows, cols, vals, nnz, n, n_components, out,
-    handle.getDeviceAllocator(), handle.getStream());
+    impl.get_cusparse_handle(), rows, cols, vals, nnz, n, n_components, out,
+    handle.get_device_allocator(), handle.get_stream());
 }
 }  // namespace Spectral
 }  // namespace ML
diff --git a/cpp/src/svm/kernelcache.cuh b/cpp/src/svm/kernelcache.cuh
index 614c96ad38..9c5b72924d 100644
--- a/cpp/src/svm/kernelcache.cuh
+++ b/cpp/src/svm/kernelcache.cuh
@@ -16,18 +16,18 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/svm/svm_parameter.h>
 #include <linalg/init.h>
+#include <raft/cudart_utils.h>
 #include <cache/cache.cuh>
 #include <cache/cache_util.cuh>
 #include <common/cumlHandle.hpp>
 #include <common/host_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <linalg/gemm.cuh>
 #include <matrix/grammatrix.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/matrix/matrix.cuh>
 
 namespace ML {
 namespace SVM {
@@ -83,7 +83,7 @@ class KernelCache {
   /**
    * Construct an object to manage kernel cache
    *
-   * @param handle reference to cumlHandle implementation
+   * @param handle reference to raft::handle_t implementation
    * @param x device array of training vectors in column major format,
    *   size [n_rows x n_cols]
    * @param n_rows number of training vectors
@@ -93,11 +93,11 @@ class KernelCache {
    * @param cache_size (default 200 MiB)
    * @param svmType is this SVR or SVC
    */
-  KernelCache(const cumlHandle_impl &handle, const math_t *x, int n_rows,
+  KernelCache(const raft::handle_t &handle, const math_t *x, int n_rows,
               int n_cols, int n_ws,
               MLCommon::Matrix::GramMatrixBase<math_t> *kernel,
               float cache_size = 200, SvmType svmType = C_SVC)
-    : cache(handle.getDeviceAllocator(), handle.getStream(), n_rows,
+    : cache(handle.get_device_allocator(), handle.get_stream(), n_rows,
             cache_size),
       kernel(kernel),
       x(x),
@@ -105,16 +105,16 @@ class KernelCache {
       n_cols(n_cols),
       n_ws(n_ws),
       svmType(svmType),
-      cublas_handle(handle.getCublasHandle()),
-      d_num_selected_out(handle.getDeviceAllocator(), handle.getStream(), 1),
-      d_temp_storage(handle.getDeviceAllocator(), handle.getStream()),
-      x_ws(handle.getDeviceAllocator(), handle.getStream(), n_ws * n_cols),
-      tile(handle.getDeviceAllocator(), handle.getStream(), n_ws * n_rows),
-      unique_idx(handle.getDeviceAllocator(), handle.getStream(), n_ws),
-      k_col_idx(handle.getDeviceAllocator(), handle.getStream(), n_ws),
-      ws_cache_idx(handle.getDeviceAllocator(), handle.getStream(), n_ws) {
+      cublas_handle(handle.get_cublas_handle()),
+      d_num_selected_out(handle.get_device_allocator(), handle.get_stream(), 1),
+      d_temp_storage(handle.get_device_allocator(), handle.get_stream()),
+      x_ws(handle.get_device_allocator(), handle.get_stream(), n_ws * n_cols),
+      tile(handle.get_device_allocator(), handle.get_stream(), n_ws * n_rows),
+      unique_idx(handle.get_device_allocator(), handle.get_stream(), n_ws),
+      k_col_idx(handle.get_device_allocator(), handle.get_stream(), n_ws),
+      ws_cache_idx(handle.get_device_allocator(), handle.get_stream(), n_ws) {
     ASSERT(kernel != nullptr, "Kernel pointer required for KernelCache!");
-    stream = handle.getStream();
+    stream = handle.get_stream();
 
     // Default kernel_column_idx map for SVC
     MLCommon::LinAlg::range(k_col_idx.data(), n_ws, stream);
@@ -197,8 +197,8 @@ class KernelCache {
                              stream);  // cache stream
 
         // collect training vectors for kernel elements that needs to be calculated
-        MLCommon::Matrix::copyRows(x, n_rows, n_cols, x_ws.data(), ws_idx_new,
-                                   non_cached, stream, false);
+        raft::matrix::copyRows(x, n_rows, n_cols, x_ws.data(), ws_idx_new,
+                               non_cached, stream, false);
         math_t *tile_new = tile.data() + n_cached * n_rows;
         (*kernel)(x, n_rows, n_cols, x_ws.data(), non_cached, tile_new, stream);
         // We need AssignCacheIdx to be finished before calling StoreCols
@@ -208,8 +208,8 @@ class KernelCache {
     } else {
       if (n_unique > 0) {
         // collect all the feature vectors in the working set
-        MLCommon::Matrix::copyRows(x, n_rows, n_cols, x_ws.data(),
-                                   unique_idx.data(), n_unique, stream, false);
+        raft::matrix::copyRows(x, n_rows, n_cols, x_ws.data(),
+                               unique_idx.data(), n_unique, stream, false);
         (*kernel)(x, n_rows, n_cols, x_ws.data(), n_unique, tile.data(),
                   stream);
       }
@@ -241,7 +241,7 @@ class KernelCache {
   */
   int *GetColIdxMap() {
     if (svmType == EPSILON_SVR) {
-      mapColumnIndices<<<MLCommon::ceildiv(n_ws, TPB), TPB, 0, stream>>>(
+      mapColumnIndices<<<raft::ceildiv(n_ws, TPB), TPB, 0, stream>>>(
         ws_idx, n_ws, n_rows, unique_idx.data(), n_unique, k_col_idx.data());
       CUDA_CHECK(cudaPeekAtLastError());
     }
@@ -281,7 +281,7 @@ class KernelCache {
    */
   void GetVecIndices(const int *ws_idx, int n_ws, int *vec_idx) {
     int n = n_rows;
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       vec_idx, ws_idx, n_ws,
       [n] __device__(math_t y) { return y < n ? y : y - n; }, stream);
   }
@@ -306,7 +306,7 @@ class KernelCache {
 
   MLCommon::Matrix::GramMatrixBase<math_t> *kernel;
 
-  const cumlHandle_impl handle;
+  const raft::handle_t handle;
 
   const int TPB = 256;  //!< threads per block for kernels launched
 
@@ -336,7 +336,7 @@ class KernelCache {
                         int *n_unique) {
     if (svmType == C_SVC) {
       *n_unique = n_ws;
-      MLCommon::copy(unique_idx, ws_idx, n_ws, stream);
+      raft::copy(unique_idx, ws_idx, n_ws, stream);
       return;
     }
     // for EPSILON_SVR
@@ -347,7 +347,7 @@ class KernelCache {
     cub::DeviceSelect::Unique(d_temp_storage.data(), d_temp_storage_size,
                               ws_cache_idx.data(), unique_idx,
                               d_num_selected_out.data(), n_ws, stream);
-    MLCommon::updateHost(n_unique, d_num_selected_out.data(), 1, stream);
+    raft::update_host(n_unique, d_num_selected_out.data(), 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
   }
 };
diff --git a/cpp/src/svm/results.cuh b/cpp/src/svm/results.cuh
index d32bfa0112..2f5776c2f4 100644
--- a/cpp/src/svm/results.cuh
+++ b/cpp/src/svm/results.cuh
@@ -17,23 +17,23 @@
 #pragma once
 
 #include <math.h>
-#include <cuda_utils.cuh>
 #include <iostream>
 #include <limits>
 #include <memory>
+#include <raft/cuda_utils.cuh>
 
-#include <common/cudart_utils.h>
 #include <linalg/init.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <common/host_buffer.hpp>
 #include <cub/device/device_select.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/add.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/unary_op.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "ws_util.cuh"
 
 namespace ML {
@@ -59,10 +59,10 @@ class Results {
    * @param n_cols number of features
    * @param C penalty parameter
    */
-  Results(const cumlHandle_impl &handle, const math_t *x, const math_t *y,
-          int n_rows, int n_cols, math_t C, SvmType svmType)
-    : allocator(handle.getDeviceAllocator()),
-      stream(handle.getStream()),
+  Results(const raft::handle_t &handle, const math_t *x, const math_t *y,
+          int n_rows, int n_cols, const math_t *C, SvmType svmType)
+    : allocator(handle.get_device_allocator()),
+      stream(handle.get_stream()),
       handle(handle),
       n_rows(n_rows),
       n_cols(n_cols),
@@ -71,14 +71,14 @@ class Results {
       C(C),
       svmType(svmType),
       n_train(svmType == EPSILON_SVR ? n_rows * 2 : n_rows),
-      cub_storage(handle.getDeviceAllocator(), stream),
-      d_num_selected(handle.getDeviceAllocator(), stream, 1),
-      d_val_reduced(handle.getDeviceAllocator(), stream, 1),
-      f_idx(handle.getDeviceAllocator(), stream, n_train),
-      idx_selected(handle.getDeviceAllocator(), stream, n_train),
-      val_selected(handle.getDeviceAllocator(), stream, n_train),
-      val_tmp(handle.getDeviceAllocator(), stream, n_train),
-      flag(handle.getDeviceAllocator(), stream, n_train) {
+      cub_storage(handle.get_device_allocator(), stream),
+      d_num_selected(handle.get_device_allocator(), stream, 1),
+      d_val_reduced(handle.get_device_allocator(), stream, 1),
+      f_idx(handle.get_device_allocator(), stream, n_train),
+      idx_selected(handle.get_device_allocator(), stream, n_train),
+      val_selected(handle.get_device_allocator(), stream, n_train),
+      val_tmp(handle.get_device_allocator(), stream, n_train),
+      flag(handle.get_device_allocator(), stream, n_train) {
     InitCubBuffers();
     MLCommon::LinAlg::range(f_idx.data(), n_train, stream);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -131,8 +131,8 @@ class Results {
     math_t *x_support = (math_t *)allocator->allocate(
       n_support * n_cols * sizeof(math_t), stream);
     // Collect support vectors into a contiguous block
-    MLCommon::Matrix::copyRows(x, n_rows, n_cols, x_support, idx, n_support,
-                               stream);
+    raft::matrix::copyRows(x, n_rows, n_cols, x_support, idx, n_support,
+                           stream);
     CUDA_CHECK(cudaPeekAtLastError());
     return x_support;
   }
@@ -156,14 +156,14 @@ class Results {
   void CombineCoefs(const math_t *alpha, math_t *coef) {
     MLCommon::device_buffer<math_t> math_tmp(allocator, stream, n_train);
     // Calculate dual coefficients = alpha * y
-    MLCommon::LinAlg::binaryOp(
+    raft::linalg::binaryOp(
       coef, alpha, y, n_train,
       [] __device__(math_t a, math_t y) { return a * y; }, stream);
 
     if (svmType == EPSILON_SVR) {
       // for regression the final coefficients are
       // coef[0..n-rows-1] = alpha[0..nrows-1] - alpha[nrows..2*n_rows-1]
-      MLCommon::LinAlg::add(coef, coef, coef + n_rows, n_rows, stream);
+      raft::linalg::add(coef, coef, coef + n_rows, n_rows, stream);
     }
   }
 
@@ -176,14 +176,14 @@ class Results {
    */
   void GetDualCoefs(const math_t *val_tmp, math_t **dual_coefs,
                     int *n_support) {
-    auto allocator = handle.getDeviceAllocator();
+    auto allocator = handle.get_device_allocator();
     // Return only the non-zero coefficients
     auto select_op = [] __device__(math_t a) { return 0 != a; };
     *n_support =
       SelectByCoef(val_tmp, n_rows, val_tmp, select_op, val_selected.data());
     *dual_coefs =
       (math_t *)allocator->allocate(*n_support * sizeof(math_t), stream);
-    MLCommon::copy(*dual_coefs, val_selected.data(), *n_support, stream);
+    raft::copy(*dual_coefs, val_selected.data(), *n_support, stream);
   }
 
   /**
@@ -198,7 +198,7 @@ class Results {
     auto select_op = [] __device__(math_t a) -> bool { return 0 != a; };
     SelectByCoef(coef, n_rows, f_idx.data(), select_op, idx_selected.data());
     int *idx = (int *)allocator->allocate(n_support * sizeof(int), stream);
-    MLCommon::copy(idx, idx_selected.data(), n_support, stream);
+    raft::copy(idx, idx_selected.data(), n_support, stream);
     return idx;
   }
 
@@ -220,15 +220,12 @@ class Results {
     // For unbound support vectors f_i = -b.
 
     // Select f for unbound support vectors (0 < alpha < C)
-    math_t C = this->C;
-    auto select = [C] __device__(math_t a) -> bool { return 0 < a && a < C; };
-
-    int n_free = SelectByCoef(alpha, n_train, f, select, val_selected.data());
+    int n_free = SelectUnboundSV(alpha, n_train, f, val_selected.data());
     if (n_free > 0) {
       cub::DeviceReduce::Sum(cub_storage.data(), cub_bytes, val_selected.data(),
                              d_val_reduced.data(), n_free, stream);
       math_t sum;
-      MLCommon::updateHost(&sum, d_val_reduced.data(), 1, stream);
+      raft::update_host(&sum, d_val_reduced.data(), 1, stream);
       return -sum / n_free;
     } else {
       // All support vectors are bound. Let's define
@@ -242,17 +239,41 @@ class Results {
     }
   }
 
+  /**
+  * @brief Select values for unbound support vectors (not bound by C).
+  * @tparam valType type of values that will be selected
+  * @param [in] alpha dual coefficients, size [n]
+  * @param [in] n number of dual coefficients
+  * @param [in] val values to filter, size [n]
+  * @param [out] out buffer size [n]
+  * @return number of selected elements
+  */
+  template <typename valType>
+  int SelectUnboundSV(const math_t *alpha, int n, const valType *val,
+                      valType *out) {
+    auto select = [] __device__(math_t a, math_t C) -> bool {
+      return 0 < a && a < C;
+    };
+    raft::linalg::binaryOp(flag.data(), alpha, C, n, select, stream);
+    cub::DeviceSelect::Flagged(cub_storage.data(), cub_bytes, val, flag.data(),
+                               out, d_num_selected.data(), n, stream);
+    int n_selected;
+    raft::update_host(&n_selected, d_num_selected.data(), 1, stream);
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    return n_selected;
+  }
+
   std::shared_ptr<deviceAllocator> allocator;
 
  private:
-  const cumlHandle_impl &handle;
+  const raft::handle_t &handle;
   cudaStream_t stream;
 
   int n_rows;       //!< number of rows in the training vector matrix
   int n_cols;       //!< number of features
   const math_t *x;  //!< training vectors
   const math_t *y;  //!< labels
-  math_t C;         //!< penalty parameter
+  const math_t *C;  //!< penalty parameter
   SvmType svmType;  //!< SVM problem type: SVC or SVR
   int n_train;  //!< number of training vectors (including duplicates for SVR)
 
@@ -304,13 +325,13 @@ class Results {
   template <typename select_op, typename valType>
   int SelectByCoef(const math_t *coef, int n, const valType *val, select_op op,
                    valType *out) {
-    set_flag<<<MLCommon::ceildiv(n, TPB), TPB, 0, stream>>>(flag.data(), coef,
-                                                            n, op);
+    set_flag<<<raft::ceildiv(n, TPB), TPB, 0, stream>>>(flag.data(), coef, n,
+                                                        op);
     CUDA_CHECK(cudaPeekAtLastError());
     cub::DeviceSelect::Flagged(cub_storage.data(), cub_bytes, val, flag.data(),
                                out, d_num_selected.data(), n, stream);
     int n_selected;
-    MLCommon::updateHost(&n_selected, d_num_selected.data(), 1, stream);
+    raft::update_host(&n_selected, d_num_selected.data(), 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     return n_selected;
   }
@@ -323,15 +344,15 @@ class Results {
    */
   math_t SelectReduce(const math_t *alpha, const math_t *f, bool min,
                       void (*flag_op)(bool *, int, const math_t *,
-                                      const math_t *, math_t)) {
-    flag_op<<<MLCommon::ceildiv(n_train, TPB), TPB, 0, stream>>>(
+                                      const math_t *, const math_t *)) {
+    flag_op<<<raft::ceildiv(n_train, TPB), TPB, 0, stream>>>(
       flag.data(), n_train, alpha, y, C);
     CUDA_CHECK(cudaPeekAtLastError());
     cub::DeviceSelect::Flagged(cub_storage.data(), cub_bytes, f, flag.data(),
                                val_selected.data(), d_num_selected.data(),
                                n_train, stream);
     int n_selected;
-    MLCommon::updateHost(&n_selected, d_num_selected.data(), 1, stream);
+    raft::update_host(&n_selected, d_num_selected.data(), 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     math_t res = 0;
     ASSERT(n_selected > 0,
@@ -344,7 +365,7 @@ class Results {
       cub::DeviceReduce::Max(cub_storage.data(), cub_bytes, val_selected.data(),
                              d_val_reduced.data(), n_selected, stream);
     }
-    MLCommon::updateHost(&res, d_val_reduced.data(), 1, stream);
+    raft::update_host(&res, d_val_reduced.data(), 1, stream);
     return res;
   }
 };  // namespace SVM
diff --git a/cpp/src/svm/smo_sets.cuh b/cpp/src/svm/smo_sets.cuh
index a128e51cb4..bc1c8ad6ee 100644
--- a/cpp/src/svm/smo_sets.cuh
+++ b/cpp/src/svm/smo_sets.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace SVM {
diff --git a/cpp/src/svm/smoblocksolve.cuh b/cpp/src/svm/smoblocksolve.cuh
index 865993cc44..a876ea0088 100644
--- a/cpp/src/svm/smoblocksolve.cuh
+++ b/cpp/src/svm/smoblocksolve.cuh
@@ -20,7 +20,7 @@
 
 #include <cuml/svm/svm_parameter.h>
 #include <stdlib.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include <selection/kselection.cuh>
 #include "smo_sets.cuh"
 
@@ -132,7 +132,8 @@ namespace SVM {
  * @param [in] kernel kernel function calculated between the working set and all
  *   other training vectors, size [n_rows * n_ws]
  * @param [in] ws_idx indices of traning vectors in the working set, size [n_ws]
- * @param [in] C penalty parameter
+ * @param [in] C_vec penalty parameter vector including class and sample weights
+ *   size [n_train]
  * @param [in] eps tolerance, iterations will stop if the duality gap is smaller
  *  than this value (or if the gap is smaller than 0.1 times the initial gap)
  * @param [out] return_buff, two valies are returned: duality gap and the number
@@ -144,7 +145,7 @@ namespace SVM {
 template <typename math_t, int WSIZE>
 __global__ __launch_bounds__(WSIZE) void SmoBlockSolve(
   math_t *y_array, int n_train, math_t *alpha, int n_ws, math_t *delta_alpha,
-  math_t *f_array, const math_t *kernel, const int *ws_idx, math_t C,
+  math_t *f_array, const math_t *kernel, const int *ws_idx, const math_t *C_vec,
   math_t eps, math_t *return_buff, int max_iter = 10000,
   SvmType svmType = C_SVC, const int *kColIdx = nullptr) {
   typedef MLCommon::Selection::KVPair<math_t, int> Pair;
@@ -190,6 +191,8 @@ __global__ __launch_bounds__(WSIZE) void SmoBlockSolve(
   math_t f = f_array[idx];
   math_t a = alpha[idx];
   math_t a_save = a;
+  math_t C = C_vec[idx];
+
   __shared__ math_t diff_end;
   __shared__ math_t diff;
 
diff --git a/cpp/src/svm/smosolver.cuh b/cpp/src/svm/smosolver.cuh
index baf8f4f939..1396650dac 100644
--- a/cpp/src/svm/smosolver.cuh
+++ b/cpp/src/svm/smosolver.cuh
@@ -16,22 +16,24 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <math.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <thrust/device_ptr.h>
+#include <thrust/fill.h>
 #include <iostream>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 #include <string>
 #include <type_traits>
 
 #include <cuml/matrix/kernelparams.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/gemv.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/gemv.h>
 #include <common/cumlHandle.hpp>
 #include <cuml/common/logger.hpp>
-#include <linalg/unary_op.cuh>
 #include <matrix/grammatrix.cuh>
 #include <matrix/kernelfactory.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "kernelcache.cuh"
 #include "smo_sets.cuh"
 #include "smoblocksolve.cuh"
@@ -72,7 +74,7 @@ namespace SVM {
 template <typename math_t>
 class SmoSolver {
  public:
-  SmoSolver(const cumlHandle_impl &handle, svmParameter param,
+  SmoSolver(const raft::handle_t &handle, svmParameter param,
             MLCommon::Matrix::GramMatrixBase<math_t> *kernel)
     : handle(handle),
       n_rows(n_rows),
@@ -83,12 +85,13 @@ class SmoSolver {
       nochange_steps(param.nochange_steps),
       epsilon(param.epsilon),
       svmType(param.svmType),
-      stream(handle.getStream()),
-      return_buff(handle.getDeviceAllocator(), stream, 2),
-      alpha(handle.getDeviceAllocator(), stream),
-      delta_alpha(handle.getDeviceAllocator(), stream),
-      f(handle.getDeviceAllocator(), stream),
-      y_label(handle.getDeviceAllocator(), stream) {
+      stream(handle.get_stream()),
+      return_buff(handle.get_device_allocator(), stream, 2),
+      alpha(handle.get_device_allocator(), stream),
+      C_vec(handle.get_device_allocator(), stream),
+      delta_alpha(handle.get_device_allocator(), stream),
+      f(handle.get_device_allocator(), stream),
+      y_label(handle.get_device_allocator(), stream) {
     ML::Logger::get().setLevel(param.verbosity);
   }
 
@@ -103,6 +106,8 @@ class SmoSolver {
    * @param [in] n_rows number of rows (training vectors)
    * @param [in] n_cols number of columns (features)
    * @param [in] y labels (values +/-1), size [n_rows]
+   * @param [in] sample_weight device array of sample weights (or nullptr if not
+   *     applicable)
    * @param [out] dual_coefs size [n_support] on exit
    * @param [out] n_support number of support vectors
    * @param [out] x_support support vectors in column major format, size [n_support, n_cols]
@@ -111,13 +116,14 @@ class SmoSolver {
    * @param [in] max_outer_iter maximum number of outer iteration (default 100 * n_rows)
    * @param [in] max_inner_iter maximum number of inner iterations (default 10000)
    */
-  void Solve(math_t *x, int n_rows, int n_cols, math_t *y, math_t **dual_coefs,
-             int *n_support, math_t **x_support, int **idx, math_t *b,
-             int max_outer_iter = -1, int max_inner_iter = 10000) {
+  void Solve(math_t *x, int n_rows, int n_cols, math_t *y,
+             const math_t *sample_weight, math_t **dual_coefs, int *n_support,
+             math_t **x_support, int **idx, math_t *b, int max_outer_iter = -1,
+             int max_inner_iter = 10000) {
     // Prepare data structures for SMO
     WorkingSet<math_t> ws(handle, stream, n_rows, SMO_WS_SIZE, svmType);
     n_ws = ws.GetSize();
-    Initialize(&y, n_rows, n_cols);
+    Initialize(&y, sample_weight, n_rows, n_cols);
     KernelCache<math_t> cache(handle, x, n_rows, n_cols, n_ws, kernel,
                               cache_size, svmType);
     // Init counters
@@ -131,17 +137,17 @@ class SmoSolver {
     while (n_iter < max_outer_iter && keep_going) {
       CUDA_CHECK(
         cudaMemsetAsync(delta_alpha.data(), 0, n_ws * sizeof(math_t), stream));
-      ws.Select(f.data(), alpha.data(), y, C);
+      ws.Select(f.data(), alpha.data(), y, C_vec.data());
 
       math_t *cacheTile = cache.GetTile(ws.GetIndices());
       SmoBlockSolve<math_t, SMO_WS_SIZE><<<1, n_ws, 0, stream>>>(
         y, n_train, alpha.data(), n_ws, delta_alpha.data(), f.data(), cacheTile,
-        cache.GetWsIndices(), C, tol, return_buff.data(), max_inner_iter,
-        svmType, cache.GetColIdxMap());
+        cache.GetWsIndices(), C_vec.data(), tol, return_buff.data(),
+        max_inner_iter, svmType, cache.GetColIdxMap());
 
       CUDA_CHECK(cudaPeekAtLastError());
 
-      MLCommon::updateHost(host_return_buff, return_buff.data(), 2, stream);
+      raft::update_host(host_return_buff, return_buff.data(), 2, stream);
 
       UpdateF(f.data(), n_rows, delta_alpha.data(), cache.GetUniqueSize(),
               cacheTile);
@@ -162,7 +168,7 @@ class SmoSolver {
       "SMO solver finished after %d outer iterations, total inner"
       " iterations, and diff %lf",
       n_iter, n_inner_iter, diff_prev);
-    Results<math_t> res(handle, x, y, n_rows, n_cols, C, svmType);
+    Results<math_t> res(handle, x, y, n_rows, n_cols, C_vec.data(), svmType);
     res.Get(alpha.data(), f.data(), dual_coefs, n_support, idx, x_support, b);
     ReleaseBuffers();
   }
@@ -185,14 +191,14 @@ class SmoSolver {
                const math_t *cacheTile) {
     // multipliers used in the equation : f = 1*cachtile * delta_alpha + 1*f
     math_t one = 1;
-    CUBLAS_CHECK(MLCommon::LinAlg::cublasgemv(
-      handle.getCublasHandle(), CUBLAS_OP_N, n_rows, n_ws, &one, cacheTile,
+    CUBLAS_CHECK(raft::linalg::cublasgemv(
+      handle.get_cublas_handle(), CUBLAS_OP_N, n_rows, n_ws, &one, cacheTile,
       n_rows, delta_alpha, 1, &one, f, 1, stream));
     if (svmType == EPSILON_SVR) {
       // SVR has doubled the number of trainig vectors and we need to update
       // alpha for both batches individually
-      CUBLAS_CHECK(MLCommon::LinAlg::cublasgemv(
-        handle.getCublasHandle(), CUBLAS_OP_N, n_rows, n_ws, &one, cacheTile,
+      CUBLAS_CHECK(raft::linalg::cublasgemv(
+        handle.get_cublas_handle(), CUBLAS_OP_N, n_rows, n_ws, &one, cacheTile,
         n_rows, delta_alpha, 1, &one, f + n_rows, 1, stream));
     }
   }
@@ -213,10 +219,13 @@ class SmoSolver {
    *
    * @param[inout] y on entry class labels or target values,
    *    on exit device pointer to class labels
+   * @param[in] sample_weight sample weights (can be nullptr, otherwise device
+   *    array of size [n_rows])
    * @param[in] n_rows
    * @param[in] n_cols
    */
-  void Initialize(math_t **y, int n_rows, int n_cols) {
+  void Initialize(math_t **y, const math_t *sample_weight, int n_rows,
+                  int n_cols) {
     this->n_rows = n_rows;
     this->n_cols = n_cols;
     n_train = (svmType == EPSILON_SVR) ? n_rows * 2 : n_rows;
@@ -224,6 +233,7 @@ class SmoSolver {
     // Zero init alpha
     CUDA_CHECK(
       cudaMemsetAsync(alpha.data(), 0, n_train * sizeof(math_t), stream));
+    InitPenalty(C_vec.data(), sample_weight, n_rows);
     // Init f (and also class labels for SVR)
     switch (svmType) {
       case C_SVC:
@@ -240,6 +250,23 @@ class SmoSolver {
     }
   }
 
+  void InitPenalty(math_t *C_vec, const math_t *sample_weight, int n_rows) {
+    if (sample_weight == nullptr) {
+      thrust::device_ptr<math_t> c_ptr(C_vec);
+      thrust::fill(thrust::cuda::par.on(stream), c_ptr, c_ptr + n_train, C);
+    } else {
+      math_t C = this->C;
+      raft::linalg::unaryOp(
+        C_vec, sample_weight, n_rows,
+        [C] __device__(math_t w) { return C * w; }, stream);
+      if (n_train > n_rows) {
+        // Set the same penalty parameter for the duplicate set of vectors
+        raft::linalg::unaryOp(
+          C_vec + n_rows, sample_weight, n_rows,
+          [C] __device__(math_t w) { return C * w; }, stream);
+      }
+    }
+  }
   /** @brief Initialize Support Vector Classification
    *
    * We would like to maximize the following quantity
@@ -253,7 +280,7 @@ class SmoSolver {
    * @param [in] y device pointer of class labels size [n_rows]
    */
   void SvcInit(const math_t *y) {
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       f.data(), y, n_rows, [] __device__(math_t y) { return -y; }, stream);
   }
 
@@ -301,18 +328,18 @@ class SmoSolver {
 
     // f_i = epsilon - y_i, for i \in [0..n_rows-1]
     math_t epsilon = this->epsilon;
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       f, yr, n_rows, [epsilon] __device__(math_t y) { return epsilon - y; },
       stream);
 
     // f_i = epsilon - y_i, for i \in [n_rows..2*n_rows-1]
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       f + n_rows, yr, n_rows,
       [epsilon] __device__(math_t y) { return -epsilon - y; }, stream);
   }
 
  private:
-  const cumlHandle_impl &handle;
+  const raft::handle_t &handle;
   cudaStream_t stream;
 
   int n_rows = 0;  //!< training data number of rows
@@ -326,6 +353,8 @@ class SmoSolver {
   MLCommon::device_buffer<math_t> f;        //!< optimality indicator vector
   MLCommon::device_buffer<math_t> y_label;  //!< extra label for regression
 
+  MLCommon::device_buffer<math_t> C_vec;  //!< penalty parameter vector
+
   // Buffers for the working set [n_ws]
   //! change in alpha parameter during a blocksolve step
   MLCommon::device_buffer<math_t> delta_alpha;
@@ -392,8 +421,9 @@ class SmoSolver {
   }
 
   void ResizeBuffers(int n_train, int n_cols) {
-    // This needs to know n_rows, therefore it can be only called during solve
+    // This needs to know n_train, therefore it can be only called during solve
     alpha.resize(n_train, stream);
+    C_vec.resize(n_train, stream);
     f.resize(n_train, stream);
     delta_alpha.resize(n_ws, stream);
     if (svmType == EPSILON_SVR) y_label.resize(n_train, stream);
diff --git a/cpp/src/svm/svc.cu b/cpp/src/svm/svc.cu
index 16b197de14..b25327c969 100644
--- a/cpp/src/svm/svc.cu
+++ b/cpp/src/svm/svc.cu
@@ -16,12 +16,12 @@
 
 #include <iostream>
 
-#include <linalg/cublas_wrappers.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/device_buffer.hpp>
 #include <cuml/svm/svc.hpp>
 #include <label/classlabels.cuh>
-#include <linalg/unary_op.cuh>
 #include <matrix/kernelfactory.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "kernelcache.cuh"
 #include "smosolver.cuh"
 #include "svc_impl.cuh"
@@ -32,36 +32,37 @@ namespace SVM {
 using namespace MLCommon;
 
 // Explicit instantiation for the library
-template void svcFit<float>(const cumlHandle &handle, float *input, int n_rows,
-                            int n_cols, float *labels,
+template void svcFit<float>(const raft::handle_t &handle, float *input,
+                            int n_rows, int n_cols, float *labels,
                             const svmParameter &param,
                             MLCommon::Matrix::KernelParams &kernel_params,
-                            svmModel<float> &model);
+                            svmModel<float> &model, const float *sample_weight);
 
-template void svcFit<double>(const cumlHandle &handle, double *input,
+template void svcFit<double>(const raft::handle_t &handle, double *input,
                              int n_rows, int n_cols, double *labels,
                              const svmParameter &param,
                              MLCommon::Matrix::KernelParams &kernel_params,
-                             svmModel<double> &model);
+                             svmModel<double> &model,
+                             const double *sample_weight);
 
-template void svcPredict<float>(const cumlHandle &handle, float *input,
+template void svcPredict<float>(const raft::handle_t &handle, float *input,
                                 int n_rows, int n_cols,
                                 MLCommon::Matrix::KernelParams &kernel_params,
                                 const svmModel<float> &model, float *preds,
                                 float buffer_size, bool predict_class);
 
-template void svcPredict<double>(const cumlHandle &handle, double *input,
+template void svcPredict<double>(const raft::handle_t &handle, double *input,
                                  int n_rows, int n_cols,
                                  MLCommon::Matrix::KernelParams &kernel_params,
                                  const svmModel<double> &model, double *preds,
                                  double buffer_size, bool predict_class);
 
-template void svmFreeBuffers(const cumlHandle &handle, svmModel<float> &m);
+template void svmFreeBuffers(const raft::handle_t &handle, svmModel<float> &m);
 
-template void svmFreeBuffers(const cumlHandle &handle, svmModel<double> &m);
+template void svmFreeBuffers(const raft::handle_t &handle, svmModel<double> &m);
 
 template <typename math_t>
-SVC<math_t>::SVC(cumlHandle &handle, math_t C, math_t tol,
+SVC<math_t>::SVC(raft::handle_t &handle, math_t C, math_t tol,
                  Matrix::KernelParams kernel_params, math_t cache_size,
                  int max_iter, int nochange_steps, int verbosity)
   : handle(handle),
@@ -81,10 +82,12 @@ SVC<math_t>::~SVC() {
 }
 
 template <typename math_t>
-void SVC<math_t>::fit(math_t *input, int n_rows, int n_cols, math_t *labels) {
+void SVC<math_t>::fit(math_t *input, int n_rows, int n_cols, math_t *labels,
+                      const math_t *sample_weight) {
   model.n_cols = n_cols;
   if (model.dual_coefs) svmFreeBuffers(handle, model);
-  svcFit(handle, input, n_rows, n_cols, labels, param, kernel_params, model);
+  svcFit(handle, input, n_rows, n_cols, labels, param, kernel_params, model,
+         sample_weight);
 }
 
 template <typename math_t>
diff --git a/cpp/src/svm/svc_impl.cuh b/cpp/src/svm/svc_impl.cuh
index c2da64fe42..34f708bc24 100644
--- a/cpp/src/svm/svc_impl.cuh
+++ b/cpp/src/svm/svc_impl.cuh
@@ -26,16 +26,16 @@
 #include <cublas_v2.h>
 #include <cuml/svm/svm_model.h>
 #include <cuml/svm/svm_parameter.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <thrust/copy.h>
 #include <thrust/device_ptr.h>
 #include <thrust/iterator/counting_iterator.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <label/classlabels.cuh>
-#include <linalg/unary_op.cuh>
 #include <matrix/kernelfactory.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "kernelcache.cuh"
 #include "smosolver.cuh"
 
@@ -43,10 +43,10 @@ namespace ML {
 namespace SVM {
 
 template <typename math_t>
-void svcFit(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
+void svcFit(const raft::handle_t &handle, math_t *input, int n_rows, int n_cols,
             math_t *labels, const svmParameter &param,
             MLCommon::Matrix::KernelParams &kernel_params,
-            svmModel<math_t> &model) {
+            svmModel<math_t> &model, const math_t *sample_weight) {
   ASSERT(n_cols > 0,
          "Parameter n_cols: number of columns cannot be less than one");
   ASSERT(n_rows > 0,
@@ -55,26 +55,26 @@ void svcFit(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
   // KernelCache could use multiple streams, not implemented currently
   // See Issue #948.
   //ML::detail::streamSyncer _(handle_impl.getImpl());
-  const cumlHandle_impl &handle_impl = handle.getImpl();
+  const raft::handle_t &handle_impl = handle;
 
-  cudaStream_t stream = handle_impl.getStream();
+  cudaStream_t stream = handle_impl.get_stream();
   MLCommon::Label::getUniqueLabels(labels, n_rows, &(model.unique_labels),
                                    &(model.n_classes), stream,
-                                   handle_impl.getDeviceAllocator());
+                                   handle_impl.get_device_allocator());
 
   ASSERT(model.n_classes == 2,
          "Only binary classification is implemented at the moment");
 
-  MLCommon::device_buffer<math_t> y(handle_impl.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<math_t> y(handle_impl.get_device_allocator(), stream,
                                     n_rows);
   MLCommon::Label::getOvrLabels(labels, n_rows, model.unique_labels,
                                 model.n_classes, y.data(), 1, stream);
 
   MLCommon::Matrix::GramMatrixBase<math_t> *kernel =
     MLCommon::Matrix::KernelFactory<math_t>::create(
-      kernel_params, handle_impl.getCublasHandle());
+      kernel_params, handle_impl.get_cublas_handle());
   SmoSolver<math_t> smo(handle_impl, param, kernel);
-  smo.Solve(input, n_rows, n_cols, y.data(), &(model.dual_coefs),
+  smo.Solve(input, n_rows, n_cols, y.data(), sample_weight, &(model.dual_coefs),
             &(model.n_support), &(model.x_support), &(model.support_idx),
             &(model.b), param.max_iter);
   model.n_cols = n_cols;
@@ -82,8 +82,8 @@ void svcFit(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
 }
 
 template <typename math_t>
-void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
-                MLCommon::Matrix::KernelParams &kernel_params,
+void svcPredict(const raft::handle_t &handle, math_t *input, int n_rows,
+                int n_cols, MLCommon::Matrix::KernelParams &kernel_params,
                 const svmModel<math_t> &model, math_t *preds,
                 math_t buffer_size, bool predict_class) {
   ASSERT(n_cols == model.n_cols,
@@ -100,18 +100,18 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
     if (n_batch < 1) n_batch = 1;
   }
 
-  const cumlHandle_impl &handle_impl = handle.getImpl();
-  cudaStream_t stream = handle_impl.getStream();
+  const raft::handle_t &handle_impl = handle;
+  cudaStream_t stream = handle_impl.get_stream();
 
-  MLCommon::device_buffer<math_t> K(handle_impl.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<math_t> K(handle_impl.get_device_allocator(), stream,
                                     n_batch * model.n_support);
-  MLCommon::device_buffer<math_t> y(handle_impl.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<math_t> y(handle_impl.get_device_allocator(), stream,
                                     n_rows);
-  MLCommon::device_buffer<math_t> x_rbf(handle_impl.getDeviceAllocator(),
+  MLCommon::device_buffer<math_t> x_rbf(handle_impl.get_device_allocator(),
                                         stream);
-  MLCommon::device_buffer<int> idx(handle_impl.getDeviceAllocator(), stream);
+  MLCommon::device_buffer<int> idx(handle_impl.get_device_allocator(), stream);
 
-  cublasHandle_t cublas_handle = handle_impl.getCublasHandle();
+  cublasHandle_t cublas_handle = handle_impl.get_cublas_handle();
 
   MLCommon::Matrix::GramMatrixBase<math_t> *kernel =
     MLCommon::Matrix::KernelFactory<math_t>::create(kernel_params,
@@ -138,8 +138,8 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
       thrust::counting_iterator<int> last = first + n_batch;
       thrust::device_ptr<int> idx_ptr(idx.data());
       thrust::copy(thrust::cuda::par.on(stream), first, last, idx_ptr);
-      MLCommon::Matrix::copyRows(input, n_rows, n_cols, x_rbf.data(),
-                                 idx.data(), n_batch, stream, false);
+      raft::matrix::copyRows(input, n_rows, n_cols, x_rbf.data(), idx.data(),
+                             n_batch, stream, false);
       x_ptr = x_rbf.data();
       ld1 = n_batch;
     } else {
@@ -150,7 +150,7 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
                      K.data(), stream, ld1, model.n_support, n_batch);
     math_t one = 1;
     math_t null = 0;
-    CUBLAS_CHECK(MLCommon::LinAlg::cublasgemv(
+    CUBLAS_CHECK(raft::linalg::cublasgemv(
       cublas_handle, CUBLAS_OP_N, n_batch, model.n_support, &one, K.data(),
       n_batch, model.dual_coefs, 1, &null, y.data() + i, 1, stream));
   }
@@ -159,7 +159,7 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
   if (predict_class) {
     // Look up the label based on the value of the decision function:
     // f(x) = sign(y(x) + b)
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       preds, y.data(), n_rows,
       [labels, b] __device__(math_t y) {
         return y + b < 0 ? labels[0] : labels[1];
@@ -167,7 +167,7 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
       stream);
   } else {
     // Calculate the value of the decision function: f(x) = y(x) + b
-    MLCommon::LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       preds, y.data(), n_rows, [b] __device__(math_t y) { return y + b; },
       stream);
   }
@@ -176,9 +176,9 @@ void svcPredict(const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
 }
 
 template <typename math_t>
-void svmFreeBuffers(const cumlHandle &handle, svmModel<math_t> &m) {
-  auto allocator = handle.getImpl().getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+void svmFreeBuffers(const raft::handle_t &handle, svmModel<math_t> &m) {
+  auto allocator = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
   if (m.dual_coefs)
     allocator->deallocate(m.dual_coefs, m.n_support * sizeof(math_t), stream);
   if (m.support_idx)
diff --git a/cpp/src/svm/svm_api.cpp b/cpp/src/svm/svm_api.cpp
index 2587fed5e4..23499e6d12 100644
--- a/cpp/src/svm/svm_api.cpp
+++ b/cpp/src/svm/svm_api.cpp
@@ -46,7 +46,7 @@ cumlError_t cumlSpSvcFit(cumlHandle_t handle, float *input, int n_rows,
   ML::SVM::svmModel<float> model;
 
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -98,7 +98,7 @@ cumlError_t cumlDpSvcFit(cumlHandle_t handle, double *input, int n_rows,
   ML::SVM::svmModel<double> model;
 
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -147,7 +147,7 @@ cumlError_t cumlSpSvcPredict(cumlHandle_t handle, float *input, int n_rows,
   model.unique_labels = unique_labels;
 
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
@@ -190,7 +190,7 @@ cumlError_t cumlDpSvcPredict(cumlHandle_t handle, double *input, int n_rows,
   model.unique_labels = unique_labels;
 
   cumlError_t status;
-  ML::cumlHandle *handle_ptr;
+  raft::handle_t *handle_ptr;
   std::tie(handle_ptr, status) = ML::handleMap.lookupHandlePointer(handle);
   if (status == CUML_SUCCESS) {
     try {
diff --git a/cpp/src/svm/svr.cu b/cpp/src/svm/svr.cu
index 98da6c4929..3047e88cf4 100644
--- a/cpp/src/svm/svr.cu
+++ b/cpp/src/svm/svr.cu
@@ -16,12 +16,12 @@
 
 #include <iostream>
 
-#include <linalg/cublas_wrappers.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/device_buffer.hpp>
 #include <cuml/svm/svc.hpp>
 #include <label/classlabels.cuh>
-#include <linalg/unary_op.cuh>
 #include <matrix/kernelfactory.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "kernelcache.cuh"
 #include "smosolver.cuh"
 #include "svr_impl.cuh"
@@ -30,15 +30,17 @@ namespace ML {
 namespace SVM {
 
 // Explicit instantiation for the library
-template void svrFit<float>(const cumlHandle &handle, float *X, int n_rows,
+template void svrFit<float>(const raft::handle_t &handle, float *X, int n_rows,
                             int n_cols, float *y, const svmParameter &param,
                             MLCommon::Matrix::KernelParams &kernel_params,
-                            svmModel<float> &model);
+                            svmModel<float> &model, const float *sample_weight);
 
-template void svrFit<double>(const cumlHandle &handle, double *X, int n_rows,
-                             int n_cols, double *y, const svmParameter &param,
+template void svrFit<double>(const raft::handle_t &handle, double *X,
+                             int n_rows, int n_cols, double *y,
+                             const svmParameter &param,
                              MLCommon::Matrix::KernelParams &kernel_params,
-                             svmModel<double> &model);
+                             svmModel<double> &model,
+                             const double *sample_weight);
 
 };  // namespace SVM
 };  // end namespace ML
diff --git a/cpp/src/svm/svr_impl.cuh b/cpp/src/svm/svr_impl.cuh
index 3e77bd71e7..257e1ce726 100644
--- a/cpp/src/svm/svr_impl.cuh
+++ b/cpp/src/svm/svr_impl.cuh
@@ -25,16 +25,16 @@
 #include <cublas_v2.h>
 #include <cuml/svm/svm_model.h>
 #include <cuml/svm/svm_parameter.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <thrust/copy.h>
 #include <thrust/device_ptr.h>
 #include <thrust/iterator/counting_iterator.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <label/classlabels.cuh>
-#include <linalg/unary_op.cuh>
 #include <matrix/kernelfactory.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "kernelcache.cuh"
 #include "smosolver.cuh"
 #include "svc_impl.cuh"
@@ -43,10 +43,10 @@ namespace ML {
 namespace SVM {
 
 template <typename math_t>
-void svrFit(const cumlHandle &handle, math_t *X, int n_rows, int n_cols,
+void svrFit(const raft::handle_t &handle, math_t *X, int n_rows, int n_cols,
             math_t *y, const svmParameter &param,
             MLCommon::Matrix::KernelParams &kernel_params,
-            svmModel<math_t> &model) {
+            svmModel<math_t> &model, const math_t *sample_weight) {
   ASSERT(n_cols > 0,
          "Parameter n_cols: number of columns cannot be less than one");
   ASSERT(n_rows > 0,
@@ -55,18 +55,17 @@ void svrFit(const cumlHandle &handle, math_t *X, int n_rows, int n_cols,
   // KernelCache could use multiple streams, not implemented currently
   // See Issue #948.
   //ML::detail::streamSyncer _(handle_impl.getImpl());
-  const cumlHandle_impl &handle_impl = handle.getImpl();
-
-  cudaStream_t stream = handle_impl.getStream();
+  const raft::handle_t &handle_impl = handle;
 
+  cudaStream_t stream = handle_impl.get_stream();
   MLCommon::Matrix::GramMatrixBase<math_t> *kernel =
     MLCommon::Matrix::KernelFactory<math_t>::create(
-      kernel_params, handle_impl.getCublasHandle());
+      kernel_params, handle_impl.get_cublas_handle());
 
   SmoSolver<math_t> smo(handle_impl, param, kernel);
-  smo.Solve(X, n_rows, n_cols, y, &(model.dual_coefs), &(model.n_support),
-            &(model.x_support), &(model.support_idx), &(model.b),
-            param.max_iter);
+  smo.Solve(X, n_rows, n_cols, y, sample_weight, &(model.dual_coefs),
+            &(model.n_support), &(model.x_support), &(model.support_idx),
+            &(model.b), param.max_iter);
   model.n_cols = n_cols;
   delete kernel;
 }
diff --git a/cpp/src/svm/workingset.cuh b/cpp/src/svm/workingset.cuh
index eea57cae0e..eee4fe6cb8 100644
--- a/cpp/src/svm/workingset.cuh
+++ b/cpp/src/svm/workingset.cuh
@@ -16,18 +16,18 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/svm/svm_parameter.h>
 #include <limits.h>
 #include <linalg/init.h>
+#include <raft/cudart_utils.h>
 #include <thrust/device_ptr.h>
 #include <thrust/iterator/permutation_iterator.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "smo_sets.cuh"
 #include "ws_util.cuh"
 
@@ -64,32 +64,32 @@ class WorkingSet {
    * @param n_ws number of elements in the working set (default 1024)
    * @param svmType classification or regression
    */
-  WorkingSet(const cumlHandle_impl &handle, cudaStream_t stream, int n_rows = 0,
+  WorkingSet(const raft::handle_t &handle, cudaStream_t stream, int n_rows = 0,
              int n_ws = 0, SvmType svmType = C_SVC)
     : handle(handle),
       stream(stream),
       svmType(svmType),
       n_rows(n_rows),
-      available(handle.getDeviceAllocator(), stream),
-      available_sorted(handle.getDeviceAllocator(), stream),
-      cub_storage(handle.getDeviceAllocator(), stream),
-      f_idx(handle.getDeviceAllocator(), stream),
-      f_idx_sorted(handle.getDeviceAllocator(), stream),
-      f_sorted(handle.getDeviceAllocator(), stream),
-      idx_tmp(handle.getDeviceAllocator(), stream),
-      idx(handle.getDeviceAllocator(), stream),
-      ws_idx_sorted(handle.getDeviceAllocator(), stream),
-      ws_idx_selected(handle.getDeviceAllocator(), stream),
-      ws_idx_save(handle.getDeviceAllocator(), stream),
-      ws_priority(handle.getDeviceAllocator(), stream),
-      ws_priority_sorted(handle.getDeviceAllocator(), stream) {
+      available(handle.get_device_allocator(), stream),
+      available_sorted(handle.get_device_allocator(), stream),
+      cub_storage(handle.get_device_allocator(), stream),
+      f_idx(handle.get_device_allocator(), stream),
+      f_idx_sorted(handle.get_device_allocator(), stream),
+      f_sorted(handle.get_device_allocator(), stream),
+      idx_tmp(handle.get_device_allocator(), stream),
+      idx(handle.get_device_allocator(), stream),
+      ws_idx_sorted(handle.get_device_allocator(), stream),
+      ws_idx_selected(handle.get_device_allocator(), stream),
+      ws_idx_save(handle.get_device_allocator(), stream),
+      ws_priority(handle.get_device_allocator(), stream),
+      ws_priority_sorted(handle.get_device_allocator(), stream) {
     n_train = (svmType == EPSILON_SVR) ? n_rows * 2 : n_rows;
     SetSize(n_train, n_ws);
   }
 
   ~WorkingSet() {
-    handle.getDeviceAllocator()->deallocate(d_num_selected, 1 * sizeof(int),
-                                            stream);
+    handle.get_device_allocator()->deallocate(d_num_selected, 1 * sizeof(int),
+                                              stream);
   }
 
   /**
@@ -138,11 +138,11 @@ class WorkingSet {
    * @param f optimality indicator vector, size [n_train]
    * @param alpha dual coefficients, size [n_train]
    * @param y target labels (+/- 1)
-   * @param C penalty parameter
+    * @param C penalty parameter vector size [n_train]
    * @param n_already_selected
    */
 
-  void SimpleSelect(math_t *f, math_t *alpha, math_t *y, math_t C,
+  void SimpleSelect(math_t *f, math_t *alpha, math_t *y, const math_t *C,
                     int n_already_selected = 0) {
     // We are not using the topK kernel, because of the additional lower/upper
     // constraint
@@ -158,20 +158,19 @@ class WorkingSet {
 
     if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG) && n_train < 20) {
       std::stringstream ss;
-      MLCommon::myPrintDevVector("idx_sorted", f_idx_sorted.data(), n_train,
-                                 ss);
+      raft::print_device_vector("idx_sorted", f_idx_sorted.data(), n_train, ss);
       CUML_LOG_DEBUG(ss.str().c_str());
     }
     // Select n_ws/2 elements from the upper set with the smallest f value
     bool *available = this->available.data();
-    set_upper<<<MLCommon::ceildiv(n_train, TPB), TPB, 0, stream>>>(
+    set_upper<<<raft::ceildiv(n_train, TPB), TPB, 0, stream>>>(
       available, n_train, alpha, y, C);
     CUDA_CHECK(cudaPeekAtLastError());
     n_already_selected +=
       GatherAvailable(n_already_selected, n_needed / 2, true);
 
     // Select n_ws/2 elements from the lower set with the highest f values
-    set_lower<<<MLCommon::ceildiv(n_train, TPB), TPB, 0, stream>>>(
+    set_lower<<<raft::ceildiv(n_train, TPB), TPB, 0, stream>>>(
       available, n_train, alpha, y, C);
     CUDA_CHECK(cudaPeekAtLastError());
     n_already_selected +=
@@ -204,8 +203,12 @@ class WorkingSet {
   * [1] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal
   *     of Machine Learning Research, 19, 1-5 (2018)
   *
+  * @param f optimality indicator vector, size [n_train]
+  * @param alpha dual coefficients, size [n_train]
+  * @param y class labels, size [n_train]
+  * @param C penalty parameter vector, size [n_train]
   */
-  void Select(math_t *f, math_t *alpha, math_t *y, math_t C) {
+  void Select(math_t *f, math_t *alpha, math_t *y, const math_t *C) {
     if (n_ws >= n_train) {
       // All elements are selected, we have initialized idx to cover this case
       return;
@@ -226,7 +229,7 @@ class WorkingSet {
       // keep 1/2 of the old working set
       if (FIFO_strategy) {
         // FIFO selection following ThunderSVM
-        MLCommon::copy(idx.data(), ws_idx_save.data() + 2 * nc, 2 * nc, stream);
+        raft::copy(idx.data(), ws_idx_save.data() + 2 * nc, 2 * nc, stream);
         n_selected = nc * 2;
       } else {
         // priority based selection preferring to keep newer elements in ws
@@ -234,7 +237,7 @@ class WorkingSet {
       }
     }
     SimpleSelect(f, alpha, y, C, n_selected);
-    MLCommon::copy(ws_idx_save.data(), idx.data(), n_ws, stream);
+    raft::copy(ws_idx_save.data(), idx.data(), n_ws, stream);
   }
 
   /**
@@ -252,10 +255,10 @@ class WorkingSet {
    *     DOI: 10.1080/10556780500140714
    *
    * @param [in] alpha device vector of dual coefficients, size [n_train]
-   * @param [in] C penalty parameter
+   * @param [in] C_vec penalty parameter
    * @param [in] nc number of elements to select
    */
-  int PrioritySelect(math_t *alpha, math_t C, int nc) {
+  int PrioritySelect(math_t *alpha, const math_t *C, int nc) {
     int n_selected = 0;
 
     cub::DeviceRadixSort::SortPairs(
@@ -264,25 +267,26 @@ class WorkingSet {
 
     //Select first from free vectors (0<alpha<C)
     n_selected += SelectPrevWs(2 * nc, n_selected, [alpha, C] HD(int idx) {
-      return 0 < alpha[idx] && alpha[idx] < C;
+      return 0 < alpha[idx] && alpha[idx] < C[idx];
     });
 
     //then from lower bound (alpha=0)
     n_selected += SelectPrevWs(2 * nc, n_selected,
                                [alpha] HD(int idx) { return alpha[idx] <= 0; });
     // and in the end from upper bound vectors (alpha=c)
-    n_selected += SelectPrevWs(
-      2 * nc, n_selected, [alpha, C] HD(int idx) { return alpha[idx] >= C; });
+    n_selected += SelectPrevWs(2 * nc, n_selected, [alpha, C] HD(int idx) {
+      return alpha[idx] >= C[idx];
+    });
     // we have now idx[0:n_selected] indices from the old working set
     // we need to update their priority.
-    update_priority<<<MLCommon::ceildiv(n_selected, TPB), TPB, 0, stream>>>(
+    update_priority<<<raft::ceildiv(n_selected, TPB), TPB, 0, stream>>>(
       ws_priority.data(), n_selected, idx.data(), n_ws, ws_idx_sorted.data(),
       ws_priority_sorted.data());
     return n_selected;
   }
 
  private:
-  const cumlHandle_impl &handle;
+  const raft::handle_t &handle;
   cudaStream_t stream;
 
   bool firstcall = true;
@@ -335,7 +339,7 @@ class WorkingSet {
       ws_priority_sorted.resize(n_ws, stream);
 
       d_num_selected =
-        (int *)handle.getDeviceAllocator()->allocate(1 * sizeof(int), stream);
+        (int *)handle.get_device_allocator()->allocate(1 * sizeof(int), stream);
 
       // Determine temporary device storage requirements for cub
       size_t cub_bytes2 = 0;
@@ -368,13 +372,13 @@ class WorkingSet {
     // First we update the mask to ignores already selected elements
     bool *available = this->available.data();
     if (n_already_selected > 0) {
-      set_unavailable<<<MLCommon::ceildiv(n_train, TPB), TPB, 0, stream>>>(
+      set_unavailable<<<raft::ceildiv(n_train, TPB), TPB, 0, stream>>>(
         available, n_train, idx.data(), n_already_selected);
       CUDA_CHECK(cudaPeekAtLastError());
     }
     if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG) && n_train < 20) {
       std::stringstream ss;
-      MLCommon::myPrintDevVector("avail", available, n_train, ss);
+      raft::print_device_vector("avail", available, n_train, ss);
       CUML_LOG_DEBUG(ss.str().c_str());
     }
 
@@ -388,8 +392,8 @@ class WorkingSet {
                  av_sorted_ptr);
     if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG) && n_train < 20) {
       std::stringstream ss;
-      MLCommon::myPrintDevVector("avail_sorted", available_sorted.data(),
-                                 n_train, ss);
+      raft::print_device_vector("avail_sorted", available_sorted.data(),
+                                n_train, ss);
       CUML_LOG_DEBUG(ss.str().c_str());
     }
 
@@ -398,22 +402,22 @@ class WorkingSet {
                                f_idx_sorted.data(), available_sorted.data(),
                                idx_tmp.data(), d_num_selected, n_train);
     int n_selected;
-    MLCommon::updateHost(&n_selected, d_num_selected, 1, stream);
+    raft::update_host(&n_selected, d_num_selected, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // Copy to output
     int n_copy = n_selected > n_needed ? n_needed : n_selected;
     if (copy_front) {
-      MLCommon::copy(idx.data() + n_already_selected, idx_tmp.data(), n_copy,
-                     stream);
+      raft::copy(idx.data() + n_already_selected, idx_tmp.data(), n_copy,
+                 stream);
     } else {
-      MLCommon::copy(idx.data() + n_already_selected,
-                     idx_tmp.data() + n_selected - n_copy, n_copy, stream);
+      raft::copy(idx.data() + n_already_selected,
+                 idx_tmp.data() + n_selected - n_copy, n_copy, stream);
     }
     if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG) && n_train < 20) {
       std::stringstream ss;
-      MLCommon::myPrintDevVector("selected", idx.data(),
-                                 n_already_selected + n_copy, ss);
+      raft::print_device_vector("selected", idx.data(),
+                                n_already_selected + n_copy, ss);
       CUML_LOG_DEBUG(ss.str().c_str());
     }
     return n_copy;
@@ -443,11 +447,11 @@ class WorkingSet {
     cub::DeviceSelect::If(cub_storage.data(), cub_bytes, ws_idx_sorted.data(),
                           ws_idx_selected.data(), d_num_selected, n_ws, op);
     int n_selected;
-    MLCommon::updateHost(&n_selected, d_num_selected, 1, stream);
+    raft::update_host(&n_selected, d_num_selected, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     int n_copy = n_selected < n_needed ? n_selected : n_needed;
-    MLCommon::copy(idx.data() + n_already_selected, ws_idx_selected.data(),
-                   n_copy, stream);
+    raft::copy(idx.data() + n_already_selected, ws_idx_selected.data(), n_copy,
+               stream);
     return n_copy;
   }
 };
diff --git a/cpp/src/svm/ws_util.cu b/cpp/src/svm/ws_util.cu
index ab7599078c..43601fe4b7 100644
--- a/cpp/src/svm/ws_util.cu
+++ b/cpp/src/svm/ws_util.cu
@@ -16,7 +16,7 @@
 
 #include <limits.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace SVM {
diff --git a/cpp/src/svm/ws_util.cuh b/cpp/src/svm/ws_util.cuh
index 73411b3303..e8aa9dd99f 100644
--- a/cpp/src/svm/ws_util.cuh
+++ b/cpp/src/svm/ws_util.cuh
@@ -39,9 +39,9 @@ __global__ void set_unavailable(bool *available, int n_rows, const int *idx,
  */
 template <typename math_t>
 __global__ void set_upper(bool *available, int n, const math_t *alpha,
-                          const math_t *y, math_t C) {
+                          const math_t *y, const math_t *C) {
   int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  if (tid < n) available[tid] = in_upper(alpha[tid], y[tid], C);
+  if (tid < n) available[tid] = in_upper(alpha[tid], y[tid], C[tid]);
 }
 
 /** Set availability to true for elements in the lower set, otherwise false.
@@ -53,9 +53,9 @@ __global__ void set_upper(bool *available, int n, const math_t *alpha,
  */
 template <typename math_t>
 __global__ void set_lower(bool *available, int n, const math_t *alpha,
-                          const math_t *y, math_t C) {
+                          const math_t *y, const math_t *C) {
   int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  if (tid < n) available[tid] = in_lower(alpha[tid], y[tid], C);
+  if (tid < n) available[tid] = in_lower(alpha[tid], y[tid], C[tid]);
 }
 /**
 * Get the priority of the elements that are selected by new_idx.
diff --git a/cpp/src/tsa/auto_arima.cu b/cpp/src/tsa/auto_arima.cu
index 6671e54331..999214341d 100644
--- a/cpp/src/tsa/auto_arima.cu
+++ b/cpp/src/tsa/auto_arima.cu
@@ -21,26 +21,26 @@
 
 namespace ML {
 
-int divide_by_mask_build_index(const cumlHandle& handle, const bool* d_mask,
+int divide_by_mask_build_index(const raft::handle_t& handle, const bool* d_mask,
                                int* d_index, int batch_size) {
-  cudaStream_t stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
   return ML::TimeSeries::divide_by_mask_build_index(d_mask, d_index, batch_size,
                                                     allocator, stream);
 }
 
 template <typename DataT>
-inline void divide_by_mask_execute_helper(const cumlHandle& handle,
+inline void divide_by_mask_execute_helper(const raft::handle_t& handle,
                                           const DataT* d_in, const bool* d_mask,
                                           const int* d_index, DataT* d_out0,
                                           DataT* d_out1, int batch_size,
                                           int n_obs) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   ML::TimeSeries::divide_by_mask_execute(d_in, d_mask, d_index, d_out0, d_out1,
                                          batch_size, n_obs, stream);
 }
 
-void divide_by_mask_execute(const cumlHandle& handle, const float* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const float* d_in,
                             const bool* d_mask, const int* d_index,
                             float* d_out0, float* d_out1, int batch_size,
                             int n_obs) {
@@ -48,7 +48,7 @@ void divide_by_mask_execute(const cumlHandle& handle, const float* d_in,
                                 batch_size, n_obs);
 }
 
-void divide_by_mask_execute(const cumlHandle& handle, const double* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const double* d_in,
                             const bool* d_mask, const int* d_index,
                             double* d_out0, double* d_out1, int batch_size,
                             int n_obs) {
@@ -56,7 +56,7 @@ void divide_by_mask_execute(const cumlHandle& handle, const double* d_in,
                                 batch_size, n_obs);
 }
 
-void divide_by_mask_execute(const cumlHandle& handle, const int* d_in,
+void divide_by_mask_execute(const raft::handle_t& handle, const int* d_in,
                             const bool* d_mask, const int* d_index, int* d_out0,
                             int* d_out1, int batch_size, int n_obs) {
   divide_by_mask_execute_helper(handle, d_in, d_mask, d_index, d_out0, d_out1,
@@ -64,44 +64,46 @@ void divide_by_mask_execute(const cumlHandle& handle, const int* d_in,
 }
 
 template <typename DataT>
-inline void divide_by_min_build_index_helper(const cumlHandle& handle,
+inline void divide_by_min_build_index_helper(const raft::handle_t& handle,
                                              const DataT* d_matrix,
                                              int* d_batch, int* d_index,
                                              int* h_size, int batch_size,
                                              int n_sub) {
-  cudaStream_t stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
   ML::TimeSeries::divide_by_min_build_index(
     d_matrix, d_batch, d_index, h_size, batch_size, n_sub, allocator, stream);
 }
 
-void divide_by_min_build_index(const cumlHandle& handle, const float* d_matrix,
-                               int* d_batch, int* d_index, int* h_size,
-                               int batch_size, int n_sub) {
+void divide_by_min_build_index(const raft::handle_t& handle,
+                               const float* d_matrix, int* d_batch,
+                               int* d_index, int* h_size, int batch_size,
+                               int n_sub) {
   divide_by_min_build_index_helper(handle, d_matrix, d_batch, d_index, h_size,
                                    batch_size, n_sub);
 }
 
-void divide_by_min_build_index(const cumlHandle& handle, const double* d_matrix,
-                               int* d_batch, int* d_index, int* h_size,
-                               int batch_size, int n_sub) {
+void divide_by_min_build_index(const raft::handle_t& handle,
+                               const double* d_matrix, int* d_batch,
+                               int* d_index, int* h_size, int batch_size,
+                               int n_sub) {
   divide_by_min_build_index_helper(handle, d_matrix, d_batch, d_index, h_size,
                                    batch_size, n_sub);
 }
 
 template <typename DataT>
-inline void divide_by_min_execute_helper(const cumlHandle& handle,
+inline void divide_by_min_execute_helper(const raft::handle_t& handle,
                                          const DataT* d_in, const int* d_batch,
                                          const int* d_index, DataT** hd_out,
                                          int batch_size, int n_sub, int n_obs) {
-  cudaStream_t stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
   ML::TimeSeries::divide_by_min_execute(d_in, d_batch, d_index, hd_out,
                                         batch_size, n_sub, n_obs, allocator,
                                         stream);
 }
 
-void divide_by_min_execute(const cumlHandle& handle, const float* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const float* d_in,
                            const int* d_batch, const int* d_index,
                            float** hd_out, int batch_size, int n_sub,
                            int n_obs) {
@@ -109,7 +111,7 @@ void divide_by_min_execute(const cumlHandle& handle, const float* d_in,
                                batch_size, n_sub, n_obs);
 }
 
-void divide_by_min_execute(const cumlHandle& handle, const double* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const double* d_in,
                            const int* d_batch, const int* d_index,
                            double** hd_out, int batch_size, int n_sub,
                            int n_obs) {
@@ -117,42 +119,42 @@ void divide_by_min_execute(const cumlHandle& handle, const double* d_in,
                                batch_size, n_sub, n_obs);
 }
 
-void divide_by_min_execute(const cumlHandle& handle, const int* d_in,
+void divide_by_min_execute(const raft::handle_t& handle, const int* d_in,
                            const int* d_batch, const int* d_index, int** hd_out,
                            int batch_size, int n_sub, int n_obs) {
   divide_by_min_execute_helper(handle, d_in, d_batch, d_index, hd_out,
                                batch_size, n_sub, n_obs);
 }
 
-void build_division_map(const cumlHandle& handle, const int* const* hd_id,
+void build_division_map(const raft::handle_t& handle, const int* const* hd_id,
                         const int* h_size, int* d_id_to_pos, int* d_id_to_model,
                         int batch_size, int n_sub) {
-  cudaStream_t stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
   ML::TimeSeries::build_division_map(hd_id, h_size, d_id_to_pos, d_id_to_model,
                                      batch_size, n_sub, allocator, stream);
 }
 
 template <typename DataT>
-inline void merge_series_helper(const cumlHandle& handle,
+inline void merge_series_helper(const raft::handle_t& handle,
                                 const DataT* const* hd_in,
                                 const int* d_id_to_pos, const int* d_id_to_sub,
                                 DataT* d_out, int batch_size, int n_sub,
                                 int n_obs) {
-  cudaStream_t stream = handle.getStream();
-  auto allocator = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto allocator = handle.get_device_allocator();
   ML::TimeSeries::merge_series(hd_in, d_id_to_pos, d_id_to_sub, d_out,
                                batch_size, n_sub, n_obs, allocator, stream);
 }
 
-void merge_series(const cumlHandle& handle, const float* const* hd_in,
+void merge_series(const raft::handle_t& handle, const float* const* hd_in,
                   const int* d_id_to_pos, const int* d_id_to_sub, float* d_out,
                   int batch_size, int n_sub, int n_obs) {
   merge_series_helper(handle, hd_in, d_id_to_pos, d_id_to_sub, d_out,
                       batch_size, n_sub, n_obs);
 }
 
-void merge_series(const cumlHandle& handle, const double* const* hd_in,
+void merge_series(const raft::handle_t& handle, const double* const* hd_in,
                   const int* d_id_to_pos, const int* d_id_to_sub, double* d_out,
                   int batch_size, int n_sub, int n_obs) {
   merge_series_helper(handle, hd_in, d_id_to_pos, d_id_to_sub, d_out,
diff --git a/cpp/src/tsa/auto_arima.cuh b/cpp/src/tsa/auto_arima.cuh
index fc6957bfbe..918be7d408 100644
--- a/cpp/src/tsa/auto_arima.cuh
+++ b/cpp/src/tsa/auto_arima.cuh
@@ -27,7 +27,7 @@
 #include <thrust/transform.h>
 #include <cub/device/device_scan.cuh>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <common/fast_int_div.cuh>
 #include <cuml/common/cuml_allocator.hpp>
@@ -101,8 +101,7 @@ inline int divide_by_mask_build_index(
 
   // Compute and return the number of true elements in the mask
   int true_elements;
-  MLCommon::updateHost(&true_elements, index1.data() + batch_size - 1, 1,
-                       stream);
+  raft::update_host(&true_elements, index1.data() + batch_size - 1, 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   return true_elements;
 }
@@ -168,7 +167,7 @@ inline void divide_by_mask_execute(const DataT* d_in, const bool* d_mask,
  */
 struct which_col : thrust::unary_function<int, int> {
   MLCommon::FastIntDiv divisor;
-  __host__ __device__ which_col(int col_length) : divisor(col_length) {}
+  __host__ which_col(int col_length) : divisor(col_length) {}
   __host__ __device__ int operator()(int idx) const { return idx / divisor; }
 };
 
@@ -233,7 +232,7 @@ inline void divide_by_min_build_index(
   thrust::for_each(
     thrust::cuda::par.on(stream), counting, counting + n_sub,
     [=] __device__(int j) { d_size[j] = d_cumul[(j + 1) * batch_size - 1]; });
-  MLCommon::updateHost(h_size, d_size, n_sub, stream);
+  raft::update_host(h_size, d_size, n_sub, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
@@ -280,7 +279,7 @@ inline void divide_by_min_execute(
   // Create a device array of pointers to each sub-batch
   MLCommon::device_buffer<DataT*> out_buffer(allocator, stream, n_sub);
   DataT** d_out = out_buffer.data();
-  MLCommon::updateDevice(d_out, hd_out, n_sub, stream);
+  raft::update_device(d_out, hd_out, n_sub, stream);
 
   if (n_obs == 1) {
     auto counting = thrust::make_counting_iterator(0);
@@ -342,12 +341,12 @@ inline void build_division_map(
   // Copy the pointers to the id trackers of each sub-batch to the device
   MLCommon::device_buffer<int*> id_ptr_buffer(allocator, stream, n_sub);
   const int** d_id = const_cast<const int**>(id_ptr_buffer.data());
-  MLCommon::updateDevice(d_id, hd_id, n_sub, stream);
+  raft::update_device(d_id, hd_id, n_sub, stream);
 
   // Copy the size of each sub-batch to the device
   MLCommon::device_buffer<int> size_buffer(allocator, stream, n_sub);
   int* d_size = size_buffer.data();
-  MLCommon::updateDevice(d_size, h_size, n_sub, stream);
+  raft::update_device(d_size, h_size, n_sub, stream);
 
   int avg_size = batch_size / n_sub;
   int TPB =
@@ -409,7 +408,7 @@ inline void merge_series(const DataT* const* hd_in, const int* d_id_to_pos,
   // Copy the pointers to each sub-batch to the device
   MLCommon::device_buffer<DataT*> in_buffer(allocator, stream, n_sub);
   const DataT** d_in = const_cast<const DataT**>(in_buffer.data());
-  MLCommon::updateDevice(d_in, hd_in, n_sub, stream);
+  raft::update_device(d_in, hd_in, n_sub, stream);
 
   int TPB = std::min(64, n_obs);
   merge_series_kernel<<<batch_size, TPB, 0, stream>>>(
diff --git a/cpp/src/tsa/stationarity.cu b/cpp/src/tsa/stationarity.cu
index 93af45f3a3..ebf55f5f7e 100644
--- a/cpp/src/tsa/stationarity.cu
+++ b/cpp/src/tsa/stationarity.cu
@@ -24,25 +24,25 @@ namespace ML {
 namespace Stationarity {
 
 template <typename DataT>
-inline void kpss_test_helper(const cumlHandle& handle, const DataT* d_y,
+inline void kpss_test_helper(const raft::handle_t& handle, const DataT* d_y,
                              bool* results, int batch_size, int n_obs, int d,
                              int D, int s, DataT pval_threshold) {
-  const auto& handle_impl = handle.getImpl();
-  cudaStream_t stream = handle_impl.getStream();
-  auto allocator = handle_impl.getDeviceAllocator();
+  const auto& handle_impl = handle;
+  cudaStream_t stream = handle_impl.get_stream();
+  auto allocator = handle_impl.get_device_allocator();
 
   MLCommon::TimeSeries::kpss_test(d_y, results, batch_size, n_obs, d, D, s,
                                   allocator, stream, pval_threshold);
 }
 
-void kpss_test(const cumlHandle& handle, const float* d_y, bool* results,
+void kpss_test(const raft::handle_t& handle, const float* d_y, bool* results,
                int batch_size, int n_obs, int d, int D, int s,
                float pval_threshold) {
   kpss_test_helper<float>(handle, d_y, results, batch_size, n_obs, d, D, s,
                           pval_threshold);
 }
 
-void kpss_test(const cumlHandle& handle, const double* d_y, bool* results,
+void kpss_test(const raft::handle_t& handle, const double* d_y, bool* results,
                int batch_size, int n_obs, int d, int D, int s,
                double pval_threshold) {
   kpss_test_helper<double>(handle, d_y, results, batch_size, n_obs, d, D, s,
diff --git a/cpp/src/tsne/barnes_hut.cuh b/cpp/src/tsne/barnes_hut.cuh
index 1471e3495f..267eda2b12 100644
--- a/cpp/src/tsne/barnes_hut.cuh
+++ b/cpp/src/tsne/barnes_hut.cuh
@@ -15,7 +15,7 @@
  */
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cuml/common/logger.hpp>
 #include "bh_kernels.cuh"
@@ -41,13 +41,15 @@ namespace TSNE {
  * @param[in] pre_learning_rate: The learning rate during the exaggeration phase.
  * @param[in] post_learning_rate: The learning rate after the exaggeration phase.
  * @param[in] max_iter: The maximum number of iterations TSNE should run for.
- * @param[in] min_grad_norm: The smallest gradient norm TSNE should terminate on.
+ * @param[in] min_grad_norm: The smallest gradient norm TSNE should terminate on. 
+              This argument is currently ignored.
  * @param[in] pre_momentum: The momentum used during the exaggeration phase.
  * @param[in] post_momentum: The momentum used after the exaggeration phase.
  * @param[in] random_state: Set this to -1 for pure random intializations or >= 0 for reproducible outputs.
+ * @param[in] initialize_embeddings: Whether to overwrite the current Y vector with random noise.
  */
 void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
-                const cumlHandle &handle, float *Y, const int n,
+                const raft::handle_t &handle, float *Y, const int n,
                 const float theta = 0.5f, const float epssq = 0.0025,
                 const float early_exaggeration = 12.0f,
                 const int exaggeration_iter = 250, const float min_gain = 0.01f,
@@ -55,13 +57,14 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
                 const float post_learning_rate = 500.0f,
                 const int max_iter = 1000, const float min_grad_norm = 1e-7,
                 const float pre_momentum = 0.5, const float post_momentum = 0.8,
-                const long long random_state = -1) {
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+                const long long random_state = -1,
+                const bool initialize_embeddings = true) {
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   // Get device properites
   //---------------------------------------------------
-  const int blocks = MLCommon::getMultiProcessorCount();
+  const int blocks = raft::getMultiProcessorCount();
 
   int nnodes = n * 2;
   if (nnodes < 1024 * blocks) nnodes = 1024 * blocks;
@@ -70,7 +73,6 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
   CUML_LOG_DEBUG("N_nodes = %d blocks = %d", nnodes, blocks);
 
   // Allocate more space
-  //---------------------------------------------------
   // MLCommon::device_buffer<unsigned> errl(d_alloc, stream, 1);
   MLCommon::device_buffer<unsigned> limiter(d_alloc, stream, 1);
   MLCommon::device_buffer<int> maxdepthd(d_alloc, stream, 1);
@@ -114,13 +116,13 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
   MLCommon::device_buffer<float> attr_forces(
     d_alloc, stream, n * 2);  // n*2 double for reduction sum
 
-  MLCommon::device_buffer<float> norm(d_alloc, stream, n);
   MLCommon::device_buffer<float> Z_norm(d_alloc, stream, 1);
 
   MLCommon::device_buffer<float> radiusd_squared(d_alloc, stream, 1);
 
   // Apply
   MLCommon::device_buffer<float> gains_bh(d_alloc, stream, n * 2);
+
   thrust::device_ptr<float> begin_gains_bh =
     thrust::device_pointer_cast(gains_bh.data());
   thrust::fill(thrust::cuda::par.on(stream), begin_gains_bh,
@@ -131,22 +133,31 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
     cudaMemsetAsync(old_forces.data(), 0, sizeof(float) * n * 2, stream));
 
   MLCommon::device_buffer<float> YY(d_alloc, stream, (nnodes + 1) * 2);
-  // TODO bug #2549: this should be conditional on bool initialize_embeddings.
-  random_vector(YY.data(), -0.0001f, 0.0001f, (nnodes + 1) * 2, stream,
-                random_state);
+  if (initialize_embeddings) {
+    random_vector(YY.data(), -0.0001f, 0.0001f, (nnodes + 1) * 2, stream,
+                  random_state);
+  } else {
+    raft::copy(YY.data(), Y, n, stream);
+    raft::copy(YY.data() + nnodes + 1, Y + n, n, stream);
+  }
 
   // Set cache levels for faster algorithm execution
   //---------------------------------------------------
-  cudaFuncSetCacheConfig(TSNE::BoundingBoxKernel, cudaFuncCachePreferShared);
-  cudaFuncSetCacheConfig(TSNE::TreeBuildingKernel, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::ClearKernel1, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::ClearKernel2, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::SummarizationKernel, cudaFuncCachePreferShared);
-  cudaFuncSetCacheConfig(TSNE::SortKernel, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::RepulsionKernel, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::attractive_kernel_bh, cudaFuncCachePreferL1);
-  cudaFuncSetCacheConfig(TSNE::IntegrationKernel, cudaFuncCachePreferL1);
-
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::BoundingBoxKernel, cudaFuncCachePreferShared));
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::TreeBuildingKernel, cudaFuncCachePreferL1));
+  CUDA_CHECK(cudaFuncSetCacheConfig(TSNE::ClearKernel1, cudaFuncCachePreferL1));
+  CUDA_CHECK(cudaFuncSetCacheConfig(TSNE::ClearKernel2, cudaFuncCachePreferL1));
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::SummarizationKernel, cudaFuncCachePreferL1));
+  CUDA_CHECK(cudaFuncSetCacheConfig(TSNE::SortKernel, cudaFuncCachePreferL1));
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::RepulsionKernel, cudaFuncCachePreferL1));
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::attractive_kernel_bh, cudaFuncCachePreferL1));
+  CUDA_CHECK(
+    cudaFuncSetCacheConfig(TSNE::IntegrationKernel, cudaFuncCachePreferL1));
   // Do gradient updates
   //---------------------------------------------------
   CUML_LOG_DEBUG("Start gradient updates!");
@@ -161,6 +172,7 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
     CUDA_CHECK(cudaMemsetAsync(static_cast<void *>(attr_forces.data()), 0,
                                attr_forces.size() * sizeof(*attr_forces.data()),
                                stream));
+
     TSNE::Reset_Normalization<<<1, 1, 0, stream>>>(
       Z_norm.data(), radiusd_squared.data(), bottomd.data(), NNODES,
       radiusd.data());
@@ -170,7 +182,7 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
       momentum = post_momentum;
       // Divide perplexities
       const float div = 1.0f / early_exaggeration;
-      MLCommon::LinAlg::scalarMultiply(VAL, VAL, div, NNZ, stream);
+      raft::linalg::scalarMultiply(VAL, VAL, div, NNZ, stream);
       learning_rate = post_learning_rate;
     }
 
@@ -238,16 +250,11 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
     END_TIMER(Reduction_time);
 
     START_TIMER;
-    TSNE::get_norm<<<MLCommon::ceildiv(n, 1024), 1024, 0, stream>>>(
-      YY.data(), YY.data() + nnodes + 1, norm.data(), n);
-    CUDA_CHECK(cudaPeekAtLastError());
-
     // TODO: Calculate Kullback-Leibler divergence
     // For general embedding dimensions
-    TSNE::
-      attractive_kernel_bh<<<MLCommon::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
-        VAL, COL, ROW, YY.data(), YY.data() + nnodes + 1, norm.data(),
-        attr_forces.data(), attr_forces.data() + n, NNZ);
+    TSNE::attractive_kernel_bh<<<raft::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
+      VAL, COL, ROW, YY.data(), YY.data() + nnodes + 1, attr_forces.data(),
+      attr_forces.data() + n, NNZ);
     CUDA_CHECK(cudaPeekAtLastError());
     END_TIMER(attractive_time);
 
@@ -265,8 +272,8 @@ void Barnes_Hut(float *VAL, const int *COL, const int *ROW, const int NNZ,
   PRINT_TIMES;
 
   // Copy final YY into true output Y
-  MLCommon::copy(Y, YY.data(), n, stream);
-  MLCommon::copy(Y + n, YY.data() + nnodes + 1, n, stream);
+  raft::copy(Y, YY.data(), n, stream);
+  raft::copy(Y + n, YY.data() + nnodes + 1, n, stream);
 }
 
 }  // namespace TSNE
diff --git a/cpp/src/tsne/bh_kernels.cuh b/cpp/src/tsne/bh_kernels.cuh
index 6f772bfd9c..ef030e7f14 100644
--- a/cpp/src/tsne/bh_kernels.cuh
+++ b/cpp/src/tsne/bh_kernels.cuh
@@ -33,9 +33,8 @@
 #define FACTOR6 2
 #define FACTOR7 1
 
-#include <common/cudart_utils.h>
 #include <float.h>
-#include <math.h>
+#include <raft/cudart_utils.h>
 
 namespace ML {
 namespace TSNE {
@@ -171,7 +170,8 @@ __global__ __launch_bounds__(1024, 1) void ClearKernel1(int *restrict childd,
 }
 
 /**
- * Build the actual KD Tree.
+ * Build the actual QuadTree.
+ * See: https://iss.oden.utexas.edu/Publications/Papers/burtscher11.pdf
  */
 __global__ __launch_bounds__(
   THREADS2, FACTOR2) void TreeBuildingKernel(/* int *restrict errd, */
@@ -194,6 +194,7 @@ __global__ __launch_bounds__(
 
   int localmaxdepth = 1;
   int skip = 1;
+
   const int inc = blockDim.x * gridDim.x;
   int i = threadIdx.x + blockIdx.x * blockDim.x;
 
@@ -206,6 +207,11 @@ __global__ __launch_bounds__(
       depth = 1;
       r = radius * 0.5f;
 
+      /* Select child node 'j'
+                    rootx < px  rootx > px
+       * rooty < py   1 -> 3    0 -> 2
+       * rooty > py   1 -> 1    0 -> 0
+       */
       x = rootx + ((rootx < (px = posxd[i])) ? (j = 1, r) : (j = 0, -r));
 
       y = rooty + ((rooty < (py = posyd[i])) ? (j |= 2, r) : (-r));
@@ -217,17 +223,20 @@ __global__ __launch_bounds__(
       depth++;
       r *= 0.5f;
 
-      // determine which child to follow
       x += ((x < px) ? (j = 1, r) : (j = 0, -r));
 
       y += ((y < py) ? (j |= 2, r) : (-r));
     }
 
+    // (ch)ild will be '-1' (nullptr), '-2' (locked), or an Integer corresponding to a body offset
+    // in the lower [0, N) blocks of childd
     if (ch != -2) {
-      // skip if child pointer is locked and try again later
+      // skip if child pointer was locked when we examined it, and try again later.
       locked = n * 4 + j;
+      // store the locked position in case we need to patch in a cell later.
 
       if (ch == -1) {
+        // Child is a nullptr ('-1'), so we write our body index to the leaf, and move on to the next body.
         if (atomicCAS(&childd[locked], -1, i) == -1) {
           if (depth > localmaxdepth) localmaxdepth = depth;
 
@@ -235,23 +244,26 @@ __global__ __launch_bounds__(
           skip = 1;
         }
       } else {
+        // Child node isn't empty, so we store the current value of the child, lock the leaf, and patch in a new cell
         if (ch == atomicCAS(&childd[locked], ch, -2)) {
-          // try to lock
           patch = -1;
 
           while (ch >= 0) {
             depth++;
 
             const int cell = atomicSub(bottomd, 1) - 1;
-            if (cell <= N) {
-              // atomicExch(errd, 1);
+            if (cell == N) {
               atomicExch(bottomd, NNODES);
+            } else if (cell < N) {
+              depth--;
+              continue;
             }
 
             if (patch != -1) childd[n * 4 + j] = cell;
 
             if (cell > patch) patch = cell;
 
+            // Insert migrated child node
             j = (x < posxd[ch]) ? 1 : 0;
             if (y < posyd[ch]) j |= 2;
 
@@ -264,7 +276,9 @@ __global__ __launch_bounds__(
             y += ((y < py) ? (j |= 2, r) : (-r));
 
             ch = childd[n * 4 + j];
-            if (r <= 1e-10) break;
+            if (r <= 1e-10) {
+              break;
+            }
           }
 
           childd[n * 4 + j] = i;
@@ -276,6 +290,7 @@ __global__ __launch_bounds__(
         }
       }
     }
+
     __threadfence();
 
     if (skip == 2) childd[locked] = patch;
@@ -614,42 +629,36 @@ __global__ __launch_bounds__(
     // update velocity
     velxd[i] += vx;
     velyd[i] += vy;
-    atomicAdd(Z_norm, normsum);
+    raft::myAtomicAdd(Z_norm, normsum);
   }
 }
 
-/**
- * Find the norm(Y)
- */
-__global__ void get_norm(const float *restrict Y1, const float *restrict Y2,
-                         float *restrict norm, const int N) {
-  const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (i >= N) return;
-  norm[i] = Y1[i] * Y1[i] + Y2[i] * Y2[i];
-}
-
 /**
  * Fast attractive kernel. Uses COO matrix.
  */
 __global__ void attractive_kernel_bh(
   const float *restrict VAL, const int *restrict COL, const int *restrict ROW,
-  const float *restrict Y1, const float *restrict Y2,
-  const float *restrict norm, float *restrict attract1,
+  const float *restrict Y1, const float *restrict Y2, float *restrict attract1,
   float *restrict attract2, const int NNZ) {
   const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index >= NNZ) return;
   const int i = ROW[index];
   const int j = COL[index];
 
+  const float y1d = Y1[i] - Y1[j];
+  const float y2d = Y2[i] - Y2[j];
+  float squared_euclidean_dist = y1d * y1d + y2d * y2d;
+  // As a sum of squares, SED is mathematically >= 0. There might be a source of
+  // NaNs upstream though, so until we find and fix them, enforce that trait.
+  if (!(squared_euclidean_dist >= 0)) squared_euclidean_dist = 0.0f;
+  const float PQ = __fdividef(VAL[index], squared_euclidean_dist + 1.0f);
+
   // TODO: Calculate Kullback-Leibler divergence
   // TODO: Convert attractive forces to CSR format
-  const float PQ = __fdividef(
-    VAL[index],
-    norm[i] + 1.0f + norm[j] - 2.0f * (Y1[i] * Y1[j] + Y2[i] * Y2[j]));  // P*Q
 
   // Apply forces
-  atomicAdd(&attract1[i], PQ * (Y1[i] - Y1[j]));
-  atomicAdd(&attract2[i], PQ * (Y2[i] - Y2[j]));
+  raft::myAtomicAdd(&attract1[i], PQ * (Y1[i] - Y1[j]));
+  raft::myAtomicAdd(&attract2[i], PQ * (Y2[i] - Y2[j]));
 }
 
 /**
diff --git a/cpp/src/tsne/distances.cuh b/cpp/src/tsne/distances.cuh
index 10f87a48da..922b361de9 100644
--- a/cpp/src/tsne/distances.cuh
+++ b/cpp/src/tsne/distances.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/eltwise.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/eltwise.cuh>
 #include <selection/knn.cuh>
 #include <sparse/coo.cuh>
 
@@ -76,8 +76,8 @@ void normalize_distances(const int n, float *distances, const int n_neighbors,
 
   // Divide distances inplace by max
   const float div = 1.0f / maxNorm;  // Mult faster than div
-  MLCommon::LinAlg::scalarMultiply(distances, distances, div, n * n_neighbors,
-                                   stream);
+  raft::linalg::scalarMultiply(distances, distances, div, n * n_neighbors,
+                               stream);
 }
 
 /**
@@ -85,25 +85,23 @@ void normalize_distances(const int n, float *distances, const int n_neighbors,
  * @param[in] P: The perplexity matrix (n, k)
  * @param[in] indices: The input sorted indices from KNN.
  * @param[in] n: The number of rows in the data X.
- * @param[in] k: The number of nearest neighbors you want.
- * @param[in] P_sum: The sum of P.
+ * @param[in] k: The number of nearest neighbors.
  * @param[in] exaggeration: How much early pressure you want the clusters in TSNE to spread out more.
  * @param[out] COO_Matrix: The final P + P.T output COO matrix.
  * @param[in] stream: The GPU stream.
  * @param[in] handle: The GPU handle.
  */
-template <int TPB_X = 32>
 void symmetrize_perplexity(float *P, long *indices, const int n, const int k,
-                           const float P_sum, const float exaggeration,
+                           const float exaggeration,
                            MLCommon::Sparse::COO<float> *COO_Matrix,
-                           cudaStream_t stream, const cumlHandle &handle) {
+                           cudaStream_t stream, const raft::handle_t &handle) {
   // Perform (P + P.T) / P_sum * early_exaggeration
-  const float div = exaggeration / (2.0f * P_sum);
-  MLCommon::LinAlg::scalarMultiply(P, P, div, n * k, stream);
+  const float div = exaggeration / (2.0f * n);
+  raft::linalg::scalarMultiply(P, P, div, n * k, stream);
 
   // Symmetrize to form P + P.T
   MLCommon::Sparse::from_knn_symmetrize_matrix(
-    indices, P, n, k, COO_Matrix, stream, handle.getDeviceAllocator());
+    indices, P, n, k, COO_Matrix, stream, handle.get_device_allocator());
 }
 
 }  // namespace TSNE
diff --git a/cpp/src/tsne/exact_kernels.cuh b/cpp/src/tsne/exact_kernels.cuh
index b16ade068f..e88e4f5e7d 100644
--- a/cpp/src/tsne/exact_kernels.cuh
+++ b/cpp/src/tsne/exact_kernels.cuh
@@ -16,22 +16,22 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <float.h>
 #include <math.h>
-#include <linalg/eltwise.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/eltwise.cuh>
+
 #define restrict __restrict__
 
 namespace ML {
 namespace TSNE {
 
 /****************************************/
-/* Finds the best guassian bandwith for
+/* Finds the best Gaussian bandwidth for
     each row in the dataset             */
 __global__ void sigmas_kernel(const float *restrict distances,
                               float *restrict P, const float perplexity,
-                              const float desired_entropy,
-                              float *restrict P_sum, const int epochs,
+                              const float desired_entropy, const int epochs,
                               const float tol, const int n, const int k) {
   // For every item in row
   const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -39,13 +39,12 @@ __global__ void sigmas_kernel(const float *restrict distances,
 
   float beta_min = -INFINITY, beta_max = INFINITY;
   float beta = 1;
-  float sum_P_row = 0;
   register const int ik = i * k;
 
   for (int step = 0; step < epochs; step++) {
     float sum_Pi = FLT_EPSILON;
 
-    // Exponentiate to get guassian
+    // Exponentiate to get Gaussian
     for (int j = 0; j < k; j++) {
       P[ik + j] = __expf(-distances[ik + j] * beta);
       sum_Pi += P[ik + j];
@@ -53,12 +52,10 @@ __global__ void sigmas_kernel(const float *restrict distances,
 
     // Normalize
     float sum_disti_Pi = 0;
-    sum_P_row = 0;
     const float div = __fdividef(1.0f, sum_Pi);
     for (int j = 0; j < k; j++) {
       P[ik + j] *= div;
       sum_disti_Pi += distances[ik + j] * P[ik + j];
-      sum_P_row += P[ik + j];
     }
 
     const float entropy = __logf(sum_Pi) + beta * sum_disti_Pi;
@@ -80,16 +77,14 @@ __global__ void sigmas_kernel(const float *restrict distances,
         beta = (beta + beta_min) * 0.5f;
     }
   }
-  atomicAdd(P_sum, sum_P_row);
 }
 
 /****************************************/
-/* Finds the best guassian bandwith for
+/* Finds the best Gaussian bandwith for
     each row in the dataset             */
 __global__ void sigmas_kernel_2d(const float *restrict distances,
                                  float *restrict P, const float perplexity,
-                                 const float desired_entropy,
-                                 float *restrict P_sum, const int epochs,
+                                 const float desired_entropy, const int epochs,
                                  const float tol, const int n) {
   // For every item in row
   const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -97,11 +92,10 @@ __global__ void sigmas_kernel_2d(const float *restrict distances,
 
   float beta_min = -INFINITY, beta_max = INFINITY;
   float beta = 1;
-  float sum_P_row = 0;
   register const int ik = i * 2;
 
   for (int step = 0; step < epochs; step++) {
-    // Exponentiate to get guassian
+    // Exponentiate to get Gaussian
     P[ik] = __expf(-distances[ik] * beta);
     P[ik + 1] = __expf(-distances[ik + 1] * beta);
     const float sum_Pi = FLT_EPSILON + P[ik] + P[ik + 1];
@@ -112,7 +106,6 @@ __global__ void sigmas_kernel_2d(const float *restrict distances,
     P[ik + 1] *= div;
     const float sum_disti_Pi =
       distances[ik] * P[ik] + distances[ik + 1] * P[ik + 1];
-    sum_P_row = P[ik] + P[ik + 1];
 
     const float entropy = __logf(sum_Pi) + beta * sum_disti_Pi;
     const float entropy_diff = entropy - desired_entropy;
@@ -133,35 +126,25 @@ __global__ void sigmas_kernel_2d(const float *restrict distances,
         beta = (beta + beta_min) * 0.5f;
     }
   }
-  atomicAdd(P_sum, sum_P_row);
 }
 
 /****************************************/
-float perplexity_search(const float *restrict distances, float *restrict P,
-                        const float perplexity, const int epochs,
-                        const float tol, const int n, const int dim,
-                        const cumlHandle &handle) {
+void perplexity_search(const float *restrict distances, float *restrict P,
+                       const float perplexity, const int epochs,
+                       const float tol, const int n, const int dim,
+                       const raft::handle_t &handle) {
   const float desired_entropy = logf(perplexity);
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
-
-  float *P_sum = (float *)d_alloc->allocate(sizeof(float), stream);
-  CUDA_CHECK(cudaMemsetAsync(P_sum, 0, sizeof(float), stream));
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   if (dim == 2)
-    sigmas_kernel_2d<<<MLCommon::ceildiv(n, 1024), 1024, 0, stream>>>(
-      distances, P, perplexity, desired_entropy, P_sum, epochs, tol, n);
+    sigmas_kernel_2d<<<raft::ceildiv(n, 1024), 1024, 0, stream>>>(
+      distances, P, perplexity, desired_entropy, epochs, tol, n);
   else
-    sigmas_kernel<<<MLCommon::ceildiv(n, 1024), 1024, 0, stream>>>(
-      distances, P, perplexity, desired_entropy, P_sum, epochs, tol, n, dim);
+    sigmas_kernel<<<raft::ceildiv(n, 1024), 1024, 0, stream>>>(
+      distances, P, perplexity, desired_entropy, epochs, tol, n, dim);
   CUDA_CHECK(cudaPeekAtLastError());
-
   cudaStreamSynchronize(stream);
-  float sum;
-  MLCommon::updateHost(&sum, P_sum, 1, stream);
-  d_alloc->deallocate(P_sum, sizeof(float), stream);
-
-  return sum;
 }
 
 /****************************************/
@@ -192,7 +175,7 @@ __global__ void attractive_kernel(
 
   // Apply forces
   for (int k = 0; k < dim; k++)
-    atomicAdd(&attract[k * n + i], PQ * (Y[k * n + i] - Y[k * n + j]));
+    raft::myAtomicAdd(&attract[k * n + i], PQ * (Y[k * n + i] - Y[k * n + j]));
 }
 
 /****************************************/
@@ -218,8 +201,8 @@ __global__ void attractive_kernel_2d(
   const float PQ = __fdividef(VAL[index], (1.0f + euclidean_d));  // P*Q
 
   // Apply forces
-  atomicAdd(&attract1[i], PQ * (Y1[i] - Y1[j]));
-  atomicAdd(&attract2[i], PQ * (Y2[i] - Y2[j]));
+  raft::myAtomicAdd(&attract1[i], PQ * (Y1[i] - Y1[j]));
+  raft::myAtomicAdd(&attract2[i], PQ * (Y2[i] - Y2[j]));
 }
 
 /****************************************/
@@ -236,12 +219,12 @@ void attractive_forces(const float *restrict VAL, const int *restrict COL,
   // #863
   // For general embedding dimensions
   if (dim != 2) {
-    attractive_kernel<<<MLCommon::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
+    attractive_kernel<<<raft::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
       VAL, COL, ROW, Y, norm, attract, NNZ, n, dim, df_power, recp_df);
   }
   // For special case dim == 2
   else {
-    attractive_kernel_2d<<<MLCommon::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
+    attractive_kernel_2d<<<raft::ceildiv(NNZ, 1024), 1024, 0, stream>>>(
       VAL, COL, ROW, Y, Y + n, norm, attract, attract + n, NNZ);
   }
   CUDA_CHECK(cudaPeekAtLastError());
@@ -276,15 +259,15 @@ __global__ void repulsive_kernel(const float *restrict Y, float *restrict repel,
   // Apply forces
   for (int k = 0; k < dim; k++) {
     const float force = Q2 * (Y[k * n + j] - Y[k * n + i]);
-    atomicAdd(&repel[k * n + i], force);
-    atomicAdd(&repel[k * n + j], force);
+    raft::myAtomicAdd(&repel[k * n + i], force);
+    raft::myAtomicAdd(&repel[k * n + j], force);
   }
 
   // Sum up Z sum
   if (i % 2 == 0)
-    atomicAdd(&Z_sum1[i], Q);
+    raft::myAtomicAdd(&Z_sum1[i], Q);
   else
-    atomicAdd(&Z_sum2[i], Q);
+    raft::myAtomicAdd(&Z_sum2[i], Q);
 }
 
 /****************************************/
@@ -311,17 +294,17 @@ __global__ void repulsive_kernel_2d(
   const float force2 = Q2 * (Y2[j] - Y2[i]);
 
   // Add forces
-  atomicAdd(&repel1[i], force1);
-  atomicAdd(&repel1[j], -force1);
+  raft::myAtomicAdd(&repel1[i], force1);
+  raft::myAtomicAdd(&repel1[j], -force1);
 
-  atomicAdd(&repel2[i], force2);
-  atomicAdd(&repel2[j], -force2);
+  raft::myAtomicAdd(&repel2[i], force2);
+  raft::myAtomicAdd(&repel2[j], -force2);
 
   // Sum up Z sum
   if (i % 2 == 0)
-    atomicAdd(&Z_sum1[i], Q);
+    raft::myAtomicAdd(&Z_sum1[i], Q);
   else
-    atomicAdd(&Z_sum2[i], Q);
+    raft::myAtomicAdd(&Z_sum2[i], Q);
 }
 
 /****************************************/
@@ -335,8 +318,7 @@ float repulsive_forces(const float *restrict Y, float *restrict repel,
   CUDA_CHECK(cudaMemsetAsync(repel, 0, sizeof(float) * n * dim, stream));
 
   const dim3 threadsPerBlock(TPB_X, TPB_Y);
-  const dim3 numBlocks(MLCommon::ceildiv(n, TPB_X),
-                       MLCommon::ceildiv(n, TPB_Y));
+  const dim3 numBlocks(raft::ceildiv(n, TPB_X), raft::ceildiv(n, TPB_Y));
 
   // For general embedding dimensions
   if (dim != 2) {
@@ -392,7 +374,7 @@ __global__ void apply_kernel(
   Y[index] += velocity[index];
 
   // Add to mean
-  //atomicAdd(&means[index / n], Y[index]);
+  //raft::myAtomicAdd(&means[index / n], Y[index]);
 }
 
 /****************************************/
@@ -410,7 +392,7 @@ float apply_forces(float *restrict Y, float *restrict velocity,
   if (check_convergence)
     CUDA_CHECK(cudaMemsetAsync(gradient, 0, sizeof(float) * n * dim, stream));
 
-  apply_kernel<<<MLCommon::ceildiv(n * dim, 1024), 1024, 0, stream>>>(
+  apply_kernel<<<raft::ceildiv(n * dim, 1024), 1024, 0, stream>>>(
     Y, velocity, attract, repel, means, gains, Z, learning_rate, C, momentum,
     n * dim, n, min_gain, gradient, check_convergence);
   CUDA_CHECK(cudaPeekAtLastError());
diff --git a/cpp/src/tsne/exact_tsne.cuh b/cpp/src/tsne/exact_tsne.cuh
index 3ba9a10706..b4b8b7f826 100644
--- a/cpp/src/tsne/exact_tsne.cuh
+++ b/cpp/src/tsne/exact_tsne.cuh
@@ -15,7 +15,7 @@
  */
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cuml/common/logger.hpp>
 #include "exact_kernels.cuh"
@@ -44,22 +44,22 @@ namespace TSNE {
  * @param[in] pre_momentum: The momentum used during the exaggeration phase.
  * @param[in] post_momentum: The momentum used after the exaggeration phase.
  * @param[in] random_state: Set this to -1 for pure random intializations or >= 0 for reproducible outputs.
- * @param[in] intialize_embeddings: Whether to overwrite the current Y vector with random noise.
+ * @param[in] initialize_embeddings: Whether to overwrite the current Y vector with random noise.
  */
 void Exact_TSNE(float *VAL, const int *COL, const int *ROW, const int NNZ,
-                const cumlHandle &handle, float *Y, const int n, const int dim,
-                const float early_exaggeration = 12.0f,
+                const raft::handle_t &handle, float *Y, const int n,
+                const int dim, const float early_exaggeration = 12.0f,
                 const int exaggeration_iter = 250, const float min_gain = 0.01f,
                 const float pre_learning_rate = 200.0f,
                 const float post_learning_rate = 500.0f,
                 const int max_iter = 1000, const float min_grad_norm = 1e-7,
                 const float pre_momentum = 0.5, const float post_momentum = 0.8,
                 const long long random_state = -1,
-                const bool intialize_embeddings = true) {
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+                const bool initialize_embeddings = true) {
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
-  if (intialize_embeddings)
+  if (initialize_embeddings)
     random_vector(Y, -0.0001f, 0.0001f, n * dim, stream, random_state);
 
   // Allocate space
@@ -102,13 +102,13 @@ void Exact_TSNE(float *VAL, const int *COL, const int *ROW, const int NNZ,
       momentum = post_momentum;
       // Divide perplexities
       const float div = 1.0f / early_exaggeration;
-      MLCommon::LinAlg::scalarMultiply(VAL, VAL, div, NNZ, stream);
+      raft::linalg::scalarMultiply(VAL, VAL, div, NNZ, stream);
       learning_rate = post_learning_rate;
     }
 
     // Get row norm of Y
-    MLCommon::LinAlg::rowNorm(norm.data(), Y, dim, n, MLCommon::LinAlg::L2Norm,
-                              false, stream);
+    raft::linalg::rowNorm(norm.data(), Y, dim, n, raft::linalg::L2Norm, false,
+                          stream);
 
     // Compute attractive forces
     TSNE::attractive_forces(VAL, COL, ROW, Y, norm.data(), attract.data(), NNZ,
diff --git a/cpp/src/tsne/tsne.cu b/cpp/src/tsne/tsne.cu
index 2a47b27a7f..d1e02269c6 100644
--- a/cpp/src/tsne/tsne.cu
+++ b/cpp/src/tsne/tsne.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/manifold/tsne.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cuml/common/logger.hpp>
 #include "distances.cuh"
@@ -27,16 +27,17 @@
 
 namespace ML {
 
-void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
-              const int p, const int dim, int n_neighbors, const float theta,
-              const float epssq, float perplexity,
+void TSNE_fit(const raft::handle_t &handle, const float *X, float *Y,
+              const int n, const int p, const int dim, int n_neighbors,
+              const float theta, const float epssq, float perplexity,
               const int perplexity_max_iter, const float perplexity_tol,
               const float early_exaggeration, const int exaggeration_iter,
               const float min_gain, const float pre_learning_rate,
               const float post_learning_rate, const int max_iter,
               const float min_grad_norm, const float pre_momentum,
               const float post_momentum, const long long random_state,
-              int verbosity, const bool intialize_embeddings, bool barnes_hut) {
+              int verbosity, const bool initialize_embeddings,
+              bool barnes_hut) {
   ASSERT(n > 0 && p > 0 && dim > 0 && n_neighbors > 0 && X != NULL && Y != NULL,
          "Wrong input args");
   ML::Logger::get().setLevel(verbosity);
@@ -65,8 +66,8 @@ void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
       "# of Nearest Neighbors should be at least 3 * perplexity. Your results"
       " might be a bit strange...");
 
-  auto d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   START_TIMER;
   //---------------------------------------------------
@@ -92,11 +93,10 @@ void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
   // Optimal perplexity
   CUML_LOG_DEBUG("Searching for optimal perplexity via bisection search.");
   MLCommon::device_buffer<float> P(d_alloc, stream, n * n_neighbors);
-  const float P_sum = TSNE::perplexity_search(
-    distances.data(), P.data(), perplexity, perplexity_max_iter, perplexity_tol,
-    n, n_neighbors, handle);
+  TSNE::perplexity_search(distances.data(), P.data(), perplexity,
+                          perplexity_max_iter, perplexity_tol, n, n_neighbors,
+                          handle);
   distances.release(stream);
-  CUML_LOG_DEBUG("Perplexity sum = %f", P_sum);
   //---------------------------------------------------
   END_TIMER(PerplexityTime);
 
@@ -104,7 +104,7 @@ void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
   //---------------------------------------------------
   // Convert data to COO layout
   MLCommon::Sparse::COO<float> COO_Matrix(d_alloc, stream);
-  TSNE::symmetrize_perplexity(P.data(), indices.data(), n, n_neighbors, P_sum,
+  TSNE::symmetrize_perplexity(P.data(), indices.data(), n, n_neighbors,
                               early_exaggeration, &COO_Matrix, stream, handle);
   P.release(stream);
   indices.release(stream);
@@ -119,12 +119,13 @@ void TSNE_fit(const cumlHandle &handle, const float *X, float *Y, const int n,
     TSNE::Barnes_Hut(VAL, COL, ROW, NNZ, handle, Y, n, theta, epssq,
                      early_exaggeration, exaggeration_iter, min_gain,
                      pre_learning_rate, post_learning_rate, max_iter,
-                     min_grad_norm, pre_momentum, post_momentum, random_state);
+                     min_grad_norm, pre_momentum, post_momentum, random_state,
+                     initialize_embeddings);
   } else {
     TSNE::Exact_TSNE(VAL, COL, ROW, NNZ, handle, Y, n, dim, early_exaggeration,
                      exaggeration_iter, min_gain, pre_learning_rate,
                      post_learning_rate, max_iter, min_grad_norm, pre_momentum,
-                     post_momentum, random_state, intialize_embeddings);
+                     post_momentum, random_state, initialize_embeddings);
   }
 }
 
diff --git a/cpp/src/tsne/utils.cuh b/cpp/src/tsne/utils.cuh
index 563a7f9b18..07ad934579 100644
--- a/cpp/src/tsne/utils.cuh
+++ b/cpp/src/tsne/utils.cuh
@@ -23,7 +23,7 @@
 
 #include <common/cumlHandle.hpp>
 #include <cuml/common/logger.hpp>
-#include <linalg/norm.cuh>
+#include <raft/linalg/norm.cuh>
 
 #include <cuda_runtime.h>
 #include <cuml/cuml.hpp>
@@ -33,8 +33,8 @@
 #include <thrust/transform.h>
 
 #include <sys/time.h>
-#include <random/rng.cuh>
-#include <stats/sum.cuh>
+#include <raft/random/rng.cuh>
+#include <raft/stats/sum.cuh>
 
 #include <unistd.h>
 #include <chrono>
@@ -57,7 +57,7 @@ void random_vector(float *vector, const float minimum, const float maximum,
     gettimeofday(&tp, NULL);
     seed = tp.tv_sec * 1000 + tp.tv_usec;
   }
-  MLCommon::Random::Rng random(seed);
+  raft::random::Rng random(seed);
   random.uniform<float>(vector, size, minimum, maximum, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 }
diff --git a/cpp/src/tsvd/tsvd.cu b/cpp/src/tsvd/tsvd.cu
index 6f067a5053..ce0ca72172 100644
--- a/cpp/src/tsvd/tsvd.cu
+++ b/cpp/src/tsvd/tsvd.cu
@@ -21,60 +21,58 @@ namespace ML {
 
 using namespace MLCommon;
 
-void tsvdFit(cumlHandle &handle, float *input, float *components,
+void tsvdFit(raft::handle_t &handle, float *input, float *components,
              float *singular_vals, const paramsTSVD &prms) {
-  tsvdFit(handle.getImpl(), input, components, singular_vals, prms,
-          handle.getStream());
+  tsvdFit(handle, input, components, singular_vals, prms, handle.get_stream());
 }
 
-void tsvdFit(cumlHandle &handle, double *input, double *components,
+void tsvdFit(raft::handle_t &handle, double *input, double *components,
              double *singular_vals, const paramsTSVD &prms) {
-  tsvdFit(handle.getImpl(), input, components, singular_vals, prms,
-          handle.getStream());
+  tsvdFit(handle, input, components, singular_vals, prms, handle.get_stream());
 }
 
-void tsvdFitTransform(cumlHandle &handle, float *input, float *trans_input,
+void tsvdFitTransform(raft::handle_t &handle, float *input, float *trans_input,
                       float *components, float *explained_var,
                       float *explained_var_ratio, float *singular_vals,
                       const paramsTSVD &prms) {
-  tsvdFitTransform(handle.getImpl(), input, trans_input, components,
-                   explained_var, explained_var_ratio, singular_vals, prms,
-                   handle.getStream());
+  tsvdFitTransform(handle, input, trans_input, components, explained_var,
+                   explained_var_ratio, singular_vals, prms,
+                   handle.get_stream());
 }
 
-void tsvdFitTransform(cumlHandle &handle, double *input, double *trans_input,
-                      double *components, double *explained_var,
-                      double *explained_var_ratio, double *singular_vals,
-                      const paramsTSVD &prms) {
-  tsvdFitTransform(handle.getImpl(), input, trans_input, components,
-                   explained_var, explained_var_ratio, singular_vals, prms,
-                   handle.getStream());
+void tsvdFitTransform(raft::handle_t &handle, double *input,
+                      double *trans_input, double *components,
+                      double *explained_var, double *explained_var_ratio,
+                      double *singular_vals, const paramsTSVD &prms) {
+  tsvdFitTransform(handle, input, trans_input, components, explained_var,
+                   explained_var_ratio, singular_vals, prms,
+                   handle.get_stream());
 }
 
-void tsvdTransform(cumlHandle &handle, float *input, float *components,
+void tsvdTransform(raft::handle_t &handle, float *input, float *components,
                    float *trans_input, const paramsTSVD &prms) {
-  tsvdTransform(handle.getImpl(), input, components, trans_input, prms,
-                handle.getStream());
+  tsvdTransform(handle, input, components, trans_input, prms,
+                handle.get_stream());
 }
 
-void tsvdTransform(cumlHandle &handle, double *input, double *components,
+void tsvdTransform(raft::handle_t &handle, double *input, double *components,
                    double *trans_input, const paramsTSVD &prms) {
-  tsvdTransform(handle.getImpl(), input, components, trans_input, prms,
-                handle.getStream());
+  tsvdTransform(handle, input, components, trans_input, prms,
+                handle.get_stream());
 }
 
-void tsvdInverseTransform(cumlHandle &handle, float *trans_input,
+void tsvdInverseTransform(raft::handle_t &handle, float *trans_input,
                           float *components, float *input,
                           const paramsTSVD &prms) {
-  tsvdInverseTransform(handle.getImpl(), trans_input, components, input, prms,
-                       handle.getStream());
+  tsvdInverseTransform(handle, trans_input, components, input, prms,
+                       handle.get_stream());
 }
 
-void tsvdInverseTransform(cumlHandle &handle, double *trans_input,
+void tsvdInverseTransform(raft::handle_t &handle, double *trans_input,
                           double *components, double *input,
                           const paramsTSVD &prms) {
-  tsvdInverseTransform(handle.getImpl(), trans_input, components, input, prms,
-                       handle.getStream());
+  tsvdInverseTransform(handle, trans_input, components, input, prms,
+                       handle.get_stream());
 }
 
 };  // end namespace ML
diff --git a/cpp/src/tsvd/tsvd.cuh b/cpp/src/tsvd/tsvd.cuh
index 11f9704340..866d83dfc7 100644
--- a/cpp/src/tsvd/tsvd.cuh
+++ b/cpp/src/tsvd/tsvd.cuh
@@ -16,39 +16,39 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
 #include <thrust/device_vector.h>
 #include <thrust/execution_policy.h>
 #include <common/allocatorAdapter.hpp>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/decomposition/params.hpp>
-#include <linalg/binary_op.cuh>
-#include <linalg/eig.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
 #include <linalg/rsvd.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/stddev.cuh>
-#include <stats/sum.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/eig.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/stddev.cuh>
+#include <raft/stats/sum.cuh>
 
 namespace ML {
 
 using namespace MLCommon;
 
 template <typename math_t>
-void calCompExpVarsSvd(const cumlHandle_impl &handle, math_t *in,
+void calCompExpVarsSvd(const raft::handle_t &handle, math_t *in,
                        math_t *components, math_t *singular_vals,
                        math_t *explained_vars, math_t *explained_var_ratio,
                        const paramsTSVD &prms, cudaStream_t stream) {
-  auto cusolver_handle = handle.getcusolverDnHandle();
-  auto cublas_handle = handle.getCublasHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cusolver_handle = handle.get_cusolver_dn_handle();
+  auto cublas_handle = handle.get_cublas_handle();
+  auto allocator = handle.get_device_allocator();
 
   int diff = prms.n_cols - prms.n_components;
   math_t ratio = math_t(diff) / math_t(prms.n_cols);
@@ -68,40 +68,39 @@ void calCompExpVarsSvd(const cumlHandle_impl &handle, math_t *in,
   device_buffer<math_t> components_temp(allocator, stream,
                                         prms.n_cols * prms.n_components);
   math_t *left_eigvec = nullptr;
-  LinAlg::rsvdFixedRank(in, prms.n_rows, prms.n_cols, singular_vals,
+  LinAlg::rsvdFixedRank(handle, in, prms.n_rows, prms.n_cols, singular_vals,
                         left_eigvec, components_temp.data(), prms.n_components,
                         p, true, false, true, false, (math_t)prms.tol,
-                        prms.n_iterations, cusolver_handle, cublas_handle,
-                        stream);
-
-  LinAlg::transpose(components_temp.data(), components, prms.n_cols,
-                    prms.n_components, cublas_handle, stream);
-  Matrix::power(singular_vals, explained_vars, math_t(1), prms.n_components,
-                stream);
-  Matrix::ratio(explained_vars, explained_var_ratio, prms.n_components, stream,
-                allocator);
+                        prms.n_iterations, stream);
+
+  raft::linalg::transpose(handle, components_temp.data(), components,
+                          prms.n_cols, prms.n_components, stream);
+  raft::matrix::power(singular_vals, explained_vars, math_t(1),
+                      prms.n_components, stream);
+  raft::matrix::ratio(handle, explained_vars, explained_var_ratio,
+                      prms.n_components, stream);
 }
 
 template <typename math_t, typename enum_solver = solver>
-void calEig(const cumlHandle_impl &handle, math_t *in, math_t *components,
+void calEig(const raft::handle_t &handle, math_t *in, math_t *components,
             math_t *explained_var, const paramsTSVDTemplate<enum_solver> &prms,
             cudaStream_t stream) {
-  auto cusolver_handle = handle.getcusolverDnHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cusolver_handle = handle.get_cusolver_dn_handle();
+  auto allocator = handle.get_device_allocator();
 
   if (prms.algorithm == enum_solver::COV_EIG_JACOBI) {
-    LinAlg::eigJacobi(in, prms.n_cols, prms.n_cols, components, explained_var,
-                      cusolver_handle, stream, allocator, (math_t)prms.tol,
-                      prms.n_iterations);
+    raft::linalg::eigJacobi(handle, in, prms.n_cols, prms.n_cols, components,
+                            explained_var, stream, (math_t)prms.tol,
+                            prms.n_iterations);
   } else {
-    LinAlg::eigDC(in, prms.n_cols, prms.n_cols, components, explained_var,
-                  cusolver_handle, stream, allocator);
+    raft::linalg::eigDC(handle, in, prms.n_cols, prms.n_cols, components,
+                        explained_var, stream);
   }
 
-  Matrix::colReverse(components, prms.n_cols, prms.n_cols, stream);
-  LinAlg::transpose(components, prms.n_cols, stream);
+  raft::matrix::colReverse(components, prms.n_cols, prms.n_cols, stream);
+  raft::linalg::transpose(components, prms.n_cols, stream);
 
-  Matrix::rowReverse(explained_var, prms.n_cols, 1, stream);
+  raft::matrix::rowReverse(explained_var, prms.n_cols, 1, stream);
 }
 
 /**
@@ -165,11 +164,11 @@ void signFlip(math_t *input, int n_rows, int n_cols, math_t *components,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void tsvdFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
+void tsvdFit(const raft::handle_t &handle, math_t *input, math_t *components,
              math_t *singular_vals, const paramsTSVD &prms,
              cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
-  auto allocator = handle.getDeviceAllocator();
+  auto cublas_handle = handle.get_cublas_handle();
+  auto allocator = handle.get_device_allocator();
 
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than two");
@@ -187,9 +186,9 @@ void tsvdFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
 
   math_t alpha = math_t(1);
   math_t beta = math_t(0);
-  LinAlg::gemm(input, prms.n_rows, prms.n_cols, input, input_cross_mult.data(),
-               prms.n_cols, prms.n_cols, CUBLAS_OP_T, CUBLAS_OP_N, alpha, beta,
-               cublas_handle, stream);
+  raft::linalg::gemm(handle, input, prms.n_rows, prms.n_cols, input,
+                     input_cross_mult.data(), prms.n_cols, prms.n_cols,
+                     CUBLAS_OP_T, CUBLAS_OP_N, alpha, beta, stream);
 
   device_buffer<math_t> components_all(allocator, stream, len);
   device_buffer<math_t> explained_var_all(allocator, stream, prms.n_cols);
@@ -197,12 +196,12 @@ void tsvdFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
   calEig(handle, input_cross_mult.data(), components_all.data(),
          explained_var_all.data(), prms, stream);
 
-  Matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
-                          n_components, prms.n_cols, stream);
+  raft::matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
+                                n_components, prms.n_cols, stream);
 
   math_t scalar = math_t(1);
-  Matrix::seqRoot(explained_var_all.data(), singular_vals, scalar, n_components,
-                  stream);
+  raft::matrix::seqRoot(explained_var_all.data(), singular_vals, scalar,
+                        n_components, stream);
 }
 
 /**
@@ -218,12 +217,12 @@ void tsvdFit(const cumlHandle_impl &handle, math_t *input, math_t *components,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void tsvdFitTransform(const cumlHandle_impl &handle, math_t *input,
+void tsvdFitTransform(const raft::handle_t &handle, math_t *input,
                       math_t *trans_input, math_t *components,
                       math_t *explained_var, math_t *explained_var_ratio,
                       math_t *singular_vals, const paramsTSVD &prms,
                       cudaStream_t stream) {
-  auto allocator = handle.getDeviceAllocator();
+  auto allocator = handle.get_device_allocator();
 
   tsvdFit(handle, input, components, singular_vals, prms, stream);
   tsvdTransform(handle, input, components, trans_input, prms, stream);
@@ -232,28 +231,30 @@ void tsvdFitTransform(const cumlHandle_impl &handle, math_t *input,
            allocator, stream);
 
   device_buffer<math_t> mu_trans(allocator, stream, prms.n_components);
-  Stats::mean(mu_trans.data(), trans_input, prms.n_components, prms.n_rows,
-              true, false, stream);
-  Stats::vars(explained_var, trans_input, mu_trans.data(), prms.n_components,
-              prms.n_rows, true, false, stream);
+  raft::stats::mean(mu_trans.data(), trans_input, prms.n_components,
+                    prms.n_rows, true, false, stream);
+  raft::stats::vars(explained_var, trans_input, mu_trans.data(),
+                    prms.n_components, prms.n_rows, true, false, stream);
 
   device_buffer<math_t> mu(allocator, stream, prms.n_cols);
   device_buffer<math_t> vars(allocator, stream, prms.n_cols);
 
-  Stats::mean(mu.data(), input, prms.n_cols, prms.n_rows, true, false, stream);
-  Stats::vars(vars.data(), input, mu.data(), prms.n_cols, prms.n_rows, true,
-              false, stream);
+  raft::stats::mean(mu.data(), input, prms.n_cols, prms.n_rows, true, false,
+                    stream);
+  raft::stats::vars(vars.data(), input, mu.data(), prms.n_cols, prms.n_rows,
+                    true, false, stream);
 
   device_buffer<math_t> total_vars(allocator, stream, 1);
-  Stats::sum(total_vars.data(), vars.data(), 1, prms.n_cols, false, stream);
+  raft::stats::sum(total_vars.data(), vars.data(), 1, prms.n_cols, false,
+                   stream);
 
   math_t total_vars_h;
-  updateHost(&total_vars_h, total_vars.data(), 1, stream);
+  raft::update_host(&total_vars_h, total_vars.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   math_t scalar = math_t(1) / total_vars_h;
 
-  LinAlg::scalarMultiply(explained_var_ratio, explained_var, scalar,
-                         prms.n_components, stream);
+  raft::linalg::scalarMultiply(explained_var_ratio, explained_var, scalar,
+                               prms.n_components, stream);
 }
 
 /**
@@ -266,24 +267,22 @@ void tsvdFitTransform(const cumlHandle_impl &handle, math_t *input,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void tsvdTransform(const cumlHandle_impl &handle, math_t *input,
+void tsvdTransform(const raft::handle_t &handle, math_t *input,
                    math_t *components, math_t *trans_input,
                    const paramsTSVD &prms, cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
-
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than two");
-  ASSERT(prms.n_rows > 1,
-         "Parameter n_rows: number of rows cannot be less than two");
+  ASSERT(prms.n_rows > 0,
+         "Parameter n_rows: number of rows cannot be less than one");
   ASSERT(
     prms.n_components > 0,
     "Parameter n_components: number of components cannot be less than one");
 
   math_t alpha = math_t(1);
   math_t beta = math_t(0);
-  LinAlg::gemm(input, prms.n_rows, prms.n_cols, components, trans_input,
-               prms.n_rows, prms.n_components, CUBLAS_OP_N, CUBLAS_OP_T, alpha,
-               beta, cublas_handle, stream);
+  raft::linalg::gemm(handle, input, prms.n_rows, prms.n_cols, components,
+                     trans_input, prms.n_rows, prms.n_components, CUBLAS_OP_N,
+                     CUBLAS_OP_T, alpha, beta, stream);
 }
 
 /**
@@ -296,14 +295,12 @@ void tsvdTransform(const cumlHandle_impl &handle, math_t *input,
  * @param[in] stream cuda stream
  */
 template <typename math_t>
-void tsvdInverseTransform(const cumlHandle_impl &handle, math_t *trans_input,
+void tsvdInverseTransform(const raft::handle_t &handle, math_t *trans_input,
                           math_t *components, math_t *input,
                           const paramsTSVD &prms, cudaStream_t stream) {
-  auto cublas_handle = handle.getCublasHandle();
-
   ASSERT(prms.n_cols > 1,
          "Parameter n_cols: number of columns cannot be less than one");
-  ASSERT(prms.n_rows > 1,
+  ASSERT(prms.n_rows > 0,
          "Parameter n_rows: number of rows cannot be less than one");
   ASSERT(
     prms.n_components > 0,
@@ -312,9 +309,9 @@ void tsvdInverseTransform(const cumlHandle_impl &handle, math_t *trans_input,
   math_t alpha = math_t(1);
   math_t beta = math_t(0);
 
-  LinAlg::gemm(trans_input, prms.n_rows, prms.n_components, components, input,
-               prms.n_rows, prms.n_cols, CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta,
-               cublas_handle, stream);
+  raft::linalg::gemm(handle, trans_input, prms.n_rows, prms.n_components,
+                     components, input, prms.n_rows, prms.n_cols, CUBLAS_OP_N,
+                     CUBLAS_OP_N, alpha, beta, stream);
 }
 
 };  // end namespace ML
diff --git a/cpp/src/tsvd/tsvd_mg.cu b/cpp/src/tsvd/tsvd_mg.cu
index 2ee11c9a91..59be9ee6e2 100644
--- a/cpp/src/tsvd/tsvd_mg.cu
+++ b/cpp/src/tsvd/tsvd_mg.cu
@@ -14,21 +14,22 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/decomposition/sign_flip_mg.hpp>
 #include <cuml/decomposition/tsvd.hpp>
 #include <cuml/decomposition/tsvd_mg.hpp>
-#include <linalg/eltwise.cuh>
-#include <matrix/math.cuh>
 #include <opg/linalg/mm_aTa.hpp>
 #include <opg/stats/mean.hpp>
 #include <opg/stats/mean_center.hpp>
 #include <opg/stats/stddev.hpp>
-#include <stats/mean_center.cuh>
+#include <raft/comms/comms.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/stats/mean_center.cuh>
 #include "tsvd.cuh"
 
 using namespace MLCommon;
@@ -38,14 +39,14 @@ namespace TSVD {
 namespace opg {
 
 template <typename T>
-void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
+void fit_impl(raft::handle_t &handle,
+              std::vector<Matrix::Data<T> *> &input_data,
               Matrix::PartDescriptor &input_desc, T *components,
               T *singular_vals, paramsTSVD prms, cudaStream_t *streams,
               int n_streams, bool verbose) {
-  const MLCommon::cumlCommunicator &comm = handle.getImpl().getCommunicator();
-  cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  const auto &comm = handle.get_comms();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
 
   // This variable should be updated to use `size_t`
   // Reference issue https://github.com/rapidsai/cuml/issues/2459
@@ -55,21 +56,20 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
   size_t cov_data_size = cov_data.size();
   Matrix::Data<T> cov{cov_data.data(), cov_data_size};
 
-  LinAlg::opg::mm_aTa(cov, input_data, input_desc, comm, allocator, streams,
-                      n_streams, cublas_handle);
+  LinAlg::opg::mm_aTa(handle, cov, input_data, input_desc, streams, n_streams);
 
   device_buffer<T> components_all(allocator, streams[0], len);
   device_buffer<T> explained_var_all(allocator, streams[0], prms.n_cols);
 
-  ML::calEig(handle.getImpl(), cov.ptr, components_all.data(),
-             explained_var_all.data(), prms, streams[0]);
+  ML::calEig(handle, cov.ptr, components_all.data(), explained_var_all.data(),
+             prms, streams[0]);
 
-  Matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
-                          prms.n_components, prms.n_cols, streams[0]);
+  raft::matrix::truncZeroOrigin(components_all.data(), prms.n_cols, components,
+                                prms.n_components, prms.n_cols, streams[0]);
 
   T scalar = T(1);
-  Matrix::seqRoot(explained_var_all.data(), singular_vals, scalar,
-                  prms.n_components, streams[0]);
+  raft::matrix::seqRoot(explained_var_all.data(), singular_vals, scalar,
+                        prms.n_components, streams[0]);
 }
 
 /**
@@ -84,10 +84,10 @@ void fit_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input_data,
  * @input param verbose
  */
 template <typename T>
-void fit_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void fit_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
               size_t n_parts, Matrix::Data<T> **input, T *components,
               T *singular_vals, paramsTSVD prms, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -96,7 +96,7 @@ void fit_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
   Matrix::PartDescriptor input_desc(prms.n_rows, prms.n_cols, ranksAndSizes,
                                     rank);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -116,16 +116,16 @@ void fit_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
 }
 
 template <typename T>
-void transform_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
+void transform_impl(raft::handle_t &handle,
+                    std::vector<Matrix::Data<T> *> &input,
                     Matrix::PartDescriptor input_desc, T *components,
                     std::vector<Matrix::Data<T> *> &trans_input,
                     paramsTSVD prms, cudaStream_t *streams, int n_streams,
                     bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  cublasHandle_t cublas_h = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
 
   std::vector<Matrix::RankSizePair *> local_blocks =
     input_desc.blocksOwnedBy(rank);
@@ -135,10 +135,10 @@ void transform_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
 
     T alpha = T(1);
     T beta = T(0);
-    LinAlg::gemm(input[i]->ptr, local_blocks[i]->size, size_t(prms.n_cols),
-                 components, trans_input[i]->ptr, local_blocks[i]->size,
-                 int(prms.n_components), CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta,
-                 cublas_h, streams[si]);
+    raft::linalg::gemm(handle, input[i]->ptr, local_blocks[i]->size,
+                       size_t(prms.n_cols), components, trans_input[i]->ptr,
+                       local_blocks[i]->size, int(prms.n_components),
+                       CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta, streams[si]);
   }
 
   for (int i = 0; i < n_streams; i++) {
@@ -158,11 +158,11 @@ void transform_impl(cumlHandle &handle, std::vector<Matrix::Data<T> *> &input,
  * @input param verbose
  */
 template <typename T>
-void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform_impl(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                     size_t n_parts, Matrix::Data<T> **input, T *components,
                     Matrix::Data<T> **trans_input, paramsTSVD prms,
                     bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -171,7 +171,7 @@ void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                                     rank);
   std::vector<Matrix::Data<T> *> trans_data(trans_input, trans_input + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -191,16 +191,15 @@ void transform_impl(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
 }
 
 template <typename T>
-void inverse_transform_impl(cumlHandle &handle,
+void inverse_transform_impl(raft::handle_t &handle,
                             std::vector<Matrix::Data<T> *> &trans_input,
                             Matrix::PartDescriptor trans_input_desc,
                             T *components,
                             std::vector<Matrix::Data<T> *> &input,
                             paramsTSVD prms, cudaStream_t *streams,
                             int n_streams, bool verbose) {
-  cublasHandle_t cublas_h = handle.getImpl().getCublasHandle();
-  const std::shared_ptr<deviceAllocator> allocator =
-    handle.getImpl().getDeviceAllocator();
+  cublasHandle_t cublas_h = handle.get_cublas_handle();
+  const auto allocator = handle.get_device_allocator();
   std::vector<Matrix::RankSizePair *> local_blocks =
     trans_input_desc.partsToRanks;
 
@@ -209,10 +208,10 @@ void inverse_transform_impl(cumlHandle &handle,
     T alpha = T(1);
     T beta = T(0);
 
-    LinAlg::gemm(trans_input[i]->ptr, local_blocks[i]->size,
-                 size_t(prms.n_components), components, input[i]->ptr,
-                 local_blocks[i]->size, prms.n_cols, CUBLAS_OP_N, CUBLAS_OP_N,
-                 alpha, beta, cublas_h, streams[si]);
+    raft::linalg::gemm(handle, trans_input[i]->ptr, local_blocks[i]->size,
+                       size_t(prms.n_components), components, input[i]->ptr,
+                       local_blocks[i]->size, prms.n_cols, CUBLAS_OP_N,
+                       CUBLAS_OP_N, alpha, beta, streams[si]);
   }
 
   for (int i = 0; i < n_streams; i++) {
@@ -232,12 +231,12 @@ void inverse_transform_impl(cumlHandle &handle,
  * @input param verbose
  */
 template <typename T>
-void inverse_transform_impl(cumlHandle &handle,
+void inverse_transform_impl(raft::handle_t &handle,
                             Matrix::RankSizePair **rank_sizes, size_t n_parts,
                             Matrix::Data<T> **trans_input, T *components,
                             Matrix::Data<T> **input, paramsTSVD prms,
                             bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
   std::vector<Matrix::RankSizePair *> ranksAndSizes(rank_sizes,
                                                     rank_sizes + n_parts);
@@ -247,7 +246,7 @@ void inverse_transform_impl(cumlHandle &handle,
 
   std::vector<Matrix::Data<T> *> input_data(input, input + n_parts);
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = n_parts;
   cudaStream_t streams[n_streams];
   for (int i = 0; i < n_streams; i++) {
@@ -281,16 +280,16 @@ void inverse_transform_impl(cumlHandle &handle,
  * @input param verbose
  */
 template <typename T>
-void fit_transform_impl(cumlHandle &handle,
+void fit_transform_impl(raft::handle_t &handle,
                         std::vector<Matrix::Data<T> *> &input_data,
                         Matrix::PartDescriptor &input_desc,
                         std::vector<Matrix::Data<T> *> &trans_data,
                         Matrix::PartDescriptor &trans_desc, T *components,
                         T *explained_var, T *explained_var_ratio,
                         T *singular_vals, paramsTSVD prms, bool verbose) {
-  int rank = handle.getImpl().getCommunicator().getRank();
+  int rank = handle.get_comms().get_rank();
 
-  // TODO: These streams should come from cumlHandle
+  // TODO: These streams should come from raft::handle_t
   int n_streams = input_desc.blocksOwnedBy(rank).size();
   ;
   cudaStream_t streams[n_streams];
@@ -307,52 +306,41 @@ void fit_transform_impl(cumlHandle &handle,
   PCA::opg::sign_flip(handle, trans_data, input_desc, components,
                       prms.n_components, streams, n_streams);
 
-  device_buffer<T> mu_trans(handle.getImpl().getDeviceAllocator(), streams[0],
+  device_buffer<T> mu_trans(handle.get_device_allocator(), streams[0],
                             prms.n_components);
   Matrix::Data<T> mu_trans_data{mu_trans.data(), size_t(prms.n_components)};
 
-  Stats::opg::mean(mu_trans_data, trans_data, trans_desc,
-                   handle.getImpl().getCommunicator(),
-                   handle.getImpl().getDeviceAllocator(), streams, n_streams,
-                   handle.getImpl().getCublasHandle());
+  Stats::opg::mean(handle, mu_trans_data, trans_data, trans_desc, streams,
+                   n_streams);
 
   Matrix::Data<T> explained_var_data{explained_var, size_t(prms.n_components)};
 
-  Stats::opg::var(explained_var_data, trans_data, trans_desc, mu_trans_data.ptr,
-                  handle.getImpl().getCommunicator(),
-                  handle.getImpl().getDeviceAllocator(), streams, n_streams,
-                  handle.getImpl().getCublasHandle());
+  Stats::opg::var(handle, explained_var_data, trans_data, trans_desc,
+                  mu_trans_data.ptr, streams, n_streams);
 
-  device_buffer<T> mu(handle.getImpl().getDeviceAllocator(), streams[0],
-                      prms.n_rows);
+  device_buffer<T> mu(handle.get_device_allocator(), streams[0], prms.n_rows);
   Matrix::Data<T> mu_data{mu.data(), size_t(prms.n_rows)};
 
-  Stats::opg::mean(mu_data, input_data, input_desc,
-                   handle.getImpl().getCommunicator(),
-                   handle.getImpl().getDeviceAllocator(), streams, n_streams,
-                   handle.getImpl().getCublasHandle());
+  Stats::opg::mean(handle, mu_data, input_data, input_desc, streams, n_streams);
 
-  device_buffer<T> var_input(handle.getImpl().getDeviceAllocator(), streams[0],
+  device_buffer<T> var_input(handle.get_device_allocator(), streams[0],
                              prms.n_rows);
   Matrix::Data<T> var_input_data{var_input.data(), size_t(prms.n_rows)};
 
-  Stats::opg::var(var_input_data, input_data, input_desc, mu_data.ptr,
-                  handle.getImpl().getCommunicator(),
-                  handle.getImpl().getDeviceAllocator(), streams, n_streams,
-                  handle.getImpl().getCublasHandle());
+  Stats::opg::var(handle, var_input_data, input_data, input_desc, mu_data.ptr,
+                  streams, n_streams);
 
-  device_buffer<T> total_vars(handle.getImpl().getDeviceAllocator(), streams[0],
-                              1);
-  Stats::sum(total_vars.data(), var_input_data.ptr, 1, prms.n_cols, false,
-             streams[0]);
+  device_buffer<T> total_vars(handle.get_device_allocator(), streams[0], 1);
+  raft::stats::sum(total_vars.data(), var_input_data.ptr, 1, prms.n_cols, false,
+                   streams[0]);
 
   T total_vars_h;
-  updateHost(&total_vars_h, total_vars.data(), 1, streams[0]);
+  raft::update_host(&total_vars_h, total_vars.data(), 1, streams[0]);
   CUDA_CHECK(cudaStreamSynchronize(streams[0]));
   T scalar = T(1) / total_vars_h;
 
-  LinAlg::scalarMultiply(explained_var_ratio, explained_var, scalar,
-                         prms.n_components, streams[0]);
+  raft::linalg::scalarMultiply(explained_var_ratio, explained_var, scalar,
+                               prms.n_components, streams[0]);
 
   for (int i = 0; i < n_streams; i++) {
     CUDA_CHECK(cudaStreamSynchronize(streams[i]));
@@ -363,21 +351,21 @@ void fit_transform_impl(cumlHandle &handle,
   }
 }
 
-void fit(cumlHandle &handle, Matrix::RankSizePair **rank_sizes, size_t n_parts,
-         Matrix::floatData_t **input, float *components, float *singular_vals,
-         paramsTSVD prms, bool verbose) {
+void fit(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
+         size_t n_parts, Matrix::floatData_t **input, float *components,
+         float *singular_vals, paramsTSVD prms, bool verbose) {
   fit_impl(handle, rank_sizes, n_parts, input, components, singular_vals, prms,
            verbose);
 }
 
-void fit(cumlHandle &handle, Matrix::RankSizePair **rank_sizes, size_t n_parts,
-         Matrix::doubleData_t **input, double *components,
+void fit(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
+         size_t n_parts, Matrix::doubleData_t **input, double *components,
          double *singular_vals, paramsTSVD prms, bool verbose) {
   fit_impl(handle, rank_sizes, n_parts, input, components, singular_vals, prms,
            verbose);
 }
 
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    std::vector<Matrix::Data<float> *> &input_data,
                    Matrix::PartDescriptor &input_desc,
                    std::vector<Matrix::Data<float> *> &trans_data,
@@ -389,7 +377,7 @@ void fit_transform(cumlHandle &handle,
                      singular_vals, prms, verbose);
 }
 
-void fit_transform(cumlHandle &handle,
+void fit_transform(raft::handle_t &handle,
                    std::vector<Matrix::Data<double> *> &input_data,
                    Matrix::PartDescriptor &input_desc,
                    std::vector<Matrix::Data<double> *> &trans_data,
@@ -401,7 +389,7 @@ void fit_transform(cumlHandle &handle,
                      singular_vals, prms, verbose);
 }
 
-void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                size_t n_parts, Matrix::Data<float> **input, float *components,
                Matrix::Data<float> **trans_input, paramsTSVD prms,
                bool verbose) {
@@ -409,7 +397,7 @@ void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                  prms, verbose);
 }
 
-void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
+void transform(raft::handle_t &handle, Matrix::RankSizePair **rank_sizes,
                size_t n_parts, Matrix::Data<double> **input, double *components,
                Matrix::Data<double> **trans_input, paramsTSVD prms,
                bool verbose) {
@@ -417,18 +405,20 @@ void transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
                  prms, verbose);
 }
 
-void inverse_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
-                       size_t n_parts, Matrix::Data<float> **trans_input,
-                       float *components, Matrix::Data<float> **input,
-                       paramsTSVD prms, bool verbose) {
+void inverse_transform(raft::handle_t &handle,
+                       Matrix::RankSizePair **rank_sizes, size_t n_parts,
+                       Matrix::Data<float> **trans_input, float *components,
+                       Matrix::Data<float> **input, paramsTSVD prms,
+                       bool verbose) {
   inverse_transform_impl(handle, rank_sizes, n_parts, trans_input, components,
                          input, prms, verbose);
 }
 
-void inverse_transform(cumlHandle &handle, Matrix::RankSizePair **rank_sizes,
-                       size_t n_parts, Matrix::Data<double> **trans_input,
-                       double *components, Matrix::Data<double> **input,
-                       paramsTSVD prms, bool verbose) {
+void inverse_transform(raft::handle_t &handle,
+                       Matrix::RankSizePair **rank_sizes, size_t n_parts,
+                       Matrix::Data<double> **trans_input, double *components,
+                       Matrix::Data<double> **input, paramsTSVD prms,
+                       bool verbose) {
   inverse_transform_impl(handle, rank_sizes, n_parts, trans_input, components,
                          input, prms, verbose);
 }
diff --git a/cpp/src/tsvd/tsvd_spmg.h b/cpp/src/tsvd/tsvd_spmg.h
deleted file mode 100644
index 981c2d01a8..0000000000
--- a/cpp/src/tsvd/tsvd_spmg.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuml/decomposition/params.hpp>
-
-namespace ML {
-
-void tsvdFitSPMG(float *h_input, float *h_components, float *h_singular_vals,
-                 paramsTSVD prms, int *gpu_ids, int n_gpus);
-void tsvdFitSPMG(double *h_input, double *h_components, double *h_singular_vals,
-                 paramsTSVD prms, int *gpu_ids, int n_gpus);
-void tsvdInverseTransformSPMG(float *h_trans_input, float *h_components,
-                              bool trans_comp, float *input, paramsTSVD prms,
-                              int *gpu_ids, int n_gpus);
-void tsvdInverseTransformSPMG(double *h_trans_input, double *h_components,
-                              bool trans_comp, double *input, paramsTSVD prms,
-                              int *gpu_ids, int n_gpus);
-void tsvdTransformSPMG(float *h_input, float *h_components, bool trans_comp,
-                       float *h_trans_input, paramsTSVD prms, int *gpu_ids,
-                       int n_gpus);
-void tsvdTransformSPMG(double *h_input, double *h_components, bool trans_comp,
-                       double *h_trans_input, paramsTSVD prms, int *gpu_ids,
-                       int n_gpus);
-void tsvdFitTransformSPMG(float *h_input, float *h_trans_input,
-                          float *h_components, float *h_explained_var,
-                          float *h_explained_var_ratio, float *h_singular_vals,
-                          paramsTSVD prms, int *gpu_ids, int n_gpus);
-void tsvdFitTransformSPMG(double *h_input, double *h_trans_input,
-                          double *h_components, double *h_explained_var,
-                          double *h_explained_var_ratio,
-                          double *h_singular_vals, paramsTSVD prms,
-                          int *gpu_ids, int n_gpus);
-
-}  // namespace ML
diff --git a/cpp/src/umap/fuzzy_simpl_set/naive.cuh b/cpp/src/umap/fuzzy_simpl_set/naive.cuh
index 5f6e04485d..8028dc80b6 100644
--- a/cpp/src/umap/fuzzy_simpl_set/naive.cuh
+++ b/cpp/src/umap/fuzzy_simpl_set/naive.cuh
@@ -20,11 +20,11 @@
 #include <cuml/common/logger.hpp>
 #include <cuml/neighbors/knn.hpp>
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
+#include <raft/stats/mean.cuh>
 #include <sparse/coo.cuh>
-#include <stats/mean.cuh>
 
 #include <cuda_runtime.h>
 
@@ -235,17 +235,17 @@ void smooth_knn_dist(int n, const int64_t *knn_indices, const float *knn_dists,
                      float local_connectivity,
                      std::shared_ptr<deviceAllocator> d_alloc,
                      cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(n, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(n, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   MLCommon::device_buffer<T> dist_means_dev(d_alloc, stream, n_neighbors);
 
-  MLCommon::Stats::mean(dist_means_dev.data(), knn_dists, 1, n_neighbors * n,
-                        false, false, stream);
+  raft::stats::mean(dist_means_dev.data(), knn_dists, 1, n_neighbors * n, false,
+                    false, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   T mean_dist = 0.0;
-  MLCommon::updateHost(&mean_dist, dist_means_dev.data(), 1, stream);
+  raft::update_host(&mean_dist, dist_means_dev.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   /**
@@ -295,9 +295,9 @@ void launcher(int n, const int64_t *knn_indices, const float *knn_dists,
   // check for logging in order to avoid the potentially costly `arr2Str` call!
   if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG)) {
     CUML_LOG_DEBUG("Smooth kNN Distances");
-    auto str = MLCommon::arr2Str(sigmas.data(), 25, "sigmas", stream);
+    auto str = raft::arr2Str(sigmas.data(), 25, "sigmas", stream);
     CUML_LOG_DEBUG("%s", str.c_str());
-    str = MLCommon::arr2Str(rhos.data(), 25, "rhos", stream);
+    str = raft::arr2Str(rhos.data(), 25, "rhos", stream);
     CUML_LOG_DEBUG("%s", str.c_str());
   }
 
@@ -307,7 +307,7 @@ void launcher(int n, const int64_t *knn_indices, const float *knn_dists,
    * Compute graph of membership strengths
    */
 
-  dim3 grid_elm(MLCommon::ceildiv(n * n_neighbors, TPB_X), 1, 1);
+  dim3 grid_elm(raft::ceildiv(n * n_neighbors, TPB_X), 1, 1);
   dim3 blk_elm(TPB_X, 1, 1);
 
   compute_membership_strength_kernel<TPB_X><<<grid_elm, blk_elm, 0, stream>>>(
diff --git a/cpp/src/umap/init_embed/random_algo.cuh b/cpp/src/umap/init_embed/random_algo.cuh
index 8701f3ed34..b555dda35b 100644
--- a/cpp/src/umap/init_embed/random_algo.cuh
+++ b/cpp/src/umap/init_embed/random_algo.cuh
@@ -15,7 +15,7 @@
  */
 
 #include <cuml/manifold/umapparams.h>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 
 #pragma once
 
@@ -33,7 +33,7 @@ void launcher(const T *X, int n, int d, const int64_t *knn_indices,
               cudaStream_t stream) {
   uint64_t seed = params->random_state;
 
-  MLCommon::Random::Rng r(seed);
+  raft::random::Rng r(seed);
   r.uniform<T>(embedding, n * params->n_components, -10, 10, stream);
 }
 }  // namespace RandomInit
diff --git a/cpp/src/umap/init_embed/runner.cuh b/cpp/src/umap/init_embed/runner.cuh
index f834ee0e45..6d18592e78 100644
--- a/cpp/src/umap/init_embed/runner.cuh
+++ b/cpp/src/umap/init_embed/runner.cuh
@@ -30,7 +30,7 @@ namespace InitEmbed {
 using namespace ML;
 
 template <typename T>
-void run(const cumlHandle &handle, const T *X, int n, int d,
+void run(const raft::handle_t &handle, const T *X, int n, int d,
          const int64_t *knn_indices, const T *knn_dists,
          MLCommon::Sparse::COO<float> *coo, UMAPParams *params, T *embedding,
          cudaStream_t stream, int algo = 0) {
@@ -40,7 +40,7 @@ void run(const cumlHandle &handle, const T *X, int n, int d,
              */
     case 0:
       RandomInit::launcher(X, n, d, knn_indices, knn_dists, params, embedding,
-                           handle.getStream());
+                           handle.get_stream());
       break;
 
     case 1:
diff --git a/cpp/src/umap/init_embed/spectral_algo.cuh b/cpp/src/umap/init_embed/spectral_algo.cuh
index 46207db0d3..86adcc1e3e 100644
--- a/cpp/src/umap/init_embed/spectral_algo.cuh
+++ b/cpp/src/umap/init_embed/spectral_algo.cuh
@@ -21,10 +21,10 @@
 
 #include <sparse/coo.cuh>
 
-#include <linalg/add.cuh>
+#include <raft/linalg/add.cuh>
 
-#include <linalg/transpose.h>
-#include <random/rng.cuh>
+#include <raft/linalg/transpose.h>
+#include <raft/random/rng.cuh>
 
 #include <cuml/cluster/spectral.hpp>
 #include <iostream>
@@ -41,27 +41,26 @@ using namespace ML;
    * Performs a spectral layout initialization
    */
 template <typename T>
-void launcher(const cumlHandle &handle, const T *X, int n, int d,
+void launcher(const raft::handle_t &handle, const T *X, int n, int d,
               const int64_t *knn_indices, const T *knn_dists,
               MLCommon::Sparse::COO<float> *coo, UMAPParams *params,
               T *embedding) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
 
   ASSERT(n > params->n_components,
          "Spectral layout requires n_samples > n_components");
 
-  MLCommon::device_buffer<T> tmp_storage(handle.getDeviceAllocator(), stream,
+  MLCommon::device_buffer<T> tmp_storage(handle.get_device_allocator(), stream,
                                          n * params->n_components);
 
   Spectral::fit_embedding(handle, coo->rows(), coo->cols(), coo->vals(),
                           coo->nnz, n, params->n_components,
                           tmp_storage.data());
 
-  MLCommon::LinAlg::transpose(tmp_storage.data(), embedding, n,
-                              params->n_components,
-                              handle.getImpl().getCublasHandle(), stream);
+  raft::linalg::transpose(handle, tmp_storage.data(), embedding, n,
+                          params->n_components, stream);
 
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     tmp_storage.data(), tmp_storage.data(), n * params->n_components,
     [=] __device__(T input) { return fabsf(input); }, stream);
 
@@ -72,15 +71,15 @@ void launcher(const cumlHandle &handle, const T *X, int n, int d,
   uint64_t seed = params->random_state;
 
   // Reuse tmp_storage to add random noise
-  MLCommon::Random::Rng r(seed);
+  raft::random::Rng r(seed);
   r.normal(tmp_storage.data(), n * params->n_components, 0.0f, 0.0001f, stream);
 
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     embedding, embedding, n * params->n_components,
     [=] __device__(T input) { return (10.0f / max) * input; }, stream);
 
-  MLCommon::LinAlg::add(embedding, embedding, tmp_storage.data(),
-                        n * params->n_components, stream);
+  raft::linalg::add(embedding, embedding, tmp_storage.data(),
+                    n * params->n_components, stream);
 
   CUDA_CHECK(cudaPeekAtLastError());
 }
diff --git a/cpp/src/umap/knn_graph/algo.cuh b/cpp/src/umap/knn_graph/algo.cuh
index 3c322cddcc..53b74eae8d 100644
--- a/cpp/src/umap/knn_graph/algo.cuh
+++ b/cpp/src/umap/knn_graph/algo.cuh
@@ -16,7 +16,7 @@
 
 #include <cuml/manifold/umapparams.h>
 #include <iostream>
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include <selection/knn.cuh>
 
 #pragma once
diff --git a/cpp/src/umap/optimize.cuh b/cpp/src/umap/optimize.cuh
index 3f555afacd..6ac130d34d 100644
--- a/cpp/src/umap/optimize.cuh
+++ b/cpp/src/umap/optimize.cuh
@@ -21,15 +21,15 @@
 
 #include <common/device_buffer.hpp>
 
-#include <common/cudart_utils.h>
-#include <linalg/add.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/multiply.cuh>
+#include <raft/cudart_utils.h>
 #include <linalg/power.cuh>
-#include <linalg/unary_op.cuh>
-#include <matrix/math.cuh>
-#include <stats/mean.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/multiply.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/stats/mean.cuh>
 
 #include <cuda_runtime.h>
 
@@ -59,7 +59,7 @@ __global__ void map_kernel(T *output, T *X, int n_rows, T *coef, Lambda grad) {
  */
 template <typename T, int TPB_X>
 void f(T *input, int n_rows, T *coef, T *preds) {
-  dim3 grid(MLCommon::ceildiv(n_rows, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(n_rows, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   // Function: 1/1+ax^(2b)
@@ -76,7 +76,7 @@ template <typename T, int TPB_X>
 void abLossGrads(T *input, int n_rows, const T *labels, T *coef, T *grads,
                  UMAPParams *params, std::shared_ptr<deviceAllocator> d_alloc,
                  cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(n_rows, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(n_rows, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   /**
@@ -85,30 +85,30 @@ void abLossGrads(T *input, int n_rows, const T *labels, T *coef, T *grads,
   MLCommon::device_buffer<T> residuals(d_alloc, stream, n_rows);
 
   f<T, TPB_X>(input, n_rows, coef, residuals.data());
-  MLCommon::LinAlg::eltwiseSub(residuals.data(), residuals.data(), labels,
-                               n_rows, stream);
+  raft::linalg::eltwiseSub(residuals.data(), residuals.data(), labels, n_rows,
+                           stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   /**
    * Gradient w/ respect to a
    */
   MLCommon::device_buffer<T> a_deriv(d_alloc, stream, n_rows);
-  MLCommon::copy(a_deriv.data(), input, n_rows, stream);
+  raft::copy(a_deriv.data(), input, n_rows, stream);
   map_kernel<T, TPB_X><<<grid, blk, 0, stream>>>(
     a_deriv.data(), a_deriv.data(), n_rows, coef,
     [] __device__ __host__(T x, T a, T b) {
       return -(pow(x, 2.0 * b)) / pow((1.0 + a * pow(x, 2.0 * b)), 2.0);
     });
 
-  MLCommon::LinAlg::eltwiseMultiply(a_deriv.data(), a_deriv.data(),
-                                    residuals.data(), n_rows, stream);
+  raft::linalg::eltwiseMultiply(a_deriv.data(), a_deriv.data(),
+                                residuals.data(), n_rows, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   /**
    * Gradient w/ respect to b
    */
   MLCommon::device_buffer<T> b_deriv(d_alloc, stream, n_rows);
-  MLCommon::copy(b_deriv.data(), input, n_rows, stream);
+  raft::copy(b_deriv.data(), input, n_rows, stream);
   map_kernel<T, TPB_X>
     <<<grid, blk, 0, stream>>>(b_deriv.data(), b_deriv.data(), n_rows, coef,
                                [] __device__ __host__(T x, T a, T b) {
@@ -119,16 +119,15 @@ void abLossGrads(T *input, int n_rows, const T *labels, T *coef, T *grads,
   /**
    * Multiply partial derivs by residuals
    */
-  MLCommon::LinAlg::eltwiseMultiply(b_deriv.data(), b_deriv.data(),
-                                    residuals.data(), n_rows, stream);
+  raft::linalg::eltwiseMultiply(b_deriv.data(), b_deriv.data(),
+                                residuals.data(), n_rows, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   /**
    * Finally, take the mean
    */
-  MLCommon::Stats::mean(grads, a_deriv.data(), 1, n_rows, false, false, stream);
-  MLCommon::Stats::mean(grads + 1, b_deriv.data(), 1, n_rows, false, false,
-                        stream);
+  raft::stats::mean(grads, a_deriv.data(), 1, n_rows, false, false, stream);
+  raft::stats::mean(grads + 1, b_deriv.data(), 1, n_rows, false, false, stream);
 
   CUDA_CHECK(cudaPeekAtLastError());
 }
@@ -156,12 +155,12 @@ void optimize_params(T *input, int n_rows, const T *labels, T *coef,
     abLossGrads<T, TPB_X>(input, n_rows, labels, coef, grads.data(), params,
                           d_alloc, stream);
 
-    MLCommon::LinAlg::multiplyScalar(grads.data(), grads.data(), learning_rate,
-                                     2, stream);
-    MLCommon::LinAlg::eltwiseSub(coef, coef, grads.data(), 2, stream);
+    raft::linalg::multiplyScalar(grads.data(), grads.data(), learning_rate, 2,
+                                 stream);
+    raft::linalg::eltwiseSub(coef, coef, grads.data(), 2, stream);
 
     T *grads_h = (T *)malloc(2 * sizeof(T));
-    MLCommon::updateHost(grads_h, grads.data(), 2, stream);
+    raft::update_host(grads_h, grads.data(), 2, stream);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
@@ -197,10 +196,10 @@ void find_params_ab(UMAPParams *params,
   }
 
   MLCommon::device_buffer<float> X_d(d_alloc, stream, 300);
-  MLCommon::updateDevice(X_d.data(), X, 300, stream);
+  raft::update_device(X_d.data(), X, 300, stream);
 
   MLCommon::device_buffer<float> y_d(d_alloc, stream, 300);
-  MLCommon::updateDevice(y_d.data(), y, 300, stream);
+  raft::update_device(y_d.data(), y, 300, stream);
   float *coeffs_h = (float *)malloc(2 * sizeof(float));
   coeffs_h[0] = 1.0;
   coeffs_h[1] = 1.0;
@@ -208,13 +207,13 @@ void find_params_ab(UMAPParams *params,
   MLCommon::device_buffer<float> coeffs(d_alloc, stream, 2);
   CUDA_CHECK(cudaMemsetAsync(coeffs.data(), 0, 2 * sizeof(float), stream));
 
-  MLCommon::updateDevice(coeffs.data(), coeffs_h, 2, stream);
+  raft::update_device(coeffs.data(), coeffs_h, 2, stream);
 
   optimize_params<float, 256>(X_d.data(), 300, y_d.data(), coeffs.data(),
                               params, d_alloc, stream);
 
-  MLCommon::updateHost(&(params->a), coeffs.data(), 1, stream);
-  MLCommon::updateHost(&(params->b), coeffs.data() + 1, 1, stream);
+  raft::update_host(&(params->a), coeffs.data(), 1, stream);
+  raft::update_host(&(params->b), coeffs.data() + 1, 1, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src/umap/runner.cuh b/cpp/src/umap/runner.cuh
index 1725972dfb..d9c8f6733a 100644
--- a/cpp/src/umap/runner.cuh
+++ b/cpp/src/umap/runner.cuh
@@ -36,7 +36,7 @@
 #include <sparse/coo.cuh>
 #include <sparse/csr.cuh>
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 #include <cuda_runtime.h>
 #include <common/nvtx.hpp>
@@ -80,15 +80,15 @@ void find_ab(UMAPParams *params, std::shared_ptr<deviceAllocator> d_alloc,
 }
 
 template <typename T, int TPB_X>
-void _fit(const cumlHandle &handle,
+void _fit(const raft::handle_t &handle,
           T *X,   // input matrix
           int n,  // rows
           int d,  // cols
           int64_t *knn_indices, T *knn_dists, UMAPParams *params,
           T *embeddings) {
   ML::PUSH_RANGE("umap::unsupervised::fit");
-  cudaStream_t stream = handle.getStream();
-  auto d_alloc = handle.getDeviceAllocator();
+  cudaStream_t stream = handle.get_stream();
+  auto d_alloc = handle.get_device_allocator();
 
   int k = params->n_neighbors;
 
@@ -160,14 +160,14 @@ void _fit(const cumlHandle &handle,
 }
 
 template <typename T, int TPB_X>
-void _fit(const cumlHandle &handle,
+void _fit(const raft::handle_t &handle,
           T *X,  // input matrix
           T *y,  // labels
           int n, int d, int64_t *knn_indices, T *knn_dists, UMAPParams *params,
           T *embeddings) {
   ML::PUSH_RANGE("umap::supervised::fit");
-  std::shared_ptr<deviceAllocator> d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   int k = params->n_neighbors;
 
@@ -280,13 +280,13 @@ void _fit(const cumlHandle &handle,
 	 *
 	 */
 template <typename T, int TPB_X>
-void _transform(const cumlHandle &handle, T *X, int n, int d,
+void _transform(const raft::handle_t &handle, T *X, int n, int d,
                 int64_t *knn_indices, float *knn_dists, T *orig_X, int orig_n,
                 T *embedding, int embedding_n, UMAPParams *params,
                 T *transformed) {
   ML::PUSH_RANGE("umap::transform");
-  std::shared_ptr<deviceAllocator> d_alloc = handle.getDeviceAllocator();
-  cudaStream_t stream = handle.getStream();
+  auto d_alloc = handle.get_device_allocator();
+  cudaStream_t stream = handle.get_stream();
 
   ML::Logger::get().setLevel(params->verbosity);
 
@@ -335,7 +335,7 @@ void _transform(const cumlHandle &handle, T *X, int n, int d,
   CUDA_CHECK(cudaMemsetAsync(sigmas.data(), 0, n * sizeof(T), stream));
   CUDA_CHECK(cudaMemsetAsync(rhos.data(), 0, n * sizeof(T), stream));
 
-  dim3 grid_n(MLCommon::ceildiv(n, TPB_X), 1, 1);
+  dim3 grid_n(raft::ceildiv(n, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   FuzzySimplSetImpl::smooth_knn_dist<TPB_X, T>(
@@ -349,7 +349,7 @@ void _transform(const cumlHandle &handle, T *X, int n, int d,
 
   int nnz = n * params->n_neighbors;
 
-  dim3 grid_nnz(MLCommon::ceildiv(nnz, TPB_X), 1, 1);
+  dim3 grid_nnz(raft::ceildiv(nnz, TPB_X), 1, 1);
 
   CUML_LOG_DEBUG("Executing fuzzy simplicial set");
 
@@ -415,7 +415,7 @@ void _transform(const cumlHandle &handle, T *X, int n, int d,
 
   CUML_LOG_DEBUG("n_epochs=%d", n_epochs);
 
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     graph_coo.vals(), graph_coo.vals(), graph_coo.nnz,
     [=] __device__(T input) {
       if (input < (max / float(n_epochs)))
diff --git a/cpp/src/umap/simpl_set_embed/algo.cuh b/cpp/src/umap/simpl_set_embed/algo.cuh
index e2226aaa02..120a57bf0e 100644
--- a/cpp/src/umap/simpl_set_embed/algo.cuh
+++ b/cpp/src/umap/simpl_set_embed/algo.cuh
@@ -16,18 +16,18 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/manifold/umapparams.h>
 #include <curand.h>
 #include <internals/internals.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <thrust/device_ptr.h>
 #include <thrust/extrema.h>
 #include <thrust/system/cuda/execution_policy.h>
 #include <common/fast_int_div.cuh>
 #include <cstdlib>
 #include <cuml/common/logger.hpp>
-#include <random/rng_impl.cuh>
+#include <raft/random/rng_impl.cuh>
 #include <sparse/coo.cuh>
 #include <string>
 #include "optimize_batch_kernel.cuh"
@@ -65,7 +65,7 @@ void make_epochs_per_sample(T *weights, int weights_n, int n_epochs, T *result,
   //      float(n_epochs) / n_samples[n_samples > 0]
   //  )
 
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     result, weights, weights_n,
     [=] __device__(T input) {
       T v = n_epochs * (input / weights_max);
@@ -129,14 +129,14 @@ void optimize_layout(T *head_embedding, int head_n, T *tail_embedding,
   MLCommon::device_buffer<T> epoch_of_next_negative_sample(d_alloc, stream,
                                                            nnz);
   T nsr_inv = T(1.0) / params->negative_sample_rate;
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     epoch_of_next_negative_sample.data(), epochs_per_sample, nnz,
     [=] __device__(T input) { return input * nsr_inv; }, stream);
 
   MLCommon::device_buffer<T> epoch_of_next_sample(d_alloc, stream, nnz);
-  MLCommon::copy(epoch_of_next_sample.data(), epochs_per_sample, nnz, stream);
+  raft::copy(epoch_of_next_sample.data(), epochs_per_sample, nnz, stream);
 
-  dim3 grid(MLCommon::ceildiv(nnz, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(nnz, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
   uint64_t seed = params->random_state;
 
@@ -157,7 +157,7 @@ void optimize_layout(T *head_embedding, int head_n, T *tail_embedding,
     MLCommon::device_buffer<double> embedding_updates_buf(
       d_alloc, stream, n_vertices * params->n_components);
     double *embedding_updates = embedding_updates_buf.data();
-    dim3 grid2(MLCommon::ceildiv(n_vertices * params->n_components, TPB_X));
+    dim3 grid2(raft::ceildiv(n_vertices * params->n_components, TPB_X));
 
     for (int n = 0; n < n_epochs; n++) {
       CUDA_CHECK(cudaMemsetAsync(
@@ -219,7 +219,7 @@ void launcher(int m, int n, MLCommon::Sparse::COO<T> *in, UMAPParams *params,
    * Go through COO values and set everything that's less than
    * vals.max() / params->n_epochs to 0.0
    */
-  MLCommon::LinAlg::unaryOp<T>(
+  raft::linalg::unaryOp<T>(
     in->vals(), in->vals(), nnz,
     [=] __device__(T input) {
       if (input < (max / float(n_epochs)))
@@ -241,8 +241,8 @@ void launcher(int m, int n, MLCommon::Sparse::COO<T> *in, UMAPParams *params,
 
   if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG)) {
     std::stringstream ss;
-    ss << MLCommon::arr2Str(epochs_per_sample.data(), out.nnz,
-                            "epochs_per_sample", stream);
+    ss << raft::arr2Str(epochs_per_sample.data(), out.nnz, "epochs_per_sample",
+                        stream);
     CUML_LOG_DEBUG(ss.str().c_str());
   }
 
diff --git a/cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh b/cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
index 6f9d8090bf..eb250cbf68 100644
--- a/cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
+++ b/cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
@@ -17,8 +17,9 @@
 #pragma once
 
 #include <cuml/manifold/umapparams.h>
+#include <raft/cudart_utils.h>
 #include <common/fast_int_div.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace UMAPAlgo {
 namespace SimplSetEmbed {
@@ -131,12 +132,12 @@ __global__ void optimize_batch_kernel_reg(
   if (move_other) {
     if (multicore_implem) {
       for (int d = 0; d < n_components; d++) {
-        atomicAdd(other + d, -grads[d]);
+        raft::myAtomicAdd(other + d, -grads[d]);
       }
     } else {
       T2 *tmp2 = (T2 *)embedding_updates + (k * n_components);
       for (int d = 0; d < n_components; d++) {
-        atomicAdd(tmp2 + d, -grads[d]);
+        raft::myAtomicAdd<T2>((T2 *)tmp2 + d, -grads[d]);
       }
     }
   }
@@ -148,8 +149,7 @@ __global__ void optimize_batch_kernel_reg(
   /**
    * Negative sampling stage
    */
-  MLCommon::Random::detail::PhiloxGenerator gen((uint64_t)seed, (uint64_t)row,
-                                                0);
+  raft::random::detail::PhiloxGenerator gen((uint64_t)seed, (uint64_t)row, 0);
   for (int p = 0; p < n_neg_samples; p++) {
     int r;
     gen.next(r);
@@ -184,12 +184,12 @@ __global__ void optimize_batch_kernel_reg(
   // storing gradients for positive samples back to global memory
   if (multicore_implem) {
     for (int d = 0; d < n_components; d++) {
-      atomicAdd(current + d, grads[d]);
+      raft::myAtomicAdd(current + d, grads[d]);
     }
   } else {
     T2 *tmp1 = (T2 *)embedding_updates + (j * n_components);
     for (int d = 0; d < n_components; d++) {
-      atomicAdd(tmp1 + d, grads[d]);
+      raft::myAtomicAdd<T2>((T2 *)tmp1 + d, grads[d]);
     }
   }
   epoch_of_next_negative_sample[row] =
@@ -249,14 +249,14 @@ __global__ void optimize_batch_kernel(
       current_buffer[d * TPB_X] = grad_d;
     } else {
       if (multicore_implem) {
-        atomicAdd(current + d, grad_d);
+        raft::myAtomicAdd<T>((T *)current + d, grad_d);
         if (move_other) {  // happens only during unsupervised training
-          atomicAdd(other + d, -grad_d);
+          raft::myAtomicAdd<T>((T *)other + d, -grad_d);
         }
       } else {
-        atomicAdd(current_buffer + d, grad_d);
+        raft::myAtomicAdd<T2>((T2 *)current_buffer + d, grad_d);
         if (move_other) {  // happens only during unsupervised training
-          atomicAdd(other_buffer + d, -grad_d);
+          raft::myAtomicAdd<T2>((T2 *)other_buffer + d, -grad_d);
         }
       }
     }
@@ -267,13 +267,13 @@ __global__ void optimize_batch_kernel(
     if (multicore_implem) {
       for (int d = 0; d < params.n_components; d++) {
         auto grad = current_buffer[d * TPB_X];
-        atomicAdd(other + d, -grad);
+        raft::myAtomicAdd<T>((T *)other + d, -grad);
       }
     } else {
       T2 *tmp2 = (T2 *)embedding_updates + (k * params.n_components);
       for (int d = 0; d < params.n_components; d++) {
         auto grad = current_buffer[d * TPB_X];
-        atomicAdd(tmp2 + d, -grad);
+        raft::myAtomicAdd<T2>((T2 *)tmp2 + d, -grad);
       }
     }
   }
@@ -285,8 +285,7 @@ __global__ void optimize_batch_kernel(
   /**
    * Negative sampling stage
    */
-  MLCommon::Random::detail::PhiloxGenerator gen((uint64_t)seed, (uint64_t)row,
-                                                0);
+  raft::random::detail::PhiloxGenerator gen((uint64_t)seed, (uint64_t)row, 0);
   for (int p = 0; p < n_neg_samples; p++) {
     int r;
     gen.next(r);
@@ -317,9 +316,9 @@ __global__ void optimize_batch_kernel(
         current_buffer[d * TPB_X] += grad_d;
       } else {
         if (multicore_implem) {
-          atomicAdd(current + d, grad_d);
+          raft::myAtomicAdd<T>((T *)current + d, grad_d);
         } else {
-          atomicAdd(current_buffer + d, grad_d);
+          raft::myAtomicAdd<T2>((T2 *)current_buffer + d, grad_d);
         }
       }
     }
@@ -329,12 +328,12 @@ __global__ void optimize_batch_kernel(
     __syncthreads();
     if (multicore_implem) {
       for (int d = 0; d < params.n_components; d++) {
-        atomicAdd(current + d, current_buffer[d * TPB_X]);
+        raft::myAtomicAdd<T>((T *)current + d, current_buffer[d * TPB_X]);
       }
     } else {
       T2 *tmp1 = (T2 *)embedding_updates + (j * params.n_components);
       for (int d = 0; d < params.n_components; d++) {
-        atomicAdd(tmp1 + d, current_buffer[d * TPB_X]);
+        raft::myAtomicAdd<T2>((T2 *)tmp1 + d, current_buffer[d * TPB_X]);
       }
     }
   }
@@ -356,7 +355,7 @@ void call_optimize_batch_kernel(
   } else {
     requiredSize *= sizeof(double);
   }
-  bool use_shared_mem = requiredSize < MLCommon::getSharedMemPerBlock();
+  bool use_shared_mem = requiredSize < raft::getSharedMemPerBlock();
   T nsr_inv = T(1.0) / params->negative_sample_rate;
   if (embedding_updates) {
     if (params->n_components == 2) {
diff --git a/cpp/src/umap/supervised.cuh b/cpp/src/umap/supervised.cuh
index 38f2260c29..2e455ba528 100644
--- a/cpp/src/umap/supervised.cuh
+++ b/cpp/src/umap/supervised.cuh
@@ -21,6 +21,7 @@
 #include <cuml/neighbors/knn.hpp>
 #include "optimize.cuh"
 
+#include <raft/cudart_utils.h>
 #include "fuzzy_simpl_set/runner.cuh"
 #include "init_embed/runner.cuh"
 #include "knn_graph/runner.cuh"
@@ -36,8 +37,7 @@
 #include <sparse/coo.cuh>
 #include <sparse/csr.cuh>
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 #include <cuda_runtime.h>
 
@@ -102,7 +102,7 @@ void categorical_simplicial_set_intersection(COO<T> *graph_coo, T *target,
                                              cudaStream_t stream,
                                              float far_dist = 5.0,
                                              float unknown_dist = 1.0) {
-  dim3 grid(MLCommon::ceildiv(graph_coo->nnz, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(graph_coo->nnz, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
   fast_intersection_kernel<TPB_X, T><<<grid, blk, 0, stream>>>(
     graph_coo->rows(), graph_coo->cols(), graph_coo->vals(), graph_coo->nnz,
@@ -199,7 +199,7 @@ void general_simplicial_set_intersection(
   T left_min = max(min1 / 2.0, 1e-8);
   T right_min = max(min2 / 2.0, 1e-8);
 
-  dim3 grid(MLCommon::ceildiv(in1->nnz, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(in1->nnz, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   sset_intersection_kernel<T, TPB_X><<<grid, blk, 0, stream>>>(
@@ -208,7 +208,7 @@ void general_simplicial_set_intersection(
     result->nnz, left_min, right_min, in1->n_rows, weight);
   CUDA_CHECK(cudaGetLastError());
 
-  dim3 grid_n(MLCommon::ceildiv(result->nnz, TPB_X), 1, 1);
+  dim3 grid_n(raft::ceildiv(result->nnz, TPB_X), 1, 1);
 }
 
 template <int TPB_X, typename T>
@@ -232,10 +232,10 @@ void perform_categorical_intersection(T *y, COO<T> *rgraph_coo,
 }
 
 template <int TPB_X, typename T>
-void perform_general_intersection(const cumlHandle &handle, T *y,
+void perform_general_intersection(const raft::handle_t &handle, T *y,
                                   COO<T> *rgraph_coo, COO<T> *final_coo,
                                   UMAPParams *params, cudaStream_t stream) {
-  auto d_alloc = handle.getDeviceAllocator();
+  auto d_alloc = handle.get_device_allocator();
 
   /**
    * Calculate kNN for Y
@@ -252,13 +252,13 @@ void perform_general_intersection(const cumlHandle &handle, T *y,
   if (ML::Logger::get().shouldLogFor(CUML_LEVEL_DEBUG)) {
     CUML_LOG_DEBUG("Target kNN Graph");
     std::stringstream ss1, ss2;
-    ss1 << MLCommon::arr2Str(y_knn_indices.data(),
-                             rgraph_coo->n_rows * params->target_n_neighbors,
-                             "knn_indices", stream);
+    ss1 << raft::arr2Str(y_knn_indices.data(),
+                         rgraph_coo->n_rows * params->target_n_neighbors,
+                         "knn_indices", stream);
     CUML_LOG_DEBUG("%s", ss1.str().c_str());
-    ss2 << MLCommon::arr2Str(y_knn_dists.data(),
-                             rgraph_coo->n_rows * params->target_n_neighbors,
-                             "knn_dists", stream);
+    ss2 << raft::arr2Str(y_knn_dists.data(),
+                         rgraph_coo->n_rows * params->target_n_neighbors,
+                         "knn_dists", stream);
     CUML_LOG_DEBUG("%s", ss2.str().c_str());
   }
 
diff --git a/cpp/src/umap/umap.cu b/cpp/src/umap/umap.cu
index 03e2553e04..425ca93585 100644
--- a/cpp/src/umap/umap.cu
+++ b/cpp/src/umap/umap.cu
@@ -24,7 +24,7 @@ namespace ML {
 
 static const int TPB_X = 256;
 
-void transform(const cumlHandle &handle, float *X, int n, int d,
+void transform(const raft::handle_t &handle, float *X, int n, int d,
                int64_t *knn_indices, float *knn_dists, float *orig_X,
                int orig_n, float *embedding, int embedding_n,
                UMAPParams *params, float *transformed) {
@@ -32,7 +32,7 @@ void transform(const cumlHandle &handle, float *X, int n, int d,
                                      orig_X, orig_n, embedding, embedding_n,
                                      params, transformed);
 }
-void fit(const cumlHandle &handle,
+void fit(const raft::handle_t &handle,
          float *X,  // input matrix
          float *y,  // labels
          int n, int d, int64_t *knn_indices, float *knn_dists,
@@ -41,7 +41,7 @@ void fit(const cumlHandle &handle,
                                params, embeddings);
 }
 
-void fit(const cumlHandle &handle,
+void fit(const raft::handle_t &handle,
          float *X,  // input matrix
          int n,     // rows
          int d,     // cols
@@ -51,14 +51,14 @@ void fit(const cumlHandle &handle,
                                embeddings);
 }
 
-void find_ab(const cumlHandle &handle, UMAPParams *params) {
-  cudaStream_t stream = handle.getStream();
-  auto d_alloc = handle.getDeviceAllocator();
+void find_ab(const raft::handle_t &handle, UMAPParams *params) {
+  cudaStream_t stream = handle.get_stream();
+  auto d_alloc = handle.get_device_allocator();
   UMAPAlgo::find_ab(params, d_alloc, stream);
 }
-UMAP_API::UMAP_API(const cumlHandle &handle, UMAPParams *params)
+UMAP_API::UMAP_API(const raft::handle_t &handle, UMAPParams *params)
   : params(params) {
-  this->handle = const_cast<cumlHandle *>(&handle);
+  this->handle = const_cast<raft::handle_t *>(&handle);
   orig_X = nullptr;
   orig_n = 0;
 };
diff --git a/cpp/src_prims/cache/cache.cuh b/cpp/src_prims/cache/cache.cuh
index ba82fd38b8..e3a25b53a4 100644
--- a/cpp/src_prims/cache/cache.cuh
+++ b/cpp/src_prims/cache/cache.cuh
@@ -16,11 +16,11 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/logger.hpp>
+#include <raft/cuda_utils.cuh>
 #include "cache_util.cuh"
 
 namespace MLCommon {
@@ -63,11 +63,11 @@ namespace Cache {
 * // We assume that our ML algo repeatedly calls calc, and the set of keys have
 * // an overlap. We will use the cache to avoid repeated calculations.
 *
-* // Assume we have cumlHandle_impl& h, and cudaStream_t stream
-* Cache<float> cache(h.getDeviceAllocator(), stream, m);
+* // Assume we have raft::handle_t& h, and cudaStream_t stream
+* Cache<float> cache(h.get_device_allocator(), stream, m);
 *
 * // A buffer that we will reuse to store the cache indices.
-* device_buffer<int> cache_idx(h.getDeviceAllocator(), stream, n);
+* device_buffer<int> cache_idx(h.get_device_allocator(), stream, n);
 *
 * void cached_calc(int *key, int n, int m, float *out, stream) {
 *   int n_cached = 0;
@@ -182,8 +182,8 @@ class Cache {
    */
   void GetVecs(const int *idx, int n, math_t *out, cudaStream_t stream) {
     if (n > 0) {
-      get_vecs<<<ceildiv(n * n_vec, TPB), TPB, 0, stream>>>(cache.data(), n_vec,
-                                                            idx, n, out);
+      get_vecs<<<raft::ceildiv(n * n_vec, TPB), TPB, 0, stream>>>(
+        cache.data(), n_vec, idx, n, out);
       CUDA_CHECK(cudaPeekAtLastError());
     }
   }
@@ -213,7 +213,7 @@ class Cache {
   void StoreVecs(const math_t *tile, int n_tile, int n, int *cache_idx,
                  cudaStream_t stream, const int *tile_idx = nullptr) {
     if (n > 0) {
-      store_vecs<<<ceildiv(n * n_vec, TPB), TPB, 0, stream>>>(
+      store_vecs<<<raft::ceildiv(n * n_vec, TPB), TPB, 0, stream>>>(
         tile, n_tile, n_vec, tile_idx, n, cache_idx, cache.data(),
         cache.size() / n_vec);
       CUDA_CHECK(cudaPeekAtLastError());
@@ -246,7 +246,7 @@ class Cache {
                    cudaStream_t stream) {
     n_iter++;  // we increase the iteration counter, that is used to time stamp
     // accessing entries from the cache
-    get_cache_idx<<<ceildiv(n, TPB), TPB, 0, stream>>>(
+    get_cache_idx<<<raft::ceildiv(n, TPB), TPB, 0, stream>>>(
       keys, n, cached_keys.data(), n_cache_sets, associativity,
       cache_time.data(), cache_idx, is_cached, n_iter);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -279,10 +279,10 @@ class Cache {
                                   ws_tmp.data(), is_cached.data(), cache_idx,
                                   d_num_selected_out.data(), n, stream);
 
-    updateHost(n_cached, d_num_selected_out.data(), 1, stream);
+    raft::update_host(n_cached, d_num_selected_out.data(), 1, stream);
 
     // Similarily re-group the input indices
-    copy(ws_tmp.data(), keys, n, stream);
+    raft::copy(ws_tmp.data(), keys, n, stream);
     cub::DevicePartition::Flagged(d_temp_storage.data(), d_temp_storage_size,
                                   ws_tmp.data(), is_cached.data(), keys,
                                   d_num_selected_out.data(), n, stream);
@@ -308,7 +308,7 @@ class Cache {
                                     cidx, ws_tmp.data(), keys, idx_tmp.data(),
                                     n, 0, sizeof(int) * 8, stream);
 
-    copy(keys, idx_tmp.data(), n, stream);
+    raft::copy(keys, idx_tmp.data(), n, stream);
 
     // set it to -1
     CUDA_CHECK(cudaMemsetAsync(cidx, 255, n * sizeof(int), stream));
diff --git a/cpp/src_prims/cache/cache_util.cuh b/cpp/src_prims/cache/cache_util.cuh
index 89ac856bdf..3cc5b27a2f 100644
--- a/cpp/src_prims/cache/cache_util.cuh
+++ b/cpp/src_prims/cache/cache_util.cuh
@@ -18,7 +18,7 @@
 
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include <selection/kselection.cuh>
 
 namespace MLCommon {
@@ -196,7 +196,7 @@ int DI find_nth_occurrence(const int *array, int n, int val, int k) {
  */
 template <int nthreads, int associativity>
 DI void rank_set_entries(const int *cache_time, int n_cache_sets, int *rank) {
-  const int items_per_thread = ceildiv(associativity, nthreads);
+  const int items_per_thread = raft::ceildiv(associativity, nthreads);
   typedef cub::BlockRadixSort<int, nthreads, items_per_thread, int>
     BlockRadixSort;
   __shared__ typename BlockRadixSort::TempStorage temp_storage;
@@ -256,7 +256,7 @@ __global__ void assign_cache_idx(const int *keys, int n, const int *cache_set,
                                  int *cache_time, int time, int *cache_idx) {
   int block_offset = blockIdx.x * associativity;
 
-  const int items_per_thread = ceildiv(associativity, nthreads);
+  const int items_per_thread = raft::ceildiv(associativity, nthreads);
 
   // the size of rank limits how large associativity can be used in practice
   __shared__ int rank[items_per_thread * nthreads];
diff --git a/cpp/src_prims/common/buffer_base.hpp b/cpp/src_prims/common/buffer_base.hpp
deleted file mode 100644
index 0fbee92f53..0000000000
--- a/cpp/src_prims/common/buffer_base.hpp
+++ /dev/null
@@ -1,136 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_runtime.h>
-#include <memory>
-
-#include "cudart_utils.h"
-
-namespace MLCommon {
-
-/**
- * @todo: Add missing doxygen documentation
- */
-template <typename T, typename Allocator>
-class buffer_base {
- public:
-  using size_type = std::size_t;
-  using value_type = T;
-  using iterator = value_type*;
-  using const_iterator = const value_type*;
-  using reference = T&;
-  using const_reference = const T&;
-
-  buffer_base() = delete;
-
-  buffer_base(const buffer_base& other) = delete;
-
-  buffer_base& operator=(const buffer_base& other) = delete;
-
-  buffer_base(std::shared_ptr<Allocator> allocator, cudaStream_t stream,
-              size_type n = 0)
-    : _size(n),
-      _capacity(n),
-      _data(nullptr),
-      _stream(stream),
-      _allocator(allocator) {
-    if (_capacity > 0) {
-      _data = static_cast<value_type*>(
-        _allocator->allocate(_capacity * sizeof(value_type), _stream));
-      CUDA_CHECK(cudaStreamSynchronize(_stream));
-    }
-  }
-
-  ~buffer_base() {
-    if (nullptr != _data) {
-      _allocator->deallocate(_data, _capacity * sizeof(value_type), _stream);
-    }
-  }
-
-  value_type* data() { return _data; }
-
-  const value_type* data() const { return _data; }
-
-  size_type size() const { return _size; }
-
-  void clear() { _size = 0; }
-
-  iterator begin() { return _data; }
-
-  const_iterator begin() const { return _data; }
-
-  iterator end() { return _data + _size; }
-
-  const_iterator end() const { return _data + _size; }
-
-  void reserve(const size_type new_capacity, cudaStream_t stream) {
-    set_stream(stream);
-    if (new_capacity > _capacity) {
-      value_type* new_data = static_cast<value_type*>(
-        _allocator->allocate(new_capacity * sizeof(value_type), _stream));
-      if (_size > 0) {
-        CUDA_CHECK(cudaMemcpyAsync(new_data, _data, _size * sizeof(value_type),
-                                   cudaMemcpyDefault, _stream));
-      }
-      if (nullptr != _data) {
-        _allocator->deallocate(_data, _capacity * sizeof(value_type), _stream);
-      }
-      _data = new_data;
-      _capacity = new_capacity;
-    }
-  }
-
-  void resize(const size_type new_size, cudaStream_t stream) {
-    reserve(new_size, stream);
-    _size = new_size;
-  }
-
-  void release(cudaStream_t stream) {
-    set_stream(stream);
-    if (nullptr != _data) {
-      _allocator->deallocate(_data, _capacity * sizeof(value_type), _stream);
-    }
-    _data = nullptr;
-    _capacity = 0;
-    _size = 0;
-  }
-
-  std::shared_ptr<Allocator> getAllocator() const { return _allocator; }
-
- protected:
-  value_type* _data;
-
- private:
-  size_type _size;
-  size_type _capacity;
-  cudaStream_t _stream;
-  std::shared_ptr<Allocator> _allocator;
-
-  void set_stream(cudaStream_t stream) {
-    if (_stream != stream) {
-      cudaEvent_t event;
-      CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
-      CUDA_CHECK(cudaEventRecord(event, _stream));
-      CUDA_CHECK(cudaStreamWaitEvent(stream, event, 0));
-      _stream = stream;
-      CUDA_CHECK(cudaEventDestroy(event));
-    }
-  }
-};
-
-}  // namespace MLCommon
diff --git a/cpp/src_prims/common/cub_wrappers.cuh b/cpp/src_prims/common/cub_wrappers.cuh
deleted file mode 100644
index e99fd5ed75..0000000000
--- a/cpp/src_prims/common/cub_wrappers.cuh
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include "device_buffer.hpp"
-
-namespace MLCommon {
-
-/**
- * @brief Convenience wrapper over cub's SortPairs method
- * @tparam KeyT key type
- * @tparam ValueT value type
- * @param workspace workspace buffer which will get resized if not enough space
- * @param inKeys input keys array
- * @param outKeys output keys array
- * @param inVals input values array
- * @param outVals output values array
- * @param len array length
- * @param stream cuda stream
- */
-template <typename KeyT, typename ValueT>
-void sortPairs(device_buffer<char> &workspace, const KeyT *inKeys,
-               KeyT *outKeys, const ValueT *inVals, ValueT *outVals, int len,
-               cudaStream_t stream) {
-  size_t worksize;
-  cub::DeviceRadixSort::SortPairs(nullptr, worksize, inKeys, outKeys, inVals,
-                                  outVals, len, 0, sizeof(KeyT) * 8, stream);
-  workspace.resize(worksize, stream);
-  cub::DeviceRadixSort::SortPairs(workspace.data(), worksize, inKeys, outKeys,
-                                  inVals, outVals, len, 0, sizeof(KeyT) * 8,
-                                  stream);
-}
-
-}  // end namespace MLCommon
diff --git a/cpp/src_prims/common/cudart_utils.h b/cpp/src_prims/common/cudart_utils.h
deleted file mode 100644
index ebf45c33ef..0000000000
--- a/cpp/src_prims/common/cudart_utils.h
+++ /dev/null
@@ -1,209 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <chrono>
-#include <cuml/common/utils.hpp>
-#include <iomanip>
-#include <iostream>
-
-namespace MLCommon {
-
-/** helper method to get max usable shared mem per block parameter */
-inline int getSharedMemPerBlock() {
-  int devId;
-  CUDA_CHECK(cudaGetDevice(&devId));
-  int smemPerBlk;
-  CUDA_CHECK(cudaDeviceGetAttribute(&smemPerBlk,
-                                    cudaDevAttrMaxSharedMemoryPerBlock, devId));
-  return smemPerBlk;
-}
-/** helper method to get multi-processor count parameter */
-inline int getMultiProcessorCount() {
-  int devId;
-  CUDA_CHECK(cudaGetDevice(&devId));
-  int mpCount;
-  CUDA_CHECK(
-    cudaDeviceGetAttribute(&mpCount, cudaDevAttrMultiProcessorCount, devId));
-  return mpCount;
-}
-
-/**
- * @brief Generic copy method for all kinds of transfers
- * @tparam Type data type
- * @param dst destination pointer
- * @param src source pointer
- * @param len lenth of the src/dst buffers in terms of number of elements
- * @param stream cuda stream
- */
-template <typename Type>
-void copy(Type* dst, const Type* src, size_t len, cudaStream_t stream) {
-  CUDA_CHECK(
-    cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream));
-}
-
-/**
- * @defgroup Copy Copy methods
- * These are here along with the generic 'copy' method in order to improve
- * code readability using explicitly specified function names
- * @{
- */
-/** performs a host to device copy */
-template <typename Type>
-void updateDevice(Type* dPtr, const Type* hPtr, size_t len,
-                  cudaStream_t stream) {
-  copy(dPtr, hPtr, len, stream);
-}
-
-/** performs a device to host copy */
-template <typename Type>
-void updateHost(Type* hPtr, const Type* dPtr, size_t len, cudaStream_t stream) {
-  copy(hPtr, dPtr, len, stream);
-}
-
-template <typename Type>
-void copyAsync(Type* dPtr1, const Type* dPtr2, size_t len,
-               cudaStream_t stream) {
-  CUDA_CHECK(cudaMemcpyAsync(dPtr1, dPtr2, len * sizeof(Type),
-                             cudaMemcpyDeviceToDevice, stream));
-}
-
-/** helper method to convert an array on device to a string on host */
-template <typename T>
-std::string arr2Str(const T* arr, int size, std::string name,
-                    cudaStream_t stream, int width = 4) {
-  std::stringstream ss;
-
-  T* arr_h = (T*)malloc(size * sizeof(T));
-  updateHost(arr_h, arr, size, stream);
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-
-  ss << name << " = [ ";
-  for (int i = 0; i < size; i++) {
-    ss << std::setw(width) << arr_h[i];
-
-    if (i < size - 1) ss << ", ";
-  }
-  ss << " ]" << std::endl;
-
-  free(arr_h);
-
-  return ss.str();
-}
-/** this seems to be unused, but may be useful in the future */
-template <typename T>
-void ASSERT_DEVICE_MEM(T* ptr, std::string name) {
-  cudaPointerAttributes s_att;
-  cudaError_t s_err = cudaPointerGetAttributes(&s_att, ptr);
-
-  if (s_err != 0 || s_att.device == -1)
-    std::cout << "Invalid device pointer encountered in " << name
-              << ". device=" << s_att.device << ", err=" << s_err << std::endl;
-};
-
-inline uint32_t curTimeMillis() {
-  auto now = std::chrono::high_resolution_clock::now();
-  auto duration = now.time_since_epoch();
-  return std::chrono::duration_cast<std::chrono::milliseconds>(duration)
-    .count();
-}
-
-/** @} */
-
-/** Helper function to calculate need memory for allocate to store dense matrix.
-* @param rows number of rows in matrix
-* @param columns number of columns in matrix
-* @return need number of items to allocate via allocate()
-* @sa allocate()
-*/
-inline size_t allocLengthForMatrix(size_t rows, size_t columns) {
-  return rows * columns;
-}
-
-/** cuda malloc */
-template <typename Type>
-void allocate(Type*& ptr, size_t len, bool setZero = false) {
-  CUDA_CHECK(cudaMalloc((void**)&ptr, sizeof(Type) * len));
-  if (setZero) CUDA_CHECK(cudaMemset(ptr, 0, sizeof(Type) * len));
-}
-
-/** Helper function to check alignment of pointer.
-* @param ptr the pointer to check
-* @param alignment to be checked for
-* @return true if address in bytes is a multiple of alignment
-*/
-template <typename Type>
-bool is_aligned(Type* ptr, size_t alignment) {
-  return reinterpret_cast<uintptr_t>(ptr) % alignment == 0;
-}
-
-/** calculate greatest common divisor of two numbers
-* @a integer
-* @b integer
-* @ return gcd of a and b
-*/
-template <typename IntType>
-IntType gcd(IntType a, IntType b) {
-  while (b != 0) {
-    IntType tmp = b;
-    b = a % b;
-    a = tmp;
-  }
-  return a;
-}
-
-/**
- * @defgroup Debug utils for debug device code
- * @{
- */
-template <class T, class OutStream>
-void myPrintHostVector(const char* variableName, const T* hostMem,
-                       size_t componentsCount, OutStream& out) {
-  out << variableName << "=[";
-  for (size_t i = 0; i < componentsCount; ++i) {
-    if (i != 0) out << ",";
-    out << hostMem[i];
-  }
-  out << "];\n";
-}
-
-template <class T>
-void myPrintHostVector(const char* variableName, const T* hostMem,
-                       size_t componentsCount) {
-  myPrintHostVector(variableName, hostMem, componentsCount, std::cout);
-  std::cout.flush();
-}
-
-template <class T, class OutStream>
-void myPrintDevVector(const char* variableName, const T* devMem,
-                      size_t componentsCount, OutStream& out) {
-  T* hostMem = new T[componentsCount];
-  CUDA_CHECK(cudaMemcpy(hostMem, devMem, componentsCount * sizeof(T),
-                        cudaMemcpyDeviceToHost));
-  myPrintHostVector(variableName, hostMem, componentsCount, out);
-  delete[] hostMem;
-}
-
-template <class T>
-void myPrintDevVector(const char* variableName, const T* devMem,
-                      size_t componentsCount) {
-  myPrintDevVector(variableName, devMem, componentsCount, std::cout);
-  std::cout.flush();
-}
-/** @} */
-
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/common/cuml_comms_iface.hpp b/cpp/src_prims/common/cuml_comms_iface.hpp
deleted file mode 100644
index 92ba3d7c19..0000000000
--- a/cpp/src_prims/common/cuml_comms_iface.hpp
+++ /dev/null
@@ -1,86 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <memory>
-
-#include "cuml_comms_int.hpp"
-
-namespace MLCommon {
-
-/**
- * cuML communicator plugin interface. This is part of ml-prims because
- * multi GPU ml-prims need it and we want to avoid a dependency of ml-prims to
- * cuML. This is not part of cuML-comms as we would have a circular dependency
- * of cuML-comms and cuML/ml-prims (cuML-comms to cuML/ml-prims due to the 
- * implementation of initialize_mpi_comms in cuML-comms, which depdends on 
- * cumlHandle and cuML/ml-prims to cuML-comms to cumlCommunicator_iface if
- * cumlCommunicator_iface would be part of cuML-comms).
- */
-class cumlCommunicator_iface {
- public:
-  typedef cumlCommunicator::request_t request_t;
-  typedef cumlCommunicator::datatype_t datatype_t;
-  typedef cumlCommunicator::op_t op_t;
-  typedef cumlCommunicator::status_t status_t;
-
-  static const int CUML_ANY_SOURCE = cumlCommunicator::CUML_ANY_SOURCE;
-
-  virtual ~cumlCommunicator_iface();
-
-  virtual int getSize() const = 0;
-  virtual int getRank() const = 0;
-
-  virtual std::unique_ptr<cumlCommunicator_iface> commSplit(int color,
-                                                            int key) const = 0;
-
-  virtual void barrier() const = 0;
-
-  virtual status_t syncStream(cudaStream_t stream) const = 0;
-
-  virtual void isend(const void* buf, int size, int dest, int tag,
-                     request_t* request) const = 0;
-
-  virtual void irecv(void* buf, int size, int source, int tag,
-                     request_t* request) const = 0;
-
-  virtual void waitall(int count, request_t array_of_requests[]) const = 0;
-
-  virtual void allreduce(const void* sendbuff, void* recvbuff, int count,
-                         datatype_t datatype, op_t op,
-                         cudaStream_t stream) const = 0;
-
-  virtual void bcast(void* buff, int count, datatype_t datatype, int root,
-                     cudaStream_t stream) const = 0;
-
-  virtual void reduce(const void* sendbuff, void* recvbuff, int count,
-                      datatype_t datatype, op_t op, int root,
-                      cudaStream_t stream) const = 0;
-
-  virtual void allgather(const void* sendbuff, void* recvbuff, int sendcount,
-                         datatype_t datatype, cudaStream_t stream) const = 0;
-
-  virtual void allgatherv(const void* sendbuf, void* recvbuf,
-                          const int recvcounts[], const int displs[],
-                          datatype_t datatype, cudaStream_t stream) const = 0;
-
-  virtual void reducescatter(const void* sendbuff, void* recvbuff,
-                             int recvcount, datatype_t datatype, op_t op,
-                             cudaStream_t stream) const = 0;
-};
-
-}  // namespace MLCommon
diff --git a/cpp/src_prims/common/cuml_comms_int.hpp b/cpp/src_prims/common/cuml_comms_int.hpp
deleted file mode 100644
index 58e019be09..0000000000
--- a/cpp/src_prims/common/cuml_comms_int.hpp
+++ /dev/null
@@ -1,332 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <memory>
-
-#include <cuda_runtime.h>
-
-namespace MLCommon {
-
-class cumlCommunicator_iface;
-
-/**
- * Communicator class intended to be used by cuML and ml-prims.
- *
- * cumlCommunicator needs an implementation of cumlCommunicator_iface.
- * The propsal is that this comes from a seperate library (cuML-comms).
- * This enables a cuML user to build cuML-comms for the comms stack version he
- * is using. The rational for this choice is that cumlCommunicator can be used
- * in closed source components like multi GPU ml-prims without a direct
- * dependency to the users comms stack.
- *
- * The methods exposed by cumlCommunicator are thin wrappers around NCCL and
- * a comms stack with MPI semantics.
- */
-class cumlCommunicator {
- public:
-  typedef unsigned int request_t;
-  enum datatype_t { CHAR, UINT8, INT, UINT, INT64, UINT64, FLOAT, DOUBLE };
-  enum op_t { SUM, PROD, MIN, MAX };
-
-  static const int CUML_ANY_SOURCE = -1;
-
-  /**
-   * The resulting status of distributed stream synchronization
-   */
-  enum status_t {
-    commStatusSuccess,  // Synchronization successful
-    commStatusError,    // An error occured querying sync status
-    commStatusAbort
-  };  // A failure occured in sync, queued operations aborted
-
-  template <typename T>
-  datatype_t getDataType() const;
-
-  cumlCommunicator() = delete;
-  cumlCommunicator(std::unique_ptr<cumlCommunicator_iface> impl);
-
-  /**
-     * Returns the size of the group associated with the underlying communicator.
-     */
-  int getSize() const;
-  /**
-     * Determines the rank of the calling process in the underlying communicator.
-     */
-  int getRank() const;
-
-  /**
-     * Creates new communicators based on colors and keys following the sematics of MPI_Comm_split.
-     *
-     * Note: Issuing concurrent communication requests to overlapping communicators can cause a 
-     *       deadlock.
-     *
-     * @param[in]   color   Control of subset assignment (nonnegative integer)
-     * @param[in]   key     Control of rank assignment
-     * @return              new communicator instance containing only the ranks with the same color
-     */
-  cumlCommunicator commSplit(int color, int key) const;
-
-  /**
-     * Synchronization of all ranks for the underlying communicator.
-     */
-  void barrier() const;
-
-  /**
-   * Synchronization of all ranks for the current stream. This allows different cumlCommunicator
-   * implementations to provide custom handling of asynchronous errors, such as the failure of
-   * ranks during collective communication operations.
-   *
-   * In the case where commStatusAbort is returned, the underlying comms implementation may need
-   * to be re-initialized.
-   *
-   * A status of commStatusError should be thrown if an error occurs when querying the stream
-   * sync status of the underlying communicator.
-   *
-   * @param[in] stream  the stream to synchronize
-   * @return            resulting status of the synchronization.
-   */
-  status_t syncStream(cudaStream_t stream) const;
-
-  /**
-     * Starts a nonblocking send following the semantics of MPI_Isend
-     *
-     * @param[in]   buf     address of send buffer (can be a CPU or GPU pointer)
-     * @param[in]   size    size of the message to send in bytes
-     * @param[in]   dest    rank of destination
-     * @param[in]   tag     message tag
-     * @param[out]  request communication request (handle)
-     */
-  void isend(const void* buf, int size, int dest, int tag,
-             request_t* request) const;
-  /**
-     * Starts a nonblocking receive following the semantics of MPI_Irecv
-     *
-     * @param[in]   buf     address of receive buffer (can be a CPU or GPU pointer)
-     * @param[in]   size    size of the message to receive in bytes
-     * @param[in]   source  rank of source
-     * @param[in]   tag     message tag
-     * @param[out]  request communication request (handle)
-     */
-  void irecv(void* buf, int size, int source, int tag,
-             request_t* request) const;
-
-  /**
-     * Convience wrapper around isend deducing message size from sizeof(T).
-     *
-     * @param[in]   buf     address of send buffer (can be a CPU or GPU pointer)
-     * @param[in]   n       number of elements to send
-     * @param[in]   dest    rank of destination
-     * @param[in]   tag     message tag
-     * @param[out]  request communication request (handle)
-     */
-  template <typename T>
-  void isend(const T* buf, int n, int dest, int tag, request_t* request) const {
-    isend(static_cast<const void*>(buf), n * sizeof(T), dest, tag, request);
-  }
-
-  /**
-     * Convience wrapper around irecv deducing message size from sizeof(T).
-     *
-     * @param[in]   buf     address of receive buffer (can be a CPU or GPU pointer)
-     * @param[in]   n       number of elements to receive
-     * @param[in]   source  rank of source
-     * @param[in]   tag     message tag
-     * @param[out]  request communication request (handle)
-     */
-  template <typename T>
-  void irecv(T* buf, int n, int source, int tag, request_t* request) const {
-    irecv(static_cast<void*>(buf), n * sizeof(T), source, tag, request);
-  }
-
-  /**
-     * Waits for all given communication requests to complete following the semantics of MPI_Waitall.
-     *
-     * @param[in]   count               number of requests
-     * @param[in]   array_of_requests   array of request handles
-     */
-  void waitall(int count, request_t array_of_requests[]) const;
-
-  /**
-     * Reduce data arrays of length count in sendbuff using op operation and leaves identical copies of the 
-     * result on each recvbuff.
-     *
-     * Follows the semantics of ncclAllReduce. In-place operation will happen if sendbuff == recvbuff .
-     *
-     * @param[in]   sendbuff    address of GPU accessible send buffer
-     * @param[in]   recvbuff    address of GPU accessible receive buffer (might alias with sendbuff)
-     * @param[in]   count       number of elements in sendbuff and recvbuff
-     * @param[in]   datatype    data type of sendbuff and recvbuff
-     * @param[in]   op          reduction operation to perform.
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void allreduce(const void* sendbuff, void* recvbuff, int count,
-                 datatype_t datatype, op_t op, cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around allreduce deducing datatype_t from T.
-     */
-  template <typename T>
-  void allreduce(const T* sendbuff, T* recvbuff, int count, op_t op,
-                 cudaStream_t stream) const {
-    allreduce(sendbuff, recvbuff, count, getDataType<T>(), op, stream);
-  }
-
-  /**
-     * Copies count elements from buff on the root rank to all ranks buff.
-     *
-     * Follows the semantics of ncclBcast.
-     *
-     * @param[in]   buff        address of GPU accessible buffer
-     * @param[in]   count       number of elements in buff
-     * @param[in]   datatype    data type of buff
-     * @param[in]   root        rank of broadcast root
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void bcast(void* buff, int count, datatype_t datatype, int root,
-             cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around bcast deducing datatype_t from T.
-     */
-  template <typename T>
-  void bcast(T* buff, int count, int root, cudaStream_t stream) const {
-    bcast(buff, count, getDataType<T>(), root, stream);
-  }
-
-  /**
-     * Reduce data arrays of length count in sendbuff into recvbuff on the root rank using the op operation.
-     * recvbuff is only used on rank root and ignored for other ranks. 
-     *
-     * Follows the semantics of ncclReduce. In-place operation will happen if sendbuff == recvbuff .
-     *
-     * @param[in]   sendbuff    address of GPU accessible send buffer
-     * @param[in]   recvbuff    address of GPU accessible receive buffer (might alias with sendbuff)
-     * @param[in]   count       number of elements in sendbuff and recvbuff
-     * @param[in]   datatype    data type of sendbuff and recvbuff
-     * @param[in]   op          reduction operation to perform.
-     * @param[in]   root        rank of broadcast root
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void reduce(const void* sendbuff, void* recvbuff, int count,
-              datatype_t datatype, op_t op, int root,
-              cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around reduce deducing datatype_t from T.
-     */
-  template <typename T>
-  void reduce(const T* sendbuff, T* recvbuff, int count, op_t op, int root,
-              cudaStream_t stream) const {
-    reduce(sendbuff, recvbuff, count, getDataType<T>(), op, root, stream);
-  }
-
-  /**
-     * Gather sendcount values from all GPUs into recvbuff, receiving data from rank i at offset i*sendcount.
-     *
-     * Note : This assumes the receive count is equal to nranks*sendcount, which means that recvbuff should
-     * have a size of at least nranks*sendcount elements.
-     *
-     * In-place operation will happen if sendbuff == recvbuff + rank * sendcount.
-     *
-     * Follows the semantics of ncclAllGather.
-     *
-     * @param[in]   sendbuff    address of GPU accessible send buffer
-     * @param[in]   recvbuff    address of GPU accessible receive buffer (might alias with sendbuff)
-     * @param[in]   sendcount   number of elements in sendbuff and recvbuff
-     * @param[in]   datatype    data type of sendbuff and recvbuff
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void allgather(const void* sendbuff, void* recvbuff, int sendcount,
-                 datatype_t datatype, cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around allgather deducing datatype_t from T.
-     */
-  template <typename T>
-  void allgather(const T* sendbuff, T* recvbuff, int sendcount,
-                 cudaStream_t stream) const {
-    allgather(sendbuff, recvbuff, sendcount, getDataType<T>(), stream);
-  }
-
-  /**
-     * Gathers data from all processes and delivers it to all. Each process may contribute a
-     * different amount of data.
-     *
-     * Semantics are equivalent to:
-     *
-     *    for (int root = 0; root < getSize(); ++root) {
-     *        ncclBroadcast(sendbuf,
-     *                      static_cast<char*>(recvbuf)+displs[root]*sizeof(datatype), recvcounts[root],
-     *                      datatype, root, nccl_comm, stream);
-     *    }
-     *
-     * @param[in]   sendbuf     address of GPU accessible send buffer
-     * @param[in]   recvbuf     address of GPU accessible receive buffer (might alias with sendbuff)
-     * @param[in]   recvcounts  array (of length group size) containing the number of elements that are
-     *                          received from each process.
-     * @param[in]   displs      array (of length group size). Entry i specifies the displacement
-     *                          (relative to recvbuf) at which to place the incoming data from process i.
-     * @param[in]   datatype    data type of sendbuff and recvbuff
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void allgatherv(const void* sendbuf, void* recvbuf, const int recvcounts[],
-                  const int displs[], datatype_t datatype,
-                  cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around allgatherv deducing datatype_t from T.
-     */
-  template <typename T>
-  void allgatherv(const void* sendbuf, void* recvbuf, const int recvcounts[],
-                  const int displs[], cudaStream_t stream) const {
-    allgatherv(sendbuf, recvbuf, recvcounts, displs, getDataType<T>(), stream);
-  }
-
-  /**
-     * Reduce data in sendbuff from all GPUs using the op operation and leave the reduced result scattered
-     * over the devices so that the recvbuff on rank i will contain the i-th block of the result.
-     *
-     * Note: This assumes the send count is equal to nranks*recvcount, which means that sendbuff should have
-     * a size of at least nranks*recvcount elements.
-     *
-     * Follows the semantics of ncclReduceScatter. 
-     *
-     * @param[in]   sendbuff    address of GPU accessible send buffer
-     * @param[in]   recvbuff    address of GPU accessible receive buffer
-     * @param[in]   recvcount   number of elements to receive to recvbuff
-     * @param[in]   datatype    data type of sendbuff and recvbuff
-     * @param[in]   op          reduction operation to perform.
-     * @param[in]   stream      stream to submit this asynchronous (with respect to the CPU) operation to
-     */
-  void reducescatter(const void* sendbuff, void* recvbuff, int recvcount,
-                     datatype_t datatype, op_t op, cudaStream_t stream) const;
-
-  /**
-     * Convience wrapper around reducescatter deducing datatype_t from T.
-     */
-  template <typename T>
-  void reducescatter(const void* sendbuff, void* recvbuff, int recvcount,
-                     datatype_t datatype, op_t op, cudaStream_t stream) const {
-    reducescatter(sendbuff, recvbuff, recvcount, getDataType<T>(), op, stream);
-  }
-
- private:
-  std::unique_ptr<cumlCommunicator_iface> _impl;
-};
-
-}  // end namespace MLCommon
diff --git a/cpp/src_prims/common/device_buffer.hpp b/cpp/src_prims/common/device_buffer.hpp
index 6ed70d7f2a..742dae398c 100644
--- a/cpp/src_prims/common/device_buffer.hpp
+++ b/cpp/src_prims/common/device_buffer.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cuml/common/cuml_allocator.hpp>
-#include "buffer_base.hpp"
+#include <raft/mr/device/buffer.hpp>
 
 namespace MLCommon {
 
@@ -26,10 +26,10 @@ namespace MLCommon {
  * deallocation so this can be used for temporary memory 
  * @code{.cpp}
  * template<typename T>
- * void foo( const cumlHandle_impl& h, ..., cudaStream_t stream )
+ * void foo( const raft::handle_t& h, ..., cudaStream_t stream )
  * {
  *     ...
- *     device_buffer<T> temp( h.getDeviceAllocator(), stream, 0 )
+ *     device_buffer<T> temp( h.get_device_allocator(), stream, 0 )
  *     
  *     temp.resize(n, stream);
  *     kernelA<<<grid,block,0,stream>>>(...,temp.data(),...);
@@ -39,6 +39,6 @@ namespace MLCommon {
  * @endcode
  */
 template <typename T>
-using device_buffer = buffer_base<T, deviceAllocator>;
+using device_buffer = raft::mr::device::buffer<T>;
 
 }  // namespace MLCommon
diff --git a/cpp/src_prims/common/device_loads_stores.cuh b/cpp/src_prims/common/device_loads_stores.cuh
index 64cc9d3bf6..5d21d887d1 100644
--- a/cpp/src_prims/common/device_loads_stores.cuh
+++ b/cpp/src_prims/common/device_loads_stores.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 
diff --git a/cpp/src_prims/common/device_utils.cuh b/cpp/src_prims/common/device_utils.cuh
index d85fabcf97..2eb6930a6d 100644
--- a/cpp/src_prims/common/device_utils.cuh
+++ b/cpp/src_prims/common/device_utils.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 
@@ -40,8 +40,8 @@ namespace MLCommon {
 template <typename T, int NThreads>
 DI T batchedWarpReduce(T val) {
 #pragma unroll
-  for (int i = NThreads; i < WarpSize; i <<= 1) {
-    val += shfl(val, laneId() + i);
+  for (int i = NThreads; i < raft::WarpSize; i <<= 1) {
+    val += raft::shfl(val, raft::laneId() + i);
   }
   return val;
 }
@@ -68,10 +68,10 @@ DI T batchedWarpReduce(T val) {
 template <typename T, int NThreads>
 DI T batchedBlockReduce(T val, char *smem) {
   auto *sTemp = reinterpret_cast<T *>(smem);
-  constexpr int nGroupsPerWarp = WarpSize / NThreads;
-  static_assert(isPo2(nGroupsPerWarp), "nGroupsPerWarp must be a PO2!");
+  constexpr int nGroupsPerWarp = raft::WarpSize / NThreads;
+  static_assert(raft::isPo2(nGroupsPerWarp), "nGroupsPerWarp must be a PO2!");
   const int nGroups = (blockDim.x + NThreads - 1) / NThreads;
-  const int lid = laneId();
+  const int lid = raft::laneId();
   const int lgid = lid % NThreads;
   const int gid = threadIdx.x / NThreads;
   const auto wrIdx = (gid / nGroupsPerWarp) * NThreads + lgid;
diff --git a/cpp/src_prims/common/fast_int_div.cuh b/cpp/src_prims/common/fast_int_div.cuh
index 7298320e5e..9069b3e542 100644
--- a/cpp/src_prims/common/fast_int_div.cuh
+++ b/cpp/src_prims/common/fast_int_div.cuh
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <stdint.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 
diff --git a/cpp/src_prims/common/grid_sync.cuh b/cpp/src_prims/common/grid_sync.cuh
index cde55c0560..00203a20eb 100644
--- a/cpp/src_prims/common/grid_sync.cuh
+++ b/cpp/src_prims/common/grid_sync.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 
@@ -192,7 +192,7 @@ struct GridSync {
     __syncthreads();
     if (masterThread()) {
       __threadfence();
-      atomicAdd(arrivalTracker, updateValue);
+      raft::myAtomicAdd(arrivalTracker, updateValue);
       __threadfence();
     }
   }
@@ -219,6 +219,33 @@ struct GridSync {
   DI bool masterThread() const {
     return threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0;
   }
-};
+};  // struct GridSync
+
+/**
+ * @brief Helper method to have a group of threadblocks signal completion to
+ *        others and also determine who's the last to arrive at this sync point
+ * @param done_count location in global mem used to mark signal done of the
+ *                   current threadblock.
+ * @param nBlks number of blocks involved with this done-handshake
+ * @param master which block is supposed to be considered as master in this
+ *               process of handshake.
+ * @param amIlast shared mem used for 'am i last' signal propagation to all the
+ *                threads in the block
+ * @return true if the current threadblock is the last to arrive else false
+ *
+ * @note This function should be entered by all threads in the block together.
+ *       It is the responsibility of the calling code to ensure that before
+ *       entering this function, all threads in this block really have completed
+ *       whatever their individual tasks were.
+ */
+DI bool signalDone(int* done_count, int nBlks, bool master, int* amIlast) {
+  if (threadIdx.x == 0) {
+    auto delta = master ? nBlks - 1 : -1;
+    auto old = atomicAdd(done_count, delta);
+    *amIlast = ((old + delta) == 0);
+  }
+  __syncthreads();
+  return *amIlast;
+}
 
 };  // end namespace MLCommon
diff --git a/cpp/src_prims/common/host_buffer.hpp b/cpp/src_prims/common/host_buffer.hpp
index 82d92cf2cd..f02f69e214 100644
--- a/cpp/src_prims/common/host_buffer.hpp
+++ b/cpp/src_prims/common/host_buffer.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cuml/common/cuml_allocator.hpp>
-#include "buffer_base.hpp"
+#include <raft/mr/host/buffer.hpp>
 
 namespace MLCommon {
 
@@ -26,10 +26,10 @@ namespace MLCommon {
  * deallocation so this can be used for temporary memory 
  * @code{.cpp}
  * template<typename T>
- * void foo( const cumlHandle_impl& h, const T* in_d , T* out_d, ..., cudaStream_t stream )
+ * void foo( const raft::handle_t& h, const T* in_d , T* out_d, ..., cudaStream_t stream )
  * {
  *     ...
- *     host_buffer<T> temp( handle->getHostAllocator(), stream, 0 )
+ *     host_buffer<T> temp( handle->get_host_allocator(), stream, 0 )
  *     
  *     temp.resize(n, stream);
  *     cudaMemcpyAsync( temp.data(), in_d, temp.size()*sizeof(T), cudaMemcpyDeviceToHost );
@@ -40,35 +40,8 @@ namespace MLCommon {
  * @endcode
  * @todo: Add missing doxygen documentation
  */
-template <typename T>
-class host_buffer : public buffer_base<T, hostAllocator> {
- public:
-  using size_type = typename buffer_base<T, hostAllocator>::size_type;
-  using value_type = typename buffer_base<T, hostAllocator>::value_type;
-  using iterator = typename buffer_base<T, hostAllocator>::iterator;
-  using const_iterator = typename buffer_base<T, hostAllocator>::const_iterator;
-  using reference = typename buffer_base<T, hostAllocator>::reference;
-  using const_reference =
-    typename buffer_base<T, hostAllocator>::const_reference;
-
-  host_buffer() = delete;
-
-  host_buffer(const host_buffer& other) = delete;
-
-  host_buffer& operator=(const host_buffer& other) = delete;
-
-  host_buffer(std::shared_ptr<hostAllocator> allocator, cudaStream_t stream,
-              size_type n = 0)
-    : buffer_base<T, hostAllocator>(allocator, stream, n) {}
 
-  ~host_buffer() {}
-
-  reference operator[](size_type pos) { return _data[pos]; }
-
-  const_reference operator[](size_type pos) const { return _data[pos]; }
-
- private:
-  using buffer_base<T, hostAllocator>::_data;
-};
+template <typename T>
+using host_buffer = raft::mr::host::buffer<T>;
 
 }  // namespace MLCommon
diff --git a/cpp/src_prims/common/iota.cuh b/cpp/src_prims/common/iota.cuh
new file mode 100644
index 0000000000..f26a42b65f
--- /dev/null
+++ b/cpp/src_prims/common/iota.cuh
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <raft/cuda_utils.cuh>
+
+namespace MLCommon {
+
+template <typename DataT, typename IdxT>
+__global__ void iotaKernel(DataT* out, DataT start, DataT step, IdxT len) {
+  auto tid = (IdxT)blockDim.x * blockIdx.x + threadIdx.x;
+  if (tid < len) {
+    out[tid] = start + DataT(tid) * step;
+  }
+}
+
+/**
+ * @brief GPU version of std::iota
+ * @tparam DataT data type
+ * @tparam IdxT indexing arithmetic type
+ * @param out the output array
+ * @param start start value in the array
+ * @param step step size for each successive locations in the array
+ * @param len the array length
+ * @param stream cuda stream
+ */
+template <typename DataT, typename IdxT>
+void iota(DataT* out, DataT start, DataT step, IdxT len, cudaStream_t stream) {
+  static const int TPB = 512;
+  IdxT nblks = raft::ceildiv<IdxT>(len, TPB);
+  iotaKernel<DataT, IdxT><<<nblks, TPB, 0, stream>>>(out, start, step, len);
+  CUDA_CHECK(cudaGetLastError());
+}
+
+}  // namespace MLCommon
diff --git a/cpp/src_prims/common/scatter.cuh b/cpp/src_prims/common/scatter.cuh
deleted file mode 100644
index f3457edea0..0000000000
--- a/cpp/src_prims/common/scatter.cuh
+++ /dev/null
@@ -1,97 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-
-template <typename DataT, int VecLen, typename Lambda, typename IdxT>
-__global__ void scatterKernel(DataT *out, const DataT *in, const IdxT *idx,
-                              IdxT len, Lambda op) {
-  typedef TxN_t<DataT, VecLen> DataVec;
-  typedef TxN_t<IdxT, VecLen> IdxVec;
-  IdxT tid = threadIdx.x + ((IdxT)blockIdx.x * blockDim.x);
-  tid *= VecLen;
-  if (tid >= len) return;
-  IdxVec idxIn;
-  idxIn.load(idx, tid);
-  DataVec dataIn;
-#pragma unroll
-  for (int i = 0; i < VecLen; ++i) {
-    auto inPos = idxIn.val.data[i];
-    dataIn.val.data[i] = op(in[inPos], tid + i);
-  }
-  dataIn.store(out, tid);
-}
-
-template <typename DataT, int VecLen, typename Lambda, typename IdxT, int TPB>
-void scatterImpl(DataT *out, const DataT *in, const IdxT *idx, IdxT len,
-                 Lambda op, cudaStream_t stream) {
-  const IdxT nblks = ceildiv(VecLen ? len / VecLen : len, (IdxT)TPB);
-  scatterKernel<DataT, VecLen, Lambda, IdxT>
-    <<<nblks, TPB, 0, stream>>>(out, in, idx, len, op);
-  CUDA_CHECK(cudaGetLastError());
-}
-
-/**
- * @brief Performs scatter operation based on the input indexing array
- * @tparam DataT data type whose array gets scattered
- * @tparam IdxT indexing type
- * @tparam TPB threads-per-block in the final kernel launched
- * @tparam Lambda the device-lambda performing a unary operation on the loaded
- * data before it gets scattered
- * @param out the output array
- * @param in the input array
- * @param idx the indexing array
- * @param len number of elements in the input array
- * @param stream cuda stream where to launch work
- * @param op the device-lambda with signature `DataT func(DataT, IdxT);`. This
- * will be applied to every element before scattering it to the right location.
- * The second param in this method will be the destination index.
- */
-template <typename DataT, typename IdxT, typename Lambda = Nop<DataT, IdxT>,
-          int TPB = 256>
-void scatter(DataT *out, const DataT *in, const IdxT *idx, IdxT len,
-             cudaStream_t stream, Lambda op = Nop<DataT, IdxT>()) {
-  if (len <= 0) return;
-  constexpr size_t DataSize = sizeof(DataT);
-  constexpr size_t IdxSize = sizeof(IdxT);
-  constexpr size_t MaxPerElem = DataSize > IdxSize ? DataSize : IdxSize;
-  size_t bytes = len * MaxPerElem;
-  if (16 / MaxPerElem && bytes % 16 == 0) {
-    scatterImpl<DataT, 16 / MaxPerElem, Lambda, IdxT, TPB>(out, in, idx, len,
-                                                           op, stream);
-  } else if (8 / MaxPerElem && bytes % 8 == 0) {
-    scatterImpl<DataT, 8 / MaxPerElem, Lambda, IdxT, TPB>(out, in, idx, len, op,
-                                                          stream);
-  } else if (4 / MaxPerElem && bytes % 4 == 0) {
-    scatterImpl<DataT, 4 / MaxPerElem, Lambda, IdxT, TPB>(out, in, idx, len, op,
-                                                          stream);
-  } else if (2 / MaxPerElem && bytes % 2 == 0) {
-    scatterImpl<DataT, 2 / MaxPerElem, Lambda, IdxT, TPB>(out, in, idx, len, op,
-                                                          stream);
-  } else if (1 / MaxPerElem) {
-    scatterImpl<DataT, 1 / MaxPerElem, Lambda, IdxT, TPB>(out, in, idx, len, op,
-                                                          stream);
-  } else {
-    scatterImpl<DataT, 1, Lambda, IdxT, TPB>(out, in, idx, len, op, stream);
-  }
-}
-
-}  // end namespace MLCommon
diff --git a/cpp/src_prims/common/seive.cuh b/cpp/src_prims/common/seive.cuh
index 31c7c06eeb..f19180422a 100644
--- a/cpp/src_prims/common/seive.cuh
+++ b/cpp/src_prims/common/seive.cuh
@@ -15,7 +15,7 @@
  */
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 
 // Taken from:
@@ -59,7 +59,7 @@ class Seive {
  private:
   void generateSeive() {
     auto sqN = fastIntSqrt(N);
-    auto size = ceildiv<unsigned>(N, sizeof(unsigned) * 8);
+    auto size = raft::ceildiv<unsigned>(N, sizeof(unsigned) * 8);
     seive.resize(size);
     // assume all to be primes initially
     for (auto& itr : seive) {
diff --git a/cpp/src_prims/cuda_utils.cuh b/cpp/src_prims/cuda_utils.cuh
deleted file mode 100644
index 8e949d1078..0000000000
--- a/cpp/src_prims/cuda_utils.cuh
+++ /dev/null
@@ -1,640 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <math_constants.h>
-#include <stdint.h>
-#include <cuml/common/utils.hpp>
-
-namespace MLCommon {
-
-/** helper macro for device inlined functions */
-#define DI inline __device__
-#define HDI inline __host__ __device__
-#define HD __host__ __device__
-
-/**
- * @brief Provide a ceiling division operation ie. ceil(a / b)
- * @tparam IntType supposed to be only integers for now!
- */
-template <typename IntType>
-constexpr HDI IntType ceildiv(IntType a, IntType b) {
-  return (a + b - 1) / b;
-}
-
-/**
- * @brief Provide an alignment function ie. ceil(a / b) * b
- * @tparam IntType supposed to be only integers for now!
- */
-template <typename IntType>
-constexpr HDI IntType alignTo(IntType a, IntType b) {
-  return ceildiv(a, b) * b;
-}
-
-/**
- * @brief Provide an alignment function ie. (a / b) * b
- * @tparam IntType supposed to be only integers for now!
- */
-template <typename IntType>
-constexpr HDI IntType alignDown(IntType a, IntType b) {
-  return (a / b) * b;
-}
-
-/**
- * @brief Check if the input is a power of 2
- * @tparam IntType data type (checked only for integers)
- */
-template <typename IntType>
-constexpr HDI bool isPo2(IntType num) {
-  return (num && !(num & (num - 1)));
-}
-
-/**
- * @brief Give logarithm of the number to base-2
- * @tparam IntType data type (checked only for integers)
- */
-template <typename IntType>
-constexpr HDI IntType log2(IntType num, IntType ret = IntType(0)) {
-  return num <= IntType(1) ? ret : log2(num >> IntType(1), ++ret);
-}
-
-/** Device function to apply the input lambda across threads in the grid */
-template <int ItemsPerThread, typename L>
-DI void forEach(int num, L lambda) {
-  int idx = (blockDim.x * blockIdx.x) + threadIdx.x;
-  const int numThreads = blockDim.x * gridDim.x;
-#pragma unroll
-  for (int itr = 0; itr < ItemsPerThread; ++itr, idx += numThreads) {
-    if (idx < num) lambda(idx, itr);
-  }
-}
-
-/** number of threads per warp */
-static const int WarpSize = 32;
-
-/** get the laneId of the current thread */
-DI int laneId() {
-  int id;
-  asm("mov.s32 %0, %laneid;" : "=r"(id));
-  return id;
-}
-
-/**
- * @brief Swap two values
- * @tparam T the datatype of the values
- * @param a first input
- * @param b second input
- */
-template <typename T>
-HDI void swap(T &a, T &b) {
-  T tmp = a;
-  a = b;
-  b = tmp;
-}
-
-/** Device function to have atomic add support for older archs */
-#if __CUDA_ARCH__ < 600
-template <typename Type>
-DI void myAtomicAdd(Type *address, Type val) {
-  atomicAdd(address, val);
-}
-// Ref:
-// http://on-demand.gputechconf.com/gtc/2013/presentations/S3101-Atomic-Memory-Operations.pdf
-template <>
-DI void myAtomicAdd(double *address, double val) {
-  unsigned long long int *address_as_ull = (unsigned long long int *)address;
-  unsigned long long int old = *address_as_ull, assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address_as_ull, assumed,
-                    __double_as_longlong(val + __longlong_as_double(assumed)));
-  } while (assumed != old);
-}
-#else
-#define myAtomicAdd(a, b) atomicAdd(a, b)
-#endif  // __CUDA_ARCH__
-
-template <typename T, typename ReduceLambda>
-DI void myAtomicReduce(T *address, T val, ReduceLambda op);
-
-template <typename ReduceLambda>
-DI void myAtomicReduce(double *address, double val, ReduceLambda op) {
-  unsigned long long int *address_as_ull = (unsigned long long int *)address;
-  unsigned long long int old = *address_as_ull, assumed;
-  do {
-    assumed = old;
-    old =
-      atomicCAS(address_as_ull, assumed,
-                __double_as_longlong(op(val, __longlong_as_double(assumed))));
-  } while (assumed != old);
-}
-
-template <typename ReduceLambda>
-DI void myAtomicReduce(float *address, float val, ReduceLambda op) {
-  unsigned int *address_as_uint = (unsigned int *)address;
-  unsigned int old = *address_as_uint, assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address_as_uint, assumed,
-                    __float_as_uint(op(val, __uint_as_float(assumed))));
-  } while (assumed != old);
-}
-
-template <typename ReduceLambda>
-DI void myAtomicReduce(int *address, int val, ReduceLambda op) {
-  int old = *address, assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address, assumed, op(val, assumed));
-  } while (assumed != old);
-}
-
-template <typename ReduceLambda>
-DI void myAtomicReduce(long long *address, long long val, ReduceLambda op) {
-  long long old = *address, assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address, assumed, op(val, assumed));
-  } while (assumed != old);
-}
-
-template <typename ReduceLambda>
-DI void myAtomicReduce(unsigned long long *address, unsigned long long val,
-                       ReduceLambda op) {
-  unsigned long long old = *address, assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address, assumed, op(val, assumed));
-  } while (assumed != old);
-}
-
-/**
- * @brief Provide atomic min operation.
- * @tparam T: data type for input data (float or double).
- * @param[in] address: address to read old value from, and to atomically update w/ min(old value, val)
- * @param[in] val: new value to compare with old
- */
-template <typename T>
-DI T myAtomicMin(T *address, T val);
-
-/**
- * @brief Provide atomic max operation.
- * @tparam T: data type for input data (float or double).
- * @param[in] address: address to read old value from, and to atomically update w/ max(old value, val)
- * @param[in] val: new value to compare with old
- */
-template <typename T>
-DI T myAtomicMax(T *address, T val);
-
-DI float myAtomicMin(float *address, float val) {
-  myAtomicReduce(address, val, fminf);
-  return *address;
-}
-
-DI float myAtomicMax(float *address, float val) {
-  myAtomicReduce(address, val, fmaxf);
-  return *address;
-}
-
-DI double myAtomicMin(double *address, double val) {
-  myAtomicReduce<double(double, double)>(address, val, fmin);
-  return *address;
-}
-
-DI double myAtomicMax(double *address, double val) {
-  myAtomicReduce<double(double, double)>(address, val, fmax);
-  return *address;
-}
-
-/**
- * @defgroup Max maximum of two numbers
- * @{
- */
-template <typename T>
-HDI T myMax(T x, T y);
-template <>
-HDI float myMax<float>(float x, float y) {
-  return fmaxf(x, y);
-}
-template <>
-HDI double myMax<double>(double x, double y) {
-  return fmax(x, y);
-}
-/** @} */
-
-/**
- * @defgroup Min minimum of two numbers
- * @{
- */
-template <typename T>
-HDI T myMin(T x, T y);
-template <>
-HDI float myMin<float>(float x, float y) {
-  return fminf(x, y);
-}
-template <>
-HDI double myMin<double>(double x, double y) {
-  return fmin(x, y);
-}
-/** @} */
-
-/**
- * @brief Provide atomic min operation.
- * @tparam T: data type for input data (float or double).
- * @param[in] address: address to read old value from, and to atomically update w/ min(old value, val)
- * @param[in] val: new value to compare with old
- */
-template <typename T>
-DI T myAtomicMin(T *address, T val) {
-  myAtomicReduce(address, val, myMin<T>);
-  return *address;
-}
-
-/**
- * @brief Provide atomic max operation.
- * @tparam T: data type for input data (float or double).
- * @param[in] address: address to read old value from, and to atomically update w/ max(old value, val)
- * @param[in] val: new value to compare with old
- */
-template <typename T>
-DI T myAtomicMax(T *address, T val) {
-  myAtomicReduce(address, val, myMax<T>);
-  return *address;
-}
-
-/**
- * Sign function
- */
-template <typename T>
-HDI int sgn(const T val) {
-  return (T(0) < val) - (val < T(0));
-}
-
-/**
- * @defgroup Exp Exponential function
- * @{
- */
-template <typename T>
-HDI T myExp(T x);
-template <>
-HDI float myExp(float x) {
-  return expf(x);
-}
-template <>
-HDI double myExp(double x) {
-  return exp(x);
-}
-/** @} */
-
-/**
- * @defgroup Cuda infinity values
- * @{
- */
-template <typename T>
-inline __device__ T myInf();
-template <>
-inline __device__ float myInf<float>() {
-  return CUDART_INF_F;
-}
-template <>
-inline __device__ double myInf<double>() {
-  return CUDART_INF;
-}
-/** @} */
-
-/**
- * @defgroup Log Natural logarithm
- * @{
- */
-template <typename T>
-HDI T myLog(T x);
-template <>
-HDI float myLog(float x) {
-  return logf(x);
-}
-template <>
-HDI double myLog(double x) {
-  return log(x);
-}
-/** @} */
-
-/**
- * @defgroup Sqrt Square root
- * @{
- */
-template <typename T>
-HDI T mySqrt(T x);
-template <>
-HDI float mySqrt(float x) {
-  return sqrtf(x);
-}
-template <>
-HDI double mySqrt(double x) {
-  return sqrt(x);
-}
-/** @} */
-
-/**
- * @defgroup SineCosine Sine and cosine calculation
- * @{
- */
-template <typename T>
-DI void mySinCos(T x, T &s, T &c);
-template <>
-DI void mySinCos(float x, float &s, float &c) {
-  sincosf(x, &s, &c);
-}
-template <>
-DI void mySinCos(double x, double &s, double &c) {
-  sincos(x, &s, &c);
-}
-/** @} */
-
-/**
- * @defgroup Sine Sine calculation
- * @{
- */
-template <typename T>
-DI T mySin(T x);
-template <>
-DI float mySin(float x) {
-  return sinf(x);
-}
-template <>
-DI double mySin(double x) {
-  return sin(x);
-}
-/** @} */
-
-/**
- * @defgroup Abs Absolute value
- * @{
- */
-template <typename T>
-DI T myAbs(T x) {
-  return x < 0 ? -x : x;
-}
-template <>
-DI float myAbs(float x) {
-  return fabsf(x);
-}
-template <>
-DI double myAbs(double x) {
-  return fabs(x);
-}
-/** @} */
-
-/**
- * @defgroup Pow Power function
- * @{
- */
-template <typename T>
-HDI T myPow(T x, T power);
-template <>
-HDI float myPow(float x, float power) {
-  return powf(x, power);
-}
-template <>
-HDI double myPow(double x, double power) {
-  return pow(x, power);
-}
-/** @} */
-
-/**
- * @defgroup myTanh tanh function
- * @{
- */
-template <typename T>
-HDI T myTanh(T x);
-template <>
-HDI float myTanh(float x) {
-  return tanhf(x);
-}
-template <>
-HDI double myTanh(double x) {
-  return tanh(x);
-}
-/** @} */
-
-/**
- * @defgroup myATanh arctanh function
- * @{
- */
-template <typename T>
-HDI T myATanh(T x);
-template <>
-HDI float myATanh(float x) {
-  return atanhf(x);
-}
-template <>
-HDI double myATanh(double x) {
-  return atanh(x);
-}
-/** @} */
-
-/**
- * @defgroup LambdaOps Lambda operations in reduction kernels
- * @{
- */
-// IdxType mostly to be used for MainLambda in *Reduction kernels
-template <typename Type, typename IdxType = int>
-struct Nop {
-  HDI Type operator()(Type in, IdxType i = 0) { return in; }
-};
-
-template <typename Type, typename IdxType = int>
-struct L1Op {
-  HDI Type operator()(Type in, IdxType i = 0) { return myAbs(in); }
-};
-
-template <typename Type, typename IdxType = int>
-struct L2Op {
-  HDI Type operator()(Type in, IdxType i = 0) { return in * in; }
-};
-
-template <typename Type>
-struct Sum {
-  HDI Type operator()(Type a, Type b) { return a + b; }
-};
-/** @} */
-
-/**
- * @defgroup Sign Obtain sign value
- * @brief Obtain sign of x
- * @param x input
- * @return +1 if x >= 0 and -1 otherwise
- * @{
- */
-template <typename T>
-DI T signPrim(T x) {
-  return x < 0 ? -1 : +1;
-}
-template <>
-DI float signPrim(float x) {
-  return signbit(x) == true ? -1.0f : +1.0f;
-}
-template <>
-DI double signPrim(double x) {
-  return signbit(x) == true ? -1.0 : +1.0;
-}
-/** @} */
-
-/**
- * @defgroup Max maximum of two numbers
- * @brief Obtain maximum of two values
- * @param x one item
- * @param y second item
- * @return maximum of two items
- * @{
- */
-template <typename T>
-DI T maxPrim(T x, T y) {
-  return x > y ? x : y;
-}
-template <>
-DI float maxPrim(float x, float y) {
-  return fmaxf(x, y);
-}
-template <>
-DI double maxPrim(double x, double y) {
-  return fmax(x, y);
-}
-/** @} */
-
-/** apply a warp-wide fence (useful from Volta+ archs) */
-DI void warpFence() {
-#if __CUDA_ARCH__ >= 700
-  __syncwarp();
-#endif
-}
-
-/** warp-wide any boolean aggregator */
-DI bool any(bool inFlag, uint32_t mask = 0xffffffffu) {
-#if CUDART_VERSION >= 9000
-  inFlag = __any_sync(mask, inFlag);
-#else
-  inFlag = __any(inFlag);
-#endif
-  return inFlag;
-}
-
-/** warp-wide all boolean aggregator */
-DI bool all(bool inFlag, uint32_t mask = 0xffffffffu) {
-#if CUDART_VERSION >= 9000
-  inFlag = __all_sync(mask, inFlag);
-#else
-  inFlag = __all(inFlag);
-#endif
-  return inFlag;
-}
-
-/**
- * @brief Shuffle the data inside a warp
- * @tparam T the data type (currently assumed to be 4B)
- * @param val value to be shuffled
- * @param srcLane lane from where to shuffle
- * @param width lane width
- * @param mask mask of participating threads (Volta+)
- * @return the shuffled data
- */
-template <typename T>
-DI T shfl(T val, int srcLane, int width = WarpSize,
-          uint32_t mask = 0xffffffffu) {
-#if CUDART_VERSION >= 9000
-  return __shfl_sync(mask, val, srcLane, width);
-#else
-  return __shfl(val, srcLane, width);
-#endif
-}
-
-/**
- * @brief Shuffle the data inside a warp
- * @tparam T the data type (currently assumed to be 4B)
- * @param val value to be shuffled
- * @param laneMask mask to be applied in order to perform xor shuffle
- * @param width lane width
- * @param mask mask of participating threads (Volta+)
- * @return the shuffled data
- */
-template <typename T>
-DI T shfl_xor(T val, int laneMask, int width = WarpSize,
-              uint32_t mask = 0xffffffffu) {
-#if CUDART_VERSION >= 9000
-  return __shfl_xor_sync(mask, val, laneMask, width);
-#else
-  return __shfl_xor(val, laneMask, width);
-#endif
-}
-
-/**
- * @brief Warp-level sum reduction
- * @param val input value
- * @return only the lane0 will contain valid reduced result
- * @note Why not cub? Because cub doesn't seem to allow working with arbitrary
- *       number of warps in a block. All threads in the warp must enter this
- *       function together
- * @todo Expand this to support arbitrary reduction ops
- */
-template <typename T>
-DI T warpReduce(T val) {
-#pragma unroll
-  for (int i = WarpSize / 2; i > 0; i >>= 1) {
-    T tmp = shfl(val, laneId() + i);
-    val += tmp;
-  }
-  return val;
-}
-
-/**
- * @brief 1-D block-level sum reduction
- * @param val input value
- * @param smem shared memory region needed for storing intermediate results. It
- *             must alteast be of size: `sizeof(T) * nWarps`
- * @return only the thread0 will contain valid reduced result
- * @note Why not cub? Because cub doesn't seem to allow working with arbitrary
- *       number of warps in a block. All threads in the block must enter this
- *       function together
- * @todo Expand this to support arbitrary reduction ops
- */
-template <typename T>
-DI T blockReduce(T val, char *smem) {
-  auto *sTemp = reinterpret_cast<T *>(smem);
-  int nWarps = (blockDim.x + WarpSize - 1) / WarpSize;
-  int lid = laneId();
-  int wid = threadIdx.x / WarpSize;
-  val = warpReduce(val);
-  if (lid == 0) sTemp[wid] = val;
-  __syncthreads();
-  val = lid < nWarps ? sTemp[lid] : T(0);
-  return warpReduce(val);
-}
-
-/**
- * @brief Simple utility function to determine whether user_stream or one of the
- * internal streams should be used.
- * @param user_stream main user stream
- * @param int_streams array of internal streams
- * @param n_int_streams number of internal streams
- * @param idx the index for which to query the stream
- */
-inline cudaStream_t select_stream(cudaStream_t user_stream,
-                                  cudaStream_t *int_streams, int n_int_streams,
-                                  int idx) {
-  return n_int_streams > 0 ? int_streams[idx % n_int_streams] : user_stream;
-}
-
-}  // namespace MLCommon
diff --git a/cpp/src_prims/decoupled_lookback.cuh b/cpp/src_prims/decoupled_lookback.cuh
index 877c4b9373..5d1f71b423 100644
--- a/cpp/src_prims/decoupled_lookback.cuh
+++ b/cpp/src_prims/decoupled_lookback.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include "cuda_utils.cuh"
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 
diff --git a/cpp/src_prims/distance/algo1.cuh b/cpp/src_prims/distance/algo1.cuh
index bb029d6da5..700d16d721 100644
--- a/cpp/src_prims/distance/algo1.cuh
+++ b/cpp/src_prims/distance/algo1.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <linalg/gemm.cuh>
-#include <linalg/norm.cuh>
+#include <linalg/cutlass_gemm.cuh>
+#include <raft/linalg/norm.cuh>
 #include "distance_epilogue.cuh"
 #include "distance_epilogue_functor.cuh"
 #include "distance_epilogue_traits.h"
@@ -82,13 +82,13 @@ void distanceAlgo1(Index_ m, Index_ n, Index_ k, const InType *pA,
   InType *row_vec = workspace;
   if (pA != pB) {
     row_vec += m;
-    LinAlg::rowNorm(col_vec, pA, k, m, LinAlg::L2Norm, isRowMajor, stream,
-                    norm_op);
-    LinAlg::rowNorm(row_vec, pB, k, n, LinAlg::L2Norm, isRowMajor, stream,
-                    norm_op);
+    raft::linalg::rowNorm(col_vec, pA, k, m, raft::linalg::L2Norm, isRowMajor,
+                          stream, norm_op);
+    raft::linalg::rowNorm(row_vec, pB, k, n, raft::linalg::L2Norm, isRowMajor,
+                          stream, norm_op);
   } else {
-    LinAlg::rowNorm(col_vec, pA, k, m, LinAlg::L2Norm, isRowMajor, stream,
-                    norm_op);
+    raft::linalg::rowNorm(col_vec, pA, k, m, raft::linalg::L2Norm, isRowMajor,
+                          stream, norm_op);
   }
 
   typedef typename cutlass::Shape<8, 8, 8> AccumulatorsPerThread_;
diff --git a/cpp/src_prims/distance/cosine.cuh b/cpp/src_prims/distance/cosine.cuh
index e73c23c9cb..31ee7383e4 100644
--- a/cpp/src_prims/distance/cosine.cuh
+++ b/cpp/src_prims/distance/cosine.cuh
@@ -52,11 +52,17 @@ void cosineAlgo1(Index_ m, Index_ n, Index_ k, const InType *pA,
                  bool isRowMajor) {
   typedef ExpandedDistanceFragmentMultiplyAdd<CosFusedDistance>
     FragmentMultiplyAdd_;
-  auto norm_op = [] __device__(AccType in) { return mySqrt(in); };
+  auto norm_op = [] __device__(AccType in) { return raft::mySqrt(in); };
+
+  // Wrap fin_op to allow computing 1 - pA before calling fin_op
+  auto wrapped_fin_op = [fin_op] __device__(AccType d_val, Index_ g_d_idx) {
+    return fin_op(static_cast<AccType>(1.0) - d_val, g_d_idx);
+  };
+
   distanceAlgo1<InType, AccType, OutType, OutputTile_, FragmentMultiplyAdd_,
-                FinalLambda, decltype(norm_op), Index_>(
-    m, n, k, pA, pB, pD, false, workspace, worksize, fin_op, norm_op, stream,
-    isRowMajor);
+                decltype(wrapped_fin_op), decltype(norm_op), Index_>(
+    m, n, k, pA, pB, pD, false, workspace, worksize, wrapped_fin_op, norm_op,
+    stream, isRowMajor);
 }
 
 };  // end namespace Distance
diff --git a/cpp/src_prims/distance/distance.cuh b/cpp/src_prims/distance/distance.cuh
index ebc26611a1..6f822f6323 100644
--- a/cpp/src_prims/distance/distance.cuh
+++ b/cpp/src_prims/distance/distance.cuh
@@ -17,9 +17,10 @@
 #pragma once
 
 #include <cuda_runtime_api.h>
+#include <cuml/distance/distance_type.h>
 #include <cutlass/shape.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "cosine.cuh"
 #include "euclidean.cuh"
 #include "l1.cuh"
@@ -29,26 +30,10 @@ namespace Distance {
 
 typedef cutlass::Shape<8, 128, 128> OutputTile_8x128x128;
 
-/** enum to tell how to compute euclidean distance */
-enum DistanceType {
-  /** evaluate as dist_ij = sum(x_ik^2) + sum(y_ij)^2 - 2*sum(x_ik * y_jk) */
-  EucExpandedL2 = 0,
-  /** same as above, but inside the epilogue, perform square root operation */
-  EucExpandedL2Sqrt,
-  /** cosine distance */
-  EucExpandedCosine,
-  /** L1 distance */
-  EucUnexpandedL1,
-  /** evaluate as dist_ij += (x_ik - y-jk)^2 */
-  EucUnexpandedL2,
-  /** same as above, but inside the epilogue, perform square root operation */
-  EucUnexpandedL2Sqrt,
-};
-
 namespace {
-template <DistanceType distanceType, typename InType, typename AccType,
-          typename OutType, typename OutputTile_, typename FinalLambda,
-          typename Index_>
+template <ML::Distance::DistanceType distanceType, typename InType,
+          typename AccType, typename OutType, typename OutputTile_,
+          typename FinalLambda, typename Index_>
 struct DistanceImpl {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
@@ -57,8 +42,8 @@ struct DistanceImpl {
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucExpandedL2, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucExpandedL2, InType, AccType,
+                    OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -70,8 +55,8 @@ struct DistanceImpl<EucExpandedL2, InType, AccType, OutType, OutputTile_,
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucExpandedL2Sqrt, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucExpandedL2Sqrt, InType,
+                    AccType, OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -83,8 +68,8 @@ struct DistanceImpl<EucExpandedL2Sqrt, InType, AccType, OutType, OutputTile_,
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucExpandedCosine, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucExpandedCosine, InType,
+                    AccType, OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -96,8 +81,8 @@ struct DistanceImpl<EucExpandedCosine, InType, AccType, OutType, OutputTile_,
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucUnexpandedL2, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucUnexpandedL2, InType,
+                    AccType, OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -108,8 +93,8 @@ struct DistanceImpl<EucUnexpandedL2, InType, AccType, OutType, OutputTile_,
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucUnexpandedL2Sqrt, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucUnexpandedL2Sqrt, InType,
+                    AccType, OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -120,8 +105,8 @@ struct DistanceImpl<EucUnexpandedL2Sqrt, InType, AccType, OutType, OutputTile_,
 
 template <typename InType, typename AccType, typename OutType,
           typename OutputTile_, typename FinalLambda, typename Index_>
-struct DistanceImpl<EucUnexpandedL1, InType, AccType, OutType, OutputTile_,
-                    FinalLambda, Index_> {
+struct DistanceImpl<ML::Distance::DistanceType::EucUnexpandedL1, InType,
+                    AccType, OutType, OutputTile_, FinalLambda, Index_> {
   void run(const InType *x, const InType *y, OutType *dist, Index_ m, Index_ n,
            Index_ k, void *workspace, size_t worksize, FinalLambda fin_op,
            cudaStream_t stream, bool isRowMajor) {
@@ -148,12 +133,13 @@ struct DistanceImpl<EucUnexpandedL1, InType, AccType, OutType, OutputTile_,
  * @note If the specifed distanceType doesn't need the workspace at all, it
  * returns 0.
  */
-template <DistanceType distanceType, typename InType, typename AccType,
-          typename OutType, typename Index_ = int>
+template <ML::Distance::DistanceType distanceType, typename InType,
+          typename AccType, typename OutType, typename Index_ = int>
 size_t getWorkspaceSize(const InType *x, const InType *y, Index_ m, Index_ n,
                         Index_ k) {
   size_t worksize = 0;
-  constexpr bool is_allocated = distanceType <= EucExpandedCosine;
+  constexpr bool is_allocated =
+    distanceType <= ML::Distance::DistanceType::EucExpandedCosine;
   if (is_allocated) {
     worksize += m * sizeof(AccType);
     if (x != y) worksize += n * sizeof(AccType);
@@ -193,9 +179,9 @@ size_t getWorkspaceSize(const InType *x, const InType *y, Index_ m, Index_ n,
  * as follows:  <pre>OutType fin_op(AccType in, int g_idx);</pre>. If one needs
  * any other parameters, feel free to pass them via closure.
  */
-template <DistanceType distanceType, typename InType, typename AccType,
-          typename OutType, typename OutputTile_, typename FinalLambda,
-          typename Index_ = int>
+template <ML::Distance::DistanceType distanceType, typename InType,
+          typename AccType, typename OutType, typename OutputTile_,
+          typename FinalLambda, typename Index_ = int>
 void distance(const InType *x, const InType *y, OutType *dist, Index_ m,
               Index_ n, Index_ k, void *workspace, size_t worksize,
               FinalLambda fin_op, cudaStream_t stream, bool isRowMajor = true) {
@@ -228,8 +214,9 @@ void distance(const InType *x, const InType *y, OutType *dist, Index_ m,
  * @note if workspace is passed as nullptr, this will return in
  *  worksize, the number of bytes of workspace required
  */
-template <DistanceType distanceType, typename InType, typename AccType,
-          typename OutType, typename OutputTile_, typename Index_ = int>
+template <ML::Distance::DistanceType distanceType, typename InType,
+          typename AccType, typename OutType, typename OutputTile_,
+          typename Index_ = int>
 void distance(const InType *x, const InType *y, OutType *dist, Index_ m,
               Index_ n, Index_ k, void *workspace, size_t worksize,
               cudaStream_t stream, bool isRowMajor = true) {
@@ -267,7 +254,7 @@ void distance(const InType *x, const InType *y, OutType *dist, Index_ m,
  * @param stream cuda stream
  * @param isRowMajor whether the matrices are row-major or col-major
  */
-template <typename Type, typename Index_, DistanceType DistType>
+template <typename Type, typename Index_, ML::Distance::DistanceType DistType>
 void pairwiseDistanceImpl(const Type *x, const Type *y, Type *dist, Index_ m,
                           Index_ n, Index_ k, device_buffer<char> &workspace,
                           cudaStream_t stream, bool isRowMajor) {
@@ -281,35 +268,41 @@ void pairwiseDistanceImpl(const Type *x, const Type *y, Type *dist, Index_ m,
 template <typename Type, typename Index_ = int>
 void pairwiseDistance(const Type *x, const Type *y, Type *dist, Index_ m,
                       Index_ n, Index_ k, device_buffer<char> &workspace,
-                      DistanceType metric, cudaStream_t stream,
+                      ML::Distance::DistanceType metric, cudaStream_t stream,
                       bool isRowMajor = true) {
   switch (metric) {
-    case DistanceType::EucExpandedL2:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucExpandedL2>(
+    case ML::Distance::DistanceType::EucExpandedL2:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucExpandedL2>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
-    case DistanceType::EucExpandedL2Sqrt:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucExpandedL2Sqrt>(
+    case ML::Distance::DistanceType::EucExpandedL2Sqrt:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucExpandedL2Sqrt>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
-    case DistanceType::EucExpandedCosine:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucExpandedCosine>(
+    case ML::Distance::DistanceType::EucExpandedCosine:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucExpandedCosine>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
-    case DistanceType::EucUnexpandedL1:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucUnexpandedL1>(
+    case ML::Distance::DistanceType::EucUnexpandedL1:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucUnexpandedL1>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
-    case DistanceType::EucUnexpandedL2:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucUnexpandedL2>(
+    case ML::Distance::DistanceType::EucUnexpandedL2:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucUnexpandedL2>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
-    case DistanceType::EucUnexpandedL2Sqrt:
-      pairwiseDistanceImpl<Type, Index_, DistanceType::EucUnexpandedL2Sqrt>(
+    case ML::Distance::DistanceType::EucUnexpandedL2Sqrt:
+      pairwiseDistanceImpl<Type, Index_,
+                           ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
         x, y, dist, m, n, k, workspace, stream, isRowMajor);
       break;
     default:
-      THROW("Unknown distance metric '%d'!", metric);
+      THROW("Unknown distance metric '%d'!", (int)metric);
   };
 }
 /** @} */
diff --git a/cpp/src_prims/distance/distance_fragment_multiply_add.cuh b/cpp/src_prims/distance/distance_fragment_multiply_add.cuh
index a85e2e66a4..513181272d 100644
--- a/cpp/src_prims/distance/distance_fragment_multiply_add.cuh
+++ b/cpp/src_prims/distance/distance_fragment_multiply_add.cuh
@@ -17,7 +17,7 @@
 #pragma once
 #include <cutlass/fragment.h>
 #include <cutlass/shape.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Distance {
@@ -65,7 +65,7 @@ struct L2FusedDistance {
                                      ColElement_ const &col_elem,
                                      RowElement_ const &row_elem) {
     accum = col_elem + row_elem - 2 * accum;
-    accum = enable_sqrt_ ? mySqrt(accum) : accum;
+    accum = enable_sqrt_ ? raft::mySqrt(accum) : accum;
   }
 };
 
@@ -103,7 +103,7 @@ struct UnexpandedDistanceFragmentMultiplyAdd {
         d[j] += b[j * kReduction + k];
       }
       if (index[j] != -1) {
-        d[j] = enable_sqrt_ ? mySqrt(d[j]) : d[j];
+        d[j] = enable_sqrt_ ? raft::mySqrt(d[j]) : d[j];
         d[j] = fin_op(d[j], index[j]);
       }
     }
diff --git a/cpp/src_prims/distance/epsilon_neighborhood.cuh b/cpp/src_prims/distance/epsilon_neighborhood.cuh
index 0dda256ee6..0d1e3d9915 100644
--- a/cpp/src_prims/distance/epsilon_neighborhood.cuh
+++ b/cpp/src_prims/distance/epsilon_neighborhood.cuh
@@ -54,8 +54,7 @@ struct EpsUnexpL2SqNeighborhood : public BaseClass {
 
  private:
   DI void prolog() {
-    this->ldgsts(0);
-    this->pageWr ^= 1;
+    this->ldgXY(0);
 #pragma unroll
     for (int i = 0; i < P::AccRowsPerTh; ++i) {
 #pragma unroll
@@ -63,13 +62,16 @@ struct EpsUnexpL2SqNeighborhood : public BaseClass {
         acc[i][j] = BaseClass::Zero;
       }
     }
+    this->stsXY();
     __syncthreads();
+    this->pageWr ^= 1;
   }
 
   DI void loop() {
     for (int kidx = P::Kblk; kidx < this->k; kidx += P::Kblk) {
-      this->ldgsts(kidx);
+      this->ldgXY(kidx);
       accumulate();  // on the previous k-block
+      this->stsXY();
       __syncthreads();
       this->pageWr ^= 1;
       this->pageRd ^= 1;
@@ -80,7 +82,7 @@ struct EpsUnexpL2SqNeighborhood : public BaseClass {
   DI void epilog() {
     IdxT startx = blockIdx.x * P::Mblk + this->accrowid;
     IdxT starty = blockIdx.y * P::Nblk + this->acccolid;
-    auto lid = laneId();
+    auto lid = raft::laneId();
     IdxT sums[P::AccColsPerTh];
 #pragma unroll
     for (int j = 0; j < P::AccColsPerTh; ++j) {
@@ -142,7 +144,7 @@ struct EpsUnexpL2SqNeighborhood : public BaseClass {
       __syncthreads();  // for safe smem reuse
     }
     // update the total edge count
-    totalSum = blockReduce<IdxT>(totalSum, smem);
+    totalSum = raft::blockReduce<IdxT>(totalSum, smem);
     if (threadIdx.x == 0) {
       atomicUpdate(this->n, totalSum);
     }
@@ -150,9 +152,10 @@ struct EpsUnexpL2SqNeighborhood : public BaseClass {
 
   DI void atomicUpdate(IdxT addrId, IdxT val) {
     if (sizeof(IdxT) == 4) {
-      myAtomicAdd((unsigned*)(vd + addrId), val);
+      raft::myAtomicAdd<unsigned>((unsigned*)(vd + addrId), val);
     } else if (sizeof(IdxT) == 8) {
-      myAtomicAdd((unsigned long long*)(vd + addrId), val);
+      raft::myAtomicAdd<unsigned long long>((unsigned long long*)(vd + addrId),
+                                            val);
     }
   }
 };  // struct EpsUnexpL2SqNeighborhood
@@ -172,7 +175,8 @@ void epsUnexpL2SqNeighImpl(bool* adj, IdxT* vd, const DataT* x, const DataT* y,
                            IdxT m, IdxT n, IdxT k, DataT eps,
                            cudaStream_t stream) {
   typedef typename LinAlg::Policy4x4<DataT, VecLen>::Policy Policy;
-  dim3 grid(ceildiv<int>(m, Policy::Mblk), ceildiv<int>(n, Policy::Nblk));
+  dim3 grid(raft::ceildiv<int>(m, Policy::Mblk),
+            raft::ceildiv<int>(n, Policy::Nblk));
   dim3 blk(Policy::Nthreads);
   epsUnexpL2SqNeighKernel<DataT, IdxT, Policy>
     <<<grid, blk, Policy::SmemSize, stream>>>(adj, vd, x, y, m, n, k, eps);
diff --git a/cpp/src_prims/distance/euclidean.cuh b/cpp/src_prims/distance/euclidean.cuh
index 7bd7a08597..cef156a536 100644
--- a/cpp/src_prims/distance/euclidean.cuh
+++ b/cpp/src_prims/distance/euclidean.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 #include <linalg/custom_accum.h>
+#include <linalg/cutlass_gemm.cuh>
 #include <linalg/eltwise2d.cuh>
-#include <linalg/gemm.cuh>
 #include "algo1.cuh"
 #include "distance_fragment_multiply_add.cuh"
 
diff --git a/cpp/src_prims/distance/fragment_sqrt.cuh b/cpp/src_prims/distance/fragment_sqrt.cuh
index 4c0c1b612b..d771d91ada 100644
--- a/cpp/src_prims/distance/fragment_sqrt.cuh
+++ b/cpp/src_prims/distance/fragment_sqrt.cuh
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cutlass/fragment_multiply_add.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Distance {
@@ -34,9 +34,9 @@ struct FragmentSqrt : public cutlass::gemm::FragmentMultiplyAdd<Scalar_> {
   CUTLASS_DEVICE void sqrt(FragmentB_ const& b, FragmentCd_& d) {
     int const kReduction = FragmentB_::kElements / FragmentCd_::kElements;
     for (int j = 0; j < FragmentCd_::kElements; ++j) {
-      d[j] = MLCommon::mySqrt(b[j * kReduction + 0]);
+      d[j] = raft::mySqrt(b[j * kReduction + 0]);
       for (int k = 1; k < kReduction; ++k) {
-        d[j] += MLCommon::mySqrt(b[j * kReduction + k]);
+        d[j] += raft::mySqrt(b[j * kReduction + k]);
       }
     }
   }
diff --git a/cpp/src_prims/distance/fused_l2_nn.cuh b/cpp/src_prims/distance/fused_l2_nn.cuh
index e7c9577f75..a32bafbf76 100644
--- a/cpp/src_prims/distance/fused_l2_nn.cuh
+++ b/cpp/src_prims/distance/fused_l2_nn.cuh
@@ -18,9 +18,14 @@
 
 #include <stdint.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <limits>
 #include <linalg/contractions.cuh>
+#include <raft/cuda_utils.cuh>
+
+#if (ENABLE_MEMCPY_ASYNC == 1)
+#include <cuda_pipeline.h>
+using namespace nvcuda::experimental;
+#endif
 
 namespace MLCommon {
 namespace Distance {
@@ -83,7 +88,13 @@ struct FusedL2NN : public BaseClass {
 
   ReduceOpT redOp;
 
+#if (ENABLE_MEMCPY_ASYNC == 1)
+  DataT zeros[P::Veclen];
+  nvcuda::experimental::pipeline pipe;
+#endif
+
   static const DataT Two = (DataT)2.0;
+  static constexpr size_t SizeAndAlign = P::Veclen * sizeof(DataT);
 
  public:
   DI FusedL2NN(OutT* _min, const DataT* _x, const DataT* _y, const DataT* _xn,
@@ -98,7 +109,14 @@ struct FusedL2NN : public BaseClass {
       syNorm(&(sxNorm[P::Mblk])),
       sRed((cub::KeyValuePair<IdxT, DataT>*)_smem),
       maxVal(_mv),
-      redOp(op) {}
+      redOp(op) {
+#if (ENABLE_MEMCPY_ASYNC == 1)
+#pragma unroll
+    for (int i = 0; i < P::Veclen; ++i) {
+      zeros[i] = BaseClass::Zero;
+    }
+#endif
+  }
 
   DI void run() {
     prolog();
@@ -109,8 +127,7 @@ struct FusedL2NN : public BaseClass {
 
  private:
   DI void prolog() {
-    this->ldgsts(0);
-    this->pageWr ^= 1;
+    this->ldgXY(0);
 #pragma unroll
     for (int i = 0; i < P::AccRowsPerTh; ++i) {
 #pragma unroll
@@ -118,13 +135,16 @@ struct FusedL2NN : public BaseClass {
         acc[i][j] = BaseClass::Zero;
       }
     }
+    this->stsXY();
     __syncthreads();
+    this->pageWr ^= 1;
   }
 
   DI void loop() {
     for (int kidx = P::Kblk; kidx < this->k; kidx += P::Kblk) {
-      this->ldgsts(kidx);
+      this->ldgXY(kidx);
       accumulate();  // on the previous k-block
+      this->stsXY();
       __syncthreads();
       this->pageWr ^= 1;
       this->pageRd ^= 1;
@@ -163,14 +183,14 @@ struct FusedL2NN : public BaseClass {
       for (int i = 0; i < P::AccRowsPerTh; ++i) {
 #pragma unroll
         for (int j = 0; j < P::AccColsPerTh; ++j) {
-          acc[i][j] = mySqrt(acc[i][j]);
+          acc[i][j] = raft::mySqrt(acc[i][j]);
         }
       }
     }
     // reduce
     cub::KeyValuePair<IdxT, DataT> val[P::AccRowsPerTh];
     KVPMinReduce<IdxT, DataT> pairRedOp;
-    auto lid = laneId();
+    auto lid = raft::laneId();
 #pragma unroll
     for (int i = 0; i < P::AccRowsPerTh; ++i) {
       val[i] = {-1, maxVal};
@@ -183,8 +203,8 @@ struct FusedL2NN : public BaseClass {
       __syncthreads();
 #pragma unroll
       for (int j = P::AccThCols / 2; j > 0; j >>= 1) {
-        auto tmpkey = shfl(val[i].key, lid + j);
-        auto tmpvalue = shfl(val[i].value, lid + j);
+        auto tmpkey = raft::shfl(val[i].key, lid + j);
+        auto tmpvalue = raft::shfl(val[i].value, lid + j);
         cub::KeyValuePair<IdxT, DataT> tmp = {tmpkey, tmpvalue};
         val[i] = pairRedOp(tmp, val[i]);
       }
@@ -219,11 +239,11 @@ struct FusedL2NN : public BaseClass {
   DI void updateResults() {
     // for now have first lane from each warp update a unique output row. This
     // will resolve hang issues with pre-Volta architectures
-    auto nWarps = blockDim.x / WarpSize;
-    auto lid = laneId();
+    auto nWarps = blockDim.x / raft::WarpSize;
+    auto lid = raft::laneId();
     auto ridx = IdxT(blockIdx.x) * P::Mblk;
     if (lid == 0) {
-      for (int i = threadIdx.x / WarpSize; i < P::Mblk; i += nWarps) {
+      for (int i = threadIdx.x / raft::WarpSize; i < P::Mblk; i += nWarps) {
         auto rid = ridx + i;
         if (rid < this->m) {
           auto val = sRed[i];
@@ -254,7 +274,36 @@ struct FusedL2NN : public BaseClass {
       }
     }
   }
-};  // struct FusedL2NN
+
+#if (ENABLE_MEMCPY_ASYNC == 1)
+  DI void ldgXY(IdxT kidx) {
+    auto koffset = kidx + this->scolid;
+    auto offset =
+      this->pageWr * P::SmemPage + this->srowid * P::SmemStride + this->scolid;
+    auto* saddrx = this->sx + offset;
+    for (int i = 0; i < P::LdgPerThX; ++i) {
+      auto* sax = saddrx + i * P::LdgRowsX * P::SmemStride;
+      auto* gax = this->x + i * P::LdgRowsX * this->k + koffset;
+      auto inside =
+        koffset < this->k && (this->xrowid + i * P::LdgRowsX) < this->m;
+      __pipeline_memcpy_async(sax, inside ? gax : nullptr, SizeAndAlign,
+                              inside ? 0 : SizeAndAlign);
+    }
+    auto* saddry = this->sy + offset;
+    for (int i = 0; i < P::LdgPerThY; ++i) {
+      auto* say = saddry + i * P::LdgRowsY * P::SmemStride;
+      auto* gay = this->y + i * P::LdgRowsY * this->k + koffset;
+      auto inside =
+        koffset < this->k && (this->yrowid + i * P::LdgRowsY) < this->n;
+      __pipeline_memcpy_async(say, inside ? gay : nullptr, SizeAndAlign,
+                              inside ? 0 : SizeAndAlign);
+    }
+    pipe.commit();
+  }
+
+  DI void stsXY() { pipe.wait_prior<0>(); }
+#endif  // ENABLE_MEMCPY_ASYNC
+};      // struct FusedL2NN
 
 template <typename DataT, typename OutT, typename IdxT, bool Sqrt,
           typename Policy, typename ReduceOpT>
@@ -282,9 +331,10 @@ void fusedL2NNImpl(OutT* min, const DataT* x, const DataT* y, const DataT* xn,
                    ReduceOpT redOp, bool sqrt, bool initOutBuffer,
                    cudaStream_t stream) {
   typedef typename LinAlg::Policy4x4<DataT, VecLen>::Policy Policy;
-  dim3 grid(ceildiv<int>(m, Policy::Mblk), ceildiv<int>(n, Policy::Nblk));
+  dim3 grid(raft::ceildiv<int>(m, Policy::Mblk),
+            raft::ceildiv<int>(n, Policy::Nblk));
   dim3 blk(Policy::Nthreads);
-  auto nblks = ceildiv<int>(m, Policy::Nthreads);
+  auto nblks = raft::ceildiv<int>(m, Policy::Nthreads);
   auto maxVal = std::numeric_limits<DataT>::max();
   CUDA_CHECK(cudaMemsetAsync(workspace, 0, sizeof(int) * m, stream));
   if (initOutBuffer) {
diff --git a/cpp/src_prims/distance/l1.cuh b/cpp/src_prims/distance/l1.cuh
index 6cbf4fc3f6..0a1fca828d 100644
--- a/cpp/src_prims/distance/l1.cuh
+++ b/cpp/src_prims/distance/l1.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 #include <linalg/custom_accum.h>
-#include <linalg/gemm.cuh>
+#include <linalg/cutlass_gemm.cuh>
 
 #include <type_traits>
 
diff --git a/cpp/src_prims/functions/hinge.cuh b/cpp/src_prims/functions/hinge.cuh
index d5c9e22474..029bf337a8 100644
--- a/cpp/src_prims/functions/hinge.cuh
+++ b/cpp/src_prims/functions/hinge.cuh
@@ -16,19 +16,21 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/subtract.cuh>
-#include <linalg/unary_op.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/sum.cuh>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <common/device_buffer.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/mr/device/buffer.hpp>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/sum.cuh>
 #include "penalty.cuh"
 
 namespace MLCommon {
@@ -37,7 +39,7 @@ namespace Functions {
 template <typename math_t, typename idx_type = int>
 void hingeLossGradMult(math_t *data, const math_t *vec1, const math_t *vec2,
                        idx_type n_row, idx_type n_col, cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
+  raft::linalg::matrixVectorOp(
     data, data, vec1, vec2, n_col, n_row, false, false,
     [] __device__(math_t a, math_t b, math_t c) {
       if (c < math_t(1))
@@ -51,7 +53,7 @@ void hingeLossGradMult(math_t *data, const math_t *vec1, const math_t *vec2,
 template <typename math_t, typename idx_type = int>
 void hingeLossSubtract(math_t *out, const math_t *in, math_t scalar,
                        idx_type len, cudaStream_t stream) {
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     out, in, len,
     [scalar] __device__(math_t in) {
       if (in < scalar)
@@ -63,35 +65,37 @@ void hingeLossSubtract(math_t *out, const math_t *in, math_t scalar,
 }
 
 template <typename math_t, typename idx_type = int>
-void hingeH(const math_t *input, idx_type n_rows, idx_type n_cols,
-            const math_t *coef, math_t *pred, math_t intercept,
-            cublasHandle_t cublas_handle, cudaStream_t stream) {
-  LinAlg::gemm(input, n_rows, n_cols, coef, pred, n_rows, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, cublas_handle, stream);
+void hingeH(const raft::handle_t &handle, const math_t *input, idx_type n_rows,
+            idx_type n_cols, const math_t *coef, math_t *pred, math_t intercept,
+            cudaStream_t stream) {
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, pred, n_rows, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, stream);
 
   if (intercept != math_t(0))
-    LinAlg::addScalar(pred, pred, intercept, n_rows, stream);
+    raft::linalg::addScalar(pred, pred, intercept, n_rows, stream);
 
   sign(pred, pred, math_t(1.0), n_rows, stream);
 }
 
 template <typename math_t>
-void hingeLossGrads(math_t *input, int n_rows, int n_cols, const math_t *labels,
-                    const math_t *coef, math_t *grads, penalty pen,
-                    math_t alpha, math_t l1_ratio, cublasHandle_t cublas_handle,
-                    std::shared_ptr<deviceAllocator> allocator,
+void hingeLossGrads(const raft::handle_t &handle, math_t *input, int n_rows,
+                    int n_cols, const math_t *labels, const math_t *coef,
+                    math_t *grads, penalty pen, math_t alpha, math_t l1_ratio,
                     cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
 
-  LinAlg::gemm(input, n_rows, n_cols, coef, labels_pred.data(), n_rows, 1,
-               CUBLAS_OP_N, CUBLAS_OP_N, cublas_handle, stream);
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
 
-  LinAlg::eltwiseMultiply(labels_pred.data(), labels_pred.data(), labels,
-                          n_rows, stream);
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, labels_pred.data(),
+                     n_rows, 1, CUBLAS_OP_N, CUBLAS_OP_N, stream);
+
+  raft::linalg::eltwiseMultiply(labels_pred.data(), labels_pred.data(), labels,
+                                n_rows, stream);
   hingeLossGradMult(input, labels, labels_pred.data(), n_rows, n_cols, stream);
-  Stats::mean(grads, input, n_cols, n_rows, false, false, stream);
+  raft::stats::mean(grads, input, n_cols, n_rows, false, false, stream);
 
-  device_buffer<math_t> pen_grads(allocator, stream, 0);
+  raft::mr::device::buffer<math_t> pen_grads(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_grads.resize(n_cols, stream);
 
@@ -104,30 +108,32 @@ void hingeLossGrads(math_t *input, int n_rows, int n_cols, const math_t *labels,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(grads, grads, pen_grads.data(), n_cols, stream);
+    raft::linalg::add(grads, grads, pen_grads.data(), n_cols, stream);
   }
 }
 
 template <typename math_t>
-void hingeLoss(math_t *input, int n_rows, int n_cols, const math_t *labels,
-               const math_t *coef, math_t *loss, penalty pen, math_t alpha,
-               math_t l1_ratio, cublasHandle_t cublas_handle,
-               std::shared_ptr<deviceAllocator> allocator,
+void hingeLoss(const raft::handle_t &handle, math_t *input, int n_rows,
+               int n_cols, const math_t *labels, const math_t *coef,
+               math_t *loss, penalty pen, math_t alpha, math_t l1_ratio,
                cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
+
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
 
-  LinAlg::gemm(input, n_rows, n_cols, coef, labels_pred.data(), n_rows, 1,
-               CUBLAS_OP_N, CUBLAS_OP_N, cublas_handle, stream);
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, labels_pred.data(),
+                     n_rows, 1, CUBLAS_OP_N, CUBLAS_OP_N, stream);
 
-  LinAlg::eltwiseMultiply(labels_pred.data(), labels_pred.data(), labels,
-                          n_rows, stream);
+  raft::linalg::eltwiseMultiply(labels_pred.data(), labels_pred.data(), labels,
+                                n_rows, stream);
 
   hingeLossSubtract(labels_pred.data(), labels_pred.data(), math_t(1), n_rows,
                     stream);
 
-  Stats::sum(loss, labels_pred.data(), 1, n_rows, false, stream);
+  raft::stats::sum(loss, labels_pred.data(), 1, n_rows, false, stream);
 
-  device_buffer<math_t> pen_val(allocator, stream, 0);
+  raft::mr::device::buffer<math_t> pen_val(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_val.resize(1, stream);
 
@@ -140,7 +146,7 @@ void hingeLoss(math_t *input, int n_rows, int n_cols, const math_t *labels,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(loss, loss, pen_val.data(), 1, stream);
+    raft::linalg::add(loss, loss, pen_val.data(), 1, stream);
   }
 }
 
diff --git a/cpp/src_prims/functions/linearReg.cuh b/cpp/src_prims/functions/linearReg.cuh
index 6cb4a3936c..78f9221b90 100644
--- a/cpp/src_prims/functions/linearReg.cuh
+++ b/cpp/src_prims/functions/linearReg.cuh
@@ -16,53 +16,56 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/subtract.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/sum.cuh>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <common/device_buffer.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/mr/device/buffer.hpp>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/sum.cuh>
 #include "penalty.cuh"
 
 namespace MLCommon {
 namespace Functions {
 
 template <typename math_t>
-void linearRegH(const math_t *input, int n_rows, int n_cols, const math_t *coef,
-                math_t *pred, math_t intercept, cublasHandle_t cublas_handle,
+void linearRegH(const raft::handle_t &handle, const math_t *input, int n_rows,
+                int n_cols, const math_t *coef, math_t *pred, math_t intercept,
                 cudaStream_t stream) {
-  LinAlg::gemm(input, n_rows, n_cols, coef, pred, n_rows, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, cublas_handle, stream);
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, pred, n_rows, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, stream);
 
   if (intercept != math_t(0))
-    LinAlg::addScalar(pred, pred, intercept, n_rows, stream);
+    raft::linalg::addScalar(pred, pred, intercept, n_rows, stream);
 }
 
 template <typename math_t>
-void linearRegLossGrads(math_t *input, int n_rows, int n_cols,
-                        const math_t *labels, const math_t *coef, math_t *grads,
-                        penalty pen, math_t alpha, math_t l1_ratio,
-                        cublasHandle_t cublas_handle,
-                        std::shared_ptr<deviceAllocator> allocator,
-                        cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
-
-  linearRegH(input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
-             cublas_handle, stream);
-  LinAlg::subtract(labels_pred.data(), labels_pred.data(), labels, n_rows,
-                   stream);
-  Matrix::matrixVectorBinaryMult(input, labels_pred.data(), n_rows, n_cols,
-                                 false, false, stream);
-
-  Stats::mean(grads, input, n_cols, n_rows, false, false, stream);
-  LinAlg::scalarMultiply(grads, grads, math_t(2), n_cols, stream);
-
-  device_buffer<math_t> pen_grads(allocator, stream, 0);
+void linearRegLossGrads(const raft::handle_t &handle, math_t *input, int n_rows,
+                        int n_cols, const math_t *labels, const math_t *coef,
+                        math_t *grads, penalty pen, math_t alpha,
+                        math_t l1_ratio, cudaStream_t stream) {
+  auto allocator = handle.get_device_allocator();
+  auto cublas_handle = handle.get_cublas_handle();
+
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
+
+  linearRegH(handle, input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
+             stream);
+  raft::linalg::subtract(labels_pred.data(), labels_pred.data(), labels, n_rows,
+                         stream);
+  raft::matrix::matrixVectorBinaryMult(input, labels_pred.data(), n_rows,
+                                       n_cols, false, false, stream);
+
+  raft::stats::mean(grads, input, n_cols, n_rows, false, false, stream);
+  raft::linalg::scalarMultiply(grads, grads, math_t(2), n_cols, stream);
+
+  raft::mr::device::buffer<math_t> pen_grads(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_grads.resize(n_cols, stream);
 
@@ -75,27 +78,29 @@ void linearRegLossGrads(math_t *input, int n_rows, int n_cols,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(grads, grads, pen_grads.data(), n_cols, stream);
+    raft::linalg::add(grads, grads, pen_grads.data(), n_cols, stream);
   }
 }
 
 template <typename math_t>
-void linearRegLoss(math_t *input, int n_rows, int n_cols, const math_t *labels,
-                   const math_t *coef, math_t *loss, penalty pen, math_t alpha,
-                   math_t l1_ratio, cublasHandle_t cublas_handle,
-                   std::shared_ptr<deviceAllocator> allocator,
+void linearRegLoss(const raft::handle_t &handle, math_t *input, int n_rows,
+                   int n_cols, const math_t *labels, const math_t *coef,
+                   math_t *loss, penalty pen, math_t alpha, math_t l1_ratio,
                    cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
+
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
 
-  linearRegH(input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
-             cublas_handle, stream);
+  linearRegH(handle, input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
+             stream);
 
-  LinAlg::subtract(labels_pred.data(), labels, labels_pred.data(), n_rows,
-                   stream);
-  Matrix::power(labels_pred.data(), n_rows, stream);
-  Stats::mean(loss, labels_pred.data(), 1, n_rows, false, false, stream);
+  raft::linalg::subtract(labels_pred.data(), labels, labels_pred.data(), n_rows,
+                         stream);
+  raft::matrix::power(labels_pred.data(), n_rows, stream);
+  raft::stats::mean(loss, labels_pred.data(), 1, n_rows, false, false, stream);
 
-  device_buffer<math_t> pen_val(allocator, stream, 0);
+  raft::mr::device::buffer<math_t> pen_val(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_val.resize(1, stream);
 
@@ -108,7 +113,7 @@ void linearRegLoss(math_t *input, int n_rows, int n_cols, const math_t *labels,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(loss, loss, pen_val.data(), 1, stream);
+    raft::linalg::add(loss, loss, pen_val.data(), 1, stream);
   }
 }
 
diff --git a/cpp/src_prims/functions/log.cuh b/cpp/src_prims/functions/log.cuh
index cf3d6c024f..d6d32a14a1 100644
--- a/cpp/src_prims/functions/log.cuh
+++ b/cpp/src_prims/functions/log.cuh
@@ -16,16 +16,16 @@
 
 #pragma once
 
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Functions {
 
 template <typename T, typename IdxType = int>
 void f_log(T *out, T *in, T scalar, IdxType len, cudaStream_t stream) {
-  LinAlg::unaryOp(
-    out, in, len, [scalar] __device__(T in) { return myLog(in) * scalar; },
-    stream);
+  raft::linalg::unaryOp(
+    out, in, len,
+    [scalar] __device__(T in) { return raft::myLog(in) * scalar; }, stream);
 }
 
 };  // end namespace Functions
diff --git a/cpp/src_prims/functions/logisticReg.cuh b/cpp/src_prims/functions/logisticReg.cuh
index 23fccf6b83..567f9337b4 100644
--- a/cpp/src_prims/functions/logisticReg.cuh
+++ b/cpp/src_prims/functions/logisticReg.cuh
@@ -16,18 +16,19 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/binary_op.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/gemm.cuh>
-#include <linalg/subtract.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <stats/mean.cuh>
-#include <stats/sum.cuh>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <common/device_buffer.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/sum.cuh>
 #include "penalty.cuh"
 #include "sigmoid.cuh"
 
@@ -35,37 +36,38 @@ namespace MLCommon {
 namespace Functions {
 
 template <typename math_t>
-void logisticRegH(const math_t *input, int n_rows, int n_cols,
-                  const math_t *coef, math_t *pred, math_t intercept,
-                  cublasHandle_t cublas_handle, cudaStream_t stream) {
-  LinAlg::gemm(input, n_rows, n_cols, coef, pred, n_rows, 1, CUBLAS_OP_N,
-               CUBLAS_OP_N, cublas_handle, stream);
+void logisticRegH(const raft::handle_t &handle, const math_t *input, int n_rows,
+                  int n_cols, const math_t *coef, math_t *pred,
+                  math_t intercept, cudaStream_t stream) {
+  raft::linalg::gemm(handle, input, n_rows, n_cols, coef, pred, n_rows, 1,
+                     CUBLAS_OP_N, CUBLAS_OP_N, stream);
 
   if (intercept != math_t(0))
-    LinAlg::addScalar(pred, pred, intercept, n_rows, stream);
+    raft::linalg::addScalar(pred, pred, intercept, n_rows, stream);
 
   sigmoid(pred, pred, n_rows, stream);
 }
 
 template <typename math_t>
-void logisticRegLossGrads(math_t *input, int n_rows, int n_cols,
-                          const math_t *labels, const math_t *coef,
-                          math_t *grads, penalty pen, math_t alpha,
-                          math_t l1_ratio, cublasHandle_t cublas_handle,
-                          std::shared_ptr<deviceAllocator> allocator,
-                          cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
+void logisticRegLossGrads(const raft::handle_t &handle, math_t *input,
+                          int n_rows, int n_cols, const math_t *labels,
+                          const math_t *coef, math_t *grads, penalty pen,
+                          math_t alpha, math_t l1_ratio, cudaStream_t stream) {
+  auto allocator = handle.get_device_allocator();
+  auto cublas_handle = handle.get_cublas_handle();
 
-  logisticRegH(input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
-               cublas_handle, stream);
-  LinAlg::subtract(labels_pred.data(), labels_pred.data(), labels, n_rows,
-                   stream);
-  Matrix::matrixVectorBinaryMult(input, labels_pred.data(), n_rows, n_cols,
-                                 false, false, stream);
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
 
-  Stats::mean(grads, input, n_cols, n_rows, false, false, stream);
+  logisticRegH(handle, input, n_rows, n_cols, coef, labels_pred.data(),
+               math_t(0), stream);
+  raft::linalg::subtract(labels_pred.data(), labels_pred.data(), labels, n_rows,
+                         stream);
+  raft::matrix::matrixVectorBinaryMult(input, labels_pred.data(), n_rows,
+                                       n_cols, false, false, stream);
 
-  device_buffer<math_t> pen_grads(allocator, stream, 0);
+  raft::stats::mean(grads, input, n_cols, n_rows, false, false, stream);
+
+  raft::mr::device::buffer<math_t> pen_grads(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_grads.resize(n_cols, stream);
 
@@ -78,7 +80,7 @@ void logisticRegLossGrads(math_t *input, int n_rows, int n_cols,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(grads, grads, pen_grads.data(), n_cols, stream);
+    raft::linalg::add(grads, grads, pen_grads.data(), n_cols, stream);
   }
 }
 
@@ -88,7 +90,7 @@ void logLoss(T *out, T *label, T *label_pred, int len, cudaStream_t stream);
 template <>
 inline void logLoss(float *out, float *label, float *label_pred, int len,
                     cudaStream_t stream) {
-  LinAlg::binaryOp(
+  raft::linalg::binaryOp(
     out, label, label_pred, len,
     [] __device__(float y, float y_pred) {
       return -y * logf(y_pred) - (1 - y) * logf(1 - y_pred);
@@ -99,7 +101,7 @@ inline void logLoss(float *out, float *label, float *label_pred, int len,
 template <>
 inline void logLoss(double *out, double *label, double *label_pred, int len,
                     cudaStream_t stream) {
-  LinAlg::binaryOp(
+  raft::linalg::binaryOp(
     out, label, label_pred, len,
     [] __device__(double y, double y_pred) {
       return -y * log(y_pred) - (1 - y) * logf(1 - y_pred);
@@ -108,21 +110,22 @@ inline void logLoss(double *out, double *label, double *label_pred, int len,
 }
 
 template <typename math_t>
-void logisticRegLoss(math_t *input, int n_rows, int n_cols, math_t *labels,
-                     const math_t *coef, math_t *loss, penalty pen,
-                     math_t alpha, math_t l1_ratio,
-                     cublasHandle_t cublas_handle,
-                     std::shared_ptr<deviceAllocator> allocator,
+void logisticRegLoss(const raft::handle_t &handle, math_t *input, int n_rows,
+                     int n_cols, math_t *labels, const math_t *coef,
+                     math_t *loss, penalty pen, math_t alpha, math_t l1_ratio,
                      cudaStream_t stream) {
-  device_buffer<math_t> labels_pred(allocator, stream, n_rows);
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
+
+  raft::mr::device::buffer<math_t> labels_pred(allocator, stream, n_rows);
 
-  logisticRegH(input, n_rows, n_cols, coef, labels_pred.data(), math_t(0),
-               cublas_handle, stream);
+  logisticRegH(handle, input, n_rows, n_cols, coef, labels_pred.data(),
+               math_t(0), stream);
   logLoss(labels_pred.data(), labels, labels_pred.data(), n_rows, stream);
 
-  Stats::mean(loss, labels_pred.data(), 1, n_rows, false, false, stream);
+  raft::stats::mean(loss, labels_pred.data(), 1, n_rows, false, false, stream);
 
-  device_buffer<math_t> pen_val(allocator, stream, 0);
+  raft::mr::device::buffer<math_t> pen_val(allocator, stream, 0);
 
   if (pen != penalty::NONE) pen_val.resize(1, stream);
 
@@ -135,7 +138,7 @@ void logisticRegLoss(math_t *input, int n_rows, int n_cols, math_t *labels,
   }
 
   if (pen != penalty::NONE) {
-    LinAlg::add(loss, loss, pen_val.data(), 1, stream);
+    raft::linalg::add(loss, loss, pen_val.data(), 1, stream);
   }
 }
 
diff --git a/cpp/src_prims/functions/penalty.cuh b/cpp/src_prims/functions/penalty.cuh
index 6ece21355d..8a66cada51 100644
--- a/cpp/src_prims/functions/penalty.cuh
+++ b/cpp/src_prims/functions/penalty.cuh
@@ -16,13 +16,13 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/norm.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "sign.cuh"
 
 namespace MLCommon {
@@ -38,8 +38,9 @@ enum penalty {
 template <typename math_t>
 void lasso(math_t *out, const math_t *coef, const int len, const math_t alpha,
            cudaStream_t stream) {
-  LinAlg::rowNorm(out, coef, len, 1, LinAlg::NormType::L1Norm, true, stream);
-  LinAlg::scalarMultiply(out, out, alpha, 1, stream);
+  raft::linalg::rowNorm(out, coef, len, 1, raft::linalg::NormType::L1Norm, true,
+                        stream);
+  raft::linalg::scalarMultiply(out, out, alpha, 1, stream);
 }
 
 template <typename math_t>
@@ -51,14 +52,15 @@ void lassoGrad(math_t *grad, const math_t *coef, const int len,
 template <typename math_t>
 void ridge(math_t *out, const math_t *coef, const int len, const math_t alpha,
            cudaStream_t stream) {
-  LinAlg::rowNorm(out, coef, len, 1, LinAlg::NormType::L2Norm, true, stream);
-  LinAlg::scalarMultiply(out, out, alpha, 1, stream);
+  raft::linalg::rowNorm(out, coef, len, 1, raft::linalg::NormType::L2Norm, true,
+                        stream);
+  raft::linalg::scalarMultiply(out, out, alpha, 1, stream);
 }
 
 template <typename math_t>
 void ridgeGrad(math_t *grad, const math_t *coef, const int len,
                const math_t alpha, cudaStream_t stream) {
-  LinAlg::scalarMultiply(grad, coef, math_t(2) * alpha, len, stream);
+  raft::linalg::scalarMultiply(grad, coef, math_t(2) * alpha, len, stream);
 }
 
 template <typename math_t>
@@ -66,12 +68,12 @@ void elasticnet(math_t *out, const math_t *coef, const int len,
                 const math_t alpha, const math_t l1_ratio,
                 cudaStream_t stream) {
   math_t *out_lasso = NULL;
-  allocate(out_lasso, 1);
+  raft::allocate(out_lasso, 1);
 
   ridge(out, coef, len, alpha * (math_t(1) - l1_ratio), stream);
   lasso(out_lasso, coef, len, alpha * l1_ratio, stream);
 
-  LinAlg::add(out, out, out_lasso, 1, stream);
+  raft::linalg::add(out, out, out_lasso, 1, stream);
 
   if (out_lasso != NULL) {
     CUDA_CHECK(cudaFree(out_lasso));
@@ -83,12 +85,12 @@ void elasticnetGrad(math_t *grad, const math_t *coef, const int len,
                     const math_t alpha, const math_t l1_ratio,
                     cudaStream_t stream) {
   math_t *grad_lasso = NULL;
-  allocate(grad_lasso, len);
+  raft::allocate(grad_lasso, len);
 
   ridgeGrad(grad, coef, len, alpha * (math_t(1) - l1_ratio), stream);
   lassoGrad(grad_lasso, coef, len, alpha * l1_ratio, stream);
 
-  LinAlg::add(grad, grad, grad_lasso, len, stream);
+  raft::linalg::add(grad, grad, grad_lasso, len, stream);
 
   if (grad_lasso != NULL) {
     CUDA_CHECK(cudaFree(grad_lasso));
diff --git a/cpp/src_prims/functions/sigmoid.cuh b/cpp/src_prims/functions/sigmoid.cuh
index 302555803c..98a6ade4c1 100644
--- a/cpp/src_prims/functions/sigmoid.cuh
+++ b/cpp/src_prims/functions/sigmoid.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Functions {
@@ -25,9 +25,9 @@ namespace Functions {
 template <typename T, typename IdxType = int>
 void sigmoid(T *out, T *in, IdxType len, cudaStream_t stream) {
   T one = T(1);
-  LinAlg::unaryOp(
-    out, in, len, [one] __device__(T in) { return one / (one + myExp(-in)); },
-    stream);
+  raft::linalg::unaryOp(
+    out, in, len,
+    [one] __device__(T in) { return one / (one + raft::myExp(-in)); }, stream);
 }
 
 };  // end namespace Functions
diff --git a/cpp/src_prims/functions/sign.cuh b/cpp/src_prims/functions/sign.cuh
index 6fc7a0215a..85abc70f16 100644
--- a/cpp/src_prims/functions/sign.cuh
+++ b/cpp/src_prims/functions/sign.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Functions {
@@ -24,7 +24,7 @@ namespace Functions {
 template <typename math_t, typename idx_type = int>
 void sign(math_t *out, const math_t *in, const math_t scalar,
           const idx_type len, cudaStream_t stream) {
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     out, in, len,
     [scalar] __device__(math_t in) {
       if (in < math_t(0))
diff --git a/cpp/src_prims/functions/softThres.cuh b/cpp/src_prims/functions/softThres.cuh
index 589a318218..4f7306633d 100644
--- a/cpp/src_prims/functions/softThres.cuh
+++ b/cpp/src_prims/functions/softThres.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Functions {
@@ -24,12 +24,12 @@ namespace Functions {
 template <typename math_t>
 void softThres(math_t *out, const math_t *in, const math_t thres, const int len,
                cudaStream_t stream) {
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     out, in, len,
     [thres] __device__(math_t in) {
-      if (in > math_t(0) && thres < myAbs(in))
+      if (in > math_t(0) && thres < raft::myAbs(in))
         return in - thres;
-      else if (in < math_t(0) && thres < myAbs(in))
+      else if (in < math_t(0) && thres < raft::myAbs(in))
         return in + thres;
       else
         return math_t(0);
diff --git a/cpp/src_prims/label/classlabels.cuh b/cpp/src_prims/label/classlabels.cuh
index eef39fed74..4fbe1b3641 100644
--- a/cpp/src_prims/label/classlabels.cuh
+++ b/cpp/src_prims/label/classlabels.cuh
@@ -18,11 +18,11 @@
 
 #include <cub/cub.cuh>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Label {
@@ -66,12 +66,12 @@ void getUniqueLabels(math_t *y, size_t n, math_t **y_unique, int *n_unique,
   cub::DeviceRadixSort::SortKeys(cub_storage.data(), bytes, y, y2.data(), n);
   cub::DeviceSelect::Unique(cub_storage.data(), bytes, y2.data(), y3.data(),
                             d_num_selected.data(), n);
-  updateHost(n_unique, d_num_selected.data(), 1, stream);
+  raft::update_host(n_unique, d_num_selected.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   // Copy unique classes to output
   *y_unique = (math_t *)allocator->allocate(*n_unique * sizeof(math_t), stream);
-  copy(*y_unique, y3.data(), *n_unique, stream);
+  raft::copy(*y_unique, y3.data(), *n_unique, stream);
 }
 
 /**
@@ -98,7 +98,7 @@ void getOvrLabels(math_t *y, int n, math_t *y_unique, int n_classes,
   ASSERT(idx < n_classes,
          "Parameter idx should not be larger than the number "
          "of classes");
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     y_out, y, n,
     [idx, y_unique] __device__(math_t y) {
       return y == y_unique[idx] ? +1 : -1;
@@ -147,14 +147,13 @@ __global__ void map_label_kernel(Type *map_ids, size_t N_labels, Type *in,
    */
 template <typename Type, typename Lambda>
 void make_monotonic(Type *out, Type *in, size_t N, cudaStream_t stream,
-                    Lambda filter_op) {
+                    Lambda filter_op,
+                    std::shared_ptr<deviceAllocator> allocator) {
   static const size_t TPB_X = 256;
 
-  dim3 blocks(ceildiv(N, TPB_X));
+  dim3 blocks(raft::ceildiv(N, TPB_X));
   dim3 threads(TPB_X);
 
-  std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
-
   Type *map_ids;
   int num_clusters;
   getUniqueLabels(in, N, &map_ids, &num_clusters, stream, allocator);
@@ -183,9 +182,10 @@ void make_monotonic(Type *out, Type *in, size_t N, cudaStream_t stream,
    * @param stream cuda stream to use
    */
 template <typename Type>
-void make_monotonic(Type *out, Type *in, size_t N, cudaStream_t stream) {
-  make_monotonic<Type>(out, in, N, stream,
-                       [] __device__(Type val) { return false; });
+void make_monotonic(Type *out, Type *in, size_t N, cudaStream_t stream,
+                    std::shared_ptr<deviceAllocator> allocator) {
+  make_monotonic<Type>(
+    out, in, N, stream, [] __device__(Type val) { return false; }, allocator);
 }
 };  // namespace Label
 };  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/add.cuh b/cpp/src_prims/linalg/add.cuh
deleted file mode 100644
index 19fd674271..0000000000
--- a/cpp/src_prims/linalg/add.cuh
+++ /dev/null
@@ -1,98 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "binary_op.cuh"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief Elementwise scalar add operation on the input buffer
- *
- * @tparam InT     input data-type. Also the data-type upon which the math ops
- *                 will be performed
- * @tparam OutT    output data-type
- * @tparam IdxType Integer type used to for addressing
- *
- * @param out    the output buffer
- * @param in     the input buffer
- * @param scalar the scalar used in the operations
- * @param len    number of elements in the input buffer
- * @param stream cuda stream where to launch work
- */
-template <typename InT, typename OutT = InT, typename IdxType = int>
-void addScalar(OutT *out, const InT *in, InT scalar, IdxType len,
-               cudaStream_t stream) {
-  auto op = [scalar] __device__(InT in) { return OutT(in + scalar); };
-  unaryOp<InT, decltype(op), IdxType, OutT>(out, in, len, op, stream);
-}
-
-/**
- * @brief Elementwise add operation on the input buffers
- * @tparam InT     input data-type. Also the data-type upon which the math ops
- *                 will be performed
- * @tparam OutT    output data-type
- * @tparam IdxType Integer type used to for addressing
- *
- * @param out    the output buffer
- * @param in1    the first input buffer
- * @param in2    the second input buffer
- * @param len    number of elements in the input buffers
- * @param stream cuda stream where to launch work
- */
-template <typename InT, typename OutT = InT, typename IdxType = int>
-void add(OutT *out, const InT *in1, const InT *in2, IdxType len,
-         cudaStream_t stream) {
-  auto op = [] __device__(InT a, InT b) { return OutT(a + b); };
-  binaryOp<InT, decltype(op), OutT, IdxType>(out, in1, in2, len, op, stream);
-}
-
-template <class math_t, typename IdxType>
-__global__ void add_dev_scalar_kernel(math_t *outDev, const math_t *inDev,
-                                      const math_t *singleScalarDev,
-                                      IdxType len) {
-  IdxType i = ((IdxType)blockIdx.x * (IdxType)blockDim.x) + threadIdx.x;
-  if (i < len) {
-    outDev[i] = inDev[i] + *singleScalarDev;
-  }
-}
-
-/** Substract single value pointed by singleScalarDev parameter in device memory from inDev[i] and write result to outDev[i]
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param outDev the output buffer
- * @param inDev the input buffer
- * @param singleScalarDev pointer to the scalar located in device memory
- * @param len number of elements in the input and output buffer
- * @param stream cuda stream
- */
-template <typename math_t, typename IdxType = int>
-void addDevScalar(math_t *outDev, const math_t *inDev,
-                  const math_t *singleScalarDev, IdxType len,
-                  cudaStream_t stream) {
-  // TODO: block dimension has not been tuned
-  dim3 block(256);
-  dim3 grid(ceildiv(len, (IdxType)block.x));
-  add_dev_scalar_kernel<math_t>
-    <<<grid, block, 0, stream>>>(outDev, inDev, singleScalarDev, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/batched/gemv.cuh b/cpp/src_prims/linalg/batched/gemv.cuh
index d71eb713ad..9b85f531bd 100644
--- a/cpp/src_prims/linalg/batched/gemv.cuh
+++ b/cpp/src_prims/linalg/batched/gemv.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/vectorized.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -45,7 +45,7 @@ DI DataT dotProduct(const DataT (&x)[VecLen], const DataT (&y)[VecLen],
   for (int i = 0; i < VecLen; ++i) {
     val += x[i] * y[i];
   }
-  auto dot = blockReduce(val, smem);
+  auto dot = raft::blockReduce(val, smem);
   if (broadcast) {
     auto* sDot = reinterpret_cast<DataT*>(smem);
     __syncthreads();
@@ -63,8 +63,8 @@ template <typename DataT, typename IdxT, int VecLenAx, int VecLenY,
 __global__ void gemvKernel(DataT* y, const DataT* A, const DataT* x,
                            const DataT* z, DataT alpha, DataT beta, IdxT m,
                            IdxT n, EpilogueOp op) {
-  typedef TxN_t<DataT, VecLenAx> VecTypeAx;
-  typedef TxN_t<DataT, VecLenY> VecTypeY;
+  typedef raft::TxN_t<DataT, VecLenAx> VecTypeAx;
+  typedef raft::TxN_t<DataT, VecLenY> VecTypeY;
   static constexpr DataT Zero = DataT(0.0);
   extern __shared__ char smem[];
   auto* sdot = smem;
@@ -109,10 +109,10 @@ void gemvImplY(DataT* y, const DataT* A, const DataT* x, const DataT* z,
                DataT alpha, DataT beta, IdxT m, IdxT n, IdxT batchSize,
                EpilogueOp op, cudaStream_t stream) {
   auto nAligned = VecLenAx ? n / VecLenAx : n;
-  int tpb = alignTo<int>(nAligned, WarpSize);
-  int nWarps = tpb / WarpSize;
+  int tpb = raft::alignTo<int>(nAligned, raft::WarpSize);
+  int nWarps = tpb / raft::WarpSize;
   size_t smemSize = sizeof(DataT) * nWarps;
-  auto mAligned = VecLenY ? ceildiv(m, VecLenY) : m;
+  auto mAligned = VecLenY ? raft::ceildiv(m, VecLenY) : m;
   dim3 nblks(mAligned, batchSize);
   gemvKernel<DataT, IdxT, VecLenAx, VecLenY, EpilogueOp>
     <<<nblks, tpb, smemSize, stream>>>(y, A, x, z, alpha, beta, m, n, op);
@@ -161,10 +161,11 @@ void gemvImplAx(DataT* y, const DataT* A, const DataT* x, const DataT* z,
  * @param stream cuda stream
  * @param op epilogue operation
  */
-template <typename DataT, typename IdxT, typename EpilogueOp = Nop<DataT, IdxT>>
+template <typename DataT, typename IdxT,
+          typename EpilogueOp = raft::Nop<DataT, IdxT>>
 void gemv(DataT* y, const DataT* A, const DataT* x, const DataT* z, DataT alpha,
           DataT beta, IdxT m, IdxT n, IdxT batchSize, cudaStream_t stream,
-          EpilogueOp op = Nop<DataT, IdxT>()) {
+          EpilogueOp op = raft::Nop<DataT, IdxT>()) {
   size_t bytes = n * sizeof(DataT);
   if (16 / sizeof(DataT) && bytes % 16 == 0) {
     gemvImplAx<DataT, IdxT, 16 / sizeof(DataT), EpilogueOp>(
diff --git a/cpp/src_prims/linalg/batched/make_symm.cuh b/cpp/src_prims/linalg/batched/make_symm.cuh
index a8f739aa5b..199bfdedbb 100644
--- a/cpp/src_prims/linalg/batched/make_symm.cuh
+++ b/cpp/src_prims/linalg/batched/make_symm.cuh
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -73,11 +73,12 @@ __global__ void symmKernel(DataT* out, const DataT* in, IdxT batchSize, IdxT n,
  * @param stream cuda stream
  * @param op custom epilogue functor
  */
-template <typename DataT, typename IdxT, typename EpilogueOp = Nop<DataT, IdxT>>
+template <typename DataT, typename IdxT,
+          typename EpilogueOp = raft::Nop<DataT, IdxT>>
 void make_symm(DataT* out, const DataT* in, IdxT batchSize, IdxT n,
-               cudaStream_t stream, EpilogueOp op = Nop<DataT, IdxT>()) {
+               cudaStream_t stream, EpilogueOp op = raft::Nop<DataT, IdxT>()) {
   dim3 blk(TileDim, BlockRows);
-  auto nblks = ceildiv<int>(n, TileDim);
+  auto nblks = raft::ceildiv<int>(n, TileDim);
   dim3 grid(nblks, nblks, batchSize);
   symmKernel<DataT, IdxT, EpilogueOp>
     <<<grid, blk, 0, stream>>>(out, in, batchSize, n, op);
diff --git a/cpp/src_prims/linalg/batched/matrix.cuh b/cpp/src_prims/linalg/batched/matrix.cuh
index 6ad5acfec2..724a4eeeda 100644
--- a/cpp/src_prims/linalg/batched/matrix.cuh
+++ b/cpp/src_prims/linalg/batched/matrix.cuh
@@ -28,17 +28,16 @@
 #include <thrust/for_each.h>
 #include <thrust/iterator/counting_iterator.h>
 
+#include <raft/cudart_utils.h>
+#include <common/device_buffer.hpp>
+#include <common/fast_int_div.cuh>
 #include <cuml/common/utils.hpp>
 #include <cuml/cuml.hpp>
+#include <raft/cuda_utils.cuh>
 
-#include <common/device_buffer.hpp>
-
-#include "../binary_op.cuh"
-#include "../cublas_wrappers.h"
-#include "../unary_op.cuh"
-
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -150,10 +149,10 @@ class Matrix {
 
     // Fill array of pointers to each batch matrix.
     constexpr int TPB = 256;
-    fill_strided_pointers_kernel<<<ceildiv<int>(m_batch_size, TPB), TPB, 0,
-                                   m_stream>>>(m_dense.data(), m_batches.data(),
-                                               m_batch_size, m_shape.first,
-                                               m_shape.second);
+    fill_strided_pointers_kernel<<<raft::ceildiv<int>(m_batch_size, TPB), TPB,
+                                   0, m_stream>>>(
+      m_dense.data(), m_batches.data(), m_batch_size, m_shape.first,
+      m_shape.second);
     CUDA_CHECK(cudaPeekAtLastError());
   }
 
@@ -198,8 +197,8 @@ class Matrix {
     initialize(false);
 
     // Copy the raw data
-    copy(m_dense.data(), other.m_dense.data(),
-         m_batch_size * m_shape.first * m_shape.second, m_stream);
+    raft::copy(m_dense.data(), other.m_dense.data(),
+               m_batch_size * m_shape.first * m_shape.second, m_stream);
   }
 
   //! Copy assignment operator
@@ -212,8 +211,8 @@ class Matrix {
     initialize(false);
 
     // Copy the raw data
-    copy(m_dense.data(), other.m_dense.data(),
-         m_batch_size * m_shape.first * m_shape.second, m_stream);
+    raft::copy(m_dense.data(), other.m_dense.data(),
+               m_batch_size * m_shape.first * m_shape.second, m_stream);
 
     return *this;
   }
@@ -272,7 +271,7 @@ class Matrix {
     int r = m * n;
     Matrix<T> toVec(r, 1, m_batch_size, m_cublasHandle, m_allocator, m_stream,
                     false);
-    copy(toVec[0], m_dense.data(), m_batch_size * r, m_stream);
+    raft::copy(toVec[0], m_dense.data(), m_batch_size * r, m_stream);
     return toVec;
   }
 
@@ -289,7 +288,7 @@ class Matrix {
            "ERROR: Size mismatch - Cannot reshape array into desired size");
     Matrix<T> toMat(m, n, m_batch_size, m_cublasHandle, m_allocator, m_stream,
                     false);
-    copy(toMat[0], m_dense.data(), m_batch_size * r, m_stream);
+    raft::copy(toMat[0], m_dense.data(), m_batch_size * r, m_stream);
 
     return toMat;
   }
@@ -298,7 +297,7 @@ class Matrix {
   void print(std::string name) const {
     size_t len = m_shape.first * m_shape.second * m_batch_size;
     std::vector<T> A(len);
-    updateHost(A.data(), m_dense.data(), len, m_stream);
+    raft::update_host(A.data(), m_dense.data(), len, m_stream);
     std::cout << name << "=\n";
     for (int i = 0; i < m_shape.first; i++) {
       for (int j = 0; j < m_shape.second; j++) {
@@ -357,11 +356,11 @@ class Matrix {
     Matrix<T> Ainv(n, n, m_batch_size, m_cublasHandle, m_allocator, m_stream,
                    false);
 
-    CUBLAS_CHECK(LinAlg::cublasgetrfBatched(m_cublasHandle, n, Acopy.data(), n,
-                                            P, info, m_batch_size, m_stream));
-    CUBLAS_CHECK(LinAlg::cublasgetriBatched(m_cublasHandle, n, Acopy.data(), n,
-                                            P, Ainv.data(), n, info,
-                                            m_batch_size, m_stream));
+    CUBLAS_CHECK(raft::linalg::cublasgetrfBatched(
+      m_cublasHandle, n, Acopy.data(), n, P, info, m_batch_size, m_stream));
+    CUBLAS_CHECK(raft::linalg::cublasgetriBatched(
+      m_cublasHandle, n, Acopy.data(), n, P, Ainv.data(), n, info, m_batch_size,
+      m_stream));
 
     m_allocator->deallocate(P, sizeof(int) * n * m_batch_size, m_stream);
     m_allocator->deallocate(info, sizeof(int) * m_batch_size, m_stream);
@@ -518,7 +517,7 @@ void b_gemm(bool aT, bool bT, int m, int n, int k, T alpha, const Matrix<T>& A,
   cublasOperation_t opB = bT ? CUBLAS_OP_T : CUBLAS_OP_N;
 
   // Call cuBLAS
-  CUBLAS_CHECK(LinAlg::cublasgemmStridedBatched(
+  CUBLAS_CHECK(raft::linalg::cublasgemmStridedBatched(
     A.cublasHandle(), opA, opB, m, n, k, &alpha, A.raw_data(), A.shape().first,
     A.shape().first * A.shape().second, B.raw_data(), B.shape().first,
     B.shape().first * B.shape().second, &beta, C.raw_data(), C.shape().first,
@@ -581,7 +580,7 @@ void b_gels(const Matrix<T>& A, Matrix<T>& C) {
   Matrix<T> Acopy(A);
 
   int info;
-  CUBLAS_CHECK(LinAlg::cublasgelsBatched(
+  CUBLAS_CHECK(raft::linalg::cublasgelsBatched(
     A.cublasHandle(), CUBLAS_OP_N, m, n, nrhs, Acopy.data(), m, C.data(), m,
     &info, nullptr, A.batches(), A.stream()));
 }
@@ -601,8 +600,8 @@ Matrix<T> b_op_A(const Matrix<T>& A, F unary_op) {
 
   Matrix<T> C(m, n, batch_size, A.cublasHandle(), A.allocator(), A.stream());
 
-  LinAlg::unaryOp(C.raw_data(), A.raw_data(), m * n * batch_size, unary_op,
-                  A.stream());
+  raft::linalg::unaryOp(C.raw_data(), A.raw_data(), m * n * batch_size,
+                        unary_op, A.stream());
 
   return C;
 }
@@ -630,8 +629,8 @@ Matrix<T> b_aA_op_B(const Matrix<T>& A, const Matrix<T>& B, F binary_op) {
 
   Matrix<T> C(m, n, batch_size, A.cublasHandle(), A.allocator(), A.stream());
 
-  LinAlg::binaryOp(C.raw_data(), A.raw_data(), B.raw_data(), m * n * batch_size,
-                   binary_op, A.stream());
+  raft::linalg::binaryOp(C.raw_data(), A.raw_data(), B.raw_data(),
+                         m * n * batch_size, binary_op, A.stream());
 
   return C;
 }
@@ -831,28 +830,33 @@ Matrix<T> b_lagged_mat(const Matrix<T>& vec, int lags) {
  * @note The blocks are the batches and the threads are the matrix elements,
  *       column-wise.
  * 
- * @param[in]  in            Input matrix
- * @param[out] out           Output matrix
- * @param[in]  starting_row  First row to copy
- * @param[in]  starting_col  First column to copy
- * @param[in]  in_rows       Number of rows in the input matrix
- * @param[in]  in_cols       Number of columns in the input matrix
- * @param[in]  out_rows      Number of rows to copy
- * @param[in]  out_cols      Number of columns to copy
+ * @param[in]  in                Input matrix
+ * @param[out] out               Output matrix
+ * @param[in]  in_starting_row   First row to copy in the input matrix
+ * @param[in]  in_starting_col   First column to copy in the input matrix
+ * @param[in]  in_rows           Number of rows in the input matrix
+ * @param[in]  in_cols           Number of columns in the input matrix
+ * @param[in]  copy_rows         Number of rows to copy
+ * @param[in]  n_copy            Total number of elements to copy
+ * @param[in]  out_starting_row  First row to copy in the output matrix
+ * @param[in]  out_starting_col  First column to copy in the output matrix
+ * @param[in]  out_rows          Number of rows in the output matrix
+ * @param[in]  out_cols          Number of columns in the output matrix
  */
 template <typename T>
-static __global__ void batched_2dcopy_kernel(const T* in, T* out,
-                                             int starting_row, int starting_col,
-                                             int in_rows, int in_cols,
-                                             int out_rows, int out_cols) {
-  const T* in_ =
-    in + blockIdx.x * in_rows * in_cols + starting_col * in_rows + starting_row;
-  T* out_ = out + blockIdx.x * out_rows * out_cols;
-
-  for (int i = threadIdx.x; i < out_rows * out_cols; i += blockDim.x) {
-    int i_col = i / out_rows;
-    int i_row = i % out_rows;
-    out_[i] = in_[i_row + in_rows * i_col];
+static __global__ void batched_2dcopy_kernel(
+  const T* in, T* out, int in_starting_row, int in_starting_col, int in_rows,
+  int in_cols, MLCommon::FastIntDiv copy_rows, int n_copy, int out_starting_row,
+  int out_starting_col, int out_rows, int out_cols) {
+  const T* in_ = in + blockIdx.x * in_rows * in_cols +
+                 in_starting_col * in_rows + in_starting_row;
+  T* out_ = out + blockIdx.x * out_rows * out_cols +
+            out_starting_col * out_rows + out_starting_row;
+
+  for (int i = threadIdx.x; i < n_copy; i += blockDim.x) {
+    int i_col = i / copy_rows;
+    int i_row = i % copy_rows;
+    out_[i_row + out_rows * i_col] = in_[i_row + in_rows * i_col];
   }
 }
 
@@ -861,24 +865,35 @@ static __global__ void batched_2dcopy_kernel(const T* in, T* out,
  * 
  * @note This overload takes two matrices as inputs
  * 
- * @param[in]  in            Batched input matrix
- * @param[out] out           Batched output matrix
- * @param[in]  starting_row  First row to copy
- * @param[in]  starting_col  First column to copy
- * @param[in]  rows          Number of rows to copy
- * @param[in]  cols          Number of columns to copy
+ * @param[in]  in                Batched input matrix
+ * @param[out] out               Batched output matrix
+ * @param[in]  in_starting_row   First row to copy in the input matrix
+ * @param[in]  in_starting_col   First column to copy in the input matrix
+ * @param[in]  copy_rows         Number of rows to copy
+ * @param[in]  copy_cols         Number of columns to copy
+ * @param[in]  out_starting_row  First row to copy in the output matrix
+ * @param[in]  out_starting_col  First column to copy in the output matrix
  */
 template <typename T>
-void b_2dcopy(const Matrix<T>& in, Matrix<T>& out, int starting_row,
-              int starting_col, int rows, int cols) {
-  ASSERT(out.shape().first == rows, "Dimension mismatch: rows");
-  ASSERT(out.shape().second == cols, "Dimension mismatch: columns");
+void b_2dcopy(const Matrix<T>& in, Matrix<T>& out, int in_starting_row,
+              int in_starting_col, int copy_rows, int copy_cols,
+              int out_starting_row = 0, int out_starting_col = 0) {
+  ASSERT(in_starting_row + copy_rows <= in.shape().first,
+         "[2D copy] Dimension mismatch: rows for input matrix");
+  ASSERT(in_starting_col + copy_cols <= in.shape().second,
+         "[2D copy] Dimension mismatch: columns for input matrix");
+  ASSERT(out_starting_row + copy_rows <= out.shape().first,
+         "[2D copy] Dimension mismatch: rows for output matrix");
+  ASSERT(out_starting_col + copy_cols <= out.shape().second,
+         "[2D copy] Dimension mismatch: columns for output matrix");
 
   // Execute the kernel
-  const int TPB = rows * cols > 512 ? 256 : 128;  // quick heuristics
+  const int TPB = copy_rows * copy_cols > 512 ? 256 : 128;  // quick heuristics
   batched_2dcopy_kernel<<<in.batches(), TPB, 0, in.stream()>>>(
-    in.raw_data(), out.raw_data(), starting_row, starting_col, in.shape().first,
-    in.shape().second, rows, cols);
+    in.raw_data(), out.raw_data(), in_starting_row, in_starting_col,
+    in.shape().first, in.shape().second, MLCommon::FastIntDiv(copy_rows),
+    copy_rows * copy_cols, out_starting_row, out_starting_col,
+    out.shape().first, out.shape().second);
   CUDA_CHECK(cudaPeekAtLastError());
 }
 
@@ -928,7 +943,7 @@ DI void generate_householder_vector(T* d_uk, const T* d_xk, int m) {
   }
   T x0 = d_xk[0];
   x_norm = sqrt(u_norm + x0 * x0);
-  T u0 = x0 + signPrim(x0) * x_norm;
+  T u0 = x0 + raft::signPrim(x0) * x_norm;
   u_norm = sqrt(u_norm + u0 * u0);
 
   // Compute u
@@ -969,7 +984,7 @@ DI void generate_householder_vector(T* d_uk, const T* d_xk, T* shared_mem,
     // Finalize computation of the norms
     T x0 = d_xk[0];
     x_norm = sqrt(shared_mem[0] + x0 * x0);
-    u0 = x0 + signPrim(x0) * x_norm;
+    u0 = x0 + raft::signPrim(x0) * x_norm;
     u_norm = sqrt(shared_mem[0] + u0 * u0);
   }
 
@@ -1125,7 +1140,7 @@ void b_hessenberg(const Matrix<T>& A, Matrix<T>& U, Matrix<T>& H) {
   auto allocator = A.allocator();
 
   // Copy A in H
-  copy(H.raw_data(), A.raw_data(), n2 * batch_size, stream);
+  raft::copy(H.raw_data(), A.raw_data(), n2 * batch_size, stream);
 
   // Initialize U with the identity
   CUDA_CHECK(
@@ -1156,11 +1171,11 @@ void b_hessenberg(const Matrix<T>& A, Matrix<T>& U, Matrix<T>& H) {
 template <typename T>
 DI void generate_givens(T a, T b, T& c, T& s) {
   if (b == 0) {
-    c = signPrim(a);
+    c = raft::signPrim(a);
     s = 0;
   } else if (a == 0) {
     c = 0;
-    s = signPrim(b);
+    s = raft::signPrim(b);
   } else if (abs(a) > abs(b)) {
     T t = -b / a;
     c = (T)1 / sqrt(1 + t * t);
@@ -1194,7 +1209,7 @@ DI bool ahues_tisseur(const T* d_M, int i, int n) {
   T h11 = d_M[i * n + i];
 
   return (abs(h10) * abs(h01) <
-          maxPrim(eps * abs(h11) * abs(h11 - h00), near_zero));
+          raft::maxPrim(eps * abs(h11) * abs(h11 - h00), near_zero));
 }
 
 /**
@@ -1272,7 +1287,7 @@ __global__ void francis_qr_algorithm_kernel(T* d_U, T* d_H, int n) {
       // Find q
       int q = 0;
       for (int k = p - 2; k > 0; k--) {
-        if (b_H[(k - 1) * n + k] == 0) q = maxPrim(q, k);
+        if (b_H[(k - 1) * n + k] == 0) q = raft::maxPrim(q, k);
       }
 
       // Compute first column of (H-aI)(H-bI), where a and b are the eigenvalues
@@ -1312,7 +1327,7 @@ __global__ void francis_qr_algorithm_kernel(T* d_U, T* d_H, int n) {
 
         // H[k:k+3, r:] = P * H[k:k+3, r:], r = max(q, k - 1) (non-coalesced)
         {
-          int j = maxPrim(q, k - 1) + threadIdx.x;
+          int j = raft::maxPrim(q, k - 1) + threadIdx.x;
           if (j < n) {
             T h0 = b_H[j * n + k];
             T h1 = b_H[j * n + k + 1];
diff --git a/cpp/src_prims/linalg/binary_op.cuh b/cpp/src_prims/linalg/binary_op.cuh
deleted file mode 100644
index 23c99fdc43..0000000000
--- a/cpp/src_prims/linalg/binary_op.cuh
+++ /dev/null
@@ -1,100 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InType, int VecLen, typename Lambda, typename IdxType,
-          typename OutType>
-__global__ void binaryOpKernel(OutType *out, const InType *in1,
-                               const InType *in2, IdxType len, Lambda op) {
-  typedef TxN_t<InType, VecLen> InVecType;
-  typedef TxN_t<OutType, VecLen> OutVecType;
-  InVecType a, b;
-  OutVecType c;
-  IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * blockDim.x);
-  idx *= InVecType::Ratio;
-  if (idx >= len) return;
-  a.load(in1, idx);
-  b.load(in2, idx);
-#pragma unroll
-  for (int i = 0; i < InVecType::Ratio; ++i) {
-    c.val.data[i] = op(a.val.data[i], b.val.data[i]);
-  }
-  c.store(out, idx);
-}
-
-template <typename InType, int VecLen, typename Lambda, typename IdxType,
-          typename OutType, int TPB>
-void binaryOpImpl(OutType *out, const InType *in1, const InType *in2,
-                  IdxType len, Lambda op, cudaStream_t stream) {
-  const IdxType nblks = ceildiv(VecLen ? len / VecLen : len, (IdxType)TPB);
-  binaryOpKernel<InType, VecLen, Lambda, IdxType, OutType>
-    <<<nblks, TPB, 0, stream>>>(out, in1, in2, len, op);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief perform element-wise binary operation on the input arrays
- * @tparam InType input data-type
- * @tparam Lambda the device-lambda performing the actual operation
- * @tparam OutType output data-type
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads-per-block in the final kernel launched
- * @param out the output array
- * @param in1 the first input array
- * @param in2 the second input array
- * @param len number of elements in the input array
- * @param op the device-lambda
- * @param stream cuda stream where to launch work
- * @note Lambda must be a functor with the following signature:
- *       `OutType func(const InType& val1, const InType& val2);`
- */
-template <typename InType, typename Lambda, typename OutType = InType,
-          typename IdxType = int, int TPB = 256>
-void binaryOp(OutType *out, const InType *in1, const InType *in2, IdxType len,
-              Lambda op, cudaStream_t stream) {
-  constexpr auto maxSize =
-    sizeof(InType) > sizeof(OutType) ? sizeof(InType) : sizeof(OutType);
-  size_t bytes = len * maxSize;
-  if (16 / maxSize && bytes % 16 == 0) {
-    binaryOpImpl<InType, 16 / maxSize, Lambda, IdxType, OutType, TPB>(
-      out, in1, in2, len, op, stream);
-  } else if (8 / maxSize && bytes % 8 == 0) {
-    binaryOpImpl<InType, 8 / maxSize, Lambda, IdxType, OutType, TPB>(
-      out, in1, in2, len, op, stream);
-  } else if (4 / maxSize && bytes % 4 == 0) {
-    binaryOpImpl<InType, 4 / maxSize, Lambda, IdxType, OutType, TPB>(
-      out, in1, in2, len, op, stream);
-  } else if (2 / maxSize && bytes % 2 == 0) {
-    binaryOpImpl<InType, 2 / maxSize, Lambda, IdxType, OutType, TPB>(
-      out, in1, in2, len, op, stream);
-  } else if (1 / maxSize) {
-    binaryOpImpl<InType, 1 / maxSize, Lambda, IdxType, OutType, TPB>(
-      out, in1, in2, len, op, stream);
-  } else {
-    binaryOpImpl<InType, 1, Lambda, IdxType, OutType, TPB>(out, in1, in2, len,
-                                                           op, stream);
-  }
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/coalesced_reduction.cuh b/cpp/src_prims/linalg/coalesced_reduction.cuh
deleted file mode 100644
index 8afd05eef8..0000000000
--- a/cpp/src_prims/linalg/coalesced_reduction.cuh
+++ /dev/null
@@ -1,114 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-// Kernel (based on norm.cuh) to perform reductions along the coalesced dimension
-// of the matrix, i.e. reduce along rows for row major or reduce along columns
-// for column major layout. Kernel does an inplace reduction adding to original
-// values of dots.
-template <typename InType, typename OutType, typename IdxType, int TPB,
-          typename MainLambda, typename ReduceLambda, typename FinalLambda>
-__global__ void coalescedReductionKernel(OutType *dots, const InType *data,
-                                         int D, int N, OutType init,
-                                         MainLambda main_op,
-                                         ReduceLambda reduce_op,
-                                         FinalLambda final_op,
-                                         bool inplace = false) {
-  typedef cub::BlockReduce<OutType, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  OutType thread_data = init;
-  IdxType rowStart = blockIdx.x * D;
-  for (IdxType i = threadIdx.x; i < D; i += TPB) {
-    IdxType idx = rowStart + i;
-    thread_data = reduce_op(thread_data, main_op(data[idx], i));
-  }
-  OutType acc = BlockReduce(temp_storage).Reduce(thread_data, reduce_op);
-  if (threadIdx.x == 0) {
-    if (inplace) {
-      dots[blockIdx.x] = final_op(reduce_op(dots[blockIdx.x], acc));
-    } else {
-      dots[blockIdx.x] = final_op(acc);
-    }
-  }
-}
-
-/**
- * @brief Compute reduction of the input matrix along the leading dimension
- *
- * @tparam InType the data type of the input
- * @tparam OutType the data type of the output (as well as the data type for
- *  which reduction is performed)
- * @tparam IdxType data type of the indices of the array
- * @tparam MainLambda Unary lambda applied while acculumation (eg: L1 or L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*MainLambda)(InType, IdxType);</pre>
- * @tparam ReduceLambda Binary lambda applied for reduction (eg: addition(+) for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*ReduceLambda)(OutType);</pre>
- * @tparam FinalLambda the final lambda applied before STG (eg: Sqrt for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*FinalLambda)(OutType);</pre>
- * @param dots the output reduction vector
- * @param data the input matrix
- * @param D leading dimension of data
- * @param N second dimension data
- * @param init initial value to use for the reduction
- * @param main_op elementwise operation to apply before reduction
- * @param reduce_op binary reduction operation
- * @param final_op elementwise operation to apply before storing results
- * @param inplace reduction result added inplace or overwrites old values?
- * @param stream cuda stream where to launch work
- */
-template <typename InType, typename OutType = InType, typename IdxType = int,
-          typename MainLambda = Nop<InType, IdxType>,
-          typename ReduceLambda = Sum<OutType>,
-          typename FinalLambda = Nop<OutType>>
-void coalescedReduction(OutType *dots, const InType *data, int D, int N,
-                        OutType init, cudaStream_t stream, bool inplace = false,
-                        MainLambda main_op = Nop<InType, IdxType>(),
-                        ReduceLambda reduce_op = Sum<OutType>(),
-                        FinalLambda final_op = Nop<OutType>()) {
-  // One block per reduction
-  // Efficient only for large leading dimensions
-  if (D <= 32) {
-    coalescedReductionKernel<InType, OutType, IdxType, 32>
-      <<<N, 32, 0, stream>>>(dots, data, D, N, init, main_op, reduce_op,
-                             final_op, inplace);
-  } else if (D <= 64) {
-    coalescedReductionKernel<InType, OutType, IdxType, 64>
-      <<<N, 64, 0, stream>>>(dots, data, D, N, init, main_op, reduce_op,
-                             final_op, inplace);
-  } else if (D <= 128) {
-    coalescedReductionKernel<InType, OutType, IdxType, 128>
-      <<<N, 128, 0, stream>>>(dots, data, D, N, init, main_op, reduce_op,
-                              final_op, inplace);
-  } else {
-    coalescedReductionKernel<InType, OutType, IdxType, 256>
-      <<<N, 256, 0, stream>>>(dots, data, D, N, init, main_op, reduce_op,
-                              final_op, inplace);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/contractions.cuh b/cpp/src_prims/linalg/contractions.cuh
index 07ec16b1d0..e8581ebfbb 100644
--- a/cpp/src_prims/linalg/contractions.cuh
+++ b/cpp/src_prims/linalg/contractions.cuh
@@ -211,12 +211,21 @@ struct Contractions_NT {
 
  protected:
   /**
-   * @brief Load current block of X/Y from global memory to smem
+   * @brief Load current block of X/Y from global memory to registers
    * @param[in] kidx current start index of k to be loaded
    */
-  DI void ldgsts(IdxT kidx) {
-    ldgstsX(kidx, sx + pageWr * P::SmemPage);
-    ldgstsY(kidx, sy + pageWr * P::SmemPage);
+  DI void ldgXY(IdxT kidx) {
+    ldgX(kidx);
+    ldgY(kidx);
+  }
+
+  /**
+   * @brief Store current block of X/Y from registers to smem
+   * @param[in] kidx current start index of k to be loaded
+   */
+  DI void stsXY() {
+    stsX(sx + pageWr * P::SmemPage);
+    stsY(sy + pageWr * P::SmemPage);
   }
 
   /**
@@ -229,47 +238,47 @@ struct Contractions_NT {
   }
 
  private:
-  DI void ldgstsX(IdxT kidx, DataT* smem) {
-    DataT data[P::LdgPerThX][P::Veclen];
-    // LDG
+  DI void ldgX(IdxT kidx) {
     auto koffset = kidx + scolid;
     for (int i = 0; i < P::LdgPerThX; ++i) {
       if (koffset < k && (xrowid + i * P::LdgRowsX) < m) {
-        ldg(data[i], x + i * P::LdgRowsX * k + koffset);
+        ldg(ldgDataX[i], x + i * P::LdgRowsX * k + koffset);
       } else {
 #pragma unroll
         for (int j = 0; j < P::Veclen; ++j) {
-          data[i][j] = Zero;
+          ldgDataX[i][j] = Zero;
         }
       }
     }
-    // STS
-    auto* saddr = smem + srowid * P::SmemStride + scolid;
-#pragma unroll
-    for (int i = 0; i < P::LdgPerThX; ++i) {
-      sts(saddr + i * P::LdgRowsX * P::SmemStride, data[i]);
-    }
   }
 
-  DI void ldgstsY(IdxT kidx, DataT* smem) {
-    DataT data[P::LdgPerThX][P::Veclen];
-    // LDG
+  DI void ldgY(IdxT kidx) {
     auto koffset = kidx + scolid;
     for (int i = 0; i < P::LdgPerThY; ++i) {
       if (koffset < k && (yrowid + i * P::LdgRowsY) < n) {
-        ldg(data[i], y + i * P::LdgRowsY * k + koffset);
+        ldg(ldgDataY[i], y + i * P::LdgRowsY * k + koffset);
       } else {
 #pragma unroll
         for (int j = 0; j < P::Veclen; ++j) {
-          data[i][j] = Zero;
+          ldgDataY[i][j] = Zero;
         }
       }
     }
-    // STS
+  }
+
+  DI void stsX(DataT* smem) {
+    auto* saddr = smem + srowid * P::SmemStride + scolid;
+#pragma unroll
+    for (int i = 0; i < P::LdgPerThX; ++i) {
+      sts(saddr + i * P::LdgRowsX * P::SmemStride, ldgDataX[i]);
+    }
+  }
+
+  DI void stsY(DataT* smem) {
     auto* saddr = smem + srowid * P::SmemStride + scolid;
 #pragma unroll
     for (int i = 0; i < P::LdgPerThY; ++i) {
-      sts(saddr + i * P::LdgRowsY * P::SmemStride, data[i]);
+      sts(saddr + i * P::LdgRowsY * P::SmemStride, ldgDataY[i]);
     }
   }
 
@@ -288,48 +297,6 @@ struct Contractions_NT {
       lds(regy[i], saddr + i * P::AccThCols * P::SmemStride);
     }
   }
-
-  ///@todo: splitting ldg/sts phases can give more latency hiding opportunity
-  ///       thereby improving perf. However the below code is causing kmeans
-  ///       unit-tests to fail. Will need to fix this issue to enable the below
-  ///       code-blocks here and also in FusedL2NN class
-  //   DI void ldgXY(IdxT kidx) {
-  //     auto koffset = kidx + scolid;
-  //     for (int i = 0; i < P::LdgPerThX; ++i) {
-  //       if (koffset < k && (xrowid + i * P::LdgRowsX) < m) {
-  //         ldg(ldgDataX[i], x + i * P::LdgRowsX * k + koffset);
-  //       } else {
-  // #pragma unroll
-  //         for (int j = 0; j < P::Veclen; ++j) {
-  //           ldgDataX[i][j] = Zero;
-  //         }
-  //       }
-  //     }
-  //     for (int i = 0; i < P::LdgPerThY; ++i) {
-  //       if (koffset < k && (yrowid + i * P::LdgRowsY) < n) {
-  //         ldg(ldgDataY[i], y + i * P::LdgRowsY * k + koffset);
-  //       } else {
-  // #pragma unroll
-  //         for (int j = 0; j < P::Veclen; ++j) {
-  //           ldgDataY[i][j] = Zero;
-  //         }
-  //       }
-  //     }
-  //   }
-
-  //   DI void stsXY() {
-  //     auto offset = pageWr * P::SmemPage + srowid * P::SmemStride + scolid;
-  //     auto* saddrx = sx + offset;
-  // #pragma unroll
-  //     for (int i = 0; i < P::LdgPerThX; ++i) {
-  //       sts(saddrx + i * P::LdgRowsX * P::SmemStride, ldgDataX[i]);
-  //     }
-  //     auto* saddry = sy + offset;
-  // #pragma unroll
-  //     for (int i = 0; i < P::LdgPerThY; ++i) {
-  //       sts(saddry + i * P::LdgRowsY * P::SmemStride, ldgDataY[i]);
-  //     }
-  //   }
 };  // struct Contractions_NT
 
 }  // namespace LinAlg
diff --git a/cpp/src_prims/linalg/cublas_wrappers.h b/cpp/src_prims/linalg/cublas_wrappers.h
deleted file mode 100644
index 368ef6a699..0000000000
--- a/cpp/src_prims/linalg/cublas_wrappers.h
+++ /dev/null
@@ -1,542 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cublas_v2.h>
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-
-namespace MLCommon {
-namespace LinAlg {
-
-#define _CUBLAS_ERR_TO_STR(err) \
-  case err:                     \
-    return #err
-inline const char *cublasErr2Str(cublasStatus_t err) {
-  switch (err) {
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_SUCCESS);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_NOT_INITIALIZED);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_ALLOC_FAILED);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_INVALID_VALUE);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_ARCH_MISMATCH);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_MAPPING_ERROR);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_EXECUTION_FAILED);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_INTERNAL_ERROR);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_NOT_SUPPORTED);
-    _CUBLAS_ERR_TO_STR(CUBLAS_STATUS_LICENSE_ERROR);
-    default:
-      return "CUBLAS_STATUS_UNKNOWN";
-  };
-}
-#undef _CUBLAS_ERR_TO_STR
-
-/** check for cublas runtime API errors and assert accordingly */
-#define CUBLAS_CHECK(call)                                         \
-  do {                                                             \
-    cublasStatus_t err = call;                                     \
-    ASSERT(err == CUBLAS_STATUS_SUCCESS,                           \
-           "CUBLAS call='%s' got errorcode=%d err=%s", #call, err, \
-           MLCommon::LinAlg::cublasErr2Str(err));                  \
-  } while (0)
-
-/** check for cublas runtime API errors but do not assert */
-#define CUBLAS_CHECK_NO_THROW(call)                                          \
-  do {                                                                       \
-    cublasStatus_t err = call;                                               \
-    if (err != CUBLAS_STATUS_SUCCESS) {                                      \
-      CUML_LOG_ERROR("CUBLAS call='%s' got errorcode=%d err=%s", #call, err, \
-                     MLCommon::LinAlg::cublasErr2Str(err));                  \
-    }                                                                        \
-  } while (0)
-
-/**
- * @defgroup Axpy cublas ax+y operations
- * @{
- */
-template <typename T>
-cublasStatus_t cublasaxpy(cublasHandle_t handle, int n, const T *alpha,
-                          const T *x, int incx, T *y, int incy,
-                          cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasaxpy(cublasHandle_t handle, int n,
-                                 const float *alpha, const float *x, int incx,
-                                 float *y, int incy, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSaxpy(handle, n, alpha, x, incx, y, incy);
-}
-
-template <>
-inline cublasStatus_t cublasaxpy(cublasHandle_t handle, int n,
-                                 const double *alpha, const double *x, int incx,
-                                 double *y, int incy, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDaxpy(handle, n, alpha, x, incx, y, incy);
-}
-/** @} */
-
-/**
- * @defgroup gemv cublas gemv calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasgemv(cublasHandle_t handle, cublasOperation_t transA,
-                          int m, int n, const T *alfa, const T *A, int lda,
-                          const T *x, int incx, const T *beta, T *y, int incy,
-                          cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgemv(cublasHandle_t handle,
-                                 cublasOperation_t transA, int m, int n,
-                                 const float *alfa, const float *A, int lda,
-                                 const float *x, int incx, const float *beta,
-                                 float *y, int incy, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgemv(handle, transA, m, n, alfa, A, lda, x, incx, beta, y,
-                     incy);
-}
-
-template <>
-inline cublasStatus_t cublasgemv(cublasHandle_t handle,
-                                 cublasOperation_t transA, int m, int n,
-                                 const double *alfa, const double *A, int lda,
-                                 const double *x, int incx, const double *beta,
-                                 double *y, int incy, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgemv(handle, transA, m, n, alfa, A, lda, x, incx, beta, y,
-                     incy);
-}
-/** @} */
-
-/**
- * @defgroup ger cublas a(x*y.T) + A calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasger(cublasHandle_t handle, int m, int n, const T *alpha,
-                         const T *x, int incx, const T *y, int incy, T *A,
-                         int lda, cudaStream_t stream);
-template <>
-inline cublasStatus_t cublasger(cublasHandle_t handle, int m, int n,
-                                const float *alpha, const float *x, int incx,
-                                const float *y, int incy, float *A, int lda,
-                                cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-
-template <>
-inline cublasStatus_t cublasger(cublasHandle_t handle, int m, int n,
-                                const double *alpha, const double *x, int incx,
-                                const double *y, int incy, double *A, int lda,
-                                cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-/** @} */
-
-/**
- * @defgroup gemm cublas gemm calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasgemm(cublasHandle_t handle, cublasOperation_t transA,
-                          cublasOperation_t transB, int m, int n, int k,
-                          const T *alfa, const T *A, int lda, const T *B,
-                          int ldb, const T *beta, T *C, int ldc,
-                          cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgemm(cublasHandle_t handle,
-                                 cublasOperation_t transA,
-                                 cublasOperation_t transB, int m, int n, int k,
-                                 const float *alfa, const float *A, int lda,
-                                 const float *B, int ldb, const float *beta,
-                                 float *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgemm(handle, transA, transB, m, n, k, alfa, A, lda, B, ldb,
-                     beta, C, ldc);
-}
-
-template <>
-inline cublasStatus_t cublasgemm(cublasHandle_t handle,
-                                 cublasOperation_t transA,
-                                 cublasOperation_t transB, int m, int n, int k,
-                                 const double *alfa, const double *A, int lda,
-                                 const double *B, int ldb, const double *beta,
-                                 double *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgemm(handle, transA, transB, m, n, k, alfa, A, lda, B, ldb,
-                     beta, C, ldc);
-}
-/** @} */
-
-/**
- * @defgroup gemmbatched cublas gemmbatched calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasgemmBatched(cublasHandle_t handle,
-                                 cublasOperation_t transa,
-                                 cublasOperation_t transb, int m, int n, int k,
-                                 const T *alpha, const T *const Aarray[],
-                                 int lda, const T *const Barray[], int ldb,
-                                 const T *beta, T *Carray[], int ldc,
-                                 int batchCount, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgemmBatched(
-  cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
-  int m, int n, int k, const float *alpha, const float *const Aarray[], int lda,
-  const float *const Barray[], int ldb, const float *beta, float *Carray[],
-  int ldc, int batchCount, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgemmBatched(handle, transa, transb, m, n, k, alpha, Aarray, lda,
-                            Barray, ldb, beta, Carray, ldc, batchCount);
-}
-
-template <>
-inline cublasStatus_t cublasgemmBatched(
-  cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
-  int m, int n, int k, const double *alpha, const double *const Aarray[],
-  int lda, const double *const Barray[], int ldb, const double *beta,
-  double *Carray[], int ldc, int batchCount, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, Aarray, lda,
-                            Barray, ldb, beta, Carray, ldc, batchCount);
-}
-/** @} */
-
-/**
- * @defgroup gemmbatched cublas gemmbatched calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasgemmStridedBatched(
-  cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
-  int m, int n, int k, const T *alpha, const T *const Aarray, int lda,
-  long long int strideA, const T *const Barray, int ldb, long long int strideB,
-  const T *beta, T *Carray, int ldc, long long int strideC, int batchCount,
-  cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgemmStridedBatched(
-  cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
-  int m, int n, int k, const float *alpha, const float *const Aarray, int lda,
-  long long int strideA, const float *const Barray, int ldb,
-  long long int strideB, const float *beta, float *Carray, int ldc,
-  long long int strideC, int batchCount, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgemmStridedBatched(handle, transa, transb, m, n, k, alpha,
-                                   Aarray, lda, strideA, Barray, ldb, strideB,
-                                   beta, Carray, ldc, strideC, batchCount);
-}
-
-template <>
-inline cublasStatus_t cublasgemmStridedBatched(
-  cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
-  int m, int n, int k, const double *alpha, const double *const Aarray, int lda,
-  long long int strideA, const double *const Barray, int ldb,
-  long long int strideB, const double *beta, double *Carray, int ldc,
-  long long int strideC, int batchCount, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgemmStridedBatched(handle, transa, transb, m, n, k, alpha,
-                                   Aarray, lda, strideA, Barray, ldb, strideB,
-                                   beta, Carray, ldc, strideC, batchCount);
-}
-/** @} */
-
-/**
- * @defgroup solverbatched cublas getrf/gettribatched calls
- * @{
- */
-
-template <typename T>
-cublasStatus_t cublasgetrfBatched(cublasHandle_t handle, int n,
-                                  T *const A[],    /*Device pointer*/
-                                  int lda, int *P, /*Device Pointer*/
-                                  int *info,       /*Device Pointer*/
-                                  int batchSize, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgetrfBatched(cublasHandle_t handle, int n,
-                                         float *const A[], /*Device pointer*/
-                                         int lda, int *P,  /*Device Pointer*/
-                                         int *info,        /*Device Pointer*/
-                                         int batchSize, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgetrfBatched(handle, n, A, lda, P, info, batchSize);
-}
-
-template <>
-inline cublasStatus_t cublasgetrfBatched(cublasHandle_t handle, int n,
-                                         double *const A[], /*Device pointer*/
-                                         int lda, int *P,   /*Device Pointer*/
-                                         int *info,         /*Device Pointer*/
-                                         int batchSize, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgetrfBatched(handle, n, A, lda, P, info, batchSize);
-}
-
-template <typename T>
-cublasStatus_t cublasgetriBatched(cublasHandle_t handle, int n,
-                                  const T *const A[],    /*Device pointer*/
-                                  int lda, const int *P, /*Device pointer*/
-                                  T *const C[],          /*Device pointer*/
-                                  int ldc, int *info, int batchSize,
-                                  cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgetriBatched(
-  cublasHandle_t handle, int n, const float *const A[], /*Device pointer*/
-  int lda, const int *P,                                /*Device pointer*/
-  float *const C[],                                     /*Device pointer*/
-  int ldc, int *info, int batchSize, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgetriBatched(handle, n, A, lda, P, C, ldc, info, batchSize);
-}
-
-template <>
-inline cublasStatus_t cublasgetriBatched(
-  cublasHandle_t handle, int n, const double *const A[], /*Device pointer*/
-  int lda, const int *P,                                 /*Device pointer*/
-  double *const C[],                                     /*Device pointer*/
-  int ldc, int *info, int batchSize, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgetriBatched(handle, n, A, lda, P, C, ldc, info, batchSize);
-}
-
-/** @} */
-
-/**
- * @defgroup gelsbatched cublas gelsbatched calls
- * @{
- */
-
-template <typename T>
-inline cublasStatus_t cublasgelsBatched(cublasHandle_t handle,
-                                        cublasOperation_t trans, int m, int n,
-                                        int nrhs, T *Aarray[], int lda,
-                                        T *Carray[], int ldc, int *info,
-                                        int *devInfoArray, int batchSize,
-                                        cudaStream_t stream = 0);
-
-template <>
-inline cublasStatus_t cublasgelsBatched(cublasHandle_t handle,
-                                        cublasOperation_t trans, int m, int n,
-                                        int nrhs, float *Aarray[], int lda,
-                                        float *Carray[], int ldc, int *info,
-                                        int *devInfoArray, int batchSize,
-                                        cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgelsBatched(handle, trans, m, n, nrhs, Aarray, lda, Carray, ldc,
-                            info, devInfoArray, batchSize);
-}
-
-template <>
-inline cublasStatus_t cublasgelsBatched(cublasHandle_t handle,
-                                        cublasOperation_t trans, int m, int n,
-                                        int nrhs, double *Aarray[], int lda,
-                                        double *Carray[], int ldc, int *info,
-                                        int *devInfoArray, int batchSize,
-                                        cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgelsBatched(handle, trans, m, n, nrhs, Aarray, lda, Carray, ldc,
-                            info, devInfoArray, batchSize);
-}
-
-/** @} */
-
-/**
- * @defgroup geam cublas geam calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasgeam(cublasHandle_t handle, cublasOperation_t transA,
-                          cublasOperation_t transB, int m, int n, const T *alfa,
-                          const T *A, int lda, const T *beta, const T *B,
-                          int ldb, T *C, int ldc, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasgeam(cublasHandle_t handle,
-                                 cublasOperation_t transA,
-                                 cublasOperation_t transB, int m, int n,
-                                 const float *alfa, const float *A, int lda,
-                                 const float *beta, const float *B, int ldb,
-                                 float *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSgeam(handle, transA, transB, m, n, alfa, A, lda, beta, B, ldb,
-                     C, ldc);
-}
-
-template <>
-inline cublasStatus_t cublasgeam(cublasHandle_t handle,
-                                 cublasOperation_t transA,
-                                 cublasOperation_t transB, int m, int n,
-                                 const double *alfa, const double *A, int lda,
-                                 const double *beta, const double *B, int ldb,
-                                 double *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDgeam(handle, transA, transB, m, n, alfa, A, lda, beta, B, ldb,
-                     C, ldc);
-}
-/** @} */
-
-/**
- * @defgroup symm cublas symm calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublassymm(cublasHandle_t handle, cublasSideMode_t side,
-                          cublasFillMode_t uplo, int m, int n, const T *alpha,
-                          const T *A, int lda, const T *B, int ldb,
-                          const T *beta, T *C, int ldc, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublassymm(cublasHandle_t handle, cublasSideMode_t side,
-                                 cublasFillMode_t uplo, int m, int n,
-                                 const float *alpha, const float *A, int lda,
-                                 const float *B, int ldb, const float *beta,
-                                 float *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSsymm(handle, side, uplo, m, n, alpha, A, lda, B, ldb, beta, C,
-                     ldc);
-}
-
-template <>
-inline cublasStatus_t cublassymm(cublasHandle_t handle, cublasSideMode_t side,
-                                 cublasFillMode_t uplo, int m, int n,
-                                 const double *alpha, const double *A, int lda,
-                                 const double *B, int ldb, const double *beta,
-                                 double *C, int ldc, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDsymm(handle, side, uplo, m, n, alpha, A, lda, B, ldb, beta, C,
-                     ldc);
-}
-/** @} */
-
-/**
- * @defgroup syrk cublas syrk calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublassyrk(cublasHandle_t handle, cublasFillMode_t uplo,
-                          cublasOperation_t trans, int n, int k, const T *alpha,
-                          const T *A, int lda, const T *beta, T *C, int ldc,
-                          cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublassyrk(cublasHandle_t handle, cublasFillMode_t uplo,
-                                 cublasOperation_t trans, int n, int k,
-                                 const float *alpha, const float *A, int lda,
-                                 const float *beta, float *C, int ldc,
-                                 cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSsyrk(handle, uplo, trans, n, k, alpha, A, lda, beta, C, ldc);
-}
-
-template <>
-inline cublasStatus_t cublassyrk(cublasHandle_t handle, cublasFillMode_t uplo,
-                                 cublasOperation_t trans, int n, int k,
-                                 const double *alpha, const double *A, int lda,
-                                 const double *beta, double *C, int ldc,
-                                 cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDsyrk(handle, uplo, trans, n, k, alpha, A, lda, beta, C, ldc);
-}
-/** @} */
-
-/**
- * @defgroup nrm2 cublas nrm2 calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasnrm2(cublasHandle_t handle, int n, const T *x, int incx,
-                          T *result, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasnrm2(cublasHandle_t handle, int n, const float *x,
-                                 int incx, float *result, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSnrm2(handle, n, x, incx, result);
-}
-
-template <>
-inline cublasStatus_t cublasnrm2(cublasHandle_t handle, int n, const double *x,
-                                 int incx, double *result,
-                                 cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDnrm2(handle, n, x, incx, result);
-}
-/** @} */
-
-template <typename T>
-cublasStatus_t cublastrsm(cublasHandle_t handle, cublasSideMode_t side,
-                          cublasFillMode_t uplo, cublasOperation_t trans,
-                          cublasDiagType_t diag, int m, int n, const T *alpha,
-                          const T *A, int lda, T *B, int ldb,
-                          cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublastrsm(cublasHandle_t handle, cublasSideMode_t side,
-                                 cublasFillMode_t uplo, cublasOperation_t trans,
-                                 cublasDiagType_t diag, int m, int n,
-                                 const float *alpha, const float *A, int lda,
-                                 float *B, int ldb, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasStrsm(handle, side, uplo, trans, diag, m, n, alpha, A, lda, B,
-                     ldb);
-}
-
-template <>
-inline cublasStatus_t cublastrsm(cublasHandle_t handle, cublasSideMode_t side,
-                                 cublasFillMode_t uplo, cublasOperation_t trans,
-                                 cublasDiagType_t diag, int m, int n,
-                                 const double *alpha, const double *A, int lda,
-                                 double *B, int ldb, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDtrsm(handle, side, uplo, trans, diag, m, n, alpha, A, lda, B,
-                     ldb);
-}
-
-/**
- * @defgroup dot cublas dot calls
- * @{
- */
-template <typename T>
-cublasStatus_t cublasdot(cublasHandle_t handle, int n, const T *x, int incx,
-                         const T *y, int incy, T *result, cudaStream_t stream);
-
-template <>
-inline cublasStatus_t cublasdot(cublasHandle_t handle, int n, const float *x,
-                                int incx, const float *y, int incy,
-                                float *result, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasSdot(handle, n, x, incx, y, incy, result);
-}
-
-template <>
-inline cublasStatus_t cublasdot(cublasHandle_t handle, int n, const double *x,
-                                int incx, const double *y, int incy,
-                                double *result, cudaStream_t stream) {
-  CUBLAS_CHECK(cublasSetStream(handle, stream));
-  return cublasDdot(handle, n, x, incx, y, incy, result);
-}
-/** @} */
-
-};  // namespace LinAlg
-};  // namespace MLCommon
diff --git a/cpp/src_prims/linalg/cusolver_wrappers.h b/cpp/src_prims/linalg/cusolver_wrappers.h
deleted file mode 100644
index e8dcda8365..0000000000
--- a/cpp/src_prims/linalg/cusolver_wrappers.h
+++ /dev/null
@@ -1,709 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cusolverDn.h>
-#include <cusolverSp.h>
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-
-namespace MLCommon {
-namespace LinAlg {
-
-#define _CUSOLVER_ERR_TO_STR(err) \
-  case err:                       \
-    return #err;
-inline const char *cusolverErr2Str(cusolverStatus_t err) {
-  switch (err) {
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_SUCCESS);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_NOT_INITIALIZED);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_ALLOC_FAILED);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_INVALID_VALUE);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_ARCH_MISMATCH);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_EXECUTION_FAILED);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_INTERNAL_ERROR);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_ZERO_PIVOT);
-    _CUSOLVER_ERR_TO_STR(CUSOLVER_STATUS_NOT_SUPPORTED);
-    default:
-      return "CUSOLVER_STATUS_UNKNOWN";
-  };
-}
-#undef _CUSOLVER_ERR_TO_STR
-
-/** check for cusolver runtime API errors and assert accordingly */
-#define CUSOLVER_CHECK(call)                                         \
-  do {                                                               \
-    cusolverStatus_t err = call;                                     \
-    ASSERT(err == CUSOLVER_STATUS_SUCCESS,                           \
-           "CUSOLVER call='%s' got errorcode=%d err=%s", #call, err, \
-           MLCommon::LinAlg::cusolverErr2Str(err));                  \
-  } while (0)
-
-/** check for cusolver runtime API errors but do not assert */
-#define CUSOLVER_CHECK_NO_THROW(call)                                          \
-  do {                                                                         \
-    cusolverStatus_t err = call;                                               \
-    if (err != CUSOLVER_STATUS_SUCCESS) {                                      \
-      CUML_LOG_ERROR("CUSOLVER call='%s' got errorcode=%d err=%s", #call, err, \
-                     MLCommon::LinAlg::cusolverErr2Str(err));                  \
-    }                                                                          \
-  } while (0)
-
-/**
- * @defgroup Getrf cusolver getrf operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDngetrf(cusolverDnHandle_t handle, int m, int n, T *A,
-                                 int lda, T *Workspace, int *devIpiv,
-                                 int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDngetrf(cusolverDnHandle_t handle, int m, int n,
-                                        float *A, int lda, float *Workspace,
-                                        int *devIpiv, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSgetrf(handle, m, n, A, lda, Workspace, devIpiv, devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDngetrf(cusolverDnHandle_t handle, int m, int n,
-                                        double *A, int lda, double *Workspace,
-                                        int *devIpiv, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDgetrf(handle, m, n, A, lda, Workspace, devIpiv, devInfo);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDngetrf_bufferSize(cusolverDnHandle_t handle, int m,
-                                            int n, T *A, int lda, int *Lwork);
-
-template <>
-inline cusolverStatus_t cusolverDngetrf_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, float *A,
-                                                   int lda, int *Lwork) {
-  return cusolverDnSgetrf_bufferSize(handle, m, n, A, lda, Lwork);
-}
-
-template <>
-inline cusolverStatus_t cusolverDngetrf_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, double *A,
-                                                   int lda, int *Lwork) {
-  return cusolverDnDgetrf_bufferSize(handle, m, n, A, lda, Lwork);
-}
-
-/**
- * @defgroup Getrs cusolver getrs operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDngetrs(cusolverDnHandle_t handle,
-                                 cublasOperation_t trans, int n, int nrhs,
-                                 const T *A, int lda, const int *devIpiv, T *B,
-                                 int ldb, int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDngetrs(cusolverDnHandle_t handle,
-                                        cublasOperation_t trans, int n,
-                                        int nrhs, const float *A, int lda,
-                                        const int *devIpiv, float *B, int ldb,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSgetrs(handle, trans, n, nrhs, A, lda, devIpiv, B, ldb,
-                          devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDngetrs(cusolverDnHandle_t handle,
-                                        cublasOperation_t trans, int n,
-                                        int nrhs, const double *A, int lda,
-                                        const int *devIpiv, double *B, int ldb,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDgetrs(handle, trans, n, nrhs, A, lda, devIpiv, B, ldb,
-                          devInfo);
-}
-/** @} */
-
-/**
- * @defgroup syevd cusolver syevd operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnsyevd_bufferSize(cusolverDnHandle_t handle,
-                                            cusolverEigMode_t jobz,
-                                            cublasFillMode_t uplo, int n,
-                                            const T *A, int lda, const T *W,
-                                            int *lwork);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevd_bufferSize(cusolverDnHandle_t handle,
-                                                   cusolverEigMode_t jobz,
-                                                   cublasFillMode_t uplo, int n,
-                                                   const float *A, int lda,
-                                                   const float *W, int *lwork) {
-  return cusolverDnSsyevd_bufferSize(handle, jobz, uplo, n, A, lda, W, lwork);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevd_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo,
-  int n, const double *A, int lda, const double *W, int *lwork) {
-  return cusolverDnDsyevd_bufferSize(handle, jobz, uplo, n, A, lda, W, lwork);
-}
-/** @} */
-
-/**
- * @defgroup syevj cusolver syevj operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnsyevj(cusolverDnHandle_t handle,
-                                 cusolverEigMode_t jobz, cublasFillMode_t uplo,
-                                 int n, T *A, int lda, T *W, T *work, int lwork,
-                                 int *info, syevjInfo_t params,
-                                 cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevj(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo,
-  int n, float *A, int lda, float *W, float *work, int lwork, int *info,
-  syevjInfo_t params, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSsyevj(handle, jobz, uplo, n, A, lda, W, work, lwork, info,
-                          params);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevj(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo,
-  int n, double *A, int lda, double *W, double *work, int lwork, int *info,
-  syevjInfo_t params, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDsyevj(handle, jobz, uplo, n, A, lda, W, work, lwork, info,
-                          params);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDnsyevj_bufferSize(cusolverDnHandle_t handle,
-                                            cusolverEigMode_t jobz,
-                                            cublasFillMode_t uplo, int n,
-                                            const T *A, int lda, const T *W,
-                                            int *lwork, syevjInfo_t params);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevj_bufferSize(cusolverDnHandle_t handle,
-                                                   cusolverEigMode_t jobz,
-                                                   cublasFillMode_t uplo, int n,
-                                                   const float *A, int lda,
-                                                   const float *W, int *lwork,
-                                                   syevjInfo_t params) {
-  return cusolverDnSsyevj_bufferSize(handle, jobz, uplo, n, A, lda, W, lwork,
-                                     params);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevj_bufferSize(cusolverDnHandle_t handle,
-                                                   cusolverEigMode_t jobz,
-                                                   cublasFillMode_t uplo, int n,
-                                                   const double *A, int lda,
-                                                   const double *W, int *lwork,
-                                                   syevjInfo_t params) {
-  return cusolverDnDsyevj_bufferSize(handle, jobz, uplo, n, A, lda, W, lwork,
-                                     params);
-}
-/** @} */
-
-/**
- * @defgroup syevd cusolver syevd operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnsyevd(cusolverDnHandle_t handle,
-                                 cusolverEigMode_t jobz, cublasFillMode_t uplo,
-                                 int n, T *A, int lda, T *W, T *work, int lwork,
-                                 int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevd(cusolverDnHandle_t handle,
-                                        cusolverEigMode_t jobz,
-                                        cublasFillMode_t uplo, int n, float *A,
-                                        int lda, float *W, float *work,
-                                        int lwork, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSsyevd(handle, jobz, uplo, n, A, lda, W, work, lwork,
-                          devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevd(cusolverDnHandle_t handle,
-                                        cusolverEigMode_t jobz,
-                                        cublasFillMode_t uplo, int n, double *A,
-                                        int lda, double *W, double *work,
-                                        int lwork, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDsyevd(handle, jobz, uplo, n, A, lda, W, work, lwork,
-                          devInfo);
-}
-/** @} */
-
-#if CUDART_VERSION >= 10010
-/**
- * @defgroup syevdx cusolver syevdx operations
- * @{
-*/
-template <typename T>
-cusolverStatus_t cusolverDnsyevdx_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, const T *A, int lda, T vl, T vu, int il, int iu,
-  int *h_meig, const T *W, int *lwork);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevdx_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, const float *A, int lda, float vl, float vu,
-  int il, int iu, int *h_meig, const float *W, int *lwork) {
-  return cusolverDnSsyevdx_bufferSize(handle, jobz, range, uplo, n, A, lda, vl,
-                                      vu, il, iu, h_meig, W, lwork);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevdx_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, const double *A, int lda, double vl, double vu,
-  int il, int iu, int *h_meig, const double *W, int *lwork) {
-  return cusolverDnDsyevdx_bufferSize(handle, jobz, range, uplo, n, A, lda, vl,
-                                      vu, il, iu, h_meig, W, lwork);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDnsyevdx(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, T *A, int lda, T vl, T vu, int il, int iu,
-  int *h_meig, T *W, T *work, int lwork, int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnsyevdx(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, float *A, int lda, float vl, float vu, int il,
-  int iu, int *h_meig, float *W, float *work, int lwork, int *devInfo,
-  cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSsyevdx(handle, jobz, range, uplo, n, A, lda, vl, vu, il, iu,
-                           h_meig, W, work, lwork, devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnsyevdx(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, cusolverEigRange_t range,
-  cublasFillMode_t uplo, int n, double *A, int lda, double vl, double vu,
-  int il, int iu, int *h_meig, double *W, double *work, int lwork, int *devInfo,
-  cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDsyevdx(handle, jobz, range, uplo, n, A, lda, vl, vu, il, iu,
-                           h_meig, W, work, lwork, devInfo);
-}
-/** @} */
-#endif
-
-/**
- * @defgroup svd cusolver svd operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDngesvd_bufferSize(cusolverDnHandle_t handle, int m,
-                                            int n, int *lwork) {
-  if (typeid(T) == typeid(float)) {
-    return cusolverDnSgesvd_bufferSize(handle, m, n, lwork);
-  } else {
-    return cusolverDnDgesvd_bufferSize(handle, m, n, lwork);
-  }
-}
-template <typename T>
-cusolverStatus_t cusolverDngesvd(cusolverDnHandle_t handle, signed char jobu,
-                                 signed char jobvt, int m, int n, T *A, int lda,
-                                 T *S, T *U, int ldu, T *VT, int ldvt, T *work,
-                                 int lwork, T *rwork, int *devInfo,
-                                 cudaStream_t stream);
-template <>
-inline cusolverStatus_t cusolverDngesvd(
-  cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n,
-  float *A, int lda, float *S, float *U, int ldu, float *VT, int ldvt,
-  float *work, int lwork, float *rwork, int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSgesvd(handle, jobu, jobvt, m, n, A, lda, S, U, ldu, VT,
-                          ldvt, work, lwork, rwork, devInfo);
-}
-template <>
-inline cusolverStatus_t cusolverDngesvd(
-  cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n,
-  double *A, int lda, double *S, double *U, int ldu, double *VT, int ldvt,
-  double *work, int lwork, double *rwork, int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDgesvd(handle, jobu, jobvt, m, n, A, lda, S, U, ldu, VT,
-                          ldvt, work, lwork, rwork, devInfo);
-}
-
-template <typename T>
-inline cusolverStatus_t CUSOLVERAPI cusolverDngesvdj_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n,
-  const T *A, int lda, const T *S, const T *U, int ldu, const T *V, int ldv,
-  int *lwork, gesvdjInfo_t params);
-template <>
-inline cusolverStatus_t CUSOLVERAPI cusolverDngesvdj_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n,
-  const float *A, int lda, const float *S, const float *U, int ldu,
-  const float *V, int ldv, int *lwork, gesvdjInfo_t params) {
-  return cusolverDnSgesvdj_bufferSize(handle, jobz, econ, m, n, A, lda, S, U,
-                                      ldu, V, ldv, lwork, params);
-}
-template <>
-inline cusolverStatus_t CUSOLVERAPI cusolverDngesvdj_bufferSize(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n,
-  const double *A, int lda, const double *S, const double *U, int ldu,
-  const double *V, int ldv, int *lwork, gesvdjInfo_t params) {
-  return cusolverDnDgesvdj_bufferSize(handle, jobz, econ, m, n, A, lda, S, U,
-                                      ldu, V, ldv, lwork, params);
-}
-template <typename T>
-inline cusolverStatus_t CUSOLVERAPI cusolverDngesvdj(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n,
-  T *A, int lda, T *S, T *U, int ldu, T *V, int ldv, T *work, int lwork,
-  int *info, gesvdjInfo_t params, cudaStream_t stream);
-template <>
-inline cusolverStatus_t CUSOLVERAPI cusolverDngesvdj(
-  cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n,
-  float *A, int lda, float *S, float *U, int ldu, float *V, int ldv,
-  float *work, int lwork, int *info, gesvdjInfo_t params, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSgesvdj(handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv,
-                           work, lwork, info, params);
-}
-template <>
-inline cusolverStatus_t CUSOLVERAPI
-cusolverDngesvdj(cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ,
-                 int m, int n, double *A, int lda, double *S, double *U,
-                 int ldu, double *V, int ldv, double *work, int lwork,
-                 int *info, gesvdjInfo_t params, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDgesvdj(handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv,
-                           work, lwork, info, params);
-}
-/** @} */
-
-/**
- * @defgroup potrf cusolver potrf operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnpotrf_bufferSize(cusolverDnHandle_t handle,
-                                            cublasFillMode_t uplo, int n, T *A,
-                                            int lda, int *Lwork);
-
-template <>
-inline cusolverStatus_t cusolverDnpotrf_bufferSize(cusolverDnHandle_t handle,
-                                                   cublasFillMode_t uplo, int n,
-                                                   float *A, int lda,
-                                                   int *Lwork) {
-  return cusolverDnSpotrf_bufferSize(handle, uplo, n, A, lda, Lwork);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnpotrf_bufferSize(cusolverDnHandle_t handle,
-                                                   cublasFillMode_t uplo, int n,
-                                                   double *A, int lda,
-                                                   int *Lwork) {
-  return cusolverDnDpotrf_bufferSize(handle, uplo, n, A, lda, Lwork);
-}
-
-template <typename T>
-inline cusolverStatus_t cusolverDnpotrf(cusolverDnHandle_t handle,
-                                        cublasFillMode_t uplo, int n, T *A,
-                                        int lda, T *Workspace, int Lwork,
-                                        int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnpotrf(cusolverDnHandle_t handle,
-                                        cublasFillMode_t uplo, int n, float *A,
-                                        int lda, float *Workspace, int Lwork,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSpotrf(handle, uplo, n, A, lda, Workspace, Lwork, devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnpotrf(cusolverDnHandle_t handle,
-                                        cublasFillMode_t uplo, int n, double *A,
-                                        int lda, double *Workspace, int Lwork,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDpotrf(handle, uplo, n, A, lda, Workspace, Lwork, devInfo);
-}
-/** @} */
-
-/**
- * @defgroup potrs cusolver potrs operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnpotrs(cusolverDnHandle_t handle,
-                                 cublasFillMode_t uplo, int n, int nrhs,
-                                 const T *A, int lda, T *B, int ldb,
-                                 int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnpotrs(cusolverDnHandle_t handle,
-                                        cublasFillMode_t uplo, int n, int nrhs,
-                                        const float *A, int lda, float *B,
-                                        int ldb, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnpotrs(cusolverDnHandle_t handle,
-                                        cublasFillMode_t uplo, int n, int nrhs,
-                                        const double *A, int lda, double *B,
-                                        int ldb, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo);
-}
-/** @} */
-
-/**
- * @defgroup geqrf cusolver geqrf operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDngeqrf(cusolverDnHandle_t handle, int m, int n, T *A,
-                                 int lda, T *TAU, T *Workspace, int Lwork,
-                                 int *devInfo, cudaStream_t stream);
-template <>
-inline cusolverStatus_t cusolverDngeqrf(cusolverDnHandle_t handle, int m, int n,
-                                        float *A, int lda, float *TAU,
-                                        float *Workspace, int Lwork,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSgeqrf(handle, m, n, A, lda, TAU, Workspace, Lwork, devInfo);
-}
-template <>
-inline cusolverStatus_t cusolverDngeqrf(cusolverDnHandle_t handle, int m, int n,
-                                        double *A, int lda, double *TAU,
-                                        double *Workspace, int Lwork,
-                                        int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDgeqrf(handle, m, n, A, lda, TAU, Workspace, Lwork, devInfo);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDngeqrf_bufferSize(cusolverDnHandle_t handle, int m,
-                                            int n, T *A, int lda, int *Lwork);
-template <>
-inline cusolverStatus_t cusolverDngeqrf_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, float *A,
-                                                   int lda, int *Lwork) {
-  return cusolverDnSgeqrf_bufferSize(handle, m, n, A, lda, Lwork);
-}
-template <>
-inline cusolverStatus_t cusolverDngeqrf_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, double *A,
-                                                   int lda, int *Lwork) {
-  return cusolverDnDgeqrf_bufferSize(handle, m, n, A, lda, Lwork);
-}
-/** @} */
-
-/**
- * @defgroup orgqr cusolver orgqr operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnorgqr(cusolverDnHandle_t handle, int m, int n, int k,
-                                 T *A, int lda, const T *tau, T *work,
-                                 int lwork, int *devInfo, cudaStream_t stream);
-template <>
-inline cusolverStatus_t cusolverDnorgqr(cusolverDnHandle_t handle, int m, int n,
-                                        int k, float *A, int lda,
-                                        const float *tau, float *work,
-                                        int lwork, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSorgqr(handle, m, n, k, A, lda, tau, work, lwork, devInfo);
-}
-template <>
-inline cusolverStatus_t cusolverDnorgqr(cusolverDnHandle_t handle, int m, int n,
-                                        int k, double *A, int lda,
-                                        const double *tau, double *work,
-                                        int lwork, int *devInfo,
-                                        cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDorgqr(handle, m, n, k, A, lda, tau, work, lwork, devInfo);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDnorgqr_bufferSize(cusolverDnHandle_t handle, int m,
-                                            int n, int k, const T *A, int lda,
-                                            const T *TAU, int *lwork);
-template <>
-inline cusolverStatus_t cusolverDnorgqr_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, int k,
-                                                   const float *A, int lda,
-                                                   const float *TAU,
-                                                   int *lwork) {
-  return cusolverDnSorgqr_bufferSize(handle, m, n, k, A, lda, TAU, lwork);
-}
-template <>
-inline cusolverStatus_t cusolverDnorgqr_bufferSize(cusolverDnHandle_t handle,
-                                                   int m, int n, int k,
-                                                   const double *A, int lda,
-                                                   const double *TAU,
-                                                   int *lwork) {
-  return cusolverDnDorgqr_bufferSize(handle, m, n, k, A, lda, TAU, lwork);
-}
-/** @} */
-
-/**
- * @defgroup ormqr cusolver ormqr operations
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverDnormqr(cusolverDnHandle_t handle,
-                                 cublasSideMode_t side, cublasOperation_t trans,
-                                 int m, int n, int k, const T *A, int lda,
-                                 const T *tau, T *C, int ldc, T *work,
-                                 int lwork, int *devInfo, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverDnormqr(
-  cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans,
-  int m, int n, int k, const float *A, int lda, const float *tau, float *C,
-  int ldc, float *work, int lwork, int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnSormqr(handle, side, trans, m, n, k, A, lda, tau, C, ldc,
-                          work, lwork, devInfo);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnormqr(
-  cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans,
-  int m, int n, int k, const double *A, int lda, const double *tau, double *C,
-  int ldc, double *work, int lwork, int *devInfo, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverDnSetStream(handle, stream));
-  return cusolverDnDormqr(handle, side, trans, m, n, k, A, lda, tau, C, ldc,
-                          work, lwork, devInfo);
-}
-
-template <typename T>
-cusolverStatus_t cusolverDnormqr_bufferSize(cusolverDnHandle_t handle,
-                                            cublasSideMode_t side,
-                                            cublasOperation_t trans, int m,
-                                            int n, int k, const T *A, int lda,
-                                            const T *tau, const T *C, int ldc,
-                                            int *lwork);
-
-template <>
-inline cusolverStatus_t cusolverDnormqr_bufferSize(
-  cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans,
-  int m, int n, int k, const float *A, int lda, const float *tau,
-  const float *C, int ldc, int *lwork) {
-  return cusolverDnSormqr_bufferSize(handle, side, trans, m, n, k, A, lda, tau,
-                                     C, ldc, lwork);
-}
-
-template <>
-inline cusolverStatus_t cusolverDnormqr_bufferSize(
-  cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans,
-  int m, int n, int k, const double *A, int lda, const double *tau,
-  const double *C, int ldc, int *lwork) {
-  return cusolverDnDormqr_bufferSize(handle, side, trans, m, n, k, A, lda, tau,
-                                     C, ldc, lwork);
-}
-/** @} */
-
-/**
- * @defgroup csrqrBatched cusolver batched
- * @{
- */
-template <typename T>
-cusolverStatus_t cusolverSpcsrqrBufferInfoBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const T *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, int batchSize, csrqrInfo_t info,
-  size_t *internalDataInBytes, size_t *workspaceInBytes);
-
-template <>
-inline cusolverStatus_t cusolverSpcsrqrBufferInfoBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, int batchSize, csrqrInfo_t info,
-  size_t *internalDataInBytes, size_t *workspaceInBytes) {
-  return cusolverSpScsrqrBufferInfoBatched(
-    handle, m, n, nnzA, descrA, csrValA, csrRowPtrA, csrColIndA, batchSize,
-    info, internalDataInBytes, workspaceInBytes);
-}
-
-template <>
-inline cusolverStatus_t cusolverSpcsrqrBufferInfoBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, int batchSize, csrqrInfo_t info,
-  size_t *internalDataInBytes, size_t *workspaceInBytes) {
-  return cusolverSpDcsrqrBufferInfoBatched(
-    handle, m, n, nnzA, descrA, csrValA, csrRowPtrA, csrColIndA, batchSize,
-    info, internalDataInBytes, workspaceInBytes);
-}
-
-template <typename T>
-cusolverStatus_t cusolverSpcsrqrsvBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const T *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, const T *b, T *x, int batchSize, csrqrInfo_t info,
-  void *pBuffer, cudaStream_t stream);
-
-template <>
-inline cusolverStatus_t cusolverSpcsrqrsvBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, const float *b, float *x, int batchSize,
-  csrqrInfo_t info, void *pBuffer, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverSpSetStream(handle, stream));
-  return cusolverSpScsrqrsvBatched(handle, m, n, nnzA, descrA, csrValA,
-                                   csrRowPtrA, csrColIndA, b, x, batchSize,
-                                   info, pBuffer);
-}
-
-template <>
-inline cusolverStatus_t cusolverSpcsrqrsvBatched(
-  cusolverSpHandle_t handle, int m, int n, int nnzA,
-  const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA,
-  const int *csrColIndA, const double *b, double *x, int batchSize,
-  csrqrInfo_t info, void *pBuffer, cudaStream_t stream) {
-  CUSOLVER_CHECK(cusolverSpSetStream(handle, stream));
-  return cusolverSpDcsrqrsvBatched(handle, m, n, nnzA, descrA, csrValA,
-                                   csrRowPtrA, csrColIndA, b, x, batchSize,
-                                   info, pBuffer);
-}
-/** @} */
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/cutlass_gemm.cuh b/cpp/src_prims/linalg/cutlass_gemm.cuh
new file mode 100644
index 0000000000..f4f1e5d46f
--- /dev/null
+++ b/cpp/src_prims/linalg/cutlass_gemm.cuh
@@ -0,0 +1,139 @@
+/*
+ * Copyright (c) 2018-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "cutlass_wrappers.cuh"
+
+// cutlass based gemm is being kept in this namespace so that RAFT does not
+// have to take cutlass-dependency when cublas based gemm is moved to RAFT
+namespace MLCommon {
+namespace LinAlg {
+
+/**
+  * @brief the gemm function for the cases with detailed epilogue customization
+  *  It computes the following equation: D = alpha . opA(A) * opB(B) + beta . C
+  * @tparam IType input data-type (for A and B matrices)
+  * @tparam AccType accumulation data-type
+  * @tparam OType output data-type (for C and D matrices)
+  * @tparam OutputTile_ output tile size for the thread block
+  * @tparam AccumulatorsPerThread_ number of accumulators per thread
+  * @tparam MainLoopFunctor_ custom functor to be used in the main loop
+  * @tparam Index_ the type of index
+  * @tparam GemmConfig_ the config for the GEMM
+  * @tparam EpilogueFunctor_ custom epilogue functor
+  * @tparam GemmEpilogueTraits_ epilogue traits class to build the epilogue
+  * @tparam GemmEpilogue_ custom epilogue
+  * @tparam Lambda lambda to initialize any custom params inside EpilogueFunctor_
+  * @tparam FinalLambda Final device lambda to be applied in epilogue
+  * @param transA cublas transpose op for A
+  * @param transB cublas transpose op for B
+  * @param m number of rows of A and C/D
+  * @param n number of columns of B and C/D
+  * @param k number of cols of A and rows of B
+  * @param alpha scalar
+  * @param A input matrix
+  * @param lda leading dim for A
+  * @param B input matrix
+  * @param ldb leading dim for B
+  * @param beta scalar
+  * @param C input matrix
+  * @param ldc leading dim for C and D
+  * @param D output matrix
+  * @param op lambda function to initialize any custom params inside
+  * EpilogueFunctor_
+  * @param fin_op the final lambda to be run inside the Epilogue. This can help
+  * in customizing a given EpilogueFunctor, without having to go through the task
+  * of creating another Functor!
+  * @param stream cuda stream where to launch work
+  */
+template <
+  typename IType, typename AccType, typename OType, typename OutputTile_,
+  typename AccumulatorsPerThread_ = cutlass::Shape<8, 8, 8>,
+  typename MainLoopFunctor_ = cutlass::gemm::ThreadMultiplyAdd<
+    AccumulatorsPerThread_, cutlass::Shape<1, 4, 8>, IType, IType, AccType>,
+  typename Index_ = int,
+  typename GemmConfig_ =
+    CustomGemmConfig<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
+                     MainLoopFunctor_>,
+  typename EpilogueFunctor_ = LinearScaling<OType>,
+  typename GemmEpilogueTraits_ = cutlass::gemm::SimplifiedGemmEpilogueTraits<
+    GemmConfig_, EpilogueFunctor_, Index_>,
+  typename GemmEpilogue_ = CustomGemmEpilogue<GemmEpilogueTraits_>,
+  typename Lambda, typename FinalLambda>
+void gemm(cublasOperation_t transA, cublasOperation_t transB, Index_ m,
+          Index_ n, Index_ k, OType alpha, IType const *A, Index_ lda,
+          IType const *B, Index_ ldb, OType beta, OType const *C, Index_ ldc,
+          OType *D, Lambda op, FinalLambda fin_op, cudaStream_t stream) {
+  baseGemm<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
+           MainLoopFunctor_, Index_, GemmConfig_, EpilogueFunctor_,
+           GemmEpilogueTraits_, GemmEpilogue_>(transA, transB, m, n, k, alpha,
+                                               A, lda, B, ldb, beta, C, ldc, D,
+                                               op, fin_op, stream);
+}
+
+/**
+  * @brief the gemm function for the case where no or simple customization is
+  * needed
+  *  It computes the following equation: D = alpha . opA(A) * opB(B) + beta . C
+  * @tparam IType input data-type (for A and B matrices)
+  * @tparam AccType accumulation data-type
+  * @tparam OType output data-type (for C and D matrices)
+  * @tparam OutputTile_ output tile size for the thread block
+  * @tparam AccumulatorsPerThread_ number of accumulators per thread
+  * @tparam Index_ index type
+  * @tparam EpilogueFunctor_ custom epilogue functor
+  * @param transA cublas transpose op for A
+  * @param transB cublas transpose op for B
+  * @param m number of rows of A and C/D
+  * @param n number of columns of B and C/D
+  * @param k number of cols of A and rows of B
+  * @param alpha scalar
+  * @param A input matrix
+  * @param lda leading dim for A
+  * @param B input matrix
+  * @param ldb leading dim for B
+  * @param beta scalar
+  * @param C input matrix
+  * @param ldc leading dim for C and D
+  * @param D output matrix
+  * @param stream cuda stream where to launch work
+  * @{
+  */
+template <
+  typename IType, typename AccType, typename OType, typename OutputTile_,
+  typename AccumulatorsPerThread_ = cutlass::Shape<8, 8, 8>,
+  typename MainLoopFunctor_ = cutlass::gemm::ThreadMultiplyAdd<
+    AccumulatorsPerThread_, cutlass::Shape<1, 4, 8>, IType, IType, AccType>,
+  typename Index_ = int,
+  typename EpilogueFunctor_ = cutlass::gemm::LinearScaling<OType>>
+void gemm(cublasOperation_t transA, cublasOperation_t transB, Index_ m,
+          Index_ n, Index_ k, OType alpha, IType const *A, Index_ lda,
+          IType const *B, Index_ ldb, OType beta, OType const *C, Index_ ldc,
+          OType *D, cudaStream_t stream) {
+  typedef CustomGemmConfig<IType, AccType, OType, OutputTile_,
+                           AccumulatorsPerThread_, MainLoopFunctor_>
+    GemmConfig_;
+  gemm<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
+       MainLoopFunctor_, Index_, GemmConfig_, EpilogueFunctor_>(
+    transA, transB, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc, D,
+    [](typename EpilogueFunctor_::Params &p) { return 0; },
+    0,  // missing final lambda here
+    stream);
+}
+
+}  // namespace LinAlg
+}  // namespace MLCommon
diff --git a/cpp/src_prims/linalg/cutlass_wrappers.cuh b/cpp/src_prims/linalg/cutlass_wrappers.cuh
index c2a73b46a0..3ddc4bac1c 100644
--- a/cpp/src_prims/linalg/cutlass_wrappers.cuh
+++ b/cpp/src_prims/linalg/cutlass_wrappers.cuh
@@ -26,8 +26,9 @@
 #include <cutlass/gemm/linear_scaling.h>
 #include <cutlass/gemm/thread_multiply_add.h>
 #include <cutlass/util/platform.h>
-#include <cuda_utils.cuh>
-#include "cublas_wrappers.h"
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -347,8 +348,8 @@ struct CustomGemm : public BaseClass {
                      cudaStream_t stream) {
     // Setup the grid.
     dim3 grid;
-    grid.x = ceildiv<int>(params.m, Traits::OutputTile::kW);
-    grid.y = ceildiv<int>(params.n, Traits::OutputTile::kH);
+    grid.x = raft::ceildiv<int>(params.m, Traits::OutputTile::kW);
+    grid.y = raft::ceildiv<int>(params.n, Traits::OutputTile::kH);
     // The number of threads.
     dim3 block;
     block.x = BaseClass::kThreads;
diff --git a/cpp/src_prims/linalg/divide.cuh b/cpp/src_prims/linalg/divide.cuh
deleted file mode 100644
index fe1401eb5d..0000000000
--- a/cpp/src_prims/linalg/divide.cuh
+++ /dev/null
@@ -1,45 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @defgroup ScalarOps Scalar operations on the input buffer
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param out the output buffer
- * @param in the input buffer
- * @param scalar the scalar used in the operations
- * @param len number of elements in the input buffer
- * @param stream cuda stream where to launch work
- * @{
- */
-template <typename math_t, typename IdxType = int>
-void divideScalar(math_t *out, const math_t *in, math_t scalar, IdxType len,
-                  cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return in / scalar; },
-    stream);
-}
-/** @} */
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/eig.cuh b/cpp/src_prims/linalg/eig.cuh
deleted file mode 100644
index 5b09d59a29..0000000000
--- a/cpp/src_prims/linalg/eig.cuh
+++ /dev/null
@@ -1,194 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/cudart_utils.h>
-#include <cuda_runtime_api.h>
-#include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
-#include <cuml/common/cuml_allocator.hpp>
-#include <matrix/matrix.cuh>
-#include "cusolver_wrappers.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @defgroup eig decomp with divide and conquer method for the column-major
- * symmetric matrices
- * @param in the input buffer (symmetric matrix that has real eig values and
- * vectors.
- * @param n_rows: number of rows of the input
- * @param n_cols: number of cols of the input
- * @param eig_vectors: eigenvectors
- * @param eig_vals: eigen values
- * @param cusolverH cusolver handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- * @{
- */
-template <typename math_t>
-void eigDC(const math_t *in, int n_rows, int n_cols, math_t *eig_vectors,
-           math_t *eig_vals, cusolverDnHandle_t cusolverH, cudaStream_t stream,
-           std::shared_ptr<deviceAllocator> allocator) {
-  int lwork;
-  CUSOLVER_CHECK(cusolverDnsyevd_bufferSize(cusolverH, CUSOLVER_EIG_MODE_VECTOR,
-                                            CUBLAS_FILL_MODE_UPPER, n_rows, in,
-                                            n_cols, eig_vals, &lwork));
-
-  device_buffer<math_t> d_work(allocator, stream, lwork);
-  device_buffer<int> d_dev_info(allocator, stream, 1);
-
-  MLCommon::Matrix::copy(in, eig_vectors, n_rows, n_cols, stream);
-
-  CUSOLVER_CHECK(cusolverDnsyevd(cusolverH, CUSOLVER_EIG_MODE_VECTOR,
-                                 CUBLAS_FILL_MODE_UPPER, n_rows, eig_vectors,
-                                 n_cols, eig_vals, d_work.data(), lwork,
-                                 d_dev_info.data(), stream));
-  CUDA_CHECK(cudaGetLastError());
-
-  int dev_info;
-  updateHost(&dev_info, d_dev_info.data(), 1, stream);
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  ASSERT(dev_info == 0,
-         "eig.cuh: eigensolver couldn't converge to a solution. "
-         "This usually occurs when some of the features do not vary enough.");
-}
-
-enum EigVecMemUsage { OVERWRITE_INPUT, COPY_INPUT };
-
-#if CUDART_VERSION >= 10010
-
-/**
- * @defgroup eig decomp with divide and conquer method for the column-major
- * symmetric matrices
- * @param in the input buffer (symmetric matrix that has real eig values and
- * vectors.
- * @param n_rows: number of rows of the input
- * @param n_cols: number of cols of the input
- * @param n_eig_vals: number of eigenvectors to be generated
- * @param eig_vectors: eigenvectors
- * @param eig_vals: eigen values
- * @param cusolverH cusolver handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- * @{
- */
-template <typename math_t>
-void eigSelDC(math_t *in, int n_rows, int n_cols, int n_eig_vals,
-              math_t *eig_vectors, math_t *eig_vals, EigVecMemUsage memUsage,
-              cusolverDnHandle_t cusolverH, cudaStream_t stream,
-              std::shared_ptr<deviceAllocator> allocator) {
-  int lwork;
-  int h_meig;
-
-  CUSOLVER_CHECK(cusolverDnsyevdx_bufferSize(
-    cusolverH, CUSOLVER_EIG_MODE_VECTOR, CUSOLVER_EIG_RANGE_I,
-    CUBLAS_FILL_MODE_UPPER, n_rows, in, n_cols, math_t(0.0), math_t(0.0),
-    n_cols - n_eig_vals + 1, n_cols, &h_meig, eig_vals, &lwork));
-
-  device_buffer<math_t> d_work(allocator, stream, lwork);
-  device_buffer<int> d_dev_info(allocator, stream, 1);
-  device_buffer<math_t> d_eig_vectors(allocator, stream, 0);
-
-  if (memUsage == OVERWRITE_INPUT) {
-    CUSOLVER_CHECK(cusolverDnsyevdx(
-      cusolverH, CUSOLVER_EIG_MODE_VECTOR, CUSOLVER_EIG_RANGE_I,
-      CUBLAS_FILL_MODE_UPPER, n_rows, in, n_cols, math_t(0.0), math_t(0.0),
-      n_cols - n_eig_vals + 1, n_cols, &h_meig, eig_vals, d_work.data(), lwork,
-      d_dev_info.data(), stream));
-  } else if (memUsage == COPY_INPUT) {
-    d_eig_vectors.resize(n_rows * n_cols, stream);
-    MLCommon::Matrix::copy(in, d_eig_vectors.data(), n_rows, n_cols, stream);
-
-    CUSOLVER_CHECK(cusolverDnsyevdx(
-      cusolverH, CUSOLVER_EIG_MODE_VECTOR, CUSOLVER_EIG_RANGE_I,
-      CUBLAS_FILL_MODE_UPPER, n_rows, eig_vectors, n_cols, math_t(0.0),
-      math_t(0.0), n_cols - n_eig_vals + 1, n_cols, &h_meig, eig_vals,
-      d_work.data(), lwork, d_dev_info.data(), stream));
-  }
-
-  CUDA_CHECK(cudaGetLastError());
-
-  int dev_info;
-  updateHost(&dev_info, d_dev_info.data(), 1, stream);
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  ASSERT(dev_info == 0,
-         "eig.cuh: eigensolver couldn't converge to a solution. "
-         "This usually occurs when some of the features do not vary enough.");
-
-  if (memUsage == OVERWRITE_INPUT) {
-    Matrix::truncZeroOrigin(in, n_rows, eig_vectors, n_rows, n_eig_vals,
-                            stream);
-  } else if (memUsage == COPY_INPUT) {
-    Matrix::truncZeroOrigin(d_eig_vectors.data(), n_rows, eig_vectors, n_rows,
-                            n_eig_vals, stream);
-  }
-}
-
-#endif
-
-/**
- * @defgroup overloaded function for eig decomp with Jacobi method for the
- * column-major symmetric matrices (in parameter)
- * @param n_rows: number of rows of the input
- * @param n_cols: number of cols of the input
- * @param eig_vectors: eigenvectors
- * @param eig_vals: eigen values
- * @param tol: error tolerance for the jacobi method. Algorithm stops when the
- * error is below tol
- * @param sweeps: number of sweeps in the Jacobi algorithm. The more the better
- * accuracy.
- * @param cusolverH cusolver handle
- * @param allocator device allocator for temporary buffers during computation
- * @{
- */
-template <typename math_t>
-void eigJacobi(const math_t *in, int n_rows, int n_cols, math_t *eig_vectors,
-               math_t *eig_vals, cusolverDnHandle_t cusolverH,
-               cudaStream_t stream, std::shared_ptr<deviceAllocator> allocator,
-               math_t tol = 1.e-7, int sweeps = 15) {
-  syevjInfo_t syevj_params = nullptr;
-  CUSOLVER_CHECK(cusolverDnCreateSyevjInfo(&syevj_params));
-  CUSOLVER_CHECK(cusolverDnXsyevjSetTolerance(syevj_params, tol));
-  CUSOLVER_CHECK(cusolverDnXsyevjSetMaxSweeps(syevj_params, sweeps));
-
-  int lwork;
-  CUSOLVER_CHECK(cusolverDnsyevj_bufferSize(
-    cusolverH, CUSOLVER_EIG_MODE_VECTOR, CUBLAS_FILL_MODE_UPPER, n_rows,
-    eig_vectors, n_cols, eig_vals, &lwork, syevj_params));
-
-  device_buffer<math_t> d_work(allocator, stream, lwork);
-  device_buffer<int> dev_info(allocator, stream, 1);
-
-  MLCommon::Matrix::copy(in, eig_vectors, n_rows, n_cols, stream);
-
-  CUSOLVER_CHECK(cusolverDnsyevj(cusolverH, CUSOLVER_EIG_MODE_VECTOR,
-                                 CUBLAS_FILL_MODE_UPPER, n_rows, eig_vectors,
-                                 n_cols, eig_vals, d_work.data(), lwork,
-                                 dev_info.data(), syevj_params, stream));
-
-  int executed_sweeps;
-  CUSOLVER_CHECK(
-    cusolverDnXsyevjGetSweeps(cusolverH, syevj_params, &executed_sweeps));
-
-  CUDA_CHECK(cudaGetLastError());
-  CUSOLVER_CHECK(cusolverDnDestroySyevjInfo(syevj_params));
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/eltwise.cuh b/cpp/src_prims/linalg/eltwise.cuh
deleted file mode 100644
index 42a8038189..0000000000
--- a/cpp/src_prims/linalg/eltwise.cuh
+++ /dev/null
@@ -1,112 +0,0 @@
-/*
- * Copyright (c) 2018, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "binary_op.cuh"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @defgroup ScalarOps Scalar operations on the input buffer
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param out the output buffer
- * @param in the input buffer
- * @param scalar the scalar used in the operations
- * @param len number of elements in the input buffer
- * @param stream cuda stream where to launch work
- * @{
- */
-template <typename math_t, typename IdxType = int>
-void scalarAdd(math_t *out, const math_t *in, math_t scalar, IdxType len,
-               cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return in + scalar; },
-    stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void scalarMultiply(math_t *out, const math_t *in, math_t scalar, IdxType len,
-                    cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return in * scalar; },
-    stream);
-}
-/** @} */
-
-/**
- * @defgroup BinaryOps Element-wise binary operations on the input buffers
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param out the output buffer
- * @param in1 the first input buffer
- * @param in2 the second input buffer
- * @param len number of elements in the input buffers
- * @param stream cuda stream where to launch work
- * @{
- */
-template <typename math_t, typename IdxType = int>
-void eltwiseAdd(math_t *out, const math_t *in1, const math_t *in2, IdxType len,
-                cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len, [] __device__(math_t a, math_t b) { return a + b; },
-    stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void eltwiseSub(math_t *out, const math_t *in1, const math_t *in2, IdxType len,
-                cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len, [] __device__(math_t a, math_t b) { return a - b; },
-    stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void eltwiseMultiply(math_t *out, const math_t *in1, const math_t *in2,
-                     IdxType len, cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len, [] __device__(math_t a, math_t b) { return a * b; },
-    stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void eltwiseDivide(math_t *out, const math_t *in1, const math_t *in2,
-                   IdxType len, cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len, [] __device__(math_t a, math_t b) { return a / b; },
-    stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void eltwiseDivideCheckZero(math_t *out, const math_t *in1, const math_t *in2,
-                            IdxType len, cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len,
-    [] __device__(math_t a, math_t b) {
-      if (b == math_t(0.0))
-        return math_t(0.0);
-      else
-        return a / b;
-    },
-    stream);
-}
-/** @} */
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/gemm.cuh b/cpp/src_prims/linalg/gemm.cuh
deleted file mode 100644
index fe436e8e93..0000000000
--- a/cpp/src_prims/linalg/gemm.cuh
+++ /dev/null
@@ -1,278 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cublas_v2.h>
-#include <cuda_utils.cuh>
-#include "cublas_wrappers.h"
-#include "cutlass_wrappers.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief the gemm function for the cases with detailed epilogue customization
- *  It computes the following equation: D = alpha . opA(A) * opB(B) + beta . C
- * @tparam IType input data-type (for A and B matrices)
- * @tparam AccType accumulation data-type
- * @tparam OType output data-type (for C and D matrices)
- * @tparam OutputTile_ output tile size for the thread block
- * @tparam AccumulatorsPerThread_ number of accumulators per thread
- * @tparam MainLoopFunctor_ custom functor to be used in the main loop
- * @tparam Index_ the type of index
- * @tparam GemmConfig_ the config for the GEMM
- * @tparam EpilogueFunctor_ custom epilogue functor
- * @tparam GemmEpilogueTraits_ epilogue traits class to build the epilogue
- * @tparam GemmEpilogue_ custom epilogue
- * @tparam Lambda lambda to initialize any custom params inside EpilogueFunctor_
- * @tparam FinalLambda Final device lambda to be applied in epilogue
- * @param transA cublas transpose op for A
- * @param transB cublas transpose op for B
- * @param m number of rows of A and C/D
- * @param n number of columns of B and C/D
- * @param k number of cols of A and rows of B
- * @param alpha scalar
- * @param A input matrix
- * @param lda leading dim for A
- * @param B input matrix
- * @param ldb leading dim for B
- * @param beta scalar
- * @param C input matrix
- * @param ldc leading dim for C and D
- * @param D output matrix
- * @param op lambda function to initialize any custom params inside
- * EpilogueFunctor_
- * @param fin_op the final lambda to be run inside the Epilogue. This can help
- * in customizing a given EpilogueFunctor, without having to go through the task
- * of creating another Functor!
- * @param stream cuda stream where to launch work
- */
-template <
-  typename IType, typename AccType, typename OType, typename OutputTile_,
-  typename AccumulatorsPerThread_ = cutlass::Shape<8, 8, 8>,
-  typename MainLoopFunctor_ = cutlass::gemm::ThreadMultiplyAdd<
-    AccumulatorsPerThread_, cutlass::Shape<1, 4, 8>, IType, IType, AccType>,
-  typename Index_ = int,
-  typename GemmConfig_ =
-    CustomGemmConfig<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
-                     MainLoopFunctor_>,
-  typename EpilogueFunctor_ = LinearScaling<OType>,
-  typename GemmEpilogueTraits_ = cutlass::gemm::SimplifiedGemmEpilogueTraits<
-    GemmConfig_, EpilogueFunctor_, Index_>,
-  typename GemmEpilogue_ = CustomGemmEpilogue<GemmEpilogueTraits_>,
-  typename Lambda, typename FinalLambda>
-void gemm(cublasOperation_t transA, cublasOperation_t transB, Index_ m,
-          Index_ n, Index_ k, OType alpha, IType const *A, Index_ lda,
-          IType const *B, Index_ ldb, OType beta, OType const *C, Index_ ldc,
-          OType *D, Lambda op, FinalLambda fin_op, cudaStream_t stream) {
-  baseGemm<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
-           MainLoopFunctor_, Index_, GemmConfig_, EpilogueFunctor_,
-           GemmEpilogueTraits_, GemmEpilogue_>(transA, transB, m, n, k, alpha,
-                                               A, lda, B, ldb, beta, C, ldc, D,
-                                               op, fin_op, stream);
-}
-
-/**
- * @brief the gemm function for the case where no or simple customization is
- * needed
- *  It computes the following equation: D = alpha . opA(A) * opB(B) + beta . C
- * @tparam IType input data-type (for A and B matrices)
- * @tparam AccType accumulation data-type
- * @tparam OType output data-type (for C and D matrices)
- * @tparam OutputTile_ output tile size for the thread block
- * @tparam AccumulatorsPerThread_ number of accumulators per thread
- * @tparam Index_ index type
- * @tparam EpilogueFunctor_ custom epilogue functor
- * @param transA cublas transpose op for A
- * @param transB cublas transpose op for B
- * @param m number of rows of A and C/D
- * @param n number of columns of B and C/D
- * @param k number of cols of A and rows of B
- * @param alpha scalar
- * @param A input matrix
- * @param lda leading dim for A
- * @param B input matrix
- * @param ldb leading dim for B
- * @param beta scalar
- * @param C input matrix
- * @param ldc leading dim for C and D
- * @param D output matrix
- * @param stream cuda stream where to launch work
- * @{
- */
-template <
-  typename IType, typename AccType, typename OType, typename OutputTile_,
-  typename AccumulatorsPerThread_ = cutlass::Shape<8, 8, 8>,
-  typename MainLoopFunctor_ = cutlass::gemm::ThreadMultiplyAdd<
-    AccumulatorsPerThread_, cutlass::Shape<1, 4, 8>, IType, IType, AccType>,
-  typename Index_ = int,
-  typename EpilogueFunctor_ = cutlass::gemm::LinearScaling<OType>>
-void gemm(cublasOperation_t transA, cublasOperation_t transB, Index_ m,
-          Index_ n, Index_ k, OType alpha, IType const *A, Index_ lda,
-          IType const *B, Index_ ldb, OType beta, OType const *C, Index_ ldc,
-          OType *D, cudaStream_t stream) {
-  typedef CustomGemmConfig<IType, AccType, OType, OutputTile_,
-                           AccumulatorsPerThread_, MainLoopFunctor_>
-    GemmConfig_;
-  gemm<IType, AccType, OType, OutputTile_, AccumulatorsPerThread_,
-       MainLoopFunctor_, Index_, GemmConfig_, EpilogueFunctor_>(
-    transA, transB, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc, D,
-    [](typename EpilogueFunctor_::Params &p) { return 0; },
-    0,  // missing final lambda here
-    stream);
-}
-
-/**
- * @brief the wrapper of cublas gemm function
- *  It computes the following equation: D = alpha . opA(A) * opB(B) + beta . C
- * @tparam math_t the type of input/output matrices
- * @param a input matrix
- * @param n_rows_a number of rows of A
- * @param n_cols_a number of columns of A
- * @param b input matrix
- * @param c output matrix
- * @param n_rows_c number of rows of C
- * @param n_cols_c number of columns of C
- * @param trans_a cublas transpose op for A
- * @param trans_b cublas transpose op for B
- * @param alpha scalar
- * @param beta scalar
- * @param cublas_h cublas handle
- * @param stream cuda stream
- */
-template <typename math_t>
-void gemm(const math_t *a, int n_rows_a, int n_cols_a, const math_t *b,
-          math_t *c, int n_rows_c, int n_cols_c, cublasOperation_t trans_a,
-          cublasOperation_t trans_b, math_t alpha, math_t beta,
-          cublasHandle_t cublas_h, cudaStream_t stream) {
-  int m = n_rows_c;
-  int n = n_cols_c;
-  int k = trans_a == CUBLAS_OP_T ? n_rows_a : n_cols_a;
-  int lda = trans_a == CUBLAS_OP_T ? k : m;
-  int ldb = trans_b == CUBLAS_OP_T ? n : k;
-  int ldc = m;
-  CUBLAS_CHECK(LinAlg::cublasgemm(cublas_h, trans_a, trans_b, m, n, k, &alpha,
-                                  a, lda, b, ldb, &beta, c, ldc, stream));
-}
-
-template <typename math_t>
-void gemm(const math_t *a, int n_rows_a, int n_cols_a, const math_t *b,
-          math_t *c, int n_rows_c, int n_cols_c, cublasOperation_t trans_a,
-          cublasOperation_t trans_b, cublasHandle_t cublas_h,
-          cudaStream_t stream) {
-  math_t alpha = math_t(1);
-  math_t beta = math_t(0);
-  gemm(a, n_rows_a, n_cols_a, b, c, n_rows_c, n_cols_c, trans_a, trans_b, alpha,
-       beta, cublas_h, stream);
-}
-
-/**
- * @brief A wrapper for CUBLS GEMM function designed for handling all possible 
- * combinations of operand layouts.
- * It computes the following equation: Z = alpha . X * Y + beta . Z
- * @tparam T Data type of input/output matrices (float/double)
- * @param handle cublas handle
- * @param z output matrix of size M rows x N columns
- * @param x input matrix of size M rows x K columns
- * @param y input matrix of size K rows x N columns
- * @param _M number of rows of X and Z
- * @param _N number of rows of Y and columns of Z
- * @param _K number of columns of X and rows of Y
- * @param isZColMajor Storage layout of Z. true = col major, false = row major
- * @param isXColMajor Storage layout of X. true = col major, false = row major
- * @param isYColMajor Storage layout of Y. true = col major, false = row major
- * @param stream cuda stream
- * @param alpha scalar
- * @param beta scalar
- */
-template <typename T>
-void gemm(cublasHandle_t handle, T *z, T *x, T *y, int _M, int _N, int _K,
-          bool isZColMajor, bool isXColMajor, bool isYColMajor,
-          cudaStream_t stream, T alpha = T(1.0), T beta = T(0.0)) {
-  cublasOperation_t trans_a, trans_b;
-  T *a, *b, *c;
-  int lda, ldb, ldc;
-  int M, N, K;
-  // This function performs c = a * b. Based on the required output layout,
-  // either a = x,  b = y or a = y, b = x. In either case c = z.
-  if (isZColMajor == true) {
-    // Result c is required in column major layout. Thus we perform,
-    // z = x * y
-    // Using BLAS call c = a * b. Therefore a = x, b = y and c = z
-
-    a = x;
-    // If x is in row major layout, cublas needs to transpose x first,
-    // therefore trans_x needs to be CUBLAS_OP_T. If x is in column major
-    // layout, trans_b needs to be CUBLAS_OP_N.
-    trans_a = isXColMajor == true ? CUBLAS_OP_N : CUBLAS_OP_T;
-    // Set leading dimension appropriately
-    lda = isXColMajor == true ? _M : _K;
-
-    b = y;
-    // If y is in row major layout, cublas needs to transpose y first,
-    // therefore trans_x needs to be CUBLAS_OP_T. If x is in column major
-    // layout, trans_b needs to be CUBLAS_OP_N.
-    trans_b = isYColMajor == true ? CUBLAS_OP_N : CUBLAS_OP_T;
-    ldb = isYColMajor == true ? _K : _N;
-
-    c = z;
-    ldc = _M;
-    M = _M;
-    N = _N;
-    K = _K;
-  } else {
-    // Result c is required in row major layout Thus we pick
-    // a = y, b = x and c = a * b = y * x
-    // cublas produces output matrix only in column major layout. To get output
-    // matrix on row major layout, we need to produce transpose of output
-    // in column major layout. Therefore we perform,
-    // tr(z) = tr(y) * tr(x)
-    // we model this using cublas call for c = a * b
-    // therefore a = tr(y), b = tr(x) and c = tr(z)
-
-    a = y;
-    // If y is in row major layout, it can be/ interpreted as tr(y) on column
-    // major layout. Therefore we can pass trans_a as CUBLAS_OP_N. If y is in
-    // column major layout, cublas needs to transpose y first, therefore
-    // trans_a needs to be CUBLAS_OP_T
-    trans_a = isYColMajor == true ? CUBLAS_OP_T : CUBLAS_OP_N;
-    // Set leading dimension appropriately
-    lda = isYColMajor == true ? _K : _N;
-
-    b = x;
-    // If x is in row major layout, it can be interpreted as tr(x) on column
-    // major layout. Therefore we can pass trans_b as CUBLAS_OP_N. If x is in
-    // column major layout, cublas needs to trasponse x first, therefore
-    // trans_b needs to be CUBLAS_OP_T
-    trans_b = isXColMajor == true ? CUBLAS_OP_T : CUBLAS_OP_N;
-    // Set leading dimension appropriately
-    ldb = isXColMajor == true ? _M : _K;
-
-    c = z;
-    ldc = _N;
-
-    M = _N;
-    N = _M;
-    K = _K;
-  }
-  // Actual cuBLAS call
-  CUBLAS_CHECK(LinAlg::cublasgemm(handle, trans_a, trans_b, M, N, K, &alpha, a,
-                                  lda, b, ldb, &beta, c, ldc, stream));
-}
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/gemv.h b/cpp/src_prims/linalg/gemv.h
deleted file mode 100644
index 8743b0a61f..0000000000
--- a/cpp/src_prims/linalg/gemv.h
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cublas_v2.h>
-#include <cuda_utils.cuh>
-#include "cublas_wrappers.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename math_t>
-void gemv(const math_t* a, int n_rows, int n_cols, const math_t* x, int incx,
-          math_t* y, int incy, bool trans_a, math_t alpha, math_t beta,
-          cublasHandle_t cublas_h, cudaStream_t stream) {
-  cublasOperation_t op_a = trans_a ? CUBLAS_OP_T : CUBLAS_OP_N;
-
-  // Unfortunately there is a clash of terminology
-  // in BLAS https://docs.nvidia.com/cuda/cublas/index.html is opposite to Machine Learning
-  // In blas:
-  //  m - number of rows in input matrix
-  //  n - number of columns in input matrix
-  //  lda - purpose of it  to have ability to operate on submatrices of matrix without copying.
-  //        If you're not think about it it's always should be equal to m
-  //  lda has deal with memory layout, but has nothing with the requirement for cuBLAS perform transpose
-
-  // In Machine Learning:
-  //  m - nunmber of columns in design matrix(number of features)
-  //  n - number of rows in designed matrix (number of train examples)
-
-  int m = n_rows;
-  int n = n_cols;
-  int lda = trans_a ? m : n;
-
-  CUBLAS_CHECK(cublasgemv(cublas_h, op_a, m, n, &alpha, a, lda, x, incx, &beta,
-                          y, incy, stream));
-}
-
-template <typename math_t>
-void gemv(const math_t* a, int n_rows_a, int n_cols_a, const math_t* x,
-          math_t* y, bool trans_a, math_t alpha, math_t beta,
-          cublasHandle_t cublas_h, cudaStream_t stream) {
-  gemv(a, n_rows_a, n_cols_a, x, 1, y, 1, trans_a, alpha, beta, cublas_h,
-       stream);
-}
-
-template <typename math_t>
-void gemv(const math_t* a, int n_rows_a, int n_cols_a, const math_t* x,
-          math_t* y, bool trans_a, cublasHandle_t cublas_h,
-          cudaStream_t stream) {
-  math_t alpha = math_t(1);
-  math_t beta = math_t(0);
-
-  gemv(a, n_rows_a, n_cols_a, x, 1, y, 1, trans_a, alpha, beta, cublas_h,
-       stream);
-}
-
-};  // namespace LinAlg
-// end namespace LinAlg
-};  // namespace MLCommon
-// end namespace MLCommon
diff --git a/cpp/src_prims/linalg/init.h b/cpp/src_prims/linalg/init.h
index f1738a50bd..7e09d23d18 100644
--- a/cpp/src_prims/linalg/init.h
+++ b/cpp/src_prims/linalg/init.h
@@ -39,7 +39,7 @@ template <typename T>
 void range(T *out, int start, int end, cudaStream_t stream) {
   thrust::counting_iterator<int> first(start);
   thrust::counting_iterator<int> last = first + (end - start);
-  thrust::device_ptr<int> ptr(out);
+  thrust::device_ptr<T> ptr(out);
   thrust::copy(thrust::cuda::par.on(stream), first, last, ptr);
 }
 
diff --git a/cpp/src_prims/linalg/lstsq.cuh b/cpp/src_prims/linalg/lstsq.cuh
index 214a24fed3..0ae5fcd3ee 100644
--- a/cpp/src_prims/linalg/lstsq.cuh
+++ b/cpp/src_prims/linalg/lstsq.cuh
@@ -16,72 +16,79 @@
 
 #pragma once
 
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/cusolver_wrappers.h>
+#include <raft/linalg/gemv.h>
+#include <raft/linalg/transpose.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <random/rng.cuh>
-#include "cublas_wrappers.h"
-#include "cusolver_wrappers.h"
-#include "eig.cuh"
-#include "gemm.cuh"
-#include "gemv.h"
-#include "qr.cuh"
-#include "svd.cuh"
-#include "transpose.h"
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/eig.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/qr.cuh>
+#include <raft/linalg/svd.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/mr/device/buffer.hpp>
+#include <raft/random/rng.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
 
 template <typename math_t>
-void lstsqSVD(math_t *A, int n_rows, int n_cols, math_t *b, math_t *w,
-              cusolverDnHandle_t cusolverH, cublasHandle_t cublasH,
-              std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream) {
+void lstsqSVD(const raft::handle_t &handle, math_t *A, int n_rows, int n_cols,
+              math_t *b, math_t *w, cudaStream_t stream) {
+  auto allocator = handle.get_device_allocator();
+  cusolverDnHandle_t cusolverH = handle.get_cusolver_dn_handle();
+  cublasHandle_t cublasH = handle.get_cublas_handle();
+
   ASSERT(n_cols > 0, "lstsq: number of columns cannot be less than one");
   ASSERT(n_rows > 1, "lstsq: number of rows cannot be less than two");
 
   int U_len = n_rows * n_cols;
   int V_len = n_cols * n_cols;
 
-  device_buffer<math_t> S(allocator, stream, n_cols);
-  device_buffer<math_t> V(allocator, stream, V_len);
-  device_buffer<math_t> U(allocator, stream, U_len);
-  device_buffer<math_t> UT_b(allocator, stream, n_rows);
+  raft::mr::device::buffer<math_t> S(allocator, stream, n_cols);
+  raft::mr::device::buffer<math_t> V(allocator, stream, V_len);
+  raft::mr::device::buffer<math_t> U(allocator, stream, U_len);
+  raft::mr::device::buffer<math_t> UT_b(allocator, stream, n_rows);
 
-  svdQR(A, n_rows, n_cols, S.data(), U.data(), V.data(), true, true, true,
-        cusolverH, cublasH, allocator, stream);
+  raft::linalg::svdQR(handle, A, n_rows, n_cols, S.data(), U.data(), V.data(),
+                      true, true, true, stream);
 
-  gemv(U.data(), n_rows, n_cols, b, w, true, cublasH, stream);
+  raft::linalg::gemv(handle, U.data(), n_rows, n_cols, b, w, true, stream);
 
-  Matrix::matrixVectorBinaryDivSkipZero(w, S.data(), 1, n_cols, false, true,
-                                        stream);
+  raft::matrix::matrixVectorBinaryDivSkipZero(w, S.data(), 1, n_cols, false,
+                                              true, stream);
 
-  gemv(V.data(), n_cols, n_cols, w, w, false, cublasH, stream);
+  raft::linalg::gemv(handle, V.data(), n_cols, n_cols, w, w, false, stream);
 }
 
 template <typename math_t>
-void lstsqEig(math_t *A, int n_rows, int n_cols, math_t *b, math_t *w,
-              cusolverDnHandle_t cusolverH, cublasHandle_t cublasH,
-              std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream) {
+void lstsqEig(const raft::handle_t &handle, math_t *A, int n_rows, int n_cols,
+              math_t *b, math_t *w, cudaStream_t stream) {
+  auto allocator = handle.get_device_allocator();
+  cusolverDnHandle_t cusolverH = handle.get_cusolver_dn_handle();
+  cublasHandle_t cublasH = handle.get_cublas_handle();
+
   ASSERT(n_cols > 1, "lstsq: number of columns cannot be less than two");
   ASSERT(n_rows > 1, "lstsq: number of rows cannot be less than two");
 
   int U_len = n_rows * n_cols;
   int V_len = n_cols * n_cols;
 
-  device_buffer<math_t> S(allocator, stream, n_cols);
-  device_buffer<math_t> V(allocator, stream, V_len);
-  device_buffer<math_t> U(allocator, stream, U_len);
+  raft::mr::device::buffer<math_t> S(allocator, stream, n_cols);
+  raft::mr::device::buffer<math_t> V(allocator, stream, V_len);
+  raft::mr::device::buffer<math_t> U(allocator, stream, U_len);
 
-  svdEig(A, n_rows, n_cols, S.data(), U.data(), V.data(), true, cublasH,
-         cusolverH, stream, allocator);
+  raft::linalg::svdEig(handle, A, n_rows, n_cols, S.data(), U.data(), V.data(),
+                       true, stream);
 
-  gemv(U.data(), n_rows, n_cols, b, w, true, cublasH, stream);
+  raft::linalg::gemv(handle, U.data(), n_rows, n_cols, b, w, true, stream);
 
-  Matrix::matrixVectorBinaryDivSkipZero(w, S.data(), 1, n_cols, false, true,
-                                        stream);
+  raft::matrix::matrixVectorBinaryDivSkipZero(w, S.data(), 1, n_cols, false,
+                                              true, stream);
 
-  gemv(V.data(), n_cols, n_cols, w, w, false, cublasH, stream);
+  raft::linalg::gemv(handle, V.data(), n_cols, n_cols, w, w, false, stream);
 }
 
 template <typename math_t>
@@ -105,29 +112,30 @@ void lstsqQR(math_t *A, int n_rows, int n_cols, math_t *b, math_t *w,
   const int lda = m;
   const int ldb = m;
 
-  CUSOLVER_CHECK(
-    cusolverDngeqrf_bufferSize(cusolverH, m, n, A, lda, &lwork_geqrf));
+  CUSOLVER_CHECK(raft::linalg::cusolverDngeqrf_bufferSize(cusolverH, m, n, A,
+                                                          lda, &lwork_geqrf));
 
-  CUSOLVER_CHECK(cusolverDnormqr_bufferSize(cusolverH, side, trans, m, 1, n, A,
-                                            lda, d_tau.data(), b,  // C,
-                                            lda,                   // ldc,
-                                            &lwork_ormqr));
+  CUSOLVER_CHECK(raft::linalg::cusolverDnormqr_bufferSize(
+    cusolverH, side, trans, m, 1, n, A, lda, d_tau.data(), b,  // C,
+    lda,                                                       // ldc,
+    &lwork_ormqr));
 
   lwork = (lwork_geqrf > lwork_ormqr) ? lwork_geqrf : lwork_ormqr;
 
   device_buffer<math_t> d_work(allocator, stream, lwork);
 
-  CUSOLVER_CHECK(cusolverDngeqrf(cusolverH, m, n, A, lda, d_tau.data(),
-                                 d_work.data(), lwork, d_info.data(), stream));
+  CUSOLVER_CHECK(raft::linalg::cusolverDngeqrf(cusolverH, m, n, A, lda,
+                                               d_tau.data(), d_work.data(),
+                                               lwork, d_info.data(), stream));
 
   CUDA_CHECK(cudaMemcpyAsync(&info, d_info.data(), sizeof(int),
                              cudaMemcpyDeviceToHost, stream));
   CUDA_CHECK(cudaStreamSynchronize(stream));
   ASSERT(0 == info, "lstsq.h: QR wasn't successful");
 
-  CUSOLVER_CHECK(cusolverDnormqr(cusolverH, side, trans, m, 1, n, A, lda,
-                                 d_tau.data(), b, ldb, d_work.data(), lwork,
-                                 d_info.data(), stream));
+  CUSOLVER_CHECK(raft::linalg::cusolverDnormqr(
+    cusolverH, side, trans, m, 1, n, A, lda, d_tau.data(), b, ldb,
+    d_work.data(), lwork, d_info.data(), stream));
 
   CUDA_CHECK(cudaMemcpyAsync(&info, d_info.data(), sizeof(int),
                              cudaMemcpyDeviceToHost, stream));
@@ -136,9 +144,9 @@ void lstsqQR(math_t *A, int n_rows, int n_cols, math_t *b, math_t *w,
 
   const math_t one = 1;
 
-  CUBLAS_CHECK(cublastrsm(cublasH, side, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N,
-                          CUBLAS_DIAG_NON_UNIT, n, 1, &one, A, lda, b, ldb,
-                          stream));
+  CUBLAS_CHECK(raft::linalg::cublastrsm(cublasH, side, CUBLAS_FILL_MODE_UPPER,
+                                        CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, 1,
+                                        &one, A, lda, b, ldb, stream));
 
   CUDA_CHECK(cudaMemcpyAsync(w, b, sizeof(math_t) * n, cudaMemcpyDeviceToDevice,
                              stream));
diff --git a/cpp/src_prims/linalg/map_then_reduce.cuh b/cpp/src_prims/linalg/map_then_reduce.cuh
deleted file mode 100644
index 4fbe615ee0..0000000000
--- a/cpp/src_prims/linalg/map_then_reduce.cuh
+++ /dev/null
@@ -1,83 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type, int TPB>
-__device__ void reduce(Type *out, const Type acc) {
-  typedef cub::BlockReduce<Type, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  Type tmp = BlockReduce(temp_storage).Sum(acc);
-  if (threadIdx.x == 0) {
-    myAtomicAdd(out, tmp);
-  }
-}
-
-template <typename Type, typename MapOp, int TPB, typename... Args>
-__global__ void mapThenSumReduceKernel(Type *out, size_t len, MapOp map,
-                                       const Type *in, Args... args) {
-  Type acc = (Type)0;
-  auto idx = (threadIdx.x + (blockIdx.x * blockDim.x));
-
-  if (idx < len) {
-    acc = map(in[idx], args[idx]...);
-  }
-
-  __syncthreads();
-
-  reduce<Type, TPB>(out, acc);
-}
-
-template <typename Type, typename MapOp, int TPB, typename... Args>
-void mapThenSumReduceImpl(Type *out, size_t len, MapOp map, cudaStream_t stream,
-                          const Type *in, Args... args) {
-  CUDA_CHECK(cudaMemsetAsync(out, 0, sizeof(Type), stream));
-  const int nblks = ceildiv(len, (size_t)TPB);
-  mapThenSumReduceKernel<Type, MapOp, TPB, Args...>
-    <<<nblks, TPB, 0, stream>>>(out, len, map, in, args...);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief CUDA version of map and then sum reduction operation
- * @tparam Type data-type upon which the math operation will be performed
- * @tparam MapOp the device-lambda performing the actual operation
- * @tparam TPB threads-per-block in the final kernel launched
- * @tparam Args additional parameters
- * @param out the output sum-reduced value (assumed to be a device pointer)
- * @param len number of elements in the input array
- * @param map the device-lambda
- * @param stream cuda-stream where to launch this kernel
- * @param in the input array
- * @param args additional input arrays
- */
-
-template <typename Type, typename MapOp, int TPB = 256, typename... Args>
-void mapThenSumReduce(Type *out, size_t len, MapOp map, cudaStream_t stream,
-                      const Type *in, Args... args) {
-  mapThenSumReduceImpl<Type, MapOp, TPB, Args...>(out, len, map, stream, in,
-                                                  args...);
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/matrix_vector_op.cuh b/cpp/src_prims/linalg/matrix_vector_op.cuh
deleted file mode 100644
index 60ce2eb904..0000000000
--- a/cpp/src_prims/linalg/matrix_vector_op.cuh
+++ /dev/null
@@ -1,214 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type, int veclen_, typename Lambda, typename IdxType>
-__global__ void matrixVectorOpKernel(Type *out, const Type *matrix,
-                                     const Type *vector, IdxType D, IdxType N,
-                                     bool rowMajor, bool bcastAlongRows,
-                                     Lambda op) {
-  typedef TxN_t<Type, veclen_> VecType;
-  IdxType len = N * D;
-  IdxType idx = threadIdx.x;
-  idx += (IdxType)blockIdx.x * (IdxType)blockDim.x;
-  idx *= VecType::Ratio;
-  if (idx >= len) return;
-  IdxType vIdx;
-  VecType mat, vec;
-  ///@todo: yikes! use fast-int-div here.
-  ///@todo: shared mem for vector could help with perf
-  if (rowMajor && bcastAlongRows) {
-    vIdx = idx % D;
-    vec.load(vector, vIdx);
-  } else if (!rowMajor && !bcastAlongRows) {
-    vIdx = idx % N;
-    vec.load(vector, vIdx);
-  } else if (rowMajor && !bcastAlongRows) {
-    vIdx = idx / D;
-    vec.fill(vector[vIdx]);
-  } else {
-    vIdx = idx / N;
-    vec.fill(vector[vIdx]);
-  }
-  mat.load(matrix, idx);
-#pragma unroll
-  for (int i = 0; i < VecType::Ratio; ++i)
-    mat.val.data[i] = op(mat.val.data[i], vec.val.data[i]);
-  mat.store(out, idx);
-}
-
-template <typename Type, int veclen_, typename Lambda, typename IdxType,
-          int TPB>
-void matrixVectorOpImpl(Type *out, const Type *matrix, const Type *vec,
-                        IdxType D, IdxType N, bool rowMajor,
-                        bool bcastAlongRows, Lambda op, cudaStream_t stream) {
-  IdxType len = N * D;
-  IdxType nblks = ceildiv(veclen_ ? len / veclen_ : veclen_, (IdxType)TPB);
-  matrixVectorOpKernel<Type, veclen_, Lambda, IdxType>
-    <<<nblks, TPB, 0, stream>>>(out, matrix, vec, D, N, rowMajor,
-                                bcastAlongRows, op);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief Operations for all the columns or rows with a given vector.
- * @tparam Type the matrix/vector type
- * @tparam Lambda a device function which represents a binary operator
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads per block of the cuda kernel launched
- * @param out the output matrix (passing out = matrix makes it in-place)
- * @param matrix the input matrix
- * @param vec the vector
- * @param D number of columns of matrix
- * @param N number of rows of matrix
- * @param rowMajor whether input is row or col major
- * @param bcastAlongRows whether the broadcast of vector needs to happen along
- * the rows of the matrix or columns
- * @param op the mathematical operation
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename Lambda, typename IdxType = int, int TPB = 256>
-void matrixVectorOp(Type *out, const Type *matrix, const Type *vec, IdxType D,
-                    IdxType N, bool rowMajor, bool bcastAlongRows, Lambda op,
-                    cudaStream_t stream) {
-  IdxType stride = rowMajor ? D : N;
-  size_t bytes = stride * sizeof(Type);
-  if (16 / sizeof(Type) && bytes % 16 == 0) {
-    matrixVectorOpImpl<Type, 16 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (8 / sizeof(Type) && bytes % 8 == 0) {
-    matrixVectorOpImpl<Type, 8 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (4 / sizeof(Type) && bytes % 4 == 0) {
-    matrixVectorOpImpl<Type, 4 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (2 / sizeof(Type) && bytes % 2 == 0) {
-    matrixVectorOpImpl<Type, 2 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (1 / sizeof(Type)) {
-    matrixVectorOpImpl<Type, 1 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else {
-    matrixVectorOpImpl<Type, 1, Lambda, IdxType, TPB>(
-      out, matrix, vec, D, N, rowMajor, bcastAlongRows, op, stream);
-  }
-}
-
-///@todo: come up with a cleaner interface to support these cases in future!
-
-template <typename Type, int veclen_, typename Lambda, typename IdxType>
-__global__ void matrixVectorOpKernel(Type *out, const Type *matrix,
-                                     const Type *vector1, const Type *vector2,
-                                     IdxType D, IdxType N, bool rowMajor,
-                                     bool bcastAlongRows, Lambda op) {
-  typedef TxN_t<Type, veclen_> VecType;
-  IdxType len = N * D;
-  IdxType idx = (threadIdx.x + (blockIdx.x * blockDim.x)) * VecType::Ratio;
-  if (idx >= len) return;
-  IdxType vIdx;
-  VecType mat, vec1, vec2;
-  ///@todo: yikes! use fast-int-div here.
-  ///@todo: shared mem for vector could help with perf
-  if (rowMajor && bcastAlongRows) {
-    vIdx = idx % D;
-    vec1.load(vector1, vIdx);
-    vec2.load(vector2, vIdx);
-  } else if (!rowMajor && !bcastAlongRows) {
-    vIdx = idx % N;
-    vec1.load(vector1, vIdx);
-    vec2.load(vector2, vIdx);
-  } else if (rowMajor && !bcastAlongRows) {
-    vIdx = idx / D;
-    vec1.fill(vector1[vIdx]);
-    vec2.fill(vector2[vIdx]);
-  } else {
-    vIdx = idx / N;
-    vec1.fill(vector1[vIdx]);
-    vec2.fill(vector2[vIdx]);
-  }
-  mat.load(matrix, idx);
-#pragma unroll
-  for (int i = 0; i < VecType::Ratio; ++i)
-    mat.val.data[i] = op(mat.val.data[i], vec1.val.data[i], vec2.val.data[i]);
-  mat.store(out, idx);
-}
-
-template <typename Type, int veclen_, typename Lambda, typename IdxType,
-          int TPB>
-void matrixVectorOpImpl(Type *out, const Type *matrix, const Type *vec1,
-                        const Type *vec2, IdxType D, IdxType N, bool rowMajor,
-                        bool bcastAlongRows, Lambda op, cudaStream_t stream) {
-  IdxType nblks = ceildiv(N * D, (IdxType)TPB);
-  matrixVectorOpKernel<Type, veclen_, Lambda, IdxType>
-    <<<nblks, TPB, 0, stream>>>(out, matrix, vec1, vec2, D, N, rowMajor,
-                                bcastAlongRows, op);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief Operations for all the columns or rows with the given vectors.
- * @tparam Type the matrix/vector type
- * @tparam Lambda a device function which represents a binary operator
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads per block of the cuda kernel launched
- * @param out the output matrix (passing out = matrix makes it in-place)
- * @param matrix the input matrix
- * @param vec1 the first vector
- * @param vec2 the second vector
- * @param D number of columns of matrix
- * @param N number of rows of matrix
- * @param rowMajor whether input is row or col major
- * @param bcastAlongRows whether the broadcast of vector needs to happen along
- * the rows of the matrix or columns
- * @param op the mathematical operation
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename Lambda, typename IdxType = int, int TPB = 256>
-void matrixVectorOp(Type *out, const Type *matrix, const Type *vec1,
-                    const Type *vec2, IdxType D, IdxType N, bool rowMajor,
-                    bool bcastAlongRows, Lambda op, cudaStream_t stream) {
-  IdxType stride = rowMajor ? D : N;
-  size_t bytes = stride * sizeof(Type);
-  if (16 / sizeof(Type) && bytes % 16 == 0) {
-    matrixVectorOpImpl<Type, 16 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (8 / sizeof(Type) && bytes % 8 == 0) {
-    matrixVectorOpImpl<Type, 8 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (4 / sizeof(Type) && bytes % 4 == 0) {
-    matrixVectorOpImpl<Type, 4 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (2 / sizeof(Type) && bytes % 2 == 0) {
-    matrixVectorOpImpl<Type, 2 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else if (1 / sizeof(Type)) {
-    matrixVectorOpImpl<Type, 1 / sizeof(Type), Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  } else {
-    matrixVectorOpImpl<Type, 1, Lambda, IdxType, TPB>(
-      out, matrix, vec1, vec2, D, N, rowMajor, bcastAlongRows, op, stream);
-  }
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/mean_squared_error.cuh b/cpp/src_prims/linalg/mean_squared_error.cuh
deleted file mode 100644
index 05e5a1483b..0000000000
--- a/cpp/src_prims/linalg/mean_squared_error.cuh
+++ /dev/null
@@ -1,47 +0,0 @@
-/*
- * Copyright (c) 2018, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "map_then_reduce.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief CUDA version mean squared error function mean((A-B)**2)
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam TPB threads-per-block 
- * @param out the output mean squared error value (assumed to be a device pointer)
- * @param A input array (assumed to be a device pointer)
- * @param B input array (assumed to be a device pointer)
- * @param len number of elements in the input arrays
- * @param weight weight to apply to every term in the mean squared error calculation
- * @param stream cuda-stream where to launch this kernel
- */
-template <typename math_t, int TPB = 256>
-void meanSquaredError(math_t* out, const math_t* A, const math_t* B, size_t len,
-                      math_t weight, cudaStream_t stream) {
-  auto sq_diff = [len, weight] __device__(const math_t a, const math_t b) {
-    math_t diff = a - b;
-    return diff * diff * weight / len;
-  };
-  mapThenSumReduce<math_t, decltype(sq_diff), TPB>(out, len, sq_diff, stream, A,
-                                                   B);
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/multiply.cuh b/cpp/src_prims/linalg/multiply.cuh
deleted file mode 100644
index ffbd761ba2..0000000000
--- a/cpp/src_prims/linalg/multiply.cuh
+++ /dev/null
@@ -1,45 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @defgroup ScalarOps Scalar operations on the input buffer
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param out the output buffer
- * @param in the input buffer
- * @param scalar the scalar used in the operations
- * @param len number of elements in the input buffer
- * @param stream cuda stream where to launch work
- * @{
- */
-template <typename math_t, typename IdxType = int>
-void multiplyScalar(math_t *out, const math_t *in, math_t scalar, IdxType len,
-                    cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return in * scalar; },
-    stream);
-}
-/** @} */
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/norm.cuh b/cpp/src_prims/linalg/norm.cuh
deleted file mode 100644
index 6d560b875b..0000000000
--- a/cpp/src_prims/linalg/norm.cuh
+++ /dev/null
@@ -1,100 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include "reduce.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/** different types of norms supported on the input buffers */
-enum NormType { L1Norm = 0, L2Norm };
-
-/**
- * @brief Compute row-wise norm of the input matrix and perform fin_op lambda
- *
- * Row-wise norm is useful while computing pairwise distance matrix, for
- * example.
- * This is used in many clustering algos like knn, kmeans, dbscan, etc... The
- * current implementation is optimized only for bigger values of 'D'.
- *
- * @tparam Type the data type
- * @tparam Lambda device final lambda
- * @tparam IdxType Integer type used to for addressing
- * @param dots the output vector of row-wise dot products
- * @param data the input matrix (currently assumed to be row-major)
- * @param D number of columns of data
- * @param N number of rows of data
- * @param type the type of norm to be applied
- * @param rowMajor whether the input is row-major or not
- * @param stream cuda stream where to launch work
- * @param fin_op the final lambda op
- */
-template <typename Type, typename IdxType = int,
-          typename Lambda = Nop<Type, IdxType>>
-void rowNorm(Type *dots, const Type *data, IdxType D, IdxType N, NormType type,
-             bool rowMajor, cudaStream_t stream,
-             Lambda fin_op = Nop<Type, IdxType>()) {
-  switch (type) {
-    case L1Norm:
-      LinAlg::reduce(dots, data, D, N, (Type)0, rowMajor, true, stream, false,
-                     L1Op<Type, IdxType>(), Sum<Type>(), fin_op);
-      break;
-    case L2Norm:
-      LinAlg::reduce(dots, data, D, N, (Type)0, rowMajor, true, stream, false,
-                     L2Op<Type>(), Sum<Type>(), fin_op);
-      break;
-    default:
-      ASSERT(false, "Invalid norm type passed! [%d]", type);
-  };
-}
-
-/**
- * @brief Compute column-wise norm of the input matrix and perform fin_op
- * @tparam Type the data type
- * @tparam Lambda device final lambda
- * @tparam IdxType Integer type used to for addressing
- * @param dots the output vector of column-wise dot products
- * @param data the input matrix (currently assumed to be row-major)
- * @param D number of columns of data
- * @param N number of rows of data
- * @param type the type of norm to be applied
- * @param rowMajor whether the input is row-major or not
- * @param stream cuda stream where to launch work
- * @param fin_op the final lambda op
- */
-template <typename Type, typename IdxType = int,
-          typename Lambda = Nop<Type, IdxType>>
-void colNorm(Type *dots, const Type *data, IdxType D, IdxType N, NormType type,
-             bool rowMajor, cudaStream_t stream,
-             Lambda fin_op = Nop<Type, IdxType>()) {
-  switch (type) {
-    case L1Norm:
-      LinAlg::reduce(dots, data, D, N, (Type)0, rowMajor, false, stream, false,
-                     L1Op<Type, IdxType>(), Sum<Type>(), fin_op);
-      break;
-    case L2Norm:
-      LinAlg::reduce(dots, data, D, N, (Type)0, rowMajor, false, stream, false,
-                     L2Op<Type, IdxType>(), Sum<Type>(), fin_op);
-      break;
-    default:
-      ASSERT(false, "Invalid norm type passed! [%d]", type);
-  };
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/power.cuh b/cpp/src_prims/linalg/power.cuh
index 4bc1bfb5cc..7a60f6c7ba 100644
--- a/cpp/src_prims/linalg/power.cuh
+++ b/cpp/src_prims/linalg/power.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include "binary_op.cuh"
-#include "unary_op.cuh"
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -37,9 +37,9 @@ namespace LinAlg {
 template <typename math_t, typename IdxType = int>
 void powerScalar(math_t *out, const math_t *in, math_t scalar, IdxType len,
                  cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return myPow(in, scalar); },
-    stream);
+  raft::linalg::unaryOp(
+    out, in, len,
+    [scalar] __device__(math_t in) { return raft::myPow(in, scalar); }, stream);
 }
 /** @} */
 
@@ -57,9 +57,9 @@ void powerScalar(math_t *out, const math_t *in, math_t scalar, IdxType len,
 template <typename math_t, typename IdxType = int>
 void power(math_t *out, const math_t *in1, const math_t *in2, IdxType len,
            cudaStream_t stream) {
-  binaryOp(
+  raft::linalg::binaryOp(
     out, in1, in2, len,
-    [] __device__(math_t a, math_t b) { return myPow(a, b); }, stream);
+    [] __device__(math_t a, math_t b) { return raft::myPow(a, b); }, stream);
 }
 /** @} */
 
diff --git a/cpp/src_prims/linalg/qr.cuh b/cpp/src_prims/linalg/qr.cuh
deleted file mode 100644
index b642cd0015..0000000000
--- a/cpp/src_prims/linalg/qr.cuh
+++ /dev/null
@@ -1,132 +0,0 @@
-/*
- * Copyright (c) 2018-2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/device_buffer.hpp>
-#include <cuml/common/cuml_allocator.hpp>
-#include <matrix/matrix.cuh>
-#include "cublas_wrappers.h"
-#include "cusolver_wrappers.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @defgroup QRdecomp QR decomposition
- * @{
- */
-
-/**
- * @brief compute QR decomp and return only Q matrix
- * @param M: input matrix
- * @param Q: Q matrix to be returned (on GPU)
- * @param n_rows: number rows of input matrix
- * @param n_cols: number columns of input matrix
- * @param cusolverH cusolver handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- * @{
- */
-template <typename math_t>
-void qrGetQ(math_t *M, math_t *Q, int n_rows, int n_cols,
-            cusolverDnHandle_t cusolverH, cudaStream_t stream,
-            std::shared_ptr<deviceAllocator> allocator) {
-  int m = n_rows, n = n_cols;
-  int k = min(m, n);
-  CUDA_CHECK(cudaMemcpyAsync(Q, M, sizeof(math_t) * m * n,
-                             cudaMemcpyDeviceToDevice, stream));
-
-  device_buffer<math_t> tau(allocator, stream, k);
-  CUDA_CHECK(cudaMemsetAsync(tau.data(), 0, sizeof(math_t) * k, stream));
-
-  device_buffer<int> devInfo(allocator, stream, 1);
-  int Lwork;
-
-  CUSOLVER_CHECK(cusolverDngeqrf_bufferSize(cusolverH, m, n, Q, m, &Lwork));
-  device_buffer<math_t> workspace(allocator, stream, Lwork);
-  CUSOLVER_CHECK(cusolverDngeqrf(cusolverH, m, n, Q, m, tau.data(),
-                                 workspace.data(), Lwork, devInfo.data(),
-                                 stream));
-  /// @note in v9.2, without deviceSynchronize *SquareMatrixNorm* ml-prims unit-tests fail.
-#if defined(CUDART_VERSION) && CUDART_VERSION <= 9020
-  CUDA_CHECK(cudaDeviceSynchronize());
-#endif
-  CUSOLVER_CHECK(
-    cusolverDnorgqr_bufferSize(cusolverH, m, n, k, Q, m, tau.data(), &Lwork));
-  workspace.resize(Lwork, stream);
-  CUSOLVER_CHECK(cusolverDnorgqr(cusolverH, m, n, k, Q, m, tau.data(),
-                                 workspace.data(), Lwork, devInfo.data(),
-                                 stream));
-}
-
-/**
- * @brief compute QR decomp and return both Q and R matrices
- * @param M: input matrix
- * @param Q: Q matrix to be returned (on GPU)
- * @param R: R matrix to be returned (on GPU)
- * @param n_rows: number rows of input matrix
- * @param n_cols: number columns of input matrix
- * @param cusolverH cusolver handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- */
-template <typename math_t>
-void qrGetQR(math_t *M, math_t *Q, math_t *R, int n_rows, int n_cols,
-             cusolverDnHandle_t cusolverH, cudaStream_t stream,
-             std::shared_ptr<deviceAllocator> allocator) {
-  int m = n_rows, n = n_cols;
-  device_buffer<math_t> R_full(allocator, stream, m * n);
-  device_buffer<math_t> tau(allocator, stream, min(m, n));
-  CUDA_CHECK(
-    cudaMemsetAsync(tau.data(), 0, sizeof(math_t) * min(m, n), stream));
-  int R_full_nrows = m, R_full_ncols = n;
-  CUDA_CHECK(cudaMemcpyAsync(R_full.data(), M, sizeof(math_t) * m * n,
-                             cudaMemcpyDeviceToDevice, stream));
-
-  int Lwork;
-  device_buffer<int> devInfo(allocator, stream, 1);
-
-  CUSOLVER_CHECK(cusolverDngeqrf_bufferSize(cusolverH, R_full_nrows,
-                                            R_full_ncols, R_full.data(),
-                                            R_full_nrows, &Lwork));
-  device_buffer<math_t> workspace(allocator, stream, Lwork);
-  CUSOLVER_CHECK(cusolverDngeqrf(
-    cusolverH, R_full_nrows, R_full_ncols, R_full.data(), R_full_nrows,
-    tau.data(), workspace.data(), Lwork, devInfo.data(), stream));
-  // @note in v9.2, without deviceSynchronize *SquareMatrixNorm* ml-prims unit-tests fail.
-#if defined(CUDART_VERSION) && CUDART_VERSION <= 9020
-  CUDA_CHECK(cudaDeviceSynchronize());
-#endif
-
-  Matrix::copyUpperTriangular(R_full.data(), R, m, n, stream);
-
-  CUDA_CHECK(cudaMemcpyAsync(Q, R_full.data(), sizeof(math_t) * m * n,
-                             cudaMemcpyDeviceToDevice, stream));
-  int Q_nrows = m, Q_ncols = n;
-
-  CUSOLVER_CHECK(cusolverDnorgqr_bufferSize(cusolverH, Q_nrows, Q_ncols,
-                                            min(Q_ncols, Q_nrows), Q, Q_nrows,
-                                            tau.data(), &Lwork));
-  workspace.resize(Lwork, stream);
-  CUSOLVER_CHECK(cusolverDnorgqr(
-    cusolverH, Q_nrows, Q_ncols, min(Q_ncols, Q_nrows), Q, Q_nrows, tau.data(),
-    workspace.data(), Lwork, devInfo.data(), stream));
-}
-/** @} */
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/reduce.cuh b/cpp/src_prims/linalg/reduce.cuh
deleted file mode 100644
index 1e1d77dc3f..0000000000
--- a/cpp/src_prims/linalg/reduce.cuh
+++ /dev/null
@@ -1,80 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include "coalesced_reduction.cuh"
-#include "strided_reduction.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief Compute reduction of the input matrix along the requested dimension
- *
- * @tparam InType the data type of the input
- * @tparam OutType the data type of the output (as well as the data type for
- *  which reduction is performed)
- * @tparam IdxType data type of the indices of the array
- * @tparam MainLambda Unary lambda applied while acculumation (eg: L1 or L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*MainLambda)(InType, IdxType);</pre>
- * @tparam ReduceLambda Binary lambda applied for reduction (eg: addition(+) for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*ReduceLambda)(OutType);</pre>
- * @tparam FinalLambda the final lambda applied before STG (eg: Sqrt for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*FinalLambda)(OutType);</pre>
- * @param dots the output reduction vector
- * @param data the input matrix
- * @param D number of columns
- * @param N number of rows
- * @param init initial value to use for the reduction
- * @param rowMajor input matrix is row-major or not
- * @param alongRows whether to reduce along rows or columns
- * @param stream cuda stream where to launch work
- * @param inplace reduction result added inplace or overwrites old values?
- * @param main_op elementwise operation to apply before reduction
- * @param reduce_op binary reduction operation
- * @param final_op elementwise operation to apply before storing results
- */
-template <typename InType, typename OutType = InType, typename IdxType = int,
-          typename MainLambda = Nop<InType, IdxType>,
-          typename ReduceLambda = Sum<OutType>,
-          typename FinalLambda = Nop<OutType>>
-void reduce(OutType *dots, const InType *data, int D, int N, OutType init,
-            bool rowMajor, bool alongRows, cudaStream_t stream,
-            bool inplace = false, MainLambda main_op = Nop<InType, IdxType>(),
-            ReduceLambda reduce_op = Sum<OutType>(),
-            FinalLambda final_op = Nop<OutType>()) {
-  if (rowMajor && alongRows) {
-    coalescedReduction(dots, data, D, N, init, stream, inplace, main_op,
-                       reduce_op, final_op);
-  } else if (rowMajor && !alongRows) {
-    stridedReduction(dots, data, D, N, init, stream, inplace, main_op,
-                     reduce_op, final_op);
-  } else if (!rowMajor && alongRows) {
-    stridedReduction(dots, data, N, D, init, stream, inplace, main_op,
-                     reduce_op, final_op);
-  } else {
-    coalescedReduction(dots, data, N, D, init, stream, inplace, main_op,
-                       reduce_op, final_op);
-  }
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/reduce_cols_by_key.cuh b/cpp/src_prims/linalg/reduce_cols_by_key.cuh
index 23a4dbcbee..e103b1d4ea 100644
--- a/cpp/src_prims/linalg/reduce_cols_by_key.cuh
+++ b/cpp/src_prims/linalg/reduce_cols_by_key.cuh
@@ -18,8 +18,8 @@
 
 #include <stdlib.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -40,7 +40,7 @@ __global__ void reduce_cols_by_key_kernel(const T* data,
   IdxType colId = idx % ncols;
   IdxType rowId = idx / ncols;
   KeyType key = keys[colId];
-  atomicAdd(out + rowId * nkeys + key, data[idx]);
+  raft::myAtomicAdd(out + rowId * nkeys + key, data[idx]);
 }
 
 /**
@@ -69,7 +69,7 @@ void reduce_cols_by_key(const T* data, const KeyIteratorT keys, T* out,
 
   CUDA_CHECK(cudaMemsetAsync(out, 0, sizeof(T) * nrows * nkeys, stream));
   constexpr int TPB = 256;
-  int nblks = (int)ceildiv<IdxType>(nrows * ncols, TPB);
+  int nblks = (int)raft::ceildiv<IdxType>(nrows * ncols, TPB);
   reduce_cols_by_key_kernel<<<nblks, TPB, 0, stream>>>(data, keys, out, nrows,
                                                        ncols, nkeys);
   CUDA_CHECK(cudaPeekAtLastError());
diff --git a/cpp/src_prims/linalg/reduce_rows_by_key.cuh b/cpp/src_prims/linalg/reduce_rows_by_key.cuh
index 1885859069..480dc0986d 100644
--- a/cpp/src_prims/linalg/reduce_rows_by_key.cuh
+++ b/cpp/src_prims/linalg/reduce_rows_by_key.cuh
@@ -20,8 +20,8 @@
 
 #include <stdlib.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -44,7 +44,7 @@ void convert_array(IteratorT1 dst, IteratorT2 src, int n, cudaStream_t st) {
   dim3 grid, block;
   block.x = 256;
 
-  grid.x = ceildiv(n, (int)block.x);
+  grid.x = raft::ceildiv(n, (int)block.x);
   grid.x = std::min(grid.x, MAX_BLOCKS);
 
   convert_array_kernel<<<grid, block, 0, st>>>(dst, src, n);
@@ -140,13 +140,13 @@ __launch_bounds__(SUM_ROWS_SMALL_K_DIMX, 4) __global__
       thread_sums = cub::ShuffleIndex<32>(thread_sums, 0, 0xffffffff);
       if (threadIdx.x < nkeys) {
         if (threadIdx.x == 0)
-          myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.x);
+          raft::myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.x);
         if (threadIdx.x == 1)
-          myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.y);
+          raft::myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.y);
         if (threadIdx.x == 2)
-          myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.z);
+          raft::myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.z);
         if (threadIdx.x == 3)
-          myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.w);
+          raft::myAtomicAdd(&d_sums[threadIdx.x * ncols + idim], thread_sums.w);
       }
     }
   }
@@ -161,7 +161,7 @@ void sum_rows_by_key_small_nkeys(const DataIteratorT d_A, int lda,
   block.x = SUM_ROWS_SMALL_K_DIMX;
   block.y = 1;  // Necessary
 
-  grid.x = ceildiv(nrows, (int)block.x);
+  grid.x = raft::ceildiv(nrows, (int)block.x);
   grid.x = std::min(grid.x, 32u);
   grid.y = ncols;
   grid.y = std::min(grid.y, MAX_BLOCKS);
@@ -202,7 +202,7 @@ __global__ void sum_rows_by_key_large_nkeys_kernel_colmajor(
       int local_key = d_keys[irow] - key_offset;
 
       // We could load next val here
-      myAtomicAdd(&local_sums[local_key], val);
+      raft::myAtomicAdd(&local_sums[local_key], val);
     }
 
     __syncthreads();  // local_sums
@@ -213,7 +213,7 @@ __global__ void sum_rows_by_key_large_nkeys_kernel_colmajor(
 
       if (local_sum != 0.0) {
         KeyType global_key = key_offset + local_key;
-        myAtomicAdd(&d_sums[global_key * ncols + idim], local_sum);
+        raft::myAtomicAdd(&d_sums[global_key * ncols + idim], local_sum);
         local_sums[local_key] = 0.0;
       }
     }
@@ -230,7 +230,7 @@ void sum_rows_by_key_large_nkeys_colmajor(const DataIteratorT d_A, int lda,
   block.x = SUM_ROWS_SMALL_K_DIMX;
   block.y = 1;  // Necessary
 
-  grid.x = ceildiv(nrows, (int)block.x);
+  grid.x = raft::ceildiv(nrows, (int)block.x);
   grid.x = std::min(grid.x, 32u);
   grid.y = ncols;
   grid.y = std::min(grid.y, MAX_BLOCKS);
@@ -290,7 +290,8 @@ __global__ void sum_rows_by_key_large_nkeys_kernel_rowmajor(
     sum += val;
   }
 
-  if (sum != 0.0) myAtomicAdd(&d_sums[global_key * ncols + this_col], sum);
+  if (sum != 0.0)
+    raft::myAtomicAdd(&d_sums[global_key * ncols + this_col], sum);
 }
 
 template <typename DataIteratorT, typename KeysIteratorT, typename WeightT>
@@ -306,7 +307,7 @@ void sum_rows_by_key_large_nkeys_rowmajor(const DataIteratorT d_A, int lda,
   dim3 grid, block;
   block.x = 256;  //Adjust me!
   block.y = 1;    //Don't adjust me!
-  grid.x = ceildiv(ncols, (int)block.x);
+  grid.x = raft::ceildiv(ncols, (int)block.x);
   grid.y = nkeys;
   grid.z = std::max(40960000 / nkeys / ncols, (int)1);  //Adjust me!
   grid.z = std::min(grid.z, (unsigned int)nrows);
diff --git a/cpp/src_prims/linalg/rsvd.cuh b/cpp/src_prims/linalg/rsvd.cuh
index 9bd3c46725..bdca7243a1 100644
--- a/cpp/src_prims/linalg/rsvd.cuh
+++ b/cpp/src_prims/linalg/rsvd.cuh
@@ -16,17 +16,20 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include <random/rng.cuh>
-#include "cublas_wrappers.h"
-#include "cusolver_wrappers.h"
-#include "eig.cuh"
-#include "gemm.cuh"
-#include "qr.cuh"
-#include "svd.cuh"
-#include "transpose.h"
+#include <common/device_buffer.hpp>
+
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/cusolver_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/eig.cuh>
+#include <raft/linalg/gemm.cuh>
+#include <raft/linalg/qr.cuh>
+#include <raft/linalg/svd.cuh>
+#include <raft/matrix/math.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/mr/device/buffer.hpp>
+#include <raft/random/rng.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -35,6 +38,7 @@ namespace LinAlg {
  * @brief randomized singular value decomposition (RSVD) on the column major
  * float type input matrix (Jacobi-based), by specifying no. of PCs and
  * upsamples directly
+ * @param handle: raft handle
  * @param M: input matrix
  * @param n_rows: number rows of input matrix
  * @param n_cols: number columns of input matrix
@@ -49,18 +53,18 @@ namespace LinAlg {
  * @param use_jacobi: whether to jacobi solver for decomposition
  * @param tol: tolerance for Jacobi-based solvers
  * @param max_sweeps: maximum number of sweeps for Jacobi-based solvers
- * @param cusolverH cusolver handle
- * @param cublasH cublas handle
  * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
  */
 template <typename math_t>
-void rsvdFixedRank(math_t *M, int n_rows, int n_cols, math_t *&S_vec,
-                   math_t *&U, math_t *&V, int k, int p, bool use_bbt,
-                   bool gen_left_vec, bool gen_right_vec, bool use_jacobi,
-                   math_t tol, int max_sweeps, cusolverDnHandle_t cusolverH,
-                   cublasHandle_t cublasH, cudaStream_t stream,
-                   std::shared_ptr<deviceAllocator> allocator) {
+void rsvdFixedRank(const raft::handle_t &handle, math_t *M, int n_rows,
+                   int n_cols, math_t *&S_vec, math_t *&U, math_t *&V, int k,
+                   int p, bool use_bbt, bool gen_left_vec, bool gen_right_vec,
+                   bool use_jacobi, math_t tol, int max_sweeps,
+                   cudaStream_t stream) {
+  auto allocator = handle.get_device_allocator();
+  cusolverDnHandle_t cusolverH = handle.get_cusolver_dn_handle();
+  cublasHandle_t cublasH = handle.get_cublas_handle();
+
   // All the notations are following Algorithm 4 & 5 in S. Voronin's paper:
   // https://arxiv.org/abs/1502.05366
 
@@ -75,23 +79,23 @@ void rsvdFixedRank(math_t *M, int n_rows, int n_cols, math_t *&S_vec,
   const math_t alpha = 1.0, beta = 0.0;
 
   // Build temporary U, S, V matrices
-  device_buffer<math_t> S_vec_tmp(allocator, stream, l);
+  raft::mr::device::buffer<math_t> S_vec_tmp(allocator, stream, l);
   CUDA_CHECK(cudaMemsetAsync(S_vec_tmp.data(), 0, sizeof(math_t) * l, stream));
 
   // build random matrix
   device_buffer<math_t> RN(allocator, stream, n * l);
-  Random::Rng rng(484);
+  raft::random::Rng rng(484);
   rng.normal(RN.data(), n * l, math_t(0.0), alpha, stream);
 
   // multiply to get matrix of random samples Y
-  device_buffer<math_t> Y(allocator, stream, m * l);
-  gemm(M, m, n, RN.data(), Y.data(), m, l, CUBLAS_OP_N, CUBLAS_OP_N, alpha,
-       beta, cublasH, stream);
+  raft::mr::device::buffer<math_t> Y(allocator, stream, m * l);
+  raft::linalg::gemm(handle, M, m, n, RN.data(), Y.data(), m, l, CUBLAS_OP_N,
+                     CUBLAS_OP_N, alpha, beta, stream);
 
   // now build up (M M^T)^q R
-  device_buffer<math_t> Z(allocator, stream, n * l);
-  device_buffer<math_t> Yorth(allocator, stream, m * l);
-  device_buffer<math_t> Zorth(allocator, stream, n * l);
+  raft::mr::device::buffer<math_t> Z(allocator, stream, n * l);
+  raft::mr::device::buffer<math_t> Yorth(allocator, stream, m * l);
+  raft::mr::device::buffer<math_t> Zorth(allocator, stream, n * l);
   CUDA_CHECK(cudaMemsetAsync(Z.data(), 0, sizeof(math_t) * n * l, stream));
   CUDA_CHECK(cudaMemsetAsync(Yorth.data(), 0, sizeof(math_t) * m * l, stream));
   CUDA_CHECK(cudaMemsetAsync(Zorth.data(), 0, sizeof(math_t) * n * l, stream));
@@ -99,127 +103,130 @@ void rsvdFixedRank(math_t *M, int n_rows, int n_cols, math_t *&S_vec,
   // power sampling scheme
   for (int j = 1; j < q; j++) {
     if ((2 * j - 2) % s == 0) {
-      qrGetQ(Y.data(), Yorth.data(), m, l, cusolverH, stream, allocator);
-      gemm(M, m, n, Yorth.data(), Z.data(), n, l, CUBLAS_OP_T, CUBLAS_OP_N,
-           alpha, beta, cublasH, stream);
+      raft::linalg::qrGetQ(handle, Y.data(), Yorth.data(), m, l, stream);
+      raft::linalg::gemm(handle, M, m, n, Yorth.data(), Z.data(), n, l,
+                         CUBLAS_OP_T, CUBLAS_OP_N, alpha, beta, stream);
     } else {
-      gemm(M, m, n, Y.data(), Z.data(), n, l, CUBLAS_OP_T, CUBLAS_OP_N, alpha,
-           beta, cublasH, stream);
+      raft::linalg::gemm(handle, M, m, n, Y.data(), Z.data(), n, l, CUBLAS_OP_T,
+                         CUBLAS_OP_N, alpha, beta, stream);
     }
 
     if ((2 * j - 1) % s == 0) {
-      qrGetQ(Z.data(), Zorth.data(), n, l, cusolverH, stream, allocator);
-      gemm(M, m, n, Zorth.data(), Y.data(), m, l, CUBLAS_OP_N, CUBLAS_OP_N,
-           alpha, beta, cublasH, stream);
+      raft::linalg::qrGetQ(handle, Z.data(), Zorth.data(), n, l, stream);
+      raft::linalg::gemm(handle, M, m, n, Zorth.data(), Y.data(), m, l,
+                         CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta, stream);
     } else {
-      gemm(M, m, n, Z.data(), Y.data(), m, l, CUBLAS_OP_N, CUBLAS_OP_N, alpha,
-           beta, cublasH, stream);
+      raft::linalg::gemm(handle, M, m, n, Z.data(), Y.data(), m, l, CUBLAS_OP_N,
+                         CUBLAS_OP_N, alpha, beta, stream);
     }
   }
 
   // orthogonalize on exit from loop to get Q
-  device_buffer<math_t> Q(allocator, stream, m * l);
+  raft::mr::device::buffer<math_t> Q(allocator, stream, m * l);
   CUDA_CHECK(cudaMemsetAsync(Q.data(), 0, sizeof(math_t) * m * l, stream));
-  qrGetQ(Y.data(), Q.data(), m, l, cusolverH, stream, allocator);
+  raft::linalg::qrGetQ(handle, Y.data(), Q.data(), m, l, stream);
 
   // either QR of B^T method, or eigendecompose BB^T method
   if (!use_bbt) {
     // form Bt = Mt*Q : nxm * mxl = nxl
-    device_buffer<math_t> Bt(allocator, stream, n * l);
+    raft::mr::device::buffer<math_t> Bt(allocator, stream, n * l);
     CUDA_CHECK(cudaMemsetAsync(Bt.data(), 0, sizeof(math_t) * n * l, stream));
-    gemm(M, m, n, Q.data(), Bt.data(), n, l, CUBLAS_OP_T, CUBLAS_OP_N, alpha,
-         beta, cublasH, stream);
+    raft::linalg::gemm(handle, M, m, n, Q.data(), Bt.data(), n, l, CUBLAS_OP_T,
+                       CUBLAS_OP_N, alpha, beta, stream);
 
     // compute QR factorization of Bt
     // M is mxn ; Q is mxn ; R is min(m,n) x min(m,n) */
-    device_buffer<math_t> Qhat(allocator, stream, n * l);
+    raft::mr::device::buffer<math_t> Qhat(allocator, stream, n * l);
     CUDA_CHECK(cudaMemsetAsync(Qhat.data(), 0, sizeof(math_t) * n * l, stream));
-    device_buffer<math_t> Rhat(allocator, stream, l * l);
+    raft::mr::device::buffer<math_t> Rhat(allocator, stream, l * l);
     CUDA_CHECK(cudaMemsetAsync(Rhat.data(), 0, sizeof(math_t) * l * l, stream));
-    qrGetQR(Bt.data(), Qhat.data(), Rhat.data(), n, l, cusolverH, stream,
-            allocator);
+    raft::linalg::qrGetQR(handle, Bt.data(), Qhat.data(), Rhat.data(), n, l,
+                          stream);
 
     // compute SVD of Rhat (lxl)
-    device_buffer<math_t> Uhat(allocator, stream, l * l);
+    raft::mr::device::buffer<math_t> Uhat(allocator, stream, l * l);
     CUDA_CHECK(cudaMemsetAsync(Uhat.data(), 0, sizeof(math_t) * l * l, stream));
-    device_buffer<math_t> Vhat(allocator, stream, l * l);
+    raft::mr::device::buffer<math_t> Vhat(allocator, stream, l * l);
     CUDA_CHECK(cudaMemsetAsync(Vhat.data(), 0, sizeof(math_t) * l * l, stream));
     if (use_jacobi)
-      svdJacobi(Rhat.data(), l, l, S_vec_tmp.data(), Uhat.data(), Vhat.data(),
-                true, true, tol, max_sweeps, cusolverH, stream, allocator);
+      raft::linalg::svdJacobi(handle, Rhat.data(), l, l, S_vec_tmp.data(),
+                              Uhat.data(), Vhat.data(), true, true, tol,
+                              max_sweeps, stream);
     else
-      svdQR(Rhat.data(), l, l, S_vec_tmp.data(), Uhat.data(), Vhat.data(), true,
-            true, true, cusolverH, cublasH, allocator, stream);
-    Matrix::sliceMatrix(S_vec_tmp.data(), 1, l, S_vec, 0, 0, 1, k,
-                        stream);  // First k elements of S_vec
+      raft::linalg::svdQR(handle, Rhat.data(), l, l, S_vec_tmp.data(),
+                          Uhat.data(), Vhat.data(), true, true, true, stream);
+    raft::matrix::sliceMatrix(S_vec_tmp.data(), 1, l, S_vec, 0, 0, 1, k,
+                              stream);  // First k elements of S_vec
 
     // Merge step 14 & 15 by calculating U = Q*Vhat[:,1:k] mxl * lxk = mxk
     if (gen_left_vec) {
-      gemm(Q.data(), m, l, Vhat.data(), U, m,
-           k /*used to be l and needs slicing*/, CUBLAS_OP_N, CUBLAS_OP_N,
-           alpha, beta, cublasH, stream);
+      raft::linalg::gemm(handle, Q.data(), m, l, Vhat.data(), U, m,
+                         k /*used to be l and needs slicing*/, CUBLAS_OP_N,
+                         CUBLAS_OP_N, alpha, beta, stream);
     }
 
     // Merge step 14 & 15 by calculating V = Qhat*Uhat[:,1:k] nxl * lxk = nxk
     if (gen_right_vec) {
-      gemm(Qhat.data(), n, l, Uhat.data(), V, n,
-           k /*used to be l and needs slicing*/, CUBLAS_OP_N, CUBLAS_OP_N,
-           alpha, beta, cublasH, stream);
+      raft::linalg::gemm(handle, Qhat.data(), n, l, Uhat.data(), V, n,
+                         k /*used to be l and needs slicing*/, CUBLAS_OP_N,
+                         CUBLAS_OP_N, alpha, beta, stream);
     }
   } else {
     // build the matrix B B^T = Q^T M M^T Q column by column
     // Bt = M^T Q ; nxm * mxk = nxk
-    device_buffer<math_t> B(allocator, stream, n * l);
-    gemm(Q.data(), m, l, M, B.data(), l, n, CUBLAS_OP_T, CUBLAS_OP_N, alpha,
-         beta, cublasH, stream);
+    raft::mr::device::buffer<math_t> B(allocator, stream, n * l);
+    raft::linalg::gemm(handle, Q.data(), m, l, M, B.data(), l, n, CUBLAS_OP_T,
+                       CUBLAS_OP_N, alpha, beta, stream);
 
-    device_buffer<math_t> BBt(allocator, stream, l * l);
-    gemm(B.data(), l, n, B.data(), BBt.data(), l, l, CUBLAS_OP_N, CUBLAS_OP_T,
-         alpha, beta, cublasH, stream);
+    raft::mr::device::buffer<math_t> BBt(allocator, stream, l * l);
+    raft::linalg::gemm(handle, B.data(), l, n, B.data(), BBt.data(), l, l,
+                       CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta, stream);
 
     // compute eigendecomposition of BBt
-    device_buffer<math_t> Uhat(allocator, stream, l * l);
+    raft::mr::device::buffer<math_t> Uhat(allocator, stream, l * l);
     CUDA_CHECK(cudaMemsetAsync(Uhat.data(), 0, sizeof(math_t) * l * l, stream));
-    device_buffer<math_t> Uhat_dup(allocator, stream, l * l);
+    raft::mr::device::buffer<math_t> Uhat_dup(allocator, stream, l * l);
     CUDA_CHECK(
       cudaMemsetAsync(Uhat_dup.data(), 0, sizeof(math_t) * l * l, stream));
-    Matrix::copyUpperTriangular(BBt.data(), Uhat_dup.data(), l, l, stream);
+    raft::matrix::copyUpperTriangular(BBt.data(), Uhat_dup.data(), l, l,
+                                      stream);
     if (use_jacobi)
-      eigJacobi(Uhat_dup.data(), l, l, Uhat.data(), S_vec_tmp.data(), cusolverH,
-                stream, allocator, tol, max_sweeps);
+      raft::linalg::eigJacobi(handle, Uhat_dup.data(), l, l, Uhat.data(),
+                              S_vec_tmp.data(), stream, tol, max_sweeps);
     else
-      eigDC(Uhat_dup.data(), l, l, Uhat.data(), S_vec_tmp.data(), cusolverH,
-            stream, allocator);
-    Matrix::seqRoot(S_vec_tmp.data(), l, stream);
-    Matrix::sliceMatrix(S_vec_tmp.data(), 1, l, S_vec, 0, p, 1, l,
-                        stream);  // Last k elements of S_vec
-    Matrix::colReverse(S_vec, 1, k, stream);
+      raft::linalg::eigDC(handle, Uhat_dup.data(), l, l, Uhat.data(),
+                          S_vec_tmp.data(), stream);
+    raft::matrix::seqRoot(S_vec_tmp.data(), l, stream);
+    raft::matrix::sliceMatrix(S_vec_tmp.data(), 1, l, S_vec, 0, p, 1, l,
+                              stream);  // Last k elements of S_vec
+    raft::matrix::colReverse(S_vec, 1, k, stream);
 
     // Merge step 14 & 15 by calculating U = Q*Uhat[:,(p+1):l] mxl * lxk = mxk
     if (gen_left_vec) {
-      gemm(Q.data(), m, l, Uhat.data() + p * l, U, m, k, CUBLAS_OP_N,
-           CUBLAS_OP_N, alpha, beta, cublasH, stream);
-      Matrix::colReverse(U, m, k, stream);
+      raft::linalg::gemm(handle, Q.data(), m, l, Uhat.data() + p * l, U, m, k,
+                         CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta, stream);
+      raft::matrix::colReverse(U, m, k, stream);
     }
 
     // Merge step 14 & 15 by calculating V = B^T Uhat[:,(p+1):l] *
     // Sigma^{-1}[(p+1):l, (p+1):l] nxl * lxk * kxk = nxk
     if (gen_right_vec) {
-      device_buffer<math_t> Sinv(allocator, stream, k * k);
+      raft::mr::device::buffer<math_t> Sinv(allocator, stream, k * k);
       CUDA_CHECK(
         cudaMemsetAsync(Sinv.data(), 0, sizeof(math_t) * k * k, stream));
-      device_buffer<math_t> UhatSinv(allocator, stream, l * k);
+      raft::mr::device::buffer<math_t> UhatSinv(allocator, stream, l * k);
       CUDA_CHECK(
         cudaMemsetAsync(UhatSinv.data(), 0, sizeof(math_t) * l * k, stream));
-      Matrix::reciprocal(S_vec_tmp.data(), l, stream);
-      Matrix::initializeDiagonalMatrix(S_vec_tmp.data() + p, Sinv.data(), k, k,
-                                       stream);
+      raft::matrix::reciprocal(S_vec_tmp.data(), l, stream);
+      raft::matrix::initializeDiagonalMatrix(S_vec_tmp.data() + p, Sinv.data(),
+                                             k, k, stream);
 
-      gemm(Uhat.data() + p * l, l, k, Sinv.data(), UhatSinv.data(), l, k,
-           CUBLAS_OP_N, CUBLAS_OP_N, alpha, beta, cublasH, stream);
-      gemm(B.data(), l, n, UhatSinv.data(), V, n, k, CUBLAS_OP_T, CUBLAS_OP_N,
-           alpha, beta, cublasH, stream);
-      Matrix::colReverse(V, n, k, stream);
+      raft::linalg::gemm(handle, Uhat.data() + p * l, l, k, Sinv.data(),
+                         UhatSinv.data(), l, k, CUBLAS_OP_N, CUBLAS_OP_N, alpha,
+                         beta, stream);
+      raft::linalg::gemm(handle, B.data(), l, n, UhatSinv.data(), V, n, k,
+                         CUBLAS_OP_T, CUBLAS_OP_N, alpha, beta, stream);
+      raft::matrix::colReverse(V, n, k, stream);
     }
   }
 }
@@ -228,6 +235,7 @@ void rsvdFixedRank(math_t *M, int n_rows, int n_cols, math_t *&S_vec,
  * @brief randomized singular value decomposition (RSVD) on the column major
  * float type input matrix (Jacobi-based), by specifying the PC and upsampling
  * ratio
+ * @param handle: raft handle
  * @param M: input matrix
  * @param n_rows: number rows of input matrix
  * @param n_cols: number columns of input matrix
@@ -242,24 +250,20 @@ void rsvdFixedRank(math_t *M, int n_rows, int n_cols, math_t *&S_vec,
  * @param use_jacobi: whether to jacobi solver for decomposition
  * @param tol: tolerance for Jacobi-based solvers
  * @param max_sweeps: maximum number of sweeps for Jacobi-based solvers
- * @param cusolverH cusolver handle
- * @param cublasH cublas handle
  * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
  */
 template <typename math_t>
-void rsvdPerc(math_t *M, int n_rows, int n_cols, math_t *&S_vec, math_t *&U,
-              math_t *&V, math_t PC_perc, math_t UpS_perc, bool use_bbt,
-              bool gen_left_vec, bool gen_right_vec, bool use_jacobi,
-              math_t tol, int max_sweeps, cusolverDnHandle_t cusolverH,
-              cublasHandle_t cublasH, cudaStream_t stream,
-              std::shared_ptr<deviceAllocator> allocator) {
+void rsvdPerc(const raft::handle_t &handle, math_t *M, int n_rows, int n_cols,
+              math_t *&S_vec, math_t *&U, math_t *&V, math_t PC_perc,
+              math_t UpS_perc, bool use_bbt, bool gen_left_vec,
+              bool gen_right_vec, bool use_jacobi, math_t tol, int max_sweeps,
+              cudaStream_t stream) {
   int k = max((int)(min(n_rows, n_cols) * PC_perc),
               1);  // Number of singular values to be computed
   int p = max((int)(min(n_rows, n_cols) * UpS_perc), 1);  // Upsamples
-  rsvdFixedRank(M, n_rows, n_cols, S_vec, U, V, k, p, use_bbt, gen_left_vec,
-                gen_right_vec, use_jacobi, tol, max_sweeps, cusolverH, cublasH,
-                stream, allocator);
+  rsvdFixedRank(handle, M, n_rows, n_cols, S_vec, U, V, k, p, use_bbt,
+                gen_left_vec, gen_right_vec, use_jacobi, tol, max_sweeps,
+                stream);
 }
 
 };  // end namespace LinAlg
diff --git a/cpp/src_prims/linalg/sqrt.cuh b/cpp/src_prims/linalg/sqrt.cuh
index 72a230b533..92a05f7797 100644
--- a/cpp/src_prims/linalg/sqrt.cuh
+++ b/cpp/src_prims/linalg/sqrt.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include "unary_op.cuh"
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -34,8 +34,9 @@ namespace LinAlg {
  */
 template <typename math_t, typename IdxType = int>
 void sqrt(math_t *out, const math_t *in, IdxType len, cudaStream_t stream) {
-  unaryOp(
-    out, in, len, [] __device__(math_t in) { return mySqrt(in); }, stream);
+  raft::linalg::unaryOp(
+    out, in, len, [] __device__(math_t in) { return raft::mySqrt(in); },
+    stream);
 }
 /** @} */
 
diff --git a/cpp/src_prims/linalg/strided_reduction.cuh b/cpp/src_prims/linalg/strided_reduction.cuh
deleted file mode 100644
index b9916d4f9e..0000000000
--- a/cpp/src_prims/linalg/strided_reduction.cuh
+++ /dev/null
@@ -1,167 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <type_traits>
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-// Kernel to perform reductions along the strided dimension
-// of the matrix, i.e. reduce along columns for row major or reduce along rows
-// for column major layout
-template <typename Type, typename MainLambda>
-__global__ void stridedSummationKernel(Type *dots, const Type *data, int D,
-                                       int N, Type init, MainLambda main_op) {
-  // Thread reduction
-  Type thread_data = Type(init);
-  int colStart = blockIdx.x * blockDim.x + threadIdx.x;
-  if (colStart < D) {
-    int rowStart = blockIdx.y * blockDim.y + threadIdx.y;
-    int stride = blockDim.y * gridDim.y;
-    for (int j = rowStart; j < N; j += stride) {
-      int idx = colStart + j * D;
-      thread_data += main_op(data[idx], j);
-    }
-  }
-
-  // Block reduction
-  extern __shared__ char tmp[];  // One element per thread in block
-  Type *temp = (Type *)tmp;      // Cast to desired type
-  int myidx = threadIdx.x + blockDim.x * threadIdx.y;
-  temp[myidx] = thread_data;
-  __syncthreads();
-  for (int j = blockDim.y / 2; j > 0; j /= 2) {
-    if (threadIdx.y < j) temp[myidx] += temp[myidx + j * blockDim.x];
-    __syncthreads();
-  }
-
-  // Grid reduction
-  if ((colStart < D) && (threadIdx.y == 0))
-    myAtomicAdd(dots + colStart, temp[myidx]);
-}
-
-// Kernel to perform reductions along the strided dimension
-// of the matrix, i.e. reduce along columns for row major or reduce along rows
-// for column major layout
-template <typename InType, typename OutType, typename IdxType,
-          typename MainLambda, typename ReduceLambda>
-__global__ void stridedReductionKernel(OutType *dots, const InType *data, int D,
-                                       int N, OutType init, MainLambda main_op,
-                                       ReduceLambda reduce_op) {
-  // Thread reduction
-  OutType thread_data = init;
-  IdxType colStart = blockIdx.x * blockDim.x + threadIdx.x;
-  if (colStart < D) {
-    IdxType rowStart = blockIdx.y * blockDim.y + threadIdx.y;
-    IdxType stride = blockDim.y * gridDim.y;
-    for (IdxType j = rowStart; j < N; j += stride) {
-      IdxType idx = colStart + j * D;
-      thread_data = reduce_op(thread_data, main_op(data[idx], j));
-    }
-  }
-
-  // Block reduction
-  extern __shared__ char tmp[];  // One element per thread in block
-  auto *temp = (OutType *)tmp;   // Cast to desired type
-  IdxType myidx = threadIdx.x + ((IdxType)blockDim.x * (IdxType)threadIdx.y);
-  temp[myidx] = thread_data;
-  __syncthreads();
-  for (int j = blockDim.y / 2; j > 0; j /= 2) {
-    if (threadIdx.y < j)
-      temp[myidx] = reduce_op(temp[myidx], temp[myidx + j * blockDim.x]);
-    __syncthreads();
-  }
-
-  // Grid reduction
-  if ((colStart < D) && (threadIdx.y == 0))
-    myAtomicReduce(dots + colStart, temp[myidx], reduce_op);
-}
-
-/**
- * @brief Compute reduction of the input matrix along the strided dimension
- *
- * @tparam InType the data type of the input
- * @tparam OutType the data type of the output (as well as the data type for
- *  which reduction is performed)
- * @tparam IdxType data type of the indices of the array
- * @tparam MainLambda Unary lambda applied while acculumation (eg: L1 or L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*MainLambda)(InType, IdxType);</pre>
- * @tparam ReduceLambda Binary lambda applied for reduction (eg: addition(+) for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*ReduceLambda)(OutType);</pre>
- * @tparam FinalLambda the final lambda applied before STG (eg: Sqrt for L2 norm)
- * It must be a 'callable' supporting the following input and output:
- * <pre>OutType (*FinalLambda)(OutType);</pre>
- * @param dots the output reduction vector
- * @param data the input matrix
- * @param D leading dimension of data
- * @param N second dimension data
- * @param init initial value to use for the reduction
- * @param main_op elementwise operation to apply before reduction
- * @param reduce_op binary reduction operation
- * @param final_op elementwise operation to apply before storing results
- * @param inplace reduction result added inplace or overwrites old values?
- * @param stream cuda stream where to launch work
- */
-template <typename InType, typename OutType = InType, typename IdxType = int,
-          typename MainLambda = Nop<InType, IdxType>,
-          typename ReduceLambda = Sum<OutType>,
-          typename FinalLambda = Nop<OutType>>
-void stridedReduction(OutType *dots, const InType *data, IdxType D, IdxType N,
-                      OutType init, cudaStream_t stream, bool inplace = false,
-                      MainLambda main_op = Nop<InType, IdxType>(),
-                      ReduceLambda reduce_op = Sum<OutType>(),
-                      FinalLambda final_op = Nop<OutType>()) {
-  ///@todo: this extra should go away once we have eliminated the need
-  /// for atomics in stridedKernel (redesign for this is already underway)
-  if (!inplace)
-    unaryOp(
-      dots, dots, D, [init] __device__(OutType a) { return init; }, stream);
-
-  // Arbitrary numbers for now, probably need to tune
-  const dim3 thrds(32, 16);
-  IdxType elemsPerThread = ceildiv(N, (IdxType)thrds.y);
-  elemsPerThread = (elemsPerThread > 8) ? 8 : elemsPerThread;
-  const dim3 nblks(ceildiv(D, (IdxType)thrds.x),
-                   ceildiv(N, (IdxType)thrds.y * elemsPerThread));
-  const size_t shmemSize = sizeof(OutType) * thrds.x * thrds.y;
-
-  ///@todo: this complication should go away once we have eliminated the need
-  /// for atomics in stridedKernel (redesign for this is already underway)
-  if (std::is_same<ReduceLambda, Sum<OutType>>::value &&
-      std::is_same<InType, OutType>::value)
-    stridedSummationKernel<InType>
-      <<<nblks, thrds, shmemSize, stream>>>(dots, data, D, N, init, main_op);
-  else
-    stridedReductionKernel<InType, OutType, IdxType>
-      <<<nblks, thrds, shmemSize, stream>>>(dots, data, D, N, init, main_op,
-                                            reduce_op);
-
-  ///@todo: this complication should go away once we have eliminated the need
-  /// for atomics in stridedKernel (redesign for this is already underway)
-  // Perform final op on output data
-  if (!std::is_same<FinalLambda, Nop<OutType>>::value)
-    unaryOp(dots, dots, D, final_op, stream);
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/subtract.cuh b/cpp/src_prims/linalg/subtract.cuh
deleted file mode 100644
index e6845dac5b..0000000000
--- a/cpp/src_prims/linalg/subtract.cuh
+++ /dev/null
@@ -1,101 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include "binary_op.cuh"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief Elementwise scalar subtraction operation on the input buffer
- *
- * @tparam InT     input data-type. Also the data-type upon which the math ops
- *                 will be performed
- * @tparam OutT    output data-type
- * @tparam IdxType Integer type used to for addressing
- *
- * @param out    the output buffer
- * @param in     the input buffer
- * @param scalar the scalar used in the operations
- * @param len    number of elements in the input buffer
- * @param stream cuda stream where to launch work
- */
-template <typename InT, typename OutT = InT, typename IdxType = int>
-void subtractScalar(OutT *out, const InT *in, InT scalar, IdxType len,
-                    cudaStream_t stream) {
-  auto op = [scalar] __device__(InT in) { return OutT(in - scalar); };
-  unaryOp<InT, decltype(op), IdxType, OutT>(out, in, len, op, stream);
-}
-
-/**
- * @brief Elementwise subtraction operation on the input buffers
- * @tparam InT     input data-type. Also the data-type upon which the math ops
- *                 will be performed
- * @tparam OutT    output data-type
- * @tparam IdxType Integer type used to for addressing
- *
- * @param out    the output buffer
- * @param in1    the first input buffer
- * @param in2    the second input buffer
- * @param len    number of elements in the input buffers
- * @param stream cuda stream where to launch work
- */
-template <typename InT, typename OutT = InT, typename IdxType = int>
-void subtract(OutT *out, const InT *in1, const InT *in2, IdxType len,
-              cudaStream_t stream) {
-  auto op = [] __device__(InT a, InT b) { return OutT(a - b); };
-  binaryOp<InT, decltype(op), OutT, IdxType>(out, in1, in2, len, op, stream);
-}
-
-template <class math_t, typename IdxType>
-__global__ void subtract_dev_scalar_kernel(math_t *outDev, const math_t *inDev,
-                                           const math_t *singleScalarDev,
-                                           IdxType len) {
-  //TODO: kernel do not use shared memory in current implementation
-  int i = ((IdxType)blockIdx.x * (IdxType)blockDim.x) + threadIdx.x;
-  if (i < len) {
-    outDev[i] = inDev[i] - *singleScalarDev;
-  }
-}
-
-/** Substract single value pointed by singleScalarDev parameter in device memory from inDev[i] and write result to outDev[i]
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param outDev the output buffer
- * @param inDev the input buffer
- * @param singleScalarDev pointer to the scalar located in device memory
- * @param len number of elements in the input and output buffer
- * @param stream cuda stream
- * @remark block size has not been tuned
- */
-template <typename math_t, typename IdxType = int, int TPB = 256>
-void subtractDevScalar(math_t *outDev, const math_t *inDev,
-                       const math_t *singleScalarDev, IdxType len,
-                       cudaStream_t stream) {
-  // Just for the note - there is no way to express such operation with cuBLAS in effective way
-  // https://stackoverflow.com/questions/14051064/add-scalar-to-vector-in-blas-cublas-cuda
-  const IdxType nblks = ceildiv(len, (IdxType)TPB);
-  subtract_dev_scalar_kernel<math_t>
-    <<<nblks, TPB, 0, stream>>>(outDev, inDev, singleScalarDev, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/svd.cuh b/cpp/src_prims/linalg/svd.cuh
deleted file mode 100644
index 77ef3fd4bb..0000000000
--- a/cpp/src_prims/linalg/svd.cuh
+++ /dev/null
@@ -1,275 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/cudart_utils.h>
-#include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
-#include <cuml/common/cuml_allocator.hpp>
-#include <matrix/math.cuh>
-#include <matrix/matrix.cuh>
-#include "cublas_wrappers.h"
-#include "cusolver_wrappers.h"
-#include "eig.cuh"
-#include "gemm.cuh"
-#include "transpose.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief singular value decomposition (SVD) on the column major float type
- * input matrix using QR method
- * @param in: input matrix
- * @param n_rows: number rows of input matrix
- * @param n_cols: number columns of input matrix
- * @param sing_vals: singular values of input matrix
- * @param left_sing_vecs: left singular values of input matrix
- * @param right_sing_vecs: right singular values of input matrix
- * @param trans_right: transpose right vectors or not
- * @param gen_left_vec: generate left eig vector. Not activated.
- * @param gen_right_vec: generate right eig vector. Not activated.
- * @param cusolverH cusolver handle
- * @param cublasH cublas handle
- * @param allocator device allocator for temporary buffers during computation
- * @param stream cuda stream
- */
-// TODO: activate gen_left_vec and gen_right_vec options
-// TODO: couldn't template this function due to cusolverDnSgesvd and
-// cusolverSnSgesvd. Check if there is any other way.
-template <typename T>
-void svdQR(T *in, int n_rows, int n_cols, T *sing_vals, T *left_sing_vecs,
-           T *right_sing_vecs, bool trans_right, bool gen_left_vec,
-           bool gen_right_vec, cusolverDnHandle_t cusolverH,
-           cublasHandle_t cublasH, std::shared_ptr<deviceAllocator> allocator,
-           cudaStream_t stream) {
-#if CUDART_VERSION >= 10010
-  // 46340: sqrt of max int value
-  ASSERT(n_rows <= 46340,
-         "svd solver is not supported for the data that has more than 46340 "
-         "samples (rows) "
-         "if you are using CUDA version 10.1. Please use other solvers such as "
-         "eig if it is available.");
-#endif
-
-  const int m = n_rows;
-  const int n = n_cols;
-
-  device_buffer<int> devInfo(allocator, stream, 1);
-  T *d_rwork = nullptr;
-
-  int lwork = 0;
-  CUSOLVER_CHECK(
-    cusolverDngesvd_bufferSize<T>(cusolverH, n_rows, n_cols, &lwork));
-  device_buffer<T> d_work(allocator, stream, lwork);
-
-  char jobu = 'S';
-  char jobvt = 'A';
-
-  if (!gen_left_vec) {
-    char new_u = 'N';
-    strcpy(&jobu, &new_u);
-  }
-
-  if (!gen_right_vec) {
-    char new_vt = 'N';
-    strcpy(&jobvt, &new_vt);
-  }
-
-  CUSOLVER_CHECK(cusolverDngesvd(
-    cusolverH, jobu, jobvt, m, n, in, m, sing_vals, left_sing_vecs, m,
-    right_sing_vecs, n, d_work.data(), lwork, d_rwork, devInfo.data(), stream));
-
-  // Transpose the right singular vector back
-  if (trans_right) transpose(right_sing_vecs, n_cols, stream);
-
-  CUDA_CHECK(cudaGetLastError());
-
-  int dev_info;
-  updateHost(&dev_info, devInfo.data(), 1, stream);
-  CUDA_CHECK(cudaStreamSynchronize(stream));
-  ASSERT(dev_info == 0,
-         "svd.cuh: svd couldn't converge to a solution. "
-         "This usually occurs when some of the features do not vary enough.");
-}
-
-template <typename T>
-void svdEig(T *in, int n_rows, int n_cols, T *S, T *U, T *V, bool gen_left_vec,
-            cublasHandle_t cublasH, cusolverDnHandle_t cusolverH,
-            cudaStream_t stream, std::shared_ptr<deviceAllocator> allocator) {
-  int len = n_cols * n_cols;
-  device_buffer<T> in_cross_mult(allocator, stream, len);
-
-  T alpha = T(1);
-  T beta = T(0);
-  gemm(in, n_rows, n_cols, in, in_cross_mult.data(), n_cols, n_cols,
-       CUBLAS_OP_T, CUBLAS_OP_N, alpha, beta, cublasH, stream);
-
-  eigDC(in_cross_mult.data(), n_cols, n_cols, V, S, cusolverH, stream,
-        allocator);
-
-  Matrix::colReverse(V, n_cols, n_cols, stream);
-  Matrix::rowReverse(S, n_cols, 1, stream);
-
-  Matrix::seqRoot(S, S, alpha, n_cols, stream, true);
-
-  if (gen_left_vec) {
-    gemm(in, n_rows, n_cols, V, U, n_rows, n_cols, CUBLAS_OP_N, CUBLAS_OP_N,
-         alpha, beta, cublasH, stream);
-    Matrix::matrixVectorBinaryDivSkipZero(U, S, n_rows, n_cols, false, true,
-                                          stream);
-  }
-}
-
-/**
- * @brief on the column major input matrix using Jacobi method
- * @param in: input matrix
- * @param n_rows: number rows of input matrix
- * @param n_cols: number columns of input matrix
- * @param sing_vals: singular values of input matrix
- * @param left_sing_vecs: left singular vectors of input matrix
- * @param right_sing_vecs: right singular vectors of input matrix
- * @param gen_left_vec: generate left eig vector. Not activated.
- * @param gen_right_vec: generate right eig vector. Not activated.
- * @param tol: error tolerance for the jacobi method. Algorithm stops when the
- * error is below tol
- * @param max_sweeps: number of sweeps in the Jacobi algorithm. The more the better
- * accuracy.
- * @param cusolverH cusolver handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- */
-template <typename math_t>
-void svdJacobi(math_t *in, int n_rows, int n_cols, math_t *sing_vals,
-               math_t *left_sing_vecs, math_t *right_sing_vecs,
-               bool gen_left_vec, bool gen_right_vec, math_t tol,
-               int max_sweeps, cusolverDnHandle_t cusolverH,
-               cudaStream_t stream,
-               std::shared_ptr<deviceAllocator> allocator) {
-  gesvdjInfo_t gesvdj_params = NULL;
-
-  CUSOLVER_CHECK(cusolverDnCreateGesvdjInfo(&gesvdj_params));
-  CUSOLVER_CHECK(cusolverDnXgesvdjSetTolerance(gesvdj_params, tol));
-  CUSOLVER_CHECK(cusolverDnXgesvdjSetMaxSweeps(gesvdj_params, max_sweeps));
-
-  int m = n_rows;
-  int n = n_cols;
-
-  device_buffer<int> devInfo(allocator, stream, 1);
-
-  int lwork = 0;
-  int econ = 1;
-
-  CUSOLVER_CHECK(cusolverDngesvdj_bufferSize(
-    cusolverH, CUSOLVER_EIG_MODE_VECTOR, econ, m, n, in, m, sing_vals,
-    left_sing_vecs, m, right_sing_vecs, n, &lwork, gesvdj_params));
-
-  device_buffer<math_t> d_work(allocator, stream, lwork);
-
-  CUSOLVER_CHECK(cusolverDngesvdj(cusolverH, CUSOLVER_EIG_MODE_VECTOR, econ, m,
-                                  n, in, m, sing_vals, left_sing_vecs, m,
-                                  right_sing_vecs, n, d_work.data(), lwork,
-                                  devInfo.data(), gesvdj_params, stream));
-
-  CUSOLVER_CHECK(cusolverDnDestroyGesvdjInfo(gesvdj_params));
-}
-
-/**
- * @brief reconstruct a matrix use left and right singular vectors and
- * singular values
- * @param U: left singular vectors of size n_rows x k
- * @param S: square matrix with singular values on its diagonal, k x k
- * @param V: right singular vectors of size n_cols x k
- * @param out: reconstructed matrix to be returned
- * @param n_rows: number rows of output matrix
- * @param n_cols: number columns of output matrix
- * @param k: number of singular values
- * @param cublasH cublas handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- */
-template <typename math_t>
-void svdReconstruction(math_t *U, math_t *S, math_t *V, math_t *out, int n_rows,
-                       int n_cols, int k, cublasHandle_t cublasH,
-                       cudaStream_t stream,
-                       std::shared_ptr<deviceAllocator> allocator) {
-  const math_t alpha = 1.0, beta = 0.0;
-  device_buffer<math_t> SVT(allocator, stream, k * n_cols);
-
-  gemm(S, k, k, V, SVT.data(), k, n_cols, CUBLAS_OP_N, CUBLAS_OP_T, alpha, beta,
-       cublasH, stream);
-  gemm(U, n_rows, k, SVT.data(), out, n_rows, n_cols, CUBLAS_OP_N, CUBLAS_OP_N,
-       alpha, beta, cublasH, stream);
-}
-
-/**
- * @brief reconstruct a matrix use left and right singular vectors and
- * singular values
- * @param A_d: input matrix
- * @param U: left singular vectors of size n_rows x k
- * @param S_vec: singular values as a vector
- * @param V: right singular vectors of size n_cols x k
- * @param n_rows: number rows of output matrix
- * @param n_cols: number columns of output matrix
- * @param k: number of singular values to be computed, 1.0 for normal SVD
- * @param tol: tolerance for the evaluation
- * @param cublasH cublas handle
- * @param stream cuda stream
- * @param allocator device allocator for temporary buffers during computation
- */
-template <typename math_t>
-bool evaluateSVDByL2Norm(math_t *A_d, math_t *U, math_t *S_vec, math_t *V,
-                         int n_rows, int n_cols, int k, math_t tol,
-                         cublasHandle_t cublasH, cudaStream_t stream,
-                         std::shared_ptr<deviceAllocator> allocator) {
-  int m = n_rows, n = n_cols;
-
-  // form product matrix
-  device_buffer<math_t> P_d(allocator, stream, m * n);
-  device_buffer<math_t> S_mat(allocator, stream, k * k);
-  CUDA_CHECK(cudaMemsetAsync(P_d.data(), 0, sizeof(math_t) * m * n, stream));
-  CUDA_CHECK(cudaMemsetAsync(S_mat.data(), 0, sizeof(math_t) * k * k, stream));
-
-  Matrix::initializeDiagonalMatrix(S_vec, S_mat.data(), k, k, stream);
-  svdReconstruction(U, S_mat.data(), V, P_d.data(), m, n, k, cublasH, stream,
-                    allocator);
-
-  // get norms of each
-  math_t normA = Matrix::getL2Norm(A_d, m * n, cublasH, stream);
-  math_t normU = Matrix::getL2Norm(U, m * k, cublasH, stream);
-  math_t normS = Matrix::getL2Norm(S_mat.data(), k * k, cublasH, stream);
-  math_t normV = Matrix::getL2Norm(V, n * k, cublasH, stream);
-  math_t normP = Matrix::getL2Norm(P_d.data(), m * n, cublasH, stream);
-
-  // calculate percent error
-  const math_t alpha = 1.0, beta = -1.0;
-  device_buffer<math_t> A_minus_P(allocator, stream, m * n);
-  CUDA_CHECK(
-    cudaMemsetAsync(A_minus_P.data(), 0, sizeof(math_t) * m * n, stream));
-
-  CUBLAS_CHECK(cublasgeam(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, m, n, &alpha, A_d,
-                          m, &beta, P_d.data(), m, A_minus_P.data(), m,
-                          stream));
-
-  math_t norm_A_minus_P =
-    Matrix::getL2Norm(A_minus_P.data(), m * n, cublasH, stream);
-  math_t percent_error = 100.0 * norm_A_minus_P / normA;
-  return (percent_error / 100.0 < tol);
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/ternary_op.cuh b/cpp/src_prims/linalg/ternary_op.cuh
index 0d62a74fd0..6062598e5e 100644
--- a/cpp/src_prims/linalg/ternary_op.cuh
+++ b/cpp/src_prims/linalg/ternary_op.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/vectorized.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -26,7 +26,7 @@ template <typename math_t, int veclen_, typename Lambda, typename IdxType>
 __global__ void ternaryOpKernel(math_t *out, const math_t *in1,
                                 const math_t *in2, const math_t *in3,
                                 IdxType len, Lambda op) {
-  typedef TxN_t<math_t, veclen_> VecType;
+  typedef raft::TxN_t<math_t, veclen_> VecType;
   VecType a, b, c;
   IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * blockDim.x);
   idx *= VecType::Ratio;
@@ -46,7 +46,8 @@ template <typename math_t, int veclen_, typename Lambda, typename IdxType,
 void ternaryOpImpl(math_t *out, const math_t *in1, const math_t *in2,
                    const math_t *in3, IdxType len, Lambda op,
                    cudaStream_t stream) {
-  const IdxType nblks = ceildiv(veclen_ ? len / veclen_ : len, (IdxType)TPB);
+  const IdxType nblks =
+    raft::ceildiv(veclen_ ? len / veclen_ : len, (IdxType)TPB);
   ternaryOpKernel<math_t, veclen_, Lambda, IdxType>
     <<<nblks, TPB, 0, stream>>>(out, in1, in2, in3, len, op);
   CUDA_CHECK(cudaPeekAtLastError());
diff --git a/cpp/src_prims/linalg/transpose.h b/cpp/src_prims/linalg/transpose.h
deleted file mode 100644
index 25ebe4782e..0000000000
--- a/cpp/src_prims/linalg/transpose.h
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * Copyright (c) 2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <thrust/device_vector.h>
-#include "cublas_wrappers.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-/**
- * @brief transpose on the column major input matrix using Jacobi method
- * @param in: input matrix
- * @param out: output. Transposed input matrix
- * @param n_rows: number rows of input matrix
- * @param n_cols: number columns of input matrix
- * @param cublas_h: cublas handle
- * @param stream: cuda stream
- */
-template <typename math_t>
-void transpose(math_t *in, math_t *out, int n_rows, int n_cols,
-               cublasHandle_t cublas_h, cudaStream_t stream) {
-  int out_n_rows = n_cols;
-  int out_n_cols = n_rows;
-
-  const math_t alpha = 1.0;
-  const math_t beta = 0.0;
-  CUBLAS_CHECK(cublasgeam(cublas_h, CUBLAS_OP_T, CUBLAS_OP_N, out_n_rows,
-                          out_n_cols, &alpha, in, n_rows, &beta, out,
-                          out_n_rows, out, out_n_rows, stream));
-}
-
-/**
- * @brief transpose on the column major input matrix using Jacobi method
- * @param inout: input and output matrix
- * @param n: number of rows and columns of input matrix
- * @param stream: cuda stream
- */
-template <typename math_t>
-void transpose(math_t *inout, int n, cudaStream_t stream) {
-  auto m = n;
-  auto size = n * n;
-  auto d_inout = inout;
-  auto counting = thrust::make_counting_iterator<int>(0);
-
-  thrust::for_each(thrust::cuda::par.on(stream), counting, counting + size,
-                   [=] __device__(int idx) {
-                     int s_row = idx % m;
-                     int s_col = idx / m;
-                     int d_row = s_col;
-                     int d_col = s_row;
-                     if (s_row < s_col) {
-                       auto temp = d_inout[d_col * m + d_row];
-                       d_inout[d_col * m + d_row] = d_inout[s_col * m + s_row];
-                       d_inout[s_col * m + s_row] = temp;
-                     }
-                   });
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/linalg/unary_op.cuh b/cpp/src_prims/linalg/unary_op.cuh
deleted file mode 100644
index 19644a3e4e..0000000000
--- a/cpp/src_prims/linalg/unary_op.cuh
+++ /dev/null
@@ -1,141 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InType, int VecLen, typename Lambda, typename OutType,
-          typename IdxType>
-__global__ void unaryOpKernel(OutType *out, const InType *in, IdxType len,
-                              Lambda op) {
-  typedef TxN_t<InType, VecLen> InVecType;
-  typedef TxN_t<OutType, VecLen> OutVecType;
-  InVecType a;
-  OutVecType b;
-  IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * blockDim.x);
-  idx *= InVecType::Ratio;
-  if (idx >= len) return;
-  a.load(in, idx);
-#pragma unroll
-  for (int i = 0; i < InVecType::Ratio; ++i) {
-    b.val.data[i] = op(a.val.data[i]);
-  }
-  b.store(out, idx);
-}
-
-template <typename InType, int VecLen, typename Lambda, typename OutType,
-          typename IdxType, int TPB>
-void unaryOpImpl(OutType *out, const InType *in, IdxType len, Lambda op,
-                 cudaStream_t stream) {
-  const IdxType nblks = ceildiv(VecLen ? len / VecLen : len, (IdxType)TPB);
-  unaryOpKernel<InType, VecLen, Lambda, OutType, IdxType>
-    <<<nblks, TPB, 0, stream>>>(out, in, len, op);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief perform element-wise unary operation in the input array
- * @tparam InType input data-type
- * @tparam Lambda the device-lambda performing the actual operation
- * @tparam OutType output data-type
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads-per-block in the final kernel launched
- * @param out the output array
- * @param in the input array
- * @param len number of elements in the input array
- * @param op the device-lambda
- * @param stream cuda stream where to launch work
- * @note Lambda must be a functor with the following signature:
- *       `OutType func(const InType& val);`
- */
-template <typename InType, typename Lambda, typename IdxType = int,
-          typename OutType = InType, int TPB = 256>
-void unaryOp(OutType *out, const InType *in, IdxType len, Lambda op,
-             cudaStream_t stream) {
-  if (len <= 0) return;  //silently skip in case of 0 length input
-  constexpr auto maxSize =
-    sizeof(InType) >= sizeof(OutType) ? sizeof(InType) : sizeof(OutType);
-  size_t bytes = len * maxSize;
-  uint64_t inAddr = uint64_t(in);
-  uint64_t outAddr = uint64_t(out);
-  if (16 / maxSize && bytes % 16 == 0 && inAddr % 16 == 0 &&
-      outAddr % 16 == 0) {
-    unaryOpImpl<InType, 16 / maxSize, Lambda, OutType, IdxType, TPB>(
-      out, in, len, op, stream);
-  } else if (8 / maxSize && bytes % 8 == 0 && inAddr % 8 == 0 &&
-             outAddr % 8 == 0) {
-    unaryOpImpl<InType, 8 / maxSize, Lambda, OutType, IdxType, TPB>(
-      out, in, len, op, stream);
-  } else if (4 / maxSize && bytes % 4 == 0 && inAddr % 4 == 0 &&
-             outAddr % 4 == 0) {
-    unaryOpImpl<InType, 4 / maxSize, Lambda, OutType, IdxType, TPB>(
-      out, in, len, op, stream);
-  } else if (2 / maxSize && bytes % 2 == 0 && inAddr % 2 == 0 &&
-             outAddr % 2 == 0) {
-    unaryOpImpl<InType, 2 / maxSize, Lambda, OutType, IdxType, TPB>(
-      out, in, len, op, stream);
-  } else if (1 / maxSize) {
-    unaryOpImpl<InType, 1 / maxSize, Lambda, OutType, IdxType, TPB>(
-      out, in, len, op, stream);
-  } else {
-    unaryOpImpl<InType, 1, Lambda, OutType, IdxType, TPB>(out, in, len, op,
-                                                          stream);
-  }
-}
-
-template <typename OutType, typename Lambda, typename IdxType>
-__global__ void writeOnlyUnaryOpKernel(OutType *out, IdxType len, Lambda op) {
-  IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * blockDim.x);
-  if (idx < len) {
-    op(out + idx, idx);
-  }
-}
-
-/**
- * @brief Perform an element-wise unary operation into the output array
- *
- * Compared to `unaryOp()`, this method does not do any reads from any inputs
- *
- * @tparam OutType output data-type
- * @tparam Lambda  the device-lambda performing the actual operation
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB     threads-per-block in the final kernel launched
- *
- * @param[out] out    the output array [on device] [len = len]
- * @param[in]  len    number of elements in the input array
- * @param[in]  op     the device-lambda which must be of the form:
- *                    `void func(OutType* outLocationOffset, IdxType idx);`
- *                    where outLocationOffset will be out + idx.
- * @param[in]  stream cuda stream where to launch work
- */
-template <typename OutType, typename Lambda, typename IdxType = int,
-          int TPB = 256>
-void writeOnlyUnaryOp(OutType *out, IdxType len, Lambda op,
-                      cudaStream_t stream) {
-  if (len <= 0) return;  // silently skip in case of 0 length input
-  auto nblks = ceildiv<IdxType>(len, TPB);
-  writeOnlyUnaryOpKernel<OutType, Lambda, IdxType>
-    <<<nblks, TPB, 0, stream>>>(out, len, op);
-  CUDA_CHECK(cudaGetLastError());
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/matrix/grammatrix.cuh b/cpp/src_prims/matrix/grammatrix.cuh
index ea31de37ea..b107affc58 100644
--- a/cpp/src_prims/matrix/grammatrix.cuh
+++ b/cpp/src_prims/matrix/grammatrix.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <distance/distance.cuh>
-#include <linalg/gemm.cuh>
+#include <raft/linalg/gemm.cuh>
 
 namespace MLCommon {
 namespace Matrix {
@@ -123,9 +123,9 @@ class GramMatrixBase {
               math_t *out, cudaStream_t stream, int ld1, int ld2, int ld_out) {
     math_t alpha = 1.0;
     math_t beta = 0.0;
-    CUBLAS_CHECK(LinAlg::cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, n1,
-                                    n2, n_cols, &alpha, x1, ld1, x2, ld2, &beta,
-                                    out, ld_out, stream));
+    CUBLAS_CHECK(raft::linalg::cublasgemm(
+      cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, n1, n2, n_cols, &alpha, x1, ld1,
+      x2, ld2, &beta, out, ld_out, stream));
   }
 
   /** Calculates the Gram matrix using Euclidean distance.
@@ -151,9 +151,9 @@ class GramMatrixBase {
                         int ld2, int ld_out) {
     typedef cutlass::Shape<8, 128, 128> OutputTile_t;
     auto fin_op = [] __device__(math_t d_val, int idx) { return d_val; };
-    Distance::distance<Distance::EucUnexpandedL2, math_t, math_t, math_t,
-                       OutputTile_t>(x1, x2, out, n1, n2, n_cols, NULL, 0,
-                                     fin_op, stream, false);
+    Distance::distance<ML::Distance::DistanceType::EucUnexpandedL2, math_t,
+                       math_t, math_t, OutputTile_t>(
+      x1, x2, out, n1, n2, n_cols, NULL, 0, fin_op, stream, false);
   }
 };
 };  // end namespace Matrix
diff --git a/cpp/src_prims/matrix/kernelfactory.cuh b/cpp/src_prims/matrix/kernelfactory.cuh
index 6194b9ac7b..d7600365e5 100644
--- a/cpp/src_prims/matrix/kernelfactory.cuh
+++ b/cpp/src_prims/matrix/kernelfactory.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cuml/matrix/kernelparams.h>
+#include <raft/cudart_utils.h>
 #include "grammatrix.cuh"
 #include "kernelmatrices.cuh"
 
@@ -48,7 +48,7 @@ class KernelFactory {
         res = new RBFKernel<math_t>(gamma);
         break;
       default:
-        throw MLCommon::Exception("Kernel not implemented");
+        throw raft::exception("Kernel not implemented");
     }
     return res;
   }
diff --git a/cpp/src_prims/matrix/kernelmatrices.cuh b/cpp/src_prims/matrix/kernelmatrices.cuh
index ec9bb4b8ac..e80ad5b3eb 100644
--- a/cpp/src_prims/matrix/kernelmatrices.cuh
+++ b/cpp/src_prims/matrix/kernelmatrices.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
 #include <distance/distance.cuh>
-#include <linalg/gemm.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/gemm.cuh>
 #include "grammatrix.cuh"
 
 namespace MLCommon {
@@ -113,10 +113,12 @@ class PolynomialKernel : public GramMatrixBase<math_t> {
   void applyKernel(math_t *inout, int ld, int rows, int cols,
                    cudaStream_t stream) {
     if (ld == cols)
-      polynomial_kernel_nopad<<<ceildiv(rows * cols, 128), 128, 0, stream>>>(
-        inout, rows * cols, exponent, gain, offset);
+      polynomial_kernel_nopad<<<raft::ceildiv(rows * cols, 128), 128, 0,
+                                stream>>>(inout, rows * cols, exponent, gain,
+                                          offset);
     else
-      polynomial_kernel<<<dim3(ceildiv(rows, 32), ceildiv(cols, 4), 1),
+      polynomial_kernel<<<dim3(raft::ceildiv(rows, 32), raft::ceildiv(cols, 4),
+                               1),
                           dim3(32, 4, 1), 0, stream>>>(inout, ld, rows, cols,
                                                        exponent, gain, offset);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -181,10 +183,10 @@ class TanhKernel : public GramMatrixBase<math_t> {
   void applyKernel(math_t *inout, int ld, int rows, int cols,
                    cudaStream_t stream) {
     if (ld == cols)
-      tanh_kernel_nopad<<<ceildiv(rows * cols, 128), 128, 0, stream>>>(
+      tanh_kernel_nopad<<<raft::ceildiv(rows * cols, 128), 128, 0, stream>>>(
         inout, rows * cols, gain, offset);
     else
-      tanh_kernel<<<dim3(ceildiv(rows, 32), ceildiv(cols, 4), 1),
+      tanh_kernel<<<dim3(raft::ceildiv(rows, 32), raft::ceildiv(cols, 4), 1),
                     dim3(32, 4, 1), 0, stream>>>(inout, ld, rows, cols, gain,
                                                  offset);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -243,11 +245,11 @@ class RBFKernel : public GramMatrixBase<math_t> {
   void applyKernel(math_t *inout, int ld, int rows, int cols,
                    cudaStream_t stream) {
     if (ld == cols)
-      rbf_kernel_nopad<<<ceildiv(rows * cols, 128), 128, 0, stream>>>(
+      rbf_kernel_nopad<<<raft::ceildiv(rows * cols, 128), 128, 0, stream>>>(
         inout, rows * cols, gain);
     else
-      rbf_kernel<<<dim3(ceildiv(rows, 32), ceildiv(cols, 4), 1), dim3(32, 4, 1),
-                   0, stream>>>(inout, ld, rows, cols, gain);
+      rbf_kernel<<<dim3(raft::ceildiv(rows, 32), raft::ceildiv(cols, 4), 1),
+                   dim3(32, 4, 1), 0, stream>>>(inout, ld, rows, cols, gain);
   }
 
  public:
@@ -300,10 +302,10 @@ class RBFKernel : public GramMatrixBase<math_t> {
     auto fin_op = [gain] __device__(math_t d_val, int idx) {
       return exp(-gain * d_val);
     };
-    Distance::distance<Distance::EucUnexpandedL2, math_t, math_t, math_t,
-                       OutputTile_t>(const_cast<math_t *>(x1),
-                                     const_cast<math_t *>(x2), out, n1, n2,
-                                     n_cols, NULL, 0, fin_op, stream, false);
+    Distance::distance<ML::Distance::DistanceType::EucUnexpandedL2, math_t,
+                       math_t, math_t, OutputTile_t>(
+      const_cast<math_t *>(x1), const_cast<math_t *>(x2), out, n1, n2, n_cols,
+      NULL, 0, fin_op, stream, false);
   }
 };
 
diff --git a/cpp/src_prims/matrix/math.cuh b/cpp/src_prims/matrix/math.cuh
deleted file mode 100644
index 78bb7617d4..0000000000
--- a/cpp/src_prims/matrix/math.cuh
+++ /dev/null
@@ -1,493 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/device_buffer.hpp>
-#include <cuml/common/cuml_allocator.hpp>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/unary_op.cuh>
-
-namespace MLCommon {
-namespace Matrix {
-
-/**
- * @defgroup MatrixMathOp math operation on the input matrix
- * @{
- */
-
-/**
- * @brief Power of every element in the input matrix
- * @param in: input matrix
- * @param out: output matrix. The result is stored in the out matrix
- * @param scalar: every element is multiplied with scalar.
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t>
-void power(math_t *in, math_t *out, math_t scalar, int len,
-           cudaStream_t stream) {
-  auto d_src = in;
-  auto d_dest = out;
-
-  MLCommon::LinAlg::binaryOp(
-    d_dest, d_src, d_src, len,
-    [=] __device__(math_t a, math_t b) { return scalar * a * b; }, stream);
-}
-
-/**
- * @brief Power of every element in the input matrix
- * @param inout: input matrix and also the result is stored
- * @param scalar: every element is multiplied with scalar.
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t>
-void power(math_t *inout, math_t scalar, int len, cudaStream_t stream) {
-  power(inout, inout, scalar, len, stream);
-}
-
-/**
- * @brief Power of every element in the input matrix
- * @param inout: input matrix and also the result is stored
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t>
-void power(math_t *inout, int len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  power(inout, scalar, len, stream);
-}
-
-/**
- * @brief Power of every element in the input matrix
- * @param in: input matrix
- * @param out: output matrix. The result is stored in the out matrix
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @{
- */
-template <typename math_t>
-void power(math_t *in, math_t *out, int len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  power(in, out, scalar, len, stream);
-}
-
-/**
- * @brief Square root of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param in: input matrix and also the result is stored
- * @param out: output matrix. The result is stored in the out matrix
- * @param scalar: every element is multiplied with scalar
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @param set_neg_zero whether to set negative numbers to zero
- */
-template <typename math_t, typename IdxType = int>
-void seqRoot(math_t *in, math_t *out, math_t scalar, IdxType len,
-             cudaStream_t stream, bool set_neg_zero = false) {
-  auto d_src = in;
-  auto d_dest = out;
-
-  MLCommon::LinAlg::unaryOp(
-    d_dest, d_src, len,
-    [=] __device__(math_t a) {
-      if (set_neg_zero) {
-        if (a < math_t(0)) {
-          return math_t(0);
-        } else {
-          return sqrt(a * scalar);
-        }
-      } else {
-        return sqrt(a * scalar);
-      }
-    },
-    stream);
-}
-
-/**
- * @brief Square root of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param inout: input matrix and also the result is stored
- * @param scalar: every element is multiplied with scalar
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @param set_neg_zero whether to set negative numbers to zero
- */
-template <typename math_t, typename IdxType = int>
-void seqRoot(math_t *inout, math_t scalar, IdxType len, cudaStream_t stream,
-             bool set_neg_zero = false) {
-  seqRoot(inout, inout, scalar, len, stream, set_neg_zero);
-}
-
-/**
- * @brief Square root of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param in: input matrix and also the result is stored
- * @param out: output matrix. The result is stored in the out matrix
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t, typename IdxType = int>
-void seqRoot(math_t *in, math_t *out, IdxType len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  seqRoot(in, out, scalar, len, stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void seqRoot(math_t *inout, IdxType len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  seqRoot(inout, inout, scalar, len, stream);
-}
-
-template <typename math_t, typename IdxType = int>
-void setSmallValuesZero(math_t *out, const math_t *in, IdxType len,
-                        cudaStream_t stream, math_t thres = 1e-15) {
-  MLCommon::LinAlg::unaryOp(
-    out, in, len,
-    [=] __device__(math_t a) {
-      if (a <= thres && -a <= thres) {
-        return math_t(0);
-      } else {
-        return a;
-      }
-    },
-    stream);
-}
-
-/**
- * @brief sets the small values to zero based on a defined threshold
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param inout: input matrix and also the result is stored
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @param thres: threshold
- */
-template <typename math_t, typename IdxType = int>
-void setSmallValuesZero(math_t *inout, IdxType len, cudaStream_t stream,
-                        math_t thres = 1e-15) {
-  setSmallValuesZero(inout, inout, len, stream, thres);
-}
-
-/**
- * @brief Reciprocal of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param in: input matrix and also the result is stored
- * @param out: output matrix. The result is stored in the out matrix
- * @param scalar: every element is multiplied with scalar
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @param setzero round down to zero if the input is less the threshold
- * @param thres the threshold used to forcibly set inputs to zero
- * @{
- */
-template <typename math_t, typename IdxType = int>
-void reciprocal(math_t *in, math_t *out, math_t scalar, int len,
-                cudaStream_t stream, bool setzero = false,
-                math_t thres = 1e-15) {
-  auto d_src = in;
-  auto d_dest = out;
-
-  MLCommon::LinAlg::unaryOp(
-    d_dest, d_src, len,
-    [=] __device__(math_t a) {
-      if (setzero) {
-        if (abs(a) <= thres) {
-          return math_t(0);
-        } else {
-          return scalar / a;
-        }
-      } else {
-        return scalar / a;
-      }
-    },
-    stream);
-}
-
-/**
- * @brief Reciprocal of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param inout: input matrix and also the result is stored
- * @param scalar: every element is multiplied with scalar
- * @param len: number elements of input matrix
- * @param stream cuda stream
- * @param setzero: (default false) when true and |value|<thres, avoid dividing by (almost) zero
- * @param thres: Threshold to avoid dividing by zero (|value| < thres -> result = 0)
- */
-template <typename math_t, typename IdxType = int>
-void reciprocal(math_t *inout, math_t scalar, IdxType len, cudaStream_t stream,
-                bool setzero = false, math_t thres = 1e-15) {
-  reciprocal(inout, inout, scalar, len, stream, setzero, thres);
-}
-
-/**
- * @brief Reciprocal of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param inout: input matrix and also the result is stored
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t, typename IdxType = int>
-void reciprocal(math_t *inout, IdxType len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  reciprocal(inout, scalar, len, stream);
-}
-
-/**
- * @brief Reciprocal of every element in the input matrix
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param in: input matrix and also the result is stored
- * @param out: output matrix. The result is stored in the out matrix
- * @param len: number elements of input matrix
- * @param stream cuda stream
- */
-template <typename math_t, typename IdxType = int>
-void reciprocal(math_t *in, math_t *out, IdxType len, cudaStream_t stream) {
-  math_t scalar = 1.0;
-  reciprocal(in, out, scalar, len, stream);
-}
-
-template <typename math_t>
-void setValue(math_t *out, const math_t *in, math_t scalar, int len,
-              cudaStream_t stream = 0) {
-  MLCommon::LinAlg::unaryOp(
-    out, in, len, [scalar] __device__(math_t in) { return scalar; }, stream);
-}
-
-/**
- * @brief ratio of every element over sum of input vector is calculated
- * @tparam math_t data-type upon which the math operation will be performed
- * @tparam IdxType Integer type used to for addressing
- * @param src: input matrix
- * @param dest: output matrix. The result is stored in the dest matrix
- * @param len: number elements of input matrix
- * @param allocator device allocator
- * @param stream cuda stream
- */
-template <typename math_t, typename IdxType = int>
-void ratio(math_t *src, math_t *dest, IdxType len,
-           std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream) {
-  auto d_src = src;
-  auto d_dest = dest;
-
-  device_buffer<math_t> d_sum(allocator, stream, 1);
-  auto *d_sum_ptr = d_sum.data();
-  auto no_op = [] __device__(math_t in) { return in; };
-  MLCommon::LinAlg::mapThenSumReduce(d_sum_ptr, len, no_op, stream, src);
-  MLCommon::LinAlg::unaryOp(
-    d_dest, d_src, len, [=] __device__(math_t a) { return a / (*d_sum_ptr); },
-    stream);
-}
-
-/** @} */
-
-// Computes the argmax(d_in) column-wise in a DxN matrix
-template <typename T, int TPB>
-__global__ void argmaxKernel(const T *d_in, int D, int N, T *argmax) {
-  typedef cub::BlockReduce<cub::KeyValuePair<int, T>, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-
-  // compute maxIndex=argMax  index for column
-  using KVP = cub::KeyValuePair<int, T>;
-  int rowStart = blockIdx.x * D;
-  KVP thread_data(-1, -myInf<T>());
-
-  for (int i = threadIdx.x; i < D; i += TPB) {
-    int idx = rowStart + i;
-    thread_data = cub::ArgMax()(thread_data, KVP(i, d_in[idx]));
-  }
-
-  auto maxKV = BlockReduce(temp_storage).Reduce(thread_data, cub::ArgMax());
-
-  if (threadIdx.x == 0) {
-    argmax[blockIdx.x] = maxKV.key;
-  }
-}
-
-/**
- * @brief Argmax: find the row idx with maximum value for each column
- * @param in: input matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param out: output vector of size n_cols
- * @param stream: cuda stream
- */
-template <typename math_t>
-void argmax(const math_t *in, int n_rows, int n_cols, math_t *out,
-            cudaStream_t stream) {
-  int D = n_rows;
-  int N = n_cols;
-  if (D <= 32) {
-    argmaxKernel<math_t, 32><<<N, 32, 0, stream>>>(in, D, N, out);
-  } else if (D <= 64) {
-    argmaxKernel<math_t, 64><<<N, 64, 0, stream>>>(in, D, N, out);
-  } else if (D <= 128) {
-    argmaxKernel<math_t, 128><<<N, 128, 0, stream>>>(in, D, N, out);
-  } else {
-    argmaxKernel<math_t, 256><<<N, 256, 0, stream>>>(in, D, N, out);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-// Utility kernel needed for signFlip.
-// Computes the argmax(abs(d_in)) column-wise in a DxN matrix followed by
-// flipping the sign if the |max| value for each column is negative.
-template <typename T, int TPB>
-__global__ void signFlipKernel(T *d_in, int D, int N) {
-  typedef cub::BlockReduce<cub::KeyValuePair<int, T>, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-
-  // compute maxIndex=argMax (with abs()) index for column
-  using KVP = cub::KeyValuePair<int, T>;
-  int rowStart = blockIdx.x * D;
-  KVP thread_data(0, 0);
-  for (int i = threadIdx.x; i < D; i += TPB) {
-    int idx = rowStart + i;
-    thread_data = cub::ArgMax()(thread_data, KVP(idx, abs(d_in[idx])));
-  }
-  auto maxKV = BlockReduce(temp_storage).Reduce(thread_data, cub::ArgMax());
-
-  // flip column sign if d_in[maxIndex] < 0
-  __shared__ bool need_sign_flip;
-  if (threadIdx.x == 0) {
-    need_sign_flip = d_in[maxKV.key] < T(0);
-  }
-  __syncthreads();
-
-  if (need_sign_flip) {
-    for (int i = threadIdx.x; i < D; i += TPB) {
-      int idx = rowStart + i;
-      d_in[idx] = -d_in[idx];
-    }
-  }
-}
-
-/**
- * @brief sign flip for PCA. This is used to stabilize the sign of column
- * major eigen vectors. Flips the sign if the column has negative |max|.
- * @param inout: input matrix. Result also stored in this parameter
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param stream cuda stream
- */
-template <typename math_t>
-void signFlip(math_t *inout, int n_rows, int n_cols, cudaStream_t stream) {
-  int D = n_rows;
-  int N = n_cols;
-  auto data = inout;
-  if (D <= 32) {
-    signFlipKernel<math_t, 32><<<N, 32, 0, stream>>>(data, D, N);
-  } else if (D <= 64) {
-    signFlipKernel<math_t, 64><<<N, 64, 0, stream>>>(data, D, N);
-  } else if (D <= 128) {
-    signFlipKernel<math_t, 128><<<N, 128, 0, stream>>>(data, D, N);
-  } else {
-    signFlipKernel<math_t, 256><<<N, 256, 0, stream>>>(data, D, N);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinaryMult(Type *data, const Type *vec, IdxType n_row,
-                            IdxType n_col, bool rowMajor, bool bcastAlongRows,
-                            cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a * b; }, stream);
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinaryMultSkipZero(Type *data, const Type *vec, IdxType n_row,
-                                    IdxType n_col, bool rowMajor,
-                                    bool bcastAlongRows, cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) {
-      if (b == Type(0))
-        return a;
-      else
-        return a * b;
-    },
-    stream);
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinaryDiv(Type *data, const Type *vec, IdxType n_row,
-                           IdxType n_col, bool rowMajor, bool bcastAlongRows,
-                           cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a / b; }, stream);
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinaryDivSkipZero(Type *data, const Type *vec, IdxType n_row,
-                                   IdxType n_col, bool rowMajor,
-                                   bool bcastAlongRows, cudaStream_t stream,
-                                   bool return_zero = false) {
-  if (return_zero) {
-    LinAlg::matrixVectorOp(
-      data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-      [] __device__(Type a, Type b) {
-        if (myAbs(b) < Type(1e-10))
-          return Type(0);
-        else
-          return a / b;
-      },
-      stream);
-  } else {
-    LinAlg::matrixVectorOp(
-      data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-      [] __device__(Type a, Type b) {
-        if (myAbs(b) < Type(1e-10))
-          return a;
-        else
-          return a / b;
-      },
-      stream);
-  }
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinaryAdd(Type *data, const Type *vec, IdxType n_row,
-                           IdxType n_col, bool rowMajor, bool bcastAlongRows,
-                           cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a + b; }, stream);
-}
-
-template <typename Type, typename IdxType = int, int TPB = 256>
-void matrixVectorBinarySub(Type *data, const Type *vec, IdxType n_row,
-                           IdxType n_col, bool rowMajor, bool bcastAlongRows,
-                           cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    data, data, vec, n_col, n_row, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a - b; }, stream);
-}
-
-};  // end namespace Matrix
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/matrix/matrix.cuh b/cpp/src_prims/matrix/matrix.cuh
deleted file mode 100644
index 63a265f1d5..0000000000
--- a/cpp/src_prims/matrix/matrix.cuh
+++ /dev/null
@@ -1,372 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/cudart_utils.h>
-#include <cuda_runtime.h>
-#include <cusolverDn.h>
-#include <linalg/cublas_wrappers.h>
-#include <thrust/device_vector.h>
-#include <thrust/execution_policy.h>
-#include <algorithm>
-#include <cstddef>
-#include <cuda_utils.cuh>
-
-namespace MLCommon {
-namespace Matrix {
-
-using namespace std;
-
-/**
- * @brief Copy selected rows of the input matrix into contiguous space.
- *
- * On exit out[i + k*n_rows] = in[indices[i] + k*n_rows],
- * where i = 0..n_rows_indices-1, and k = 0..n_cols-1.
- *
- * @param in input matrix
- * @param n_rows number of rows of output matrix
- * @param n_cols number of columns of output matrix
- * @param out output matrix
- * @param indices of the rows to be copied
- * @param n_rows_indices number of rows to copy
- * @param stream cuda stream
- * @param rowMajor whether the matrix has row major layout
- */
-template <typename m_t>
-void copyRows(const m_t *in, int n_rows, int n_cols, m_t *out,
-              const int *indices, int n_rows_indices, cudaStream_t stream,
-              bool rowMajor = false) {
-  if (rowMajor) {
-    ASSERT(false, "matrix.h: row major is not supported yet!");
-  }
-
-  auto size = n_rows_indices * n_cols;
-  auto counting = thrust::make_counting_iterator<int>(0);
-
-  thrust::for_each(thrust::cuda::par.on(stream), counting, counting + size,
-                   [=] __device__(int idx) {
-                     int row = idx % n_rows_indices;
-                     int col = idx / n_rows_indices;
-
-                     out[col * n_rows_indices + row] =
-                       in[col * n_rows + indices[row]];
-                   });
-}
-
-/**
- * @brief copy matrix operation for column major matrices.
- * @param in: input matrix
- * @param out: output matrix
- * @param n_rows: number of rows of output matrix
- * @param n_cols: number of columns of output matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void copy(const m_t *in, m_t *out, int n_rows, int n_cols,
-          cudaStream_t stream) {
-  copyAsync(out, in, n_rows * n_cols, stream);
-}
-
-/**
- * @brief copy matrix operation for column major matrices. First n_rows and
- * n_cols of input matrix "in" is copied to "out" matrix.
- * @param in: input matrix
- * @param in_n_rows: number of rows of input matrix
- * @param out: output matrix
- * @param out_n_rows: number of rows of output matrix
- * @param out_n_cols: number of columns of output matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void truncZeroOrigin(m_t *in, int in_n_rows, m_t *out, int out_n_rows,
-                     int out_n_cols, cudaStream_t stream) {
-  auto m = out_n_rows;
-  auto k = in_n_rows;
-  auto size = out_n_rows * out_n_cols;
-  auto d_q = in;
-  auto d_q_trunc = out;
-  auto counting = thrust::make_counting_iterator<int>(0);
-
-  thrust::for_each(thrust::cuda::par.on(stream), counting, counting + size,
-                   [=] __device__(int idx) {
-                     int row = idx % m;
-                     int col = idx / m;
-                     d_q_trunc[col * m + row] = d_q[col * k + row];
-                   });
-}
-
-/**
- * @brief Columns of a column major matrix is reversed (i.e. first column and
- * last column are swapped)
- * @param inout: input and output matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void colReverse(m_t *inout, int n_rows, int n_cols, cudaStream_t stream) {
-  auto n = n_cols;
-  auto m = n_rows;
-  auto size = n_rows * n_cols;
-  auto d_q = inout;
-  auto d_q_reversed = inout;
-  auto counting = thrust::make_counting_iterator<int>(0);
-
-  thrust::for_each(thrust::cuda::par.on(stream), counting,
-                   counting + (size / 2), [=] __device__(int idx) {
-                     int dest_row = idx % m;
-                     int dest_col = idx / m;
-                     int src_row = dest_row;
-                     int src_col = (n - dest_col) - 1;
-                     m_t temp = (m_t)d_q_reversed[idx];
-                     d_q_reversed[idx] = d_q[src_col * m + src_row];
-                     d_q[src_col * m + src_row] = temp;
-                   });
-}
-
-/**
- * @brief Rows of a column major matrix is reversed (i.e. first row and last
- * row are swapped)
- * @param inout: input and output matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void rowReverse(m_t *inout, int n_rows, int n_cols, cudaStream_t stream) {
-  auto m = n_rows;
-  auto size = n_rows * n_cols;
-  auto d_q = inout;
-  auto d_q_reversed = inout;
-  auto counting = thrust::make_counting_iterator<int>(0);
-
-  thrust::for_each(thrust::cuda::par.on(stream), counting,
-                   counting + (size / 2), [=] __device__(int idx) {
-                     int dest_row = idx % m;
-                     int dest_col = idx / m;
-                     int src_row = (m - dest_row) - 1;
-                     ;
-                     int src_col = dest_col;
-
-                     m_t temp = (m_t)d_q_reversed[idx];
-                     d_q_reversed[idx] = d_q[src_col * m + src_row];
-                     d_q[src_col * m + src_row] = temp;
-                   });
-}
-
-/**
- * @brief Prints the data stored in GPU memory
- * @param in: input matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param h_separator: horizontal separator character
- * @param v_separator: vertical separator character
- */
-template <typename m_t>
-void print(const m_t *in, int n_rows, int n_cols, char h_separator = ' ',
-           char v_separator = '\n') {
-  std::vector<m_t> h_matrix = std::vector<m_t>(n_cols * n_rows);
-  CUDA_CHECK(cudaMemcpy(h_matrix.data(), in, n_cols * n_rows * sizeof(m_t),
-                        cudaMemcpyDeviceToHost));
-
-  for (auto i = 0; i < n_rows; i++) {
-    for (auto j = 0; j < n_cols; j++) {
-      printf("%1.4f%c", h_matrix[j * n_rows + i],
-             j < n_cols - 1 ? h_separator : v_separator);
-    }
-  }
-}
-
-/**
- * @brief Prints the data stored in CPU memory
- * @param in: input matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- */
-template <typename m_t>
-void printHost(const m_t *in, int n_rows, int n_cols) {
-  for (auto i = 0; i < n_rows; i++) {
-    for (auto j = 0; j < n_cols; j++) {
-      printf("%1.4f ", in[j * n_rows + i]);
-    }
-    printf("\n");
-  }
-}
-
-/**
- * @brief Kernel for copying a slice of a big matrix to a small matrix with a
- * size matches that slice
- * @param src_d: input matrix
- * @param m: number of rows of input matrix
- * @param n: number of columns of input matrix
- * @param dst_d: output matrix
- * @param x1, y1: coordinate of the top-left point of the wanted area (0-based)
- * @param x2, y2: coordinate of the bottom-right point of the wanted area
- * (1-based)
- */
-template <typename m_t>
-__global__ void slice(m_t *src_d, int m, int n, m_t *dst_d, int x1, int y1,
-                      int x2, int y2) {
-  int idx = threadIdx.x + blockDim.x * blockIdx.x;
-  int dm = x2 - x1, dn = y2 - y1;
-  if (idx < dm * dn) {
-    int i = idx % dm, j = idx / dm;
-    int is = i + x1, js = j + y1;
-    dst_d[idx] = src_d[is + js * m];
-  }
-}
-
-/**
- * @brief Slice a matrix (in-place)
- * @param in: input matrix
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param out: output matrix
- * @param x1, y1: coordinate of the top-left point of the wanted area (0-based)
- * @param x2, y2: coordinate of the bottom-right point of the wanted area
- * (1-based)
- * example: Slice the 2nd and 3rd columns of a 4x3 matrix: slice_matrix(M_d, 4,
- * 3, 0, 1, 4, 3);
- * @param stream: cuda stream
- */
-template <typename m_t>
-void sliceMatrix(m_t *in, int n_rows, int n_cols, m_t *out, int x1, int y1,
-                 int x2, int y2, cudaStream_t stream) {
-  // Slicing
-  dim3 block(64);
-  dim3 grid(((x2 - x1) * (y2 - y1) + block.x - 1) / block.x);
-  slice<<<grid, block, 0, stream>>>(in, n_rows, n_cols, out, x1, y1, x2, y2);
-}
-
-/**
- * @brief Kernel for copying the upper triangular part of a matrix to another
- * @param src: input matrix with a size of mxn
- * @param dst: output matrix with a size of kxk
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param k: min(n_rows, n_cols)
- */
-template <typename m_t>
-__global__ void getUpperTriangular(m_t *src, m_t *dst, int n_rows, int n_cols,
-                                   int k) {
-  int idx = threadIdx.x + blockDim.x * blockIdx.x;
-  int m = n_rows, n = n_cols;
-  if (idx < m * n) {
-    int i = idx % m, j = idx / m;
-    if (i < k && j < k && j >= i) {
-      dst[i + j * k] = src[idx];
-    }
-  }
-}
-
-/**
- * @brief Copy the upper triangular part of a matrix to another
- * @param src: input matrix with a size of n_rows x n_cols
- * @param dst: output matrix with a size of kxk, k = min(n_rows, n_cols)
- * @param n_rows: number of rows of input matrix
- * @param n_cols: number of columns of input matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void copyUpperTriangular(m_t *src, m_t *dst, int n_rows, int n_cols,
-                         cudaStream_t stream) {
-  int m = n_rows, n = n_cols;
-  int k = min(m, n);
-  dim3 block(64);
-  dim3 grid((m * n + block.x - 1) / block.x);
-  getUpperTriangular<<<grid, block, 0, stream>>>(src, dst, m, n, k);
-}
-
-/**
- * @brief Copy a vector to the diagonal of a matrix
- * @param vec: vector of length k = min(n_rows, n_cols)
- * @param matrix: matrix of size n_rows x n_cols
- * @param m: number of rows of the matrix
- * @param n: number of columns of the matrix
- * @param k: dimensionality
- */
-template <typename m_t>
-__global__ void copyVectorToMatrixDiagonal(m_t *vec, m_t *matrix, int m, int n,
-                                           int k) {
-  int idx = threadIdx.x + blockDim.x * blockIdx.x;
-
-  if (idx < k) {
-    matrix[idx + idx * m] = vec[idx];
-  }
-}
-
-/**
- * @brief Initialize a diagonal matrix with a vector
- * @param vec: vector of length k = min(n_rows, n_cols)
- * @param matrix: matrix of size n_rows x n_cols
- * @param n_rows: number of rows of the matrix
- * @param n_cols: number of columns of the matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void initializeDiagonalMatrix(m_t *vec, m_t *matrix, int n_rows, int n_cols,
-                              cudaStream_t stream) {
-  int k = min(n_rows, n_cols);
-  dim3 block(64);
-  dim3 grid((k + block.x - 1) / block.x);
-  copyVectorToMatrixDiagonal<<<grid, block, 0, stream>>>(vec, matrix, n_rows,
-                                                         n_cols, k);
-}
-
-/**
- * @brief Calculate the inverse of the diagonal of a square matrix
- * element-wise and in place
- * @param in: square input matrix with size len x len
- * @param len: size of one side of the matrix
- */
-template <typename m_t>
-__global__ void matrixDiagonalInverse(m_t *in, int len) {
-  int idx = threadIdx.x + blockDim.x * blockIdx.x;
-  if (idx < len) {
-    in[idx + idx * len] = 1.0 / in[idx + idx * len];
-  }
-}
-
-/**
- * @brief Get a square matrix with elements on diagonal reversed (in-place)
- * @param in: square input matrix with size len x len
- * @param len: size of one side of the matrix
- * @param stream: cuda stream
- */
-template <typename m_t>
-void getDiagonalInverseMatrix(m_t *in, int len, cudaStream_t stream) {
-  dim3 block(64);
-  dim3 grid((len + block.x - 1) / block.x);
-  matrixDiagonalInverse<m_t><<<grid, block, 0, stream>>>(in, len);
-}
-
-/**
- * @brief Get the L2/F-norm of a matrix/vector
- * @param in: input matrix/vector with totally size elements
- * @param size: size of the matrix/vector
- * @param cublasH cublas handle
- * @param stream: cuda stream
- */
-template <typename m_t>
-m_t getL2Norm(m_t *in, int size, cublasHandle_t cublasH, cudaStream_t stream) {
-  m_t normval = 0;
-  CUBLAS_CHECK(LinAlg::cublasnrm2(cublasH, size, in, 1, &normval, stream));
-  return normval;
-}
-
-};  // end namespace Matrix
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/matrix/reverse.cuh b/cpp/src_prims/matrix/reverse.cuh
index 010c5e6b22..53ad9f24d1 100644
--- a/cpp/src_prims/matrix/reverse.cuh
+++ b/cpp/src_prims/matrix/reverse.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/vectorized.cuh>
 
 namespace MLCommon {
 namespace Matrix {
@@ -26,7 +26,7 @@ template <typename math_t, int veclen_, typename Lambda>
 __global__ void reverseKernel(math_t *out, const math_t *in, int nrows,
                               int ncols, bool rowMajor, bool alongRows, int len,
                               Lambda op) {
-  typedef TxN_t<math_t, veclen_> VecType;
+  typedef raft::TxN_t<math_t, veclen_> VecType;
   int idx = (threadIdx.x + (blockIdx.x * blockDim.x)) * VecType::Ratio;
   if (idx >= len) return;
   int srcIdx, dstIdx;
@@ -38,7 +38,7 @@ __global__ void reverseKernel(math_t *out, const math_t *in, int nrows,
     srcIdx = idx;
     dstIdx = dstCol * nrows + dstRow;
   } else if (!rowMajor && alongRows) {
-    int mod = ceildiv(nrows, 2);
+    int mod = raft::ceildiv(nrows, 2);
     int srcRow = idx % mod;
     int srcCol = idx / mod;
     int dstRow = nrows - srcRow - VecType::Ratio;
@@ -46,7 +46,7 @@ __global__ void reverseKernel(math_t *out, const math_t *in, int nrows,
     srcIdx = srcCol * nrows + srcRow;
     dstIdx = dstCol * nrows + dstRow;
   } else if (rowMajor && !alongRows) {
-    int mod = ceildiv(ncols, 2);
+    int mod = raft::ceildiv(ncols, 2);
     int srcRow = idx / mod;
     int srcCol = idx % mod;
     int dstRow = srcRow;
@@ -68,8 +68,8 @@ __global__ void reverseKernel(math_t *out, const math_t *in, int nrows,
   if ((rowMajor && !alongRows) || (!rowMajor && alongRows)) {
 #pragma unroll
     for (int i = 0; i < VecType::Ratio; ++i) {
-      swap(a.val.data[i], a.val.data[VecType::Ratio - i - 1]);
-      swap(b.val.data[i], b.val.data[VecType::Ratio - i - 1]);
+      raft::swapVals(a.val.data[i], a.val.data[VecType::Ratio - i - 1]);
+      raft::swapVals(b.val.data[i], b.val.data[VecType::Ratio - i - 1]);
     }
   }
 #pragma unroll
@@ -85,8 +85,9 @@ template <typename math_t, int veclen_, typename Lambda, int TPB>
 void reverseImpl(math_t *out, const math_t *in, int nrows, int ncols,
                  bool rowMajor, bool alongRows, Lambda op,
                  cudaStream_t stream) {
-  int len = alongRows ? ceildiv(nrows, 2) * ncols : nrows * ceildiv(ncols, 2);
-  const int nblks = ceildiv(veclen_ ? len / veclen_ : len, TPB);
+  int len = alongRows ? raft::ceildiv(nrows, 2) * ncols
+                      : nrows * raft::ceildiv(ncols, 2);
+  const int nblks = raft::ceildiv(veclen_ ? len / veclen_ : len, TPB);
   reverseKernel<math_t, veclen_, Lambda><<<nblks, TPB, 0, stream>>>(
     out, in, nrows, ncols, rowMajor, alongRows, len, op);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -107,9 +108,10 @@ void reverseImpl(math_t *out, const math_t *in, int nrows, int ncols,
  * @param op the device-lambda to perform an optional final unary operation on
  *  each element after the reverse
  */
-template <typename math_t, typename Lambda = Nop<math_t>, int TPB = 256>
+template <typename math_t, typename Lambda = raft::Nop<math_t>, int TPB = 256>
 void reverse(math_t *out, const math_t *in, int nrows, int ncols, bool rowMajor,
-             bool alongRows, cudaStream_t stream, Lambda op = Nop<math_t>()) {
+             bool alongRows, cudaStream_t stream,
+             Lambda op = raft::Nop<math_t>()) {
   size_t bytes = (rowMajor ? ncols : nrows) * sizeof(math_t);
   if (16 / sizeof(math_t) && bytes % 16 == 0) {
     reverseImpl<math_t, 16 / sizeof(math_t), Lambda, TPB>(
diff --git a/cpp/src_prims/metrics/adjustedRandIndex.cuh b/cpp/src_prims/metrics/adjustedRandIndex.cuh
index b3b65cdb84..febafa2396 100644
--- a/cpp/src_prims/metrics/adjustedRandIndex.cuh
+++ b/cpp/src_prims/metrics/adjustedRandIndex.cuh
@@ -22,14 +22,14 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/reduce.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/linalg/reduce.cuh>
 #include <stats/histogram.cuh>
 #include "contingencyMatrix.cuh"
 
@@ -93,11 +93,11 @@ int countUnique(const T* arr, int size, T& minLabel, T& maxLabel,
                            [minLabel] __device__(T val, int row, int col) {
                              return int(val - minLabel);
                            });
-  LinAlg::mapThenSumReduce<int>(
+  raft::linalg::mapThenSumReduce<int>(
     nUniq.data(), totalLabels, [] __device__(const T& val) { return val != 0; },
     stream, labelCounts.data());
   int numUniques;
-  updateHost(&numUniques, nUniq.data(), 1, stream);
+  raft::update_host(&numUniques, nUniq.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   return numUniques;
 }
@@ -158,29 +158,29 @@ double computeAdjustedRandIndex(const T* firstClusterArray,
   CUDA_CHECK(cudaMemsetAsync(d_bCTwoSum.data(), 0, sizeof(MathT), stream));
   CUDA_CHECK(cudaMemsetAsync(d_nChooseTwoSum.data(), 0, sizeof(MathT), stream));
   //calculating the sum of NijC2
-  LinAlg::mapThenSumReduce<MathT, nCTwo<MathT>>(
+  raft::linalg::mapThenSumReduce<MathT, nCTwo<MathT>>(
     d_nChooseTwoSum.data(), nUniqClasses * nUniqClasses, nCTwo<MathT>(), stream,
     dContingencyMatrix.data(), dContingencyMatrix.data());
   //calculating the row-wise sums
-  LinAlg::reduce<MathT, MathT>(a.data(), dContingencyMatrix.data(),
-                               nUniqClasses, nUniqClasses, 0, true, true,
-                               stream);
+  raft::linalg::reduce<MathT, MathT>(a.data(), dContingencyMatrix.data(),
+                                     nUniqClasses, nUniqClasses, 0, true, true,
+                                     stream);
   //calculating the column-wise sums
-  LinAlg::reduce<MathT, MathT>(b.data(), dContingencyMatrix.data(),
-                               nUniqClasses, nUniqClasses, 0, true, false,
-                               stream);
+  raft::linalg::reduce<MathT, MathT>(b.data(), dContingencyMatrix.data(),
+                                     nUniqClasses, nUniqClasses, 0, true, false,
+                                     stream);
   //calculating the sum of number of unordered pairs for every element in a
-  LinAlg::mapThenSumReduce<MathT, nCTwo<MathT>>(d_aCTwoSum.data(), nUniqClasses,
-                                                nCTwo<MathT>(), stream,
-                                                a.data(), a.data());
+  raft::linalg::mapThenSumReduce<MathT, nCTwo<MathT>>(
+    d_aCTwoSum.data(), nUniqClasses, nCTwo<MathT>(), stream, a.data(),
+    a.data());
   //calculating the sum of number of unordered pairs for every element of b
-  LinAlg::mapThenSumReduce<MathT, nCTwo<MathT>>(d_bCTwoSum.data(), nUniqClasses,
-                                                nCTwo<MathT>(), stream,
-                                                b.data(), b.data());
+  raft::linalg::mapThenSumReduce<MathT, nCTwo<MathT>>(
+    d_bCTwoSum.data(), nUniqClasses, nCTwo<MathT>(), stream, b.data(),
+    b.data());
   //updating in the host memory
-  updateHost(&h_nChooseTwoSum, d_nChooseTwoSum.data(), 1, stream);
-  updateHost(&h_aCTwoSum, d_aCTwoSum.data(), 1, stream);
-  updateHost(&h_bCTwoSum, d_bCTwoSum.data(), 1, stream);
+  raft::update_host(&h_nChooseTwoSum, d_nChooseTwoSum.data(), 1, stream);
+  raft::update_host(&h_aCTwoSum, d_aCTwoSum.data(), 1, stream);
+  raft::update_host(&h_bCTwoSum, d_bCTwoSum.data(), 1, stream);
   //calculating the ARI
   auto nChooseTwo = double(size) * double(size - 1) / 2.0;
   auto expectedIndex =
diff --git a/cpp/src_prims/metrics/batched/information_criterion.cuh b/cpp/src_prims/metrics/batched/information_criterion.cuh
index 0d93c2e62f..25500849e9 100644
--- a/cpp/src_prims/metrics/batched/information_criterion.cuh
+++ b/cpp/src_prims/metrics/batched/information_criterion.cuh
@@ -28,7 +28,7 @@
 #include <cuda_runtime.h>
 #include <cmath>
 
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 namespace Metrics {
@@ -73,7 +73,7 @@ void information_criterion(ScalarT* d_ic, const ScalarT* d_loglikelihood,
       break;
   }
   /* Compute information criterion from log-likelihood and base term */
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     d_ic, d_loglikelihood, batch_size,
     [=] __device__(ScalarT loglike) {
       return ic_base - (ScalarT)2.0 * loglike;
diff --git a/cpp/src_prims/metrics/contingencyMatrix.cuh b/cpp/src_prims/metrics/contingencyMatrix.cuh
index a601bd4c97..0203e1f639 100644
--- a/cpp/src_prims/metrics/contingencyMatrix.cuh
+++ b/cpp/src_prims/metrics/contingencyMatrix.cuh
@@ -17,10 +17,11 @@
 #pragma once
 
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <thrust/device_ptr.h>
 #include <thrust/reduce.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Metrics {
@@ -42,7 +43,7 @@ __global__ void devConstructContingencyMatrix(const T *groundTruth,
     T gt = groundTruth[elementId];
     T pd = predicted[elementId];
     auto outputIdx = (gt - outIdxOffset) * outMatWidth + pd - outIdxOffset;
-    myAtomicAdd(outMat + outputIdx, OutT(1));
+    raft::myAtomicAdd(outMat + outputIdx, OutT(1));
   }
 }
 
@@ -53,7 +54,7 @@ void computeCMatWAtomics(const T *groundTruth, const T *predictedLabel,
   CUDA_CHECK(cudaFuncSetCacheConfig(devConstructContingencyMatrix<T, OutT>,
                                     cudaFuncCachePreferL1));
   static const int block = 128;
-  auto grid = ceildiv(nSamples, block);
+  auto grid = raft::ceildiv(nSamples, block);
   devConstructContingencyMatrix<T, OutT><<<grid, block, 0, stream>>>(
     groundTruth, predictedLabel, nSamples, outMat, outIdxOffset, outDimN);
   CUDA_CHECK(cudaGetLastError());
@@ -77,12 +78,12 @@ __global__ void devConstructContingencyMatrixSmem(const T *groundTruth,
     T gt = groundTruth[elementId];
     T pd = predicted[elementId];
     auto outputIdx = (gt - outIdxOffset) * outMatWidth + pd - outIdxOffset;
-    myAtomicAdd(sMemMatrix + outputIdx, OutT(1));
+    raft::myAtomicAdd(sMemMatrix + outputIdx, OutT(1));
   }
   __syncthreads();
   for (auto smemIdx = threadIdx.x; smemIdx < outMatWidth * outMatWidth;
        smemIdx += blockDim.x) {
-    myAtomicAdd(outMat + smemIdx, sMemMatrix[smemIdx]);
+    raft::myAtomicAdd(outMat + smemIdx, sMemMatrix[smemIdx]);
   }
 }
 
@@ -91,7 +92,7 @@ void computeCMatWSmemAtomics(const T *groundTruth, const T *predictedLabel,
                              int nSamples, OutT *outMat, int outIdxOffset,
                              int outDimN, cudaStream_t stream) {
   static const int block = 128;
-  auto grid = ceildiv(nSamples, block);
+  auto grid = raft::ceildiv(nSamples, block);
   size_t smemSizePerBlock = outDimN * outDimN * sizeof(OutT);
   devConstructContingencyMatrixSmem<T, OutT>
     <<<grid, block, smemSizePerBlock, stream>>>(
@@ -105,12 +106,12 @@ void contingencyMatrixWSort(const T *groundTruth, const T *predictedLabel,
                             void *workspace, size_t workspaceSize,
                             cudaStream_t stream) {
   T *outKeys = reinterpret_cast<T *>(workspace);
-  auto alignedBufferSz = alignTo<size_t>(nSamples * sizeof(T), 256);
+  auto alignedBufferSz = raft::alignTo<size_t>(nSamples * sizeof(T), 256);
   T *outValue = reinterpret_cast<T *>((size_t)workspace + alignedBufferSz);
   void *pWorkspaceCub =
     reinterpret_cast<void *>((size_t)workspace + 2 * alignedBufferSz);
   auto bitsToSort = log2<int>(maxLabel);
-  if (!isPo2(maxLabel)) ++bitsToSort;
+  if (!raft::isPo2(maxLabel)) ++bitsToSort;
   // we dont really need perfect sorting, should get by with some sort of
   // binning-reordering operation
   ///@todo: future work - explore "efficient" custom binning kernels vs cub sort
@@ -131,7 +132,7 @@ ContingencyMatrixImplType getImplVersion(OutT outDimN) {
   CUDA_CHECK(cudaGetDevice(&currDevice));
   CUDA_CHECK(
     cudaDeviceGetAttribute(&l2CacheSize, cudaDevAttrL2CacheSize, currDevice));
-  auto maxSmemPerBlock = getSharedMemPerBlock();
+  auto maxSmemPerBlock = raft::getSharedMemPerBlock();
   ContingencyMatrixImplType implVersion = IMPL_NONE;
   // keeping 8 block per SM to get good utilization
   // can go higher but reduced L1 size degrades perf
@@ -199,7 +200,8 @@ size_t getContingencyMatrixWorkspaceSize(
     CUDA_CHECK(cub::DeviceRadixSort::SortPairs(pWorkspaceCub, tmpStorageBytes,
                                                pTmpKey, pTmpValue, pTmpKeyOut,
                                                pTmpValueOut, nSamples));
-    auto tmpStagingMemorySize = alignTo<size_t>(nSamples * sizeof(T), 256);
+    auto tmpStagingMemorySize =
+      raft::alignTo<size_t>(nSamples * sizeof(T), 256);
     tmpStagingMemorySize *= 2;
     workspaceSize = tmpStagingMemorySize + tmpStorageBytes;
   }
diff --git a/cpp/src_prims/metrics/dispersion.cuh b/cpp/src_prims/metrics/dispersion.cuh
index c4560f6d52..08fdc654ac 100644
--- a/cpp/src_prims/metrics/dispersion.cuh
+++ b/cpp/src_prims/metrics/dispersion.cuh
@@ -16,13 +16,13 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/eltwise.cuh>
 #include <memory>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/eltwise.cuh>
 
 namespace MLCommon {
 namespace Metrics {
@@ -45,10 +45,10 @@ __global__ void weightedMeanKernel(DataT *mu, const DataT *data,
       (colId < D) ? data[i * D + colId] * (DataT)counts[i] : DataT(0);
   }
   __syncthreads();
-  myAtomicAdd(smu + thisColId, thread_data);
+  raft::myAtomicAdd(smu + thisColId, thread_data);
   __syncthreads();
   if (threadIdx.x < ColsPerBlk && colId < D)
-    myAtomicAdd(mu + colId, smu[thisColId]);
+    raft::myAtomicAdd(mu + colId, smu[thisColId]);
 }
 
 template <typename DataT, typename IdxT, int TPB>
@@ -70,7 +70,7 @@ __global__ void dispersionKernel(DataT *result, const DataT *clusters,
   __syncthreads();
   auto acc = BlockReduce(temp_storage).Sum(sum);
   __syncthreads();
-  if (threadIdx.x == 0) myAtomicAdd(result, acc);
+  if (threadIdx.x == 0) raft::myAtomicAdd(result, acc);
 }
 
 /**
@@ -100,7 +100,8 @@ DataT dispersion(const DataT *centroids, const IdxT *clusterSizes,
   static const int RowsPerThread = 4;
   static const int ColsPerBlk = 32;
   static const int RowsPerBlk = (TPB / ColsPerBlk) * RowsPerThread;
-  dim3 grid(ceildiv(nPoints, (IdxT)RowsPerBlk), ceildiv(dim, (IdxT)ColsPerBlk));
+  dim3 grid(raft::ceildiv(nPoints, (IdxT)RowsPerBlk),
+            raft::ceildiv(dim, (IdxT)ColsPerBlk));
   device_buffer<DataT> mean(allocator, stream);
   device_buffer<DataT> result(allocator, stream, 1);
   DataT *mu = globalCentroid;
@@ -114,15 +115,15 @@ DataT dispersion(const DataT *centroids, const IdxT *clusterSizes,
     <<<grid, TPB, 0, stream>>>(mu, centroids, clusterSizes, dim, nClusters);
   CUDA_CHECK(cudaGetLastError());
   DataT ratio = DataT(1) / DataT(nPoints);
-  LinAlg::scalarMultiply(mu, mu, ratio, dim, stream);
+  raft::linalg::scalarMultiply(mu, mu, ratio, dim, stream);
   // finally, compute the dispersion
   constexpr int ItemsPerThread = 4;
-  int nblks = ceildiv<int>(dim * nClusters, TPB * ItemsPerThread);
+  int nblks = raft::ceildiv<int>(dim * nClusters, TPB * ItemsPerThread);
   dispersionKernel<DataT, IdxT, TPB><<<nblks, TPB, 0, stream>>>(
     result.data(), centroids, clusterSizes, mu, dim, nClusters);
   CUDA_CHECK(cudaGetLastError());
   DataT h_result;
-  updateHost(&h_result, result.data(), 1, stream);
+  raft::update_host(&h_result, result.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   return sqrt(h_result);
 }
diff --git a/cpp/src_prims/metrics/entropy.cuh b/cpp/src_prims/metrics/entropy.cuh
index 933f51c354..f0520200b3 100644
--- a/cpp/src_prims/metrics/entropy.cuh
+++ b/cpp/src_prims/metrics/entropy.cuh
@@ -18,14 +18,14 @@
 * @brief Calculates the entropy for a labeling in nats.(ie, uses natural logarithm for the calculations)
 */
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/divide.cuh>
-#include <linalg/map_then_reduce.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/divide.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
 
 namespace MLCommon {
 
@@ -117,17 +117,17 @@ double entropy(const T *clusterArray, const int size, const T lowerLabelRange,
               workspace, allocator, stream);
 
   //scalar dividing by size
-  MLCommon::LinAlg::divideScalar<double>(prob.data(), prob.data(), (double)size,
-                                         numUniqueClasses, stream);
+  raft::linalg::divideScalar<double>(prob.data(), prob.data(), (double)size,
+                                     numUniqueClasses, stream);
 
   //calculating the aggregate entropy
-  MLCommon::LinAlg::mapThenSumReduce<double, entropyOp>(
+  raft::linalg::mapThenSumReduce<double, entropyOp>(
     d_entropy.data(), numUniqueClasses, entropyOp(), stream, prob.data(),
     prob.data());
 
   //updating in the host memory
   double h_entropy;
-  MLCommon::updateHost(&h_entropy, d_entropy.data(), 1, stream);
+  raft::update_host(&h_entropy, d_entropy.data(), 1, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src_prims/metrics/klDivergence.cuh b/cpp/src_prims/metrics/klDivergence.cuh
index 3d9b201d6b..44bae7173e 100644
--- a/cpp/src_prims/metrics/klDivergence.cuh
+++ b/cpp/src_prims/metrics/klDivergence.cuh
@@ -21,12 +21,12 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/map_then_reduce.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
 
 namespace MLCommon {
 
@@ -68,13 +68,13 @@ DataT klDivergence(const DataT* modelPDF, const DataT* candidatePDF, int size,
   MLCommon::device_buffer<DataT> d_KLDVal(allocator, stream, 1);
   CUDA_CHECK(cudaMemsetAsync(d_KLDVal.data(), 0, sizeof(DataT), stream));
 
-  MLCommon::LinAlg::mapThenSumReduce<DataT, KLDOp<DataT>, 256, const DataT*>(
+  raft::linalg::mapThenSumReduce<DataT, KLDOp<DataT>, 256, const DataT*>(
     d_KLDVal.data(), (size_t)size, KLDOp<DataT>(), stream, modelPDF,
     candidatePDF);
 
   DataT h_KLDVal;
 
-  MLCommon::updateHost(&h_KLDVal, d_KLDVal.data(), 1, stream);
+  raft::update_host(&h_KLDVal, d_KLDVal.data(), 1, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src_prims/metrics/mutualInfoScore.cuh b/cpp/src_prims/metrics/mutualInfoScore.cuh
index f89241d034..243451aef8 100644
--- a/cpp/src_prims/metrics/mutualInfoScore.cuh
+++ b/cpp/src_prims/metrics/mutualInfoScore.cuh
@@ -24,13 +24,13 @@
 *   on the same dataset when the real ground truth is not known.
 */
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/reduce.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/reduce.cuh>
 #include "contingencyMatrix.cuh"
 
 namespace MLCommon {
@@ -79,7 +79,7 @@ __global__ void mutualInfoKernel(const int *dContingencyMatrix, const int *a,
 
   //executed once per block
   if (threadIdx.x == 0 && threadIdx.y == 0) {
-    myAtomicAdd(d_MI, localMI);
+    raft::myAtomicAdd(d_MI, localMI);
   }
 }
 
@@ -136,20 +136,20 @@ double mutualInfoScore(const T *firstClusterArray, const T *secondClusterArray,
   CUDA_CHECK(cudaMemsetAsync(d_MI.data(), 0, sizeof(double), stream));
 
   //calculating the row-wise sums
-  MLCommon::LinAlg::reduce<int, int, int>(a.data(), dContingencyMatrix.data(),
-                                          numUniqueClasses, numUniqueClasses, 0,
-                                          true, true, stream);
+  raft::linalg::reduce<int, int, int>(a.data(), dContingencyMatrix.data(),
+                                      numUniqueClasses, numUniqueClasses, 0,
+                                      true, true, stream);
 
   //calculating the column-wise sums
-  MLCommon::LinAlg::reduce<int, int, int>(b.data(), dContingencyMatrix.data(),
-                                          numUniqueClasses, numUniqueClasses, 0,
-                                          true, false, stream);
+  raft::linalg::reduce<int, int, int>(b.data(), dContingencyMatrix.data(),
+                                      numUniqueClasses, numUniqueClasses, 0,
+                                      true, false, stream);
 
   //kernel configuration
   static const int BLOCK_DIM_Y = 16, BLOCK_DIM_X = 16;
   dim3 numThreadsPerBlock(BLOCK_DIM_X, BLOCK_DIM_Y);
-  dim3 numBlocks(ceildiv<int>(numUniqueClasses, numThreadsPerBlock.x),
-                 ceildiv<int>(numUniqueClasses, numThreadsPerBlock.y));
+  dim3 numBlocks(raft::ceildiv<int>(numUniqueClasses, numThreadsPerBlock.x),
+                 raft::ceildiv<int>(numUniqueClasses, numThreadsPerBlock.y));
 
   //calling the kernel
   mutualInfoKernel<T, BLOCK_DIM_X, BLOCK_DIM_Y>
@@ -158,7 +158,7 @@ double mutualInfoScore(const T *firstClusterArray, const T *secondClusterArray,
       d_MI.data());
 
   //updating in the host memory
-  MLCommon::updateHost(&h_MI, d_MI.data(), 1, stream);
+  raft::update_host(&h_MI, d_MI.data(), 1, stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src_prims/metrics/pairwiseDistance.cuh b/cpp/src_prims/metrics/pairwiseDistance.cuh
new file mode 100644
index 0000000000..8f4b98c2fd
--- /dev/null
+++ b/cpp/src_prims/metrics/pairwiseDistance.cuh
@@ -0,0 +1,60 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <raft/cudart_utils.h>
+#include <common/device_buffer.hpp>
+#include <cuml/common/cuml_allocator.hpp>
+#include <distance/distance.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+
+namespace MLCommon {
+namespace Metrics {
+
+/**
+ * @brief Function to calculate the distance between each ij pair in the input
+          array
+ * 
+ * @tparam DataT type of the data samples
+ * @tparam IndexT typeof the index
+ * @param x pointer to the input data samples array (mRows x kCols)
+ * @param y pointer to the second input data samples array. Can use the same
+ *          pointer as x (nRows x kCols)
+ * @param dist output pointer where the results will be stored (mRows x nCols)
+ * @param m number of rows in x
+ * @param n number of rows in y
+ * @param k number of cols in x and y (must be the same)
+ * @param metric the distance metric to use for the calculation
+ * @param allocator default allocator to allocate device memory
+ * @param stream the cuda stream where to launch this kernel
+ * @param isRowMajor specifies whether the x and y data pointers are row (C
+ *                   type array) or col (F type array) major
+ */
+template <typename DataT, typename IndexT>
+void pairwiseDistance(const DataT *x, const DataT *y, DataT *dist, IndexT m,
+                      IndexT n, IndexT k, ML::Distance::DistanceType metric,
+                      std::shared_ptr<MLCommon::deviceAllocator> allocator,
+                      cudaStream_t stream, bool isRowMajor = true) {
+  //Allocate workspace
+  MLCommon::device_buffer<char> workspace(allocator, stream, 1);
+
+  //Call the distance function
+  Distance::pairwiseDistance(x, y, dist, m, n, k, workspace, metric, stream,
+                             isRowMajor);
+}
+
+};  // namespace Metrics
+};  // namespace MLCommon
diff --git a/cpp/src_prims/metrics/randIndex.cuh b/cpp/src_prims/metrics/randIndex.cuh
index 802cde758b..f4480b3abd 100644
--- a/cpp/src_prims/metrics/randIndex.cuh
+++ b/cpp/src_prims/metrics/randIndex.cuh
@@ -42,12 +42,12 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Metrics {
@@ -101,8 +101,8 @@ __global__ void computeTheNumerator(const T* firstClusterArray,
 
   //executed once per block
   if (threadIdx.x == 0 && threadIdx.y == 0) {
-    atomicAdd((unsigned long long int*)a, myA);
-    atomicAdd((unsigned long long int*)b, myB);
+    raft::myAtomicAdd<unsigned long long int>((unsigned long long int*)a, myA);
+    raft::myAtomicAdd<unsigned long long int>((unsigned long long int*)b, myB);
   }
 }
 
@@ -130,8 +130,8 @@ double computeRandIndex(T* firstClusterArray, T* secondClusterArray,
   //kernel configuration
   static const int BLOCK_DIM_Y = 16, BLOCK_DIM_X = 16;
   dim3 numThreadsPerBlock(BLOCK_DIM_X, BLOCK_DIM_Y);
-  dim3 numBlocks(ceildiv<int>(size, numThreadsPerBlock.x),
-                 ceildiv<int>(size, numThreadsPerBlock.y));
+  dim3 numBlocks(raft::ceildiv<int>(size, numThreadsPerBlock.x),
+                 raft::ceildiv<int>(size, numThreadsPerBlock.y));
 
   //calling the kernel
   computeTheNumerator<T, BLOCK_DIM_X, BLOCK_DIM_Y>
@@ -141,7 +141,7 @@ double computeRandIndex(T* firstClusterArray, T* secondClusterArray,
 
   //synchronizing and updating the calculated values of a and b from device to host
   uint64_t ab_host[2] = {0};
-  MLCommon::updateHost(ab_host, arr_buf.data(), 2, stream);
+  raft::update_host(ab_host, arr_buf.data(), 2, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   //error handling
diff --git a/cpp/src_prims/score/scores.cuh b/cpp/src_prims/metrics/scores.cuh
similarity index 92%
rename from cpp/src_prims/score/scores.cuh
rename to cpp/src_prims/metrics/scores.cuh
index ce85fd9cfe..62a3682741 100644
--- a/cpp/src_prims/score/scores.cuh
+++ b/cpp/src_prims/metrics/scores.cuh
@@ -16,11 +16,11 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/eltwise.cuh>
+#include <raft/cudart_utils.h>
 #include <linalg/power.cuh>
-#include <linalg/subtract.cuh>
-#include <stats/mean.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/stats/mean.cuh>
 
 #include <memory>
 
@@ -64,7 +64,7 @@ __global__ void compute_rank(math_t *ind_X, knn_index_t *ind_X_embedded, int n,
   for (int r = 1; r < n; r++) {
     if (sample_i[r] == idx) {
       int tmp = r - n_neighbors;
-      if (tmp > 0) atomicAdd(rank, tmp);
+      if (tmp > 0) raft::myAtomicAdd<double>(rank, tmp);
       break;
     }
   }
@@ -115,7 +115,7 @@ long *get_knn_indices(math_t *input, int n, int d, int n_neighbors,
  * @param batchSize batch size
  * @return Trustworthiness score
  */
-template <typename math_t, Distance::DistanceType distance_type>
+template <typename math_t, ML::Distance::DistanceType distance_type>
 double trustworthiness_score(math_t *X, math_t *X_embedded, int n, int m, int d,
                              int n_neighbors,
                              std::shared_ptr<deviceAllocator> d_alloc,
@@ -168,16 +168,16 @@ double trustworthiness_score(math_t *X, math_t *X_embedded, int n, int m, int d,
     CUDA_CHECK(cudaPeekAtLastError());
 
     t_tmp = 0.0;
-    updateDevice(d_t, &t_tmp, 1, stream);
+    raft::update_device(d_t, &t_tmp, 1, stream);
 
     int work = curBatchSize * n_neighbors;
-    int n_blocks = ceildiv(work, N_THREADS);
+    int n_blocks = raft::ceildiv(work, N_THREADS);
     compute_rank<<<n_blocks, N_THREADS, 0, stream>>>(
       d_ind_X_tmp, &ind_X_embedded[(n - toDo) * (n_neighbors + 1)], n,
       n_neighbors, curBatchSize * n_neighbors, d_t);
     CUDA_CHECK(cudaPeekAtLastError());
 
-    updateHost(&t_tmp, d_t, 1, stream);
+    raft::update_host(&t_tmp, d_t, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     if (bAllocWorkspace) {
@@ -220,22 +220,22 @@ double trustworthiness_score(math_t *X, math_t *X_embedded, int n, int m, int d,
 template <typename math_t>
 math_t r2_score(math_t *y, math_t *y_hat, int n, cudaStream_t stream) {
   math_t *y_bar;
-  MLCommon::allocate(y_bar, 1);
+  raft::allocate(y_bar, 1);
 
-  MLCommon::Stats::mean(y_bar, y, 1, n, false, false, stream);
+  raft::stats::mean(y_bar, y, 1, n, false, false, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   math_t *sse_arr;
-  MLCommon::allocate(sse_arr, n);
+  raft::allocate(sse_arr, n);
 
-  MLCommon::LinAlg::eltwiseSub(sse_arr, y, y_hat, n, stream);
+  raft::linalg::eltwiseSub(sse_arr, y, y_hat, n, stream);
   MLCommon::LinAlg::powerScalar(sse_arr, sse_arr, math_t(2.0), n, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
   math_t *ssto_arr;
-  MLCommon::allocate(ssto_arr, n);
+  raft::allocate(ssto_arr, n);
 
-  MLCommon::LinAlg::subtractDevScalar(ssto_arr, y, y_bar, n, stream);
+  raft::linalg::subtractDevScalar(ssto_arr, y, y_bar, n, stream);
   MLCommon::LinAlg::powerScalar(ssto_arr, ssto_arr, math_t(2.0), n, stream);
   CUDA_CHECK(cudaPeekAtLastError());
 
@@ -271,8 +271,8 @@ float accuracy_score(const math_t *predictions, const math_t *ref_predictions,
   math_t *diffs_array = (math_t *)d_alloc->allocate(n * sizeof(math_t), stream);
 
   //TODO could write a kernel instead
-  MLCommon::LinAlg::eltwiseSub(diffs_array, predictions, ref_predictions, n,
-                               stream);
+  raft::linalg::eltwiseSub(diffs_array, predictions, ref_predictions, n,
+                           stream);
   CUDA_CHECK(cudaGetLastError());
   correctly_predicted = thrust::count(thrust::cuda::par.on(stream), diffs_array,
                                       diffs_array + n, 0);
@@ -297,8 +297,8 @@ __global__ void reg_metrics_kernel(const T *predictions,
   for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
     double diff = predictions[i] - ref_predictions[i];
     double abs_diff = abs(diff);
-    atomicAdd(&shmem[0], abs_diff);
-    atomicAdd(&shmem[1], diff * diff);
+    raft::myAtomicAdd(&shmem[0], abs_diff);
+    raft::myAtomicAdd(&shmem[1], diff * diff);
 
     // update absolute difference in global memory for subsequent abs. median computation
     abs_diffs[i] = abs_diff;
@@ -307,7 +307,7 @@ __global__ void reg_metrics_kernel(const T *predictions,
 
   // Update tmp_sum w/ total abs_difference_sum and squared difference sum.
   for (int i = threadIdx.x; i < 2; i += blockDim.x) {
-    atomicAdd(&tmp_sums[i], shmem[i]);
+    raft::myAtomicAdd(&tmp_sums[i], shmem[i]);
   }
 }
 
@@ -331,7 +331,7 @@ void regression_metrics(const T *predictions, const T *ref_predictions, int n,
   std::vector<double> mean_errors(2);
   std::vector<double> h_sorted_abs_diffs(n);
   int thread_cnt = 256;
-  int block_cnt = ceildiv(n, thread_cnt);
+  int block_cnt = raft::ceildiv(n, thread_cnt);
 
   int array_size = n * sizeof(double);
   double *abs_diffs_array = (double *)d_alloc->allocate(array_size, stream);
@@ -342,7 +342,7 @@ void regression_metrics(const T *predictions, const T *ref_predictions, int n,
   reg_metrics_kernel<T><<<block_cnt, thread_cnt, 0, stream>>>(
     predictions, ref_predictions, n, abs_diffs_array, tmp_sums);
   CUDA_CHECK(cudaGetLastError());
-  MLCommon::updateHost(&mean_errors[0], tmp_sums, 2, stream);
+  raft::update_host(&mean_errors[0], tmp_sums, 2, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   mean_abs_error = mean_errors[0] / n;
@@ -359,7 +359,7 @@ void regression_metrics(const T *predictions, const T *ref_predictions, int n,
     (void *)temp_storage, temp_storage_bytes, abs_diffs_array, sorted_abs_diffs,
     n, 0, 8 * sizeof(double), stream));
 
-  MLCommon::updateHost(h_sorted_abs_diffs.data(), sorted_abs_diffs, n, stream);
+  raft::update_host(h_sorted_abs_diffs.data(), sorted_abs_diffs, n, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   int middle = n / 2;
diff --git a/cpp/src_prims/metrics/silhouetteScore.cuh b/cpp/src_prims/metrics/silhouetteScore.cuh
index be0120ed97..b261cd18a1 100644
--- a/cpp/src_prims/metrics/silhouetteScore.cuh
+++ b/cpp/src_prims/metrics/silhouetteScore.cuh
@@ -14,22 +14,22 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <distance/distance.cuh>
 #include <iostream>
-#include <linalg/binary_op.cuh>
-#include <linalg/eltwise.cuh>
-#include <linalg/map_then_reduce.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/reduce.cuh>
 #include <linalg/reduce_cols_by_key.cuh>
 #include <numeric>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/eltwise.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/reduce.cuh>
 
 namespace MLCommon {
 namespace Metrics {
@@ -187,7 +187,7 @@ DataT silhouetteScore(DataT *X_in, int nRows, int nCols, LabelT *labels,
 
   Distance::pairwiseDistance(
     X_in, X_in, distanceMatrix.data(), nRows, nRows, nCols, workspace,
-    static_cast<Distance::DistanceType>(metric), stream);
+    static_cast<ML::Distance::DistanceType>(metric), stream);
 
   //deciding on the array of silhouette scores for each dataPoint
   MLCommon::device_buffer<DataT> silhouetteScoreSamples(allocator, stream, 0);
@@ -213,9 +213,9 @@ DataT silhouetteScore(DataT *X_in, int nRows, int nCols, LabelT *labels,
                                                      nRows * nLabels);
   CUDA_CHECK(cudaMemsetAsync(sampleToClusterSumOfDistances.data(), 0,
                              nRows * nLabels * sizeof(DataT), stream));
-  LinAlg::reduce_cols_by_key(distanceMatrix.data(), labels,
-                             sampleToClusterSumOfDistances.data(), nRows, nRows,
-                             nLabels, stream);
+  MLCommon::LinAlg::reduce_cols_by_key(distanceMatrix.data(), labels,
+                                       sampleToClusterSumOfDistances.data(),
+                                       nRows, nRows, nLabels, stream);
 
   //creating the a array and b array
   device_buffer<DataT> d_aArray(allocator, stream, nRows);
@@ -228,7 +228,7 @@ DataT silhouetteScore(DataT *X_in, int nRows, int nCols, LabelT *labels,
   //kernel that populates the d_aArray
   //kernel configuration
   dim3 numThreadsPerBlock(32, 1, 1);
-  dim3 numBlocks(ceildiv<int>(nRows, numThreadsPerBlock.x), 1, 1);
+  dim3 numBlocks(raft::ceildiv<int>(nRows, numThreadsPerBlock.x), 1, 1);
 
   //calling the kernel
   populateAKernel<<<numBlocks, numThreadsPerBlock, 0, stream>>>(
@@ -241,21 +241,21 @@ DataT silhouetteScore(DataT *X_in, int nRows, int nCols, LabelT *labels,
   CUDA_CHECK(cudaMemsetAsync(averageDistanceBetweenSampleAndCluster.data(), 0,
                              nRows * nLabels * sizeof(DataT), stream));
 
-  LinAlg::matrixVectorOp<DataT, DivOp<DataT>>(
+  raft::linalg::matrixVectorOp<DataT, DivOp<DataT>>(
     averageDistanceBetweenSampleAndCluster.data(),
     sampleToClusterSumOfDistances.data(), binCountArray.data(),
     binCountArray.data(), nLabels, nRows, true, true, DivOp<DataT>(), stream);
 
   //calculating row-wise minimum
-  LinAlg::reduce<DataT, DataT, int, Nop<DataT>, MinOp<DataT>>(
+  raft::linalg::reduce<DataT, DataT, int, raft::Nop<DataT>, MinOp<DataT>>(
     d_bArray.data(), averageDistanceBetweenSampleAndCluster.data(), nLabels,
     nRows, std::numeric_limits<DataT>::max(), true, true, stream, false,
-    Nop<DataT>(), MinOp<DataT>());
+    raft::Nop<DataT>(), MinOp<DataT>());
 
   //calculating the silhouette score per sample using the d_aArray and d_bArray
-  LinAlg::binaryOp<DataT, SilOp<DataT>>(perSampleSilScore, d_aArray.data(),
-                                        d_bArray.data(), nRows, SilOp<DataT>(),
-                                        stream);
+  raft::linalg::binaryOp<DataT, SilOp<DataT>>(perSampleSilScore,
+                                              d_aArray.data(), d_bArray.data(),
+                                              nRows, SilOp<DataT>(), stream);
 
   //calculating the sum of all the silhouette score
   device_buffer<DataT> d_avgSilhouetteScore(allocator, stream, 1);
@@ -264,11 +264,12 @@ DataT silhouetteScore(DataT *X_in, int nRows, int nCols, LabelT *labels,
 
   DataT avgSilhouetteScore;
 
-  MLCommon::LinAlg::mapThenSumReduce<double, Nop<DataT>>(
-    d_avgSilhouetteScore.data(), nRows, Nop<DataT>(), stream, perSampleSilScore,
-    perSampleSilScore);
+  raft::linalg::mapThenSumReduce<double, raft::Nop<DataT>>(
+    d_avgSilhouetteScore.data(), nRows, raft::Nop<DataT>(), stream,
+    perSampleSilScore, perSampleSilScore);
 
-  updateHost(&avgSilhouetteScore, d_avgSilhouetteScore.data(), 1, stream);
+  raft::update_host(&avgSilhouetteScore, d_avgSilhouetteScore.data(), 1,
+                    stream);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
diff --git a/cpp/src_prims/random/make_arima.cuh b/cpp/src_prims/random/make_arima.cuh
index 3e6e4bd015..5cf5bf5e29 100644
--- a/cpp/src_prims/random/make_arima.cuh
+++ b/cpp/src_prims/random/make_arima.cuh
@@ -24,8 +24,8 @@
 
 #include <cuml/tsa/arima_common.h>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/random/rng.cuh>
 #include <timeSeries/arima_helpers.cuh>
-#include "rng.cuh"
 
 namespace MLCommon {
 namespace Random {
@@ -94,7 +94,7 @@ __global__ void make_arima_kernel(DataT* d_diff, const DataT* d_res,
     obs +=
       (threadIdx.x < min(i, n_theta)) ? theta * b_res[i - threadIdx.x - 1] : 0;
 
-    obs = MLCommon::blockReduce(obs, temp_smem);
+    obs = raft::blockReduce(obs, temp_smem);
 
     if (threadIdx.x == 0) {
       // Intercept and residual
@@ -134,14 +134,14 @@ void make_arima(DataT* out, int batch_size, int n_obs, ML::ARIMAOrder order,
                 std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream,
                 DataT scale = (DataT)1.0, DataT noise_scale = (DataT)0.2,
                 DataT intercept_scale = (DataT)1.0, uint64_t seed = 0ULL,
-                GeneratorType type = GenPhilox) {
+                raft::random::GeneratorType type = raft::random::GenPhilox) {
   int d_sD = order.d + order.s * order.D;
   int n_phi = order.p + order.s * order.P;
   int n_theta = order.q + order.s * order.Q;
   auto counting = thrust::make_counting_iterator(0);
 
   // Create CPU/GPU random generators and distributions
-  Rng gpu_gen(seed, type);
+  raft::random::Rng gpu_gen(seed, type);
 
   // Generate parameters. We draw temporary random parameters and transform
   // them to create the final parameters.
@@ -215,7 +215,7 @@ void make_arima(DataT* out, int batch_size, int n_obs, ML::ARIMAOrder order,
                  noise_scale, stream);
 
   // Call the main kernel to generate the differenced series
-  int n_warps = std::max(ceildiv<int>(std::max(n_phi, n_theta), 32), 1);
+  int n_warps = std::max(raft::ceildiv<int>(std::max(n_phi, n_theta), 32), 1);
   size_t shared_mem_size = (2 * (n_obs - d_sD) + n_warps) * sizeof(double);
   make_arima_kernel<<<batch_size, 32 * n_warps, shared_mem_size, stream>>>(
     d_diff, residuals.data(), params.mu, params.ar, params.ma, params.sar,
diff --git a/cpp/src_prims/random/make_blobs.cuh b/cpp/src_prims/random/make_blobs.cuh
index 5bce029a8f..ff9cb5c79d 100644
--- a/cpp/src_prims/random/make_blobs.cuh
+++ b/cpp/src_prims/random/make_blobs.cuh
@@ -16,14 +16,14 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/random/rng.cuh>
 #include <vector>
 #include "permute.cuh"
-#include "rng.cuh"
 
 namespace MLCommon {
 namespace Random {
@@ -33,7 +33,7 @@ namespace {
 // generate the labels first and shuffle them instead of shuffling the dataset
 template <typename IdxT>
 void generate_labels(IdxT* labels, IdxT n_rows, IdxT n_clusters, bool shuffle,
-                     Rng& r, cudaStream_t stream) {
+                     raft::random::Rng& r, cudaStream_t stream) {
   IdxT a, b;
   r.affine_transform_params(n_clusters, a, b);
   auto op = [=] __device__(IdxT * ptr, IdxT idx) {
@@ -47,8 +47,8 @@ void generate_labels(IdxT* labels, IdxT n_rows, IdxT n_clusters, bool shuffle,
       *ptr = idx;
     }
   };
-  LinAlg::writeOnlyUnaryOp<IdxT, decltype(op), IdxT>(labels, n_rows, op,
-                                                     stream);
+  raft::linalg::writeOnlyUnaryOp<IdxT, decltype(op), IdxT>(labels, n_rows, op,
+                                                           stream);
 }
 
 template <typename DataT, typename IdxT>
@@ -70,6 +70,11 @@ DI void get_mu_sigma(DataT& mu, DataT& sigma, IdxT idx, const IdxT* labels,
   } else {
     center_id = 0;
   }
+
+  if (fid >= n_cols) {
+    fid = 0;
+  }
+
   if (row_major) {
     center_id = center_id * n_cols + fid;
   } else {
@@ -83,14 +88,15 @@ template <typename DataT, typename IdxT>
 void generate_data(DataT* out, const IdxT* labels, IdxT n_rows, IdxT n_cols,
                    IdxT n_clusters, cudaStream_t stream, bool row_major,
                    const DataT* centers, const DataT* cluster_std,
-                   const DataT cluster_std_scalar, Rng& rng) {
+                   const DataT cluster_std_scalar, raft::random::Rng& rng) {
   auto op = [=] __device__(DataT & val1, DataT & val2, IdxT idx1, IdxT idx2) {
     DataT mu1, sigma1, mu2, sigma2;
     get_mu_sigma(mu1, sigma1, idx1, labels, row_major, centers, cluster_std,
                  cluster_std_scalar, n_rows, n_cols, n_clusters);
     get_mu_sigma(mu2, sigma2, idx2, labels, row_major, centers, cluster_std,
                  cluster_std_scalar, n_rows, n_cols, n_clusters);
-    box_muller_transform<DataT>(val1, val2, sigma1, mu1, sigma2, mu2);
+    raft::random::box_muller_transform<DataT>(val1, val2, sigma1, mu1, sigma2,
+                                              mu2);
   };
   rng.custom_distribution2<DataT, DataT, IdxT>(out, n_rows * n_cols, op,
                                                stream);
@@ -142,8 +148,8 @@ void make_blobs(DataT* out, IdxT* labels, IdxT n_rows, IdxT n_cols,
                 const DataT cluster_std_scalar = (DataT)1.0,
                 bool shuffle = true, DataT center_box_min = (DataT)-10.0,
                 DataT center_box_max = (DataT)10.0, uint64_t seed = 0ULL,
-                GeneratorType type = GenPhilox) {
-  Rng r(seed, type);
+                raft::random::GeneratorType type = raft::random::GenPhilox) {
+  raft::random::Rng r(seed, type);
   // use the right centers buffer for data generation
   device_buffer<DataT> rand_centers(allocator, stream);
   const DataT* _centers;
diff --git a/cpp/src_prims/random/make_regression.cuh b/cpp/src_prims/random/make_regression.cuh
index 0be87819f0..fbcc05d18a 100644
--- a/cpp/src_prims/random/make_regression.cuh
+++ b/cpp/src_prims/random/make_regression.cuh
@@ -23,15 +23,18 @@
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
 #include <linalg/init.h>
-#include <linalg/transpose.h>
-#include <linalg/add.cuh>
-#include <linalg/qr.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <common/device_buffer.hpp>
+#include <raft/handle.hpp>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/qr.cuh>
+#include <raft/matrix/matrix.cuh>
+#include <raft/mr/device/buffer.hpp>
+#include <raft/random/rng.cuh>
 #include "permute.cuh"
-#include "rng.cuh"
 
 namespace MLCommon {
 namespace Random {
@@ -44,67 +47,68 @@ static __global__ void _singular_profile_kernel(DataT* out, IdxT n,
   IdxT tid = threadIdx.x + blockIdx.x * blockDim.x;
   if (tid < n) {
     DataT sval = static_cast<DataT>(tid) / rank;
-    DataT low_rank = ((DataT)1.0 - tail_strength) * myExp(-sval * sval);
-    DataT tail = tail_strength * myExp((DataT)-0.1 * sval);
+    DataT low_rank = ((DataT)1.0 - tail_strength) * raft::myExp(-sval * sval);
+    DataT tail = tail_strength * raft::myExp((DataT)-0.1 * sval);
     out[tid] = low_rank + tail;
   }
 }
 
 /* Internal auxiliary function to generate a low-rank matrix */
 template <typename DataT, typename IdxT>
-static void _make_low_rank_matrix(DataT* out, IdxT n_rows, IdxT n_cols,
-                                  IdxT effective_rank, DataT tail_strength,
-                                  Rng& r, cublasHandle_t cublas_handle,
-                                  cusolverDnHandle_t cusolver_handle,
-                                  std::shared_ptr<deviceAllocator> allocator,
+static void _make_low_rank_matrix(const raft::handle_t& handle, DataT* out,
+                                  IdxT n_rows, IdxT n_cols, IdxT effective_rank,
+                                  DataT tail_strength, raft::random::Rng& r,
                                   cudaStream_t stream) {
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
+  cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+
   IdxT n = std::min(n_rows, n_cols);
 
   // Generate random (ortho normal) vectors with QR decomposition
-  device_buffer<DataT> rd_mat_0(allocator, stream);
-  device_buffer<DataT> rd_mat_1(allocator, stream);
+  raft::mr::device::buffer<DataT> rd_mat_0(allocator, stream);
+  raft::mr::device::buffer<DataT> rd_mat_1(allocator, stream);
   rd_mat_0.resize(n_rows * n, stream);
   rd_mat_1.resize(n_cols * n, stream);
   r.normal(rd_mat_0.data(), n_rows * n, (DataT)0.0, (DataT)1.0, stream);
   r.normal(rd_mat_1.data(), n_cols * n, (DataT)0.0, (DataT)1.0, stream);
-  device_buffer<DataT> q0(allocator, stream);
-  device_buffer<DataT> q1(allocator, stream);
+  raft::mr::device::buffer<DataT> q0(allocator, stream);
+  raft::mr::device::buffer<DataT> q1(allocator, stream);
   q0.resize(n_rows * n, stream);
   q1.resize(n_cols * n, stream);
-  LinAlg::qrGetQ(rd_mat_0.data(), q0.data(), n_rows, n, cusolver_handle, stream,
-                 allocator);
-  LinAlg::qrGetQ(rd_mat_1.data(), q1.data(), n_cols, n, cusolver_handle, stream,
-                 allocator);
+  raft::linalg::qrGetQ(handle, rd_mat_0.data(), q0.data(), n_rows, n, stream);
+  raft::linalg::qrGetQ(handle, rd_mat_1.data(), q1.data(), n_cols, n, stream);
 
   // Build the singular profile by assembling signal and noise components
-  device_buffer<DataT> singular_vec(allocator, stream);
-  device_buffer<DataT> singular_mat(allocator, stream);
+  raft::mr::device::buffer<DataT> singular_vec(allocator, stream);
+  raft::mr::device::buffer<DataT> singular_mat(allocator, stream);
   singular_vec.resize(n, stream);
-  _singular_profile_kernel<<<ceildiv<IdxT>(n, 256), 256, 0, stream>>>(
+  _singular_profile_kernel<<<raft::ceildiv<IdxT>(n, 256), 256, 0, stream>>>(
     singular_vec.data(), n, tail_strength, effective_rank);
   CUDA_CHECK(cudaPeekAtLastError());
   singular_mat.resize(n * n, stream);
   CUDA_CHECK(
     cudaMemsetAsync(singular_mat.data(), 0, n * n * sizeof(DataT), stream));
-  Matrix::initializeDiagonalMatrix(singular_vec.data(), singular_mat.data(), n,
-                                   n, stream);
+  raft::matrix::initializeDiagonalMatrix(singular_vec.data(),
+                                         singular_mat.data(), n, n, stream);
 
   // Generate the column-major matrix
-  device_buffer<DataT> temp_q0s(allocator, stream);
-  device_buffer<DataT> temp_out(allocator, stream);
+  raft::mr::device::buffer<DataT> temp_q0s(allocator, stream);
+  raft::mr::device::buffer<DataT> temp_out(allocator, stream);
   temp_q0s.resize(n_rows * n, stream);
   temp_out.resize(n_rows * n_cols, stream);
   DataT alpha = 1.0, beta = 0.0;
-  LinAlg::cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, n_rows, n, n,
-                     &alpha, q0.data(), n_rows, singular_mat.data(), n, &beta,
-                     temp_q0s.data(), n_rows, stream);
-  LinAlg::cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, n_rows, n_cols, n,
-                     &alpha, temp_q0s.data(), n_rows, q1.data(), n_cols, &beta,
-                     temp_out.data(), n_rows, stream);
+  raft::linalg::cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, n_rows, n,
+                           n, &alpha, q0.data(), n_rows, singular_mat.data(), n,
+                           &beta, temp_q0s.data(), n_rows, stream);
+  raft::linalg::cublasgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, n_rows,
+                           n_cols, n, &alpha, temp_q0s.data(), n_rows,
+                           q1.data(), n_cols, &beta, temp_out.data(), n_rows,
+                           stream);
 
   // Transpose from column-major to row-major
-  LinAlg::transpose(temp_out.data(), out, n_rows, n_cols, cublas_handle,
-                    stream);
+  raft::linalg::transpose(handle, temp_out.data(), out, n_rows, n_cols, stream);
 }
 
 /* Internal auxiliary function to permute rows in the given matrix according
@@ -163,31 +167,34 @@ static __global__ void _gather2d_kernel(DataT* out, const DataT* in,
  * @param[in]   type            Random generator type
  */
 template <typename DataT, typename IdxT>
-void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
-                     IdxT n_informative, cublasHandle_t cublas_handle,
-                     cusolverDnHandle_t cusolver_handle,
-                     std::shared_ptr<deviceAllocator> allocator,
-                     cudaStream_t stream, DataT* coef = nullptr,
-                     IdxT n_targets = (IdxT)1, DataT bias = (DataT)0.0,
-                     IdxT effective_rank = (IdxT)-1,
-                     DataT tail_strength = (DataT)0.5, DataT noise = (DataT)0.0,
-                     bool shuffle = true, uint64_t seed = 0ULL,
-                     GeneratorType type = GenPhilox) {
+void make_regression(
+  const raft::handle_t& handle, DataT* out, DataT* values, IdxT n_rows,
+  IdxT n_cols, IdxT n_informative, cudaStream_t stream, DataT* coef = nullptr,
+  IdxT n_targets = (IdxT)1, DataT bias = (DataT)0.0,
+  IdxT effective_rank = (IdxT)-1, DataT tail_strength = (DataT)0.5,
+  DataT noise = (DataT)0.0, bool shuffle = true, uint64_t seed = 0ULL,
+  raft::random::GeneratorType type = raft::random::GenPhilox) {
   n_informative = std::min(n_informative, n_cols);
+
+  std::shared_ptr<raft::mr::device::allocator> allocator =
+    handle.get_device_allocator();
+  cusolverDnHandle_t cusolver_handle = handle.get_cusolver_dn_handle();
+  cublasHandle_t cublas_handle = handle.get_cublas_handle();
+
   cublasSetPointerMode(cublas_handle, CUBLAS_POINTER_MODE_HOST);
-  Rng r(seed, type);
+  raft::random::Rng r(seed, type);
 
   if (effective_rank < 0) {
     // Randomly generate a well conditioned input set
     r.normal(out, n_rows * n_cols, (DataT)0.0, (DataT)1.0, stream);
   } else {
     // Randomly generate a low rank, fat tail input set
-    _make_low_rank_matrix(out, n_rows, n_cols, effective_rank, tail_strength, r,
-                          cublas_handle, cusolver_handle, allocator, stream);
+    _make_low_rank_matrix(handle, out, n_rows, n_cols, effective_rank,
+                          tail_strength, r, stream);
   }
 
   // Use the right output buffer for the values
-  device_buffer<DataT> tmp_values(allocator, stream);
+  raft::mr::device::buffer<DataT> tmp_values(allocator, stream);
   DataT* _values;
   if (shuffle) {
     tmp_values.resize(n_rows * n_targets, stream);
@@ -197,7 +204,7 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
   }
   // Create a column-major matrix of output values only if it has more
   // than 1 column
-  device_buffer<DataT> values_col(allocator, stream);
+  raft::mr::device::buffer<DataT> values_col(allocator, stream);
   DataT* _values_col;
   if (n_targets > 1) {
     values_col.resize(n_rows * n_targets, stream);
@@ -207,7 +214,7 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
   }
 
   // Use the right buffer for the coefficients
-  device_buffer<DataT> tmp_coef(allocator, stream);
+  raft::mr::device::buffer<DataT> tmp_coef(allocator, stream);
   DataT* _coef;
   if (coef != nullptr && !shuffle) {
     _coef = coef;
@@ -226,19 +233,19 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
 
   // Compute the output values
   DataT alpha = (DataT)1.0, beta = (DataT)0.0;
-  CUBLAS_CHECK(LinAlg::cublasgemm(
+  CUBLAS_CHECK(raft::linalg::cublasgemm(
     cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T, n_rows, n_targets, n_informative,
     &alpha, out, n_cols, _coef, n_targets, &beta, _values_col, n_rows, stream));
 
   // Transpose the values from column-major to row-major if needed
   if (n_targets > 1) {
-    LinAlg::transpose(_values_col, _values, n_rows, n_targets, cublas_handle,
-                      stream);
+    raft::linalg::transpose(handle, _values_col, _values, n_rows, n_targets,
+                            stream);
   }
 
   if (bias != 0.0) {
     // Add bias
-    LinAlg::addScalar(_values, _values, bias, n_rows * n_targets, stream);
+    raft::linalg::addScalar(_values, _values, bias, n_rows * n_targets, stream);
   }
 
   device_buffer<DataT> white_noise(allocator, stream);
@@ -246,14 +253,14 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
     // Add white noise
     white_noise.resize(n_rows * n_targets, stream);
     r.normal(white_noise.data(), n_rows * n_targets, (DataT)0.0, noise, stream);
-    LinAlg::add(_values, _values, white_noise.data(), n_rows * n_targets,
-                stream);
+    raft::linalg::add(_values, _values, white_noise.data(), n_rows * n_targets,
+                      stream);
   }
 
   if (shuffle) {
-    device_buffer<DataT> tmp_out(allocator, stream);
-    device_buffer<IdxT> perms_samples(allocator, stream);
-    device_buffer<IdxT> perms_features(allocator, stream);
+    raft::mr::device::buffer<DataT> tmp_out(allocator, stream);
+    raft::mr::device::buffer<IdxT> perms_samples(allocator, stream);
+    raft::mr::device::buffer<IdxT> perms_features(allocator, stream);
     tmp_out.resize(n_rows * n_cols, stream);
     perms_samples.resize(n_rows, stream);
     perms_features.resize(n_cols, stream);
@@ -263,7 +270,7 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
     // Shuffle the samples from out to tmp_out
     permute<DataT, IdxT, IdxT>(perms_samples.data(), tmp_out.data(), out,
                                n_cols, n_rows, true, stream);
-    IdxT nblks_rows = ceildiv<IdxT>(n_rows, Nthreads);
+    IdxT nblks_rows = raft::ceildiv<IdxT>(n_rows, Nthreads);
     _gather2d_kernel<<<nblks_rows, Nthreads, 0, stream>>>(
       values, _values, perms_samples.data(), n_rows, n_targets);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -274,7 +281,7 @@ void make_regression(DataT* out, DataT* values, IdxT n_rows, IdxT n_cols,
 
     // Shuffle the coefficients accordingly
     if (coef != nullptr) {
-      IdxT nblks_cols = ceildiv<IdxT>(n_cols, Nthreads);
+      IdxT nblks_cols = raft::ceildiv<IdxT>(n_cols, Nthreads);
       _gather2d_kernel<<<nblks_cols, Nthreads, 0, stream>>>(
         coef, _coef, perms_features.data(), n_cols, n_targets);
       CUDA_CHECK(cudaPeekAtLastError());
diff --git a/cpp/src_prims/random/mvg.cuh b/cpp/src_prims/random/mvg.cuh
index eb4b8cfb2e..e5e3562140 100644
--- a/cpp/src_prims/random/mvg.cuh
+++ b/cpp/src_prims/random/mvg.cuh
@@ -15,14 +15,14 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/cusolver_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <stdio.h>
 #include <cmath>
-#include <cuda_utils.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "curand_wrappers.h"
 
 // mvg.cuh takes in matrices that are colomn major (as in fortan)
@@ -46,7 +46,7 @@ enum Filler : unsigned char {
  */
 template <typename T>
 void epsilonToZero(T *eig, T epsilon, int size, cudaStream_t stream) {
-  LinAlg::unaryOp(
+  raft::linalg::unaryOp(
     eig, eig, size,
     [epsilon] __device__(T in) {
       return (in < epsilon && in > -epsilon) ? T(0.0) : in;
@@ -68,7 +68,7 @@ void epsilonToZero(T *eig, T epsilon, int size, cudaStream_t stream) {
 template <typename T>
 void matVecAdd(T *out, const T *in_m, const T *in_v, T scalar, int rows,
                int cols, cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
+  raft::linalg::matrixVectorOp(
     out, in_m, in_v, cols, rows, true, true,
     [=] __device__(T mat, T vec) { return mat + scalar * vec; }, stream);
 }
@@ -132,11 +132,11 @@ class MultiVarGaussian {
     // malloc workspace_decomp
     size_t granuality = 256, offset = 0;
     workspace_decomp = (T *)offset;
-    offset += alignTo(sizeof(T) * Lwork, granuality);
+    offset += raft::alignTo(sizeof(T) * Lwork, granuality);
     eig = (T *)offset;
-    offset += alignTo(sizeof(T) * dim, granuality);
+    offset += raft::alignTo(sizeof(T) * dim, granuality);
     info = (int *)offset;
-    offset += alignTo(sizeof(int), granuality);
+    offset += raft::alignTo(sizeof(int), granuality);
     return offset;
   }
 
@@ -153,16 +153,16 @@ class MultiVarGaussian {
     CURAND_CHECK(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT));
     CURAND_CHECK(curandSetPseudoRandomGeneratorSeed(gen, 28));  // SEED
     if (method == chol_decomp) {
-      CUSOLVER_CHECK(LinAlg::cusolverDnpotrf_bufferSize(cusolverHandle, uplo,
-                                                        dim, P, dim, &Lwork));
+      CUSOLVER_CHECK(raft::linalg::cusolverDnpotrf_bufferSize(
+        cusolverHandle, uplo, dim, P, dim, &Lwork));
     } else if (method == jacobi) {  // jacobi init
       CUSOLVER_CHECK(cusolverDnCreateSyevjInfo(&syevj_params));
       CUSOLVER_CHECK(cusolverDnXsyevjSetTolerance(syevj_params, tol));
       CUSOLVER_CHECK(cusolverDnXsyevjSetMaxSweeps(syevj_params, max_sweeps));
-      CUSOLVER_CHECK(LinAlg::cusolverDnsyevj_bufferSize(
+      CUSOLVER_CHECK(raft::linalg::cusolverDnsyevj_bufferSize(
         cusolverHandle, jobz, uplo, dim, P, dim, eig, &Lwork, syevj_params));
     } else {  // method == qr
-      CUSOLVER_CHECK(LinAlg::cusolverDnsyevd_bufferSize(
+      CUSOLVER_CHECK(raft::linalg::cusolverDnsyevd_bufferSize(
         cusolverHandle, jobz, uplo, dim, P, dim, eig, &Lwork));
     }
     return give_buffer_size();
@@ -177,20 +177,20 @@ class MultiVarGaussian {
   void give_gaussian(const int nPoints, T *P, T *X, const T *x = 0) {
     if (method == chol_decomp) {
       // lower part will contains chol_decomp
-      CUSOLVER_CHECK(LinAlg::cusolverDnpotrf(cusolverHandle, uplo, dim, P, dim,
-                                             workspace_decomp, Lwork, info,
-                                             cudaStream));
+      CUSOLVER_CHECK(raft::linalg::cusolverDnpotrf(cusolverHandle, uplo, dim, P,
+                                                   dim, workspace_decomp, Lwork,
+                                                   info, cudaStream));
     } else if (method == jacobi) {
-      CUSOLVER_CHECK(LinAlg::cusolverDnsyevj(
+      CUSOLVER_CHECK(raft::linalg::cusolverDnsyevj(
         cusolverHandle, jobz, uplo, dim, P, dim, eig, workspace_decomp, Lwork,
         info, syevj_params,
         cudaStream));  // vectors stored as cols. & col major
     } else {           // qr
-      CUSOLVER_CHECK(LinAlg::cusolverDnsyevd(cusolverHandle, jobz, uplo, dim, P,
-                                             dim, eig, workspace_decomp, Lwork,
-                                             info, cudaStream));
+      CUSOLVER_CHECK(raft::linalg::cusolverDnsyevd(
+        cusolverHandle, jobz, uplo, dim, P, dim, eig, workspace_decomp, Lwork,
+        info, cudaStream));
     }
-    updateHost(&info_h, info, 1, cudaStream);
+    raft::update_host(&info_h, info, 1, cudaStream);
     CUDA_CHECK(cudaStreamSynchronize(cudaStream));
     ASSERT(info_h == 0, "mvg: error in syevj/syevd/potrf, info=%d | expected=0",
            info_h);
@@ -202,33 +202,34 @@ class MultiVarGaussian {
     if (method == chol_decomp) {
       // upper part (0) being filled with 0.0
       dim3 block(32, 32);
-      dim3 grid(ceildiv(dim, (int)block.x), ceildiv(dim, (int)block.y));
+      dim3 grid(raft::ceildiv(dim, (int)block.x),
+                raft::ceildiv(dim, (int)block.y));
       fill_uplo<T><<<grid, block, 0, cudaStream>>>(dim, UPPER, (T)0.0, P);
       CUDA_CHECK(cudaPeekAtLastError());
 
       // P is lower triangular chol decomp mtrx
-      CUBLAS_CHECK(LinAlg::cublasgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
-                                      dim, nPoints, dim, &alfa, P, dim, X, dim,
-                                      &beta, X, dim, cudaStream));
+      CUBLAS_CHECK(raft::linalg::cublasgemm(
+        cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, dim, nPoints, dim, &alfa, P,
+        dim, X, dim, &beta, X, dim, cudaStream));
     } else {
       epsilonToZero(eig, epsilon, dim, cudaStream);
       dim3 block(64);
-      dim3 grid(ceildiv(dim, (int)block.x));
+      dim3 grid(raft::ceildiv(dim, (int)block.x));
       CUDA_CHECK(cudaMemsetAsync(info, 0, sizeof(int), cudaStream));
-      grid.x = ceildiv(dim * dim, (int)block.x);
+      grid.x = raft::ceildiv(dim * dim, (int)block.x);
       combined_dot_product<T>
         <<<grid, block, 0, cudaStream>>>(dim, dim, eig, P, info);
       CUDA_CHECK(cudaPeekAtLastError());
 
       // checking if any eigen vals were negative
-      updateHost(&info_h, info, 1, cudaStream);
+      raft::update_host(&info_h, info, 1, cudaStream);
       CUDA_CHECK(cudaStreamSynchronize(cudaStream));
       ASSERT(info_h == 0, "mvg: Cov matrix has %dth Eigenval negative", info_h);
 
       // Got Q = eigvect*eigvals.sqrt in P, Q*X in X below
-      CUBLAS_CHECK(LinAlg::cublasgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
-                                      dim, nPoints, dim, &alfa, P, dim, X, dim,
-                                      &beta, X, dim, cudaStream));
+      CUBLAS_CHECK(raft::linalg::cublasgemm(
+        cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, dim, nPoints, dim, &alfa, P,
+        dim, X, dim, &beta, X, dim, cudaStream));
     }
     // working to make mean not 0
     // since we are working with column-major, nPoints and dim are swapped
diff --git a/cpp/src_prims/random/permute.cuh b/cpp/src_prims/random/permute.cuh
index f46dd45a49..f765668e61 100644
--- a/cpp/src_prims/random/permute.cuh
+++ b/cpp/src_prims/random/permute.cuh
@@ -16,11 +16,11 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <cooperative_groups.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <memory>
-#include <vectorized.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/vectorized.cuh>
 
 namespace MLCommon {
 namespace Random {
@@ -85,13 +85,13 @@ struct permute_impl_t {
                           IdxType D, int nblks, IdxType a, IdxType b,
                           cudaStream_t stream) {
     //determine vector type and set new pointers
-    typedef typename MLCommon::IOType<Type, VLen>::Type VType;
+    typedef typename raft::IOType<Type, VLen>::Type VType;
     VType* vout = reinterpret_cast<VType*>(out);
     const VType* vin = reinterpret_cast<const VType*>(in);
 
     // check if we can execute at this vector length
-    if (D % VLen == 0 && is_aligned(vout, sizeof(VType)) &&
-        is_aligned(vin, sizeof(VType))) {
+    if (D % VLen == 0 && raft::is_aligned(vout, sizeof(VType)) &&
+        raft::is_aligned(vin, sizeof(VType))) {
       permuteKernel<VType, IntType, IdxType, TPB, rowMajor>
         <<<nblks, TPB, 0, stream>>>(perms, vout, vin, a, b, N, D / VLen);
       CUDA_CHECK(cudaPeekAtLastError());
@@ -143,11 +143,11 @@ template <typename Type, typename IntType = int, typename IdxType = int,
           int TPB = 256>
 void permute(IntType* perms, Type* out, const Type* in, IntType D, IntType N,
              bool rowMajor, cudaStream_t stream) {
-  auto nblks = ceildiv(N, (IntType)TPB);
+  auto nblks = raft::ceildiv(N, (IntType)TPB);
 
   // always keep 'a' to be coprime to N
   IdxType a = rand() % N;
-  while (gcd(a, N) != 1) a = (a + 1) % N;
+  while (raft::gcd(a, N) != 1) a = (a + 1) % N;
   IdxType b = rand() % N;
 
   if (rowMajor) {
diff --git a/cpp/src_prims/random/rng.cuh b/cpp/src_prims/random/rng.cuh
deleted file mode 100644
index 4d237433bf..0000000000
--- a/cpp/src_prims/random/rng.cuh
+++ /dev/null
@@ -1,670 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <common/cudart_utils.h>
-#include <stdint.h>
-#include <common/cub_wrappers.cuh>
-#include <common/scatter.cuh>
-#include <cstdio>
-#include <cstdlib>
-#include <cuda_utils.cuh>
-#include <cuml/common/cuml_allocator.hpp>
-#include <random>
-#include <type_traits>
-#include "rng_impl.cuh"
-
-namespace MLCommon {
-namespace Random {
-
-/** all different generator types used */
-enum GeneratorType {
-  /** curand-based philox generator */
-  GenPhilox = 0,
-  /** LFSR taps generator */
-  GenTaps,
-  /** kiss99 generator (currently the fastest) */
-  GenKiss99
-};
-
-template <typename OutType, typename MathType, typename GenType,
-          typename LenType, typename Lambda>
-__global__ void randKernel(uint64_t seed, uint64_t offset, OutType *ptr,
-                           LenType len, Lambda randOp) {
-  LenType tid = (blockIdx.x * blockDim.x) + threadIdx.x;
-  detail::Generator<GenType> gen(seed, (uint64_t)tid, offset);
-  const LenType stride = gridDim.x * blockDim.x;
-  for (LenType idx = tid; idx < len; idx += stride) {
-    MathType val;
-    gen.next(val);
-    ptr[idx] = randOp(val, idx);
-  }
-}
-
-// used for Box-Muller type transformations
-template <typename OutType, typename MathType, typename GenType,
-          typename LenType, typename Lambda2>
-__global__ void rand2Kernel(uint64_t seed, uint64_t offset, OutType *ptr,
-                            LenType len, Lambda2 rand2Op) {
-  LenType tid = (blockIdx.x * blockDim.x) + threadIdx.x;
-  detail::Generator<GenType> gen(seed, (uint64_t)tid, offset);
-  const LenType stride = gridDim.x * blockDim.x;
-  for (LenType idx = tid; idx < len; idx += stride) {
-    MathType val1, val2;
-    gen.next(val1);
-    gen.next(val2);
-    rand2Op(val1, val2, idx, idx + stride);
-    if (idx < len) ptr[idx] = (OutType)val1;
-    idx += stride;
-    if (idx < len) ptr[idx] = (OutType)val2;
-  }
-}
-
-template <typename Type>
-__global__ void constFillKernel(Type *ptr, int len, Type val) {
-  unsigned tid = (blockIdx.x * blockDim.x) + threadIdx.x;
-  const unsigned stride = gridDim.x * blockDim.x;
-  for (unsigned idx = tid; idx < len; idx += stride) {
-    ptr[idx] = val;
-  }
-}
-
-/**
- * @brief Helper method to compute Box Muller transform
- *
- * @tparam Type data type
- *
- * @param[inout] val1   first value
- * @param[inout] val2   second value
- * @param[in]    sigma1 standard deviation of output gaussian for first value
- * @param[in]    mu1    mean of output gaussian for first value
- * @param[in]    sigma2 standard deviation of output gaussian for second value
- * @param[in]    mu2    mean of output gaussian for second value
- * @{
- */
-template <typename Type>
-DI void box_muller_transform(Type &val1, Type &val2, Type sigma1, Type mu1,
-                             Type sigma2, Type mu2) {
-  constexpr Type twoPi = Type(2.0) * Type(3.141592654);
-  constexpr Type minus2 = -Type(2.0);
-  Type R = mySqrt(minus2 * myLog(val1));
-  Type theta = twoPi * val2;
-  Type s, c;
-  mySinCos(theta, s, c);
-  val1 = R * c * sigma1 + mu1;
-  val2 = R * s * sigma2 + mu2;
-}
-template <typename Type>
-DI void box_muller_transform(Type &val1, Type &val2, Type sigma1, Type mu1) {
-  box_muller_transform<Type>(val1, val2, sigma1, mu1, sigma1, mu1);
-}
-/** @} */
-
-/** The main random number generator class, fully on GPUs */
-class Rng {
- public:
-  /**
-   * @brief ctor
-   * @param _s 64b seed used to initialize the RNG
-   * @param _t backend device RNG generator type
-   * @note Refer to the `Rng::seed` method for details about seeding the engine
-   */
-  Rng(uint64_t _s, GeneratorType _t = GenPhilox)
-    : type(_t),
-      offset(0),
-      // simple heuristic to make sure all SMs will be occupied properly
-      // and also not too many initialization calls will be made by each thread
-      nBlocks(4 * getMultiProcessorCount()),
-      gen() {
-    seed(_s);
-  }
-
-  /**
-   * @brief Seed (and thus re-initialize) the underlying RNG engine
-   * @param _s 64b seed used to initialize the RNG
-   * @note If you need non-reproducibility, pass a seed that's, for example, a
-   *       function of timestamp. Another example is to use the c++11's
-   *       `std::random_device` for setting seed.
-   */
-  void seed(uint64_t _s) {
-    gen.seed(_s);
-    offset = 0;
-  }
-
-  /**
-   * @brief Generates the 'a' and 'b' parameters for a modulo affine
-   *        transformation equation: `(ax + b) % n`
-   *
-   * @tparam IdxT integer type
-   *
-   * @param[in]  n the modulo range
-   * @param[out] a slope parameter
-   * @param[out] b intercept parameter
-   */
-  template <typename IdxT>
-  void affine_transform_params(IdxT n, IdxT &a, IdxT &b) {
-    // always keep 'a' to be coprime to 'n'
-    a = gen() % n;
-    while (gcd(a, n) != 1) {
-      ++a;
-      if (a >= n) a = 0;
-    }
-    // the bias term 'b' can be any number in the range of [0, n)
-    b = gen() % n;
-  }
-
-  /**
-   * @brief Generate uniformly distributed numbers in the given range
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output array
-   * @param len the number of elements in the output
-   * @param start start of the range
-   * @param end end of the range
-   * @param stream stream where to launch the kernel
-   * @{
-   */
-  template <typename Type, typename LenType = int>
-  void uniform(Type *ptr, LenType len, Type start, Type end,
-               cudaStream_t stream) {
-    static_assert(std::is_floating_point<Type>::value,
-                  "Type for 'uniform' can only be floating point!");
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        return (val * (end - start)) + start;
-      },
-      stream);
-  }
-  template <typename IntType, typename LenType = int>
-  void uniformInt(IntType *ptr, LenType len, IntType start, IntType end,
-                  cudaStream_t stream) {
-    static_assert(std::is_integral<IntType>::value,
-                  "Type for 'uniformInt' can only be integer!");
-    custom_distribution(
-      ptr, len,
-      [=] __device__(IntType val, LenType idx) {
-        return (val % (end - start)) + start;
-      },
-      stream);
-  }
-  /** @} */
-
-  /**
-   * @brief Generate normal distributed numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output array
-   * @param len the number of elements in the output
-   * @param mu mean of the distribution
-   * @param sigma std-dev of the distribution
-   * @param stream stream where to launch the kernel
-   * @{
-   */
-  template <typename Type, typename LenType = int>
-  void normal(Type *ptr, LenType len, Type mu, Type sigma,
-              cudaStream_t stream) {
-    static_assert(std::is_floating_point<Type>::value,
-                  "Type for 'normal' can only be floating point!");
-    rand2Impl(
-      offset, ptr, len,
-      [=] __device__(Type & val1, Type & val2, LenType idx1, LenType idx2) {
-        box_muller_transform<Type>(val1, val2, sigma, mu);
-      },
-      NumThreads, nBlocks, type, stream);
-  }
-  template <typename IntType, typename LenType = int>
-  void normalInt(IntType *ptr, LenType len, IntType mu, IntType sigma,
-                 cudaStream_t stream) {
-    static_assert(std::is_integral<IntType>::value,
-                  "Type for 'normalInt' can only be integer!");
-    rand2Impl<IntType, double>(
-      offset, ptr, len,
-      [=] __device__(double &val1, double &val2, LenType idx1, LenType idx2) {
-        box_muller_transform<double>(val1, val2, sigma, mu);
-      },
-      NumThreads, nBlocks, type, stream);
-  }
-  /** @} */
-
-  /**
-   * @brief Generate normal distributed table according to the given set of
-   * means and scalar standard deviations.
-   *
-   * Each row in this table conforms to a normally distributed n-dim vector
-   * whose mean is the input vector and standard deviation is the corresponding
-   * vector or scalar. Correlations among the dimensions itself is assumed to
-   * be absent.
-   *
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output table (dim = n_rows x n_cols)
-   * @param n_rows number of rows in the table
-   * @param n_cols number of columns in the table
-   * @param mu mean vector (dim = n_cols x 1).
-   * @param sigma_vec std-dev vector of each component (dim = n_cols x 1). Pass
-   * a nullptr to use the same scalar 'sigma' across all components
-   * @param sigma scalar sigma to be used if 'sigma_vec' is nullptr
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void normalTable(Type *ptr, LenType n_rows, LenType n_cols, const Type *mu,
-                   const Type *sigma_vec, Type sigma, cudaStream_t stream) {
-    rand2Impl(
-      offset, ptr, n_rows * n_cols,
-      [=] __device__(Type & val1, Type & val2, LenType idx1, LenType idx2) {
-        // yikes! use fast-int-div
-        auto col1 = idx1 % n_cols;
-        auto col2 = idx2 % n_cols;
-        auto mean1 = mu[col1];
-        auto mean2 = mu[col2];
-        auto sig1 = sigma_vec == nullptr ? sigma : sigma_vec[col1];
-        auto sig2 = sigma_vec == nullptr ? sigma : sigma_vec[col2];
-        box_muller_transform<Type>(val1, val2, sig1, mean1, sig2, mean2);
-      },
-      NumThreads, nBlocks, type, stream);
-  }
-
-  /**
-   * @brief Fill an array with the given value
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output array
-   * @param len the number of elements in the output
-   * @param val value to be filled
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void fill(Type *ptr, LenType len, Type val, cudaStream_t stream) {
-    constFillKernel<Type><<<nBlocks, NumThreads, 0, stream>>>(ptr, len, val);
-    CUDA_CHECK(cudaPeekAtLastError());
-  }
-
-  /**
-   * @brief Generate bernoulli distributed boolean array
-   *
-   * @tparam Type    data type in which to compute the probabilities
-   * @tparam OutType output data type
-   * @tparam LenType data type used to represent length of the arrays
-   *
-   * @param[out] ptr    the output array
-   * @param[in]  len    the number of elements in the output
-   * @param[in]  prob   coin-toss probability for heads
-   * @param[in]  stream stream where to launch the kernel
-   */
-  template <typename Type, typename OutType = bool, typename LenType = int>
-  void bernoulli(OutType *ptr, LenType len, Type prob, cudaStream_t stream) {
-    custom_distribution<OutType, Type>(
-      ptr, len, [=] __device__(Type val, LenType idx) { return val > prob; },
-      stream);
-  }
-
-  /**
-   * @brief Generate bernoulli distributed array and applies scale
-   * @tparam Type data type in which to compute the probabilities
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output array
-   * @param len the number of elements in the output
-   * @param prob coin-toss probability for heads
-   * @param scale scaling factor
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void scaled_bernoulli(Type *ptr, LenType len, Type prob, Type scale,
-                        cudaStream_t stream) {
-    static_assert(std::is_floating_point<Type>::value,
-                  "Type for 'scaled_bernoulli' can only be floating point!");
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        return val > prob ? -scale : scale;
-      },
-      stream);
-  }
-
-  /**
-   * @brief Generate gumbel distributed random numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr output array
-   * @param len number of elements in the output array
-   * @param mu mean value
-   * @param beta scale value
-   * @param stream stream where to launch the kernel
-   * @note https://en.wikipedia.org/wiki/Gumbel_distribution
-   */
-  template <typename Type, typename LenType = int>
-  void gumbel(Type *ptr, LenType len, Type mu, Type beta, cudaStream_t stream) {
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        return mu - beta * myLog(-myLog(val));
-      },
-      stream);
-  }
-
-  /**
-   * @brief Generate lognormal distributed numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr the output array
-   * @param len the number of elements in the output
-   * @param mu mean of the distribution
-   * @param sigma std-dev of the distribution
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void lognormal(Type *ptr, LenType len, Type mu, Type sigma,
-                 cudaStream_t stream) {
-    rand2Impl(
-      offset, ptr, len,
-      [=] __device__(Type & val1, Type & val2, LenType idx1, LenType idx2) {
-        box_muller_transform<Type>(val1, val2, sigma, mu);
-        val1 = myExp(val1);
-        val2 = myExp(val2);
-      },
-      NumThreads, nBlocks, type, stream);
-  }
-
-  /**
-   * @brief Generate logistic distributed random numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr output array
-   * @param len number of elements in the output array
-   * @param mu mean value
-   * @param scale scale value
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void logistic(Type *ptr, LenType len, Type mu, Type scale,
-                cudaStream_t stream) {
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        constexpr Type one = (Type)1.0;
-        return mu - scale * myLog(one / val - one);
-      },
-      stream);
-  }
-
-  /**
-   * @brief Generate exponentially distributed random numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr output array
-   * @param len number of elements in the output array
-   * @param lambda the lambda
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void exponential(Type *ptr, LenType len, Type lambda, cudaStream_t stream) {
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        constexpr Type one = (Type)1.0;
-        return -myLog(one - val) / lambda;
-      },
-      stream);
-  }
-
-  /**
-   * @brief Generate rayleigh distributed random numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr output array
-   * @param len number of elements in the output array
-   * @param sigma the sigma
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void rayleigh(Type *ptr, LenType len, Type sigma, cudaStream_t stream) {
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        constexpr Type one = (Type)1.0;
-        constexpr Type two = (Type)2.0;
-        return mySqrt(-two * myLog(one - val)) * sigma;
-      },
-      stream);
-  }
-
-  /**
-   * @brief Generate laplace distributed random numbers
-   * @tparam Type data type of output random number
-   * @tparam LenType data type used to represent length of the arrays
-   * @param ptr output array
-   * @param len number of elements in the output array
-   * @param mu the mean
-   * @param scale the scale
-   * @param stream stream where to launch the kernel
-   */
-  template <typename Type, typename LenType = int>
-  void laplace(Type *ptr, LenType len, Type mu, Type scale,
-               cudaStream_t stream) {
-    custom_distribution(
-      ptr, len,
-      [=] __device__(Type val, LenType idx) {
-        constexpr Type one = (Type)1.0;
-        constexpr Type two = (Type)2.0;
-        constexpr Type oneHalf = (Type)0.5;
-        Type out;
-        if (val <= oneHalf) {
-          out = mu + scale * myLog(two * val);
-        } else {
-          out = mu - scale * myLog(two * (one - val));
-        }
-        return out;
-      },
-      stream);
-  }
-
-  /**
-   * @brief Sample the input array without replacement, optionally based on the
-   * input weight vector for each element in the array
-   *
-   * Implementation here is based on the `one-pass sampling` algo described here:
-   * https://www.ethz.ch/content/dam/ethz/special-interest/baug/ivt/ivt-dam/vpl/reports/1101-1200/ab1141.pdf
-   *
-   * @note In the sampled array the elements which are picked will always appear
-   * in the increasing order of their weights as computed using the exponential
-   * distribution. So, if you're particular about the order (for eg. array
-   * permutations), then this might not be the right choice!
-   *
-   * @tparam DataT data type
-   * @tparam WeightsT weights type
-   * @tparam IdxT index type
-   * @param out output sampled array (of length 'sampledLen')
-   * @param outIdx indices of the sampled array (of length 'sampledLen'). Pass
-   * a nullptr if this is not required.
-   * @param in input array to be sampled (of length 'len')
-   * @param wts weights array (of length 'len'). Pass a nullptr if uniform
-   * sampling is desired
-   * @param sampledLen output sampled array length
-   * @param len input array length
-   * @param allocator device allocator for allocating any workspace required
-   * @param stream cuda stream
-   */
-  template <typename DataT, typename WeightsT, typename IdxT = int>
-  void sampleWithoutReplacement(DataT *out, IdxT *outIdx, const DataT *in,
-                                const WeightsT *wts, IdxT sampledLen, IdxT len,
-                                std::shared_ptr<deviceAllocator> allocator,
-                                cudaStream_t stream) {
-    ASSERT(sampledLen <= len,
-           "sampleWithoutReplacement: 'sampledLen' cant be more than 'len'.");
-    device_buffer<WeightsT> expWts(allocator, stream, len);
-    device_buffer<WeightsT> sortedWts(allocator, stream, len);
-    device_buffer<IdxT> inIdx(allocator, stream, len);
-    device_buffer<IdxT> outIdxBuff(allocator, stream, len);
-    auto *inIdxPtr = inIdx.data();
-    // generate modified weights
-    custom_distribution(
-      expWts.data(), len,
-      [wts, inIdxPtr] __device__(WeightsT val, IdxT idx) {
-        inIdxPtr[idx] = idx;
-        constexpr WeightsT one = (WeightsT)1.0;
-        auto exp = -myLog(one - val);
-        if (wts != nullptr) {
-          return exp / wts[idx];
-        }
-        return exp;
-      },
-      stream);
-    ///@todo: use a more efficient partitioning scheme instead of full sort
-    // sort the array and pick the top sampledLen items
-    IdxT *outIdxPtr = outIdxBuff.data();
-    device_buffer<char> workspace(allocator, stream);
-    sortPairs(workspace, expWts.data(), sortedWts.data(), inIdxPtr, outIdxPtr,
-              (int)len, stream);
-    if (outIdx != nullptr) {
-      CUDA_CHECK(cudaMemcpyAsync(outIdx, outIdxPtr, sizeof(IdxT) * sampledLen,
-                                 cudaMemcpyDeviceToDevice, stream));
-    }
-    scatter<DataT, IdxT>(out, in, outIdxPtr, sampledLen, stream);
-  }
-
-  /**
-   * @brief Core method to generate a pdf based on the cdf that is defined in
-   *        the input device lambda
-   *
-   * @tparam OutType  output type
-   * @tparam MathType type on which arithmetic is done
-   * @tparam LenTyp   index type
-   * @tparam Lambda   device lambda (or operator)
-   *
-   * @param[out] ptr    output buffer [on device] [len = len]
-   * @param[in]  len    number of elements to be generated
-   * @param[in]  randOp the device lambda or operator
-   * @param[in]  stream cuda stream
-   * @{
-   */
-  template <typename OutType, typename MathType = OutType,
-            typename LenType = int, typename Lambda>
-  void custom_distribution(OutType *ptr, LenType len, Lambda randOp,
-                           cudaStream_t stream) {
-    randImpl<OutType, MathType, LenType, Lambda>(
-      offset, ptr, len, randOp, NumThreads, nBlocks, type, stream);
-  }
-  template <typename OutType, typename MathType = OutType,
-            typename LenType = int, typename Lambda>
-  void custom_distribution2(OutType *ptr, LenType len, Lambda randOp,
-                            cudaStream_t stream) {
-    rand2Impl<OutType, MathType, LenType, Lambda>(
-      offset, ptr, len, randOp, NumThreads, nBlocks, type, stream);
-  }
-  /** @} */
-
- private:
-  /** generator type */
-  GeneratorType type;
-  /**
-   * offset is also used to initialize curand state.
-   * Limits period of Philox RNG from (4 * 2^128) to (Blocks * Threads * 2^64),
-   * but is still a large period.
-   */
-  uint64_t offset;
-  /** number of blocks to launch */
-  int nBlocks;
-  /** next seed generator for device-side RNG */
-  std::mt19937_64 gen;
-
-  static const int NumThreads = 256;
-
-  template <bool IsNormal, typename Type, typename LenType>
-  uint64_t _setupSeeds(uint64_t &seed, uint64_t &offset, LenType len,
-                       int nThreads, int nBlocks) {
-    LenType itemsPerThread = ceildiv(len, LenType(nBlocks * nThreads));
-    if (IsNormal && itemsPerThread % 2 == 1) {
-      ++itemsPerThread;
-    }
-    // curand uses 2 32b uint's to generate one double
-    uint64_t factor = sizeof(Type) / sizeof(float);
-    if (factor == 0) ++factor;
-    // Check if there are enough random numbers left in sequence
-    // If not, then generate new seed and start from zero offset
-    uint64_t newOffset = offset + LenType(itemsPerThread) * factor;
-    if (newOffset < offset) {
-      offset = 0;
-      seed = gen();
-      newOffset = itemsPerThread * factor;
-    }
-    return newOffset;
-  }
-
-  template <typename OutType, typename MathType = OutType,
-            typename LenType = int, typename Lambda>
-  void randImpl(uint64_t &offset, OutType *ptr, LenType len, Lambda randOp,
-                int nThreads, int nBlocks, GeneratorType type,
-                cudaStream_t stream) {
-    if (len <= 0) return;
-    uint64_t seed = gen();
-    auto newOffset = _setupSeeds<false, MathType, LenType>(seed, offset, len,
-                                                           nThreads, nBlocks);
-    switch (type) {
-      case GenPhilox:
-        randKernel<OutType, MathType, detail::PhiloxGenerator, LenType, Lambda>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, randOp);
-        break;
-      case GenTaps:
-        randKernel<OutType, MathType, detail::TapsGenerator, LenType, Lambda>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, randOp);
-        break;
-      case GenKiss99:
-        randKernel<OutType, MathType, detail::Kiss99Generator, LenType, Lambda>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, randOp);
-        break;
-      default:
-        ASSERT(false, "randImpl: Incorrect generator type! %d", type);
-    };
-    CUDA_CHECK(cudaGetLastError());
-    offset = newOffset;
-  }
-
-  template <typename OutType, typename MathType = OutType,
-            typename LenType = int, typename Lambda2>
-  void rand2Impl(uint64_t &offset, OutType *ptr, LenType len, Lambda2 rand2Op,
-                 int nThreads, int nBlocks, GeneratorType type,
-                 cudaStream_t stream) {
-    if (len <= 0) return;
-    auto seed = gen();
-    auto newOffset = _setupSeeds<true, MathType, LenType>(seed, offset, len,
-                                                          nThreads, nBlocks);
-    switch (type) {
-      case GenPhilox:
-        rand2Kernel<OutType, MathType, detail::PhiloxGenerator, LenType,
-                    Lambda2>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, rand2Op);
-        break;
-      case GenTaps:
-        rand2Kernel<OutType, MathType, detail::TapsGenerator, LenType, Lambda2>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, rand2Op);
-        break;
-      case GenKiss99:
-        rand2Kernel<OutType, MathType, detail::Kiss99Generator, LenType,
-                    Lambda2>
-          <<<nBlocks, nThreads, 0, stream>>>(seed, offset, ptr, len, rand2Op);
-        break;
-      default:
-        ASSERT(false, "rand2Impl: Incorrect generator type! %d", type);
-    };
-    CUDA_CHECK(cudaGetLastError());
-    offset = newOffset;
-  }
-};
-
-};  // end namespace Random
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/random/rng_impl.cuh b/cpp/src_prims/random/rng_impl.cuh
deleted file mode 100644
index 92ff9b6d7a..0000000000
--- a/cpp/src_prims/random/rng_impl.cuh
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <curand_kernel.h>
-#include <stdint.h>
-#include <cuda_utils.cuh>
-
-namespace MLCommon {
-namespace Random {
-namespace detail {
-
-/** Philox-based random number generator */
-// Courtesy: Jakub Szuppe
-struct PhiloxGenerator {
-  /**
-   * @brief ctor. Initializes the state for RNG
-   * @param seed random seed (can be same across all threads)
-   * @param subsequence as found in curand docs
-   * @param offset as found in curand docs
-   */
-  DI PhiloxGenerator(uint64_t seed, uint64_t subsequence, uint64_t offset) {
-    curand_init(seed, subsequence, offset, &state);
-  }
-
-  /**
-   * @defgroup NextRand Generate the next random number
-   * @{
-   */
-  DI void next(float& ret) { ret = curand_uniform(&(this->state)); }
-  DI void next(double& ret) { ret = curand_uniform_double(&(this->state)); }
-  DI void next(uint32_t& ret) { ret = curand(&(this->state)); }
-  DI void next(uint64_t& ret) {
-    uint32_t a, b;
-    next(a);
-    next(b);
-    ret = (uint64_t)a | ((uint64_t)b << 32);
-  }
-  DI void next(int32_t& ret) {
-    uint32_t val;
-    next(val);
-    ret = int32_t(val & 0x7fffffff);
-  }
-  DI void next(int64_t& ret) {
-    uint64_t val;
-    next(val);
-    ret = int64_t(val & 0x7fffffffffffffff);
-  }
-  /** @} */
-
- private:
-  /** the state for RNG */
-  curandStatePhilox4_32_10_t state;
-};
-
-/** LFSR taps-filter for generating random numbers. */
-// Courtesy: Vinay Deshpande
-struct TapsGenerator {
-  /**
-   * @brief ctor. Initializes the state for RNG
-   * @param seed the seed (can be same across all threads)
-   * @param subsequence unused
-   * @param offset unused
-   */
-  DI TapsGenerator(uint64_t seed, uint64_t subsequence, uint64_t offset) {
-    uint64_t delta = (blockIdx.x * blockDim.x) + threadIdx.x;
-    uint64_t stride = blockDim.x * gridDim.x;
-    delta += ((blockIdx.y * blockDim.y) + threadIdx.y) * stride;
-    stride *= blockDim.y * gridDim.y;
-    delta += ((blockIdx.z * blockDim.z) + threadIdx.z) * stride;
-    state = seed + delta + 1;
-  }
-
-  /**
-   * @defgroup NextRand Generate the next random number
-   * @{
-   */
-  template <typename Type>
-  DI void next(Type& ret) {
-    constexpr double ULL_LARGE = 1.8446744073709551614e19;
-    uint64_t val;
-    next(val);
-    ret = static_cast<Type>(val);
-    ret /= static_cast<Type>(ULL_LARGE);
-  }
-  DI void next(uint64_t& ret) {
-    constexpr uint64_t TAPS = 0x8000100040002000ULL;
-    constexpr int ROUNDS = 128;
-    for (int i = 0; i < ROUNDS; i++)
-      state = (state >> 1) ^ (-(state & 1ULL) & TAPS);
-    ret = state;
-  }
-  DI void next(uint32_t& ret) {
-    uint64_t val;
-    next(val);
-    ret = (uint32_t)val;
-  }
-  DI void next(int32_t& ret) {
-    uint32_t val;
-    next(val);
-    ret = int32_t(val & 0x7fffffff);
-  }
-  DI void next(int64_t& ret) {
-    uint64_t val;
-    next(val);
-    ret = int64_t(val & 0x7fffffffffffffff);
-  }
-  /** @} */
-
- private:
-  /** the state for RNG */
-  uint64_t state;
-};
-
-/** Kiss99-based random number generator */
-// Courtesy: Vinay Deshpande
-struct Kiss99Generator {
-  /**
-   * @brief ctor. Initializes the state for RNG
-   * @param seed the seed (can be same across all threads)
-   * @param subsequence unused
-   * @param offset unused
-   */
-  DI Kiss99Generator(uint64_t seed, uint64_t subsequence, uint64_t offset) {
-    initKiss99((uint32_t)seed);
-  }
-
-  /**
-   * @defgroup NextRand Generate the next random number
-   * @{
-   */
-  template <typename Type>
-  DI void next(Type& ret) {
-    constexpr double U_LARGE = 4.294967295e9;
-    uint32_t val;
-    next(val);
-    ret = static_cast<Type>(val);
-    ret /= static_cast<Type>(U_LARGE);
-  }
-  DI void next(uint32_t& ret) {
-    uint32_t MWC;
-    z = 36969 * (z & 65535) + (z >> 16);
-    w = 18000 * (w & 65535) + (w >> 16);
-    MWC = ((z << 16) + w);
-    jsr ^= (jsr << 17);
-    jsr ^= (jsr >> 13);
-    jsr ^= (jsr << 5);
-    jcong = 69069 * jcong + 1234567;
-    MWC = ((MWC ^ jcong) + jsr);
-    ret = MWC;
-  }
-  DI void next(uint64_t& ret) {
-    uint32_t a, b;
-    next(a);
-    next(b);
-    ret = (uint64_t)a | ((uint64_t)b << 32);
-  }
-  DI void next(int32_t& ret) {
-    uint32_t val;
-    next(val);
-    ret = int32_t(val & 0x7fffffff);
-  }
-  DI void next(int64_t& ret) {
-    uint64_t val;
-    next(val);
-    ret = int64_t(val & 0x7fffffffffffffff);
-  }
-  /** @} */
-
- private:
-  /** one of the kiss99 states */
-  uint32_t z;
-  /** one of the kiss99 states */
-  uint32_t w;
-  /** one of the kiss99 states */
-  uint32_t jsr;
-  /** one of the kiss99 states */
-  uint32_t jcong;
-
-  static const uint32_t fnvBasis = 2166136261U;
-  static const uint32_t fnvPrime = 16777619U;
-
-  DI void fnv1a32(uint32_t& hash, uint32_t txt) {
-    hash ^= (txt >> 0) & 0xFF;
-    hash *= fnvPrime;
-    hash ^= (txt >> 8) & 0xFF;
-    hash *= fnvPrime;
-    hash ^= (txt >> 16) & 0xFF;
-    hash *= fnvPrime;
-    hash ^= (txt >> 24) & 0xFF;
-    hash *= fnvPrime;
-  }
-
-  DI void initKiss99(uint32_t seed) {
-    uint32_t hash = fnvBasis;
-    fnv1a32(hash, uint32_t(threadIdx.x));
-    fnv1a32(hash, uint32_t(threadIdx.y));
-    fnv1a32(hash, uint32_t(threadIdx.z));
-    fnv1a32(hash, uint32_t(blockIdx.x));
-    fnv1a32(hash, uint32_t(blockIdx.y));
-    fnv1a32(hash, uint32_t(blockIdx.z));
-    fnv1a32(hash, seed);
-    z = hash;
-    fnv1a32(hash, 0x01);
-    w = hash;
-    fnv1a32(hash, 0x01);
-    jsr = hash;
-    fnv1a32(hash, 0x01);
-    jcong = hash;
-  }
-};
-
-/**
- * @brief generator-agnostic way of generating random numbers
- * @tparam GenType the generator object that expose 'next' method
- */
-template <typename GenType>
-struct Generator {
-  DI Generator(uint64_t seed, uint64_t subsequence, uint64_t offset)
-    : gen(seed, subsequence, offset) {}
-
-  template <typename Type>
-  DI void next(Type& ret) {
-    gen.next(ret);
-  }
-
- private:
-  /** the actual generator */
-  GenType gen;
-};
-
-};  // end namespace detail
-};  // end namespace Random
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/selection/columnWiseSort.cuh b/cpp/src_prims/selection/columnWiseSort.cuh
index e1ebc12a58..b6f62820aa 100644
--- a/cpp/src_prims/selection/columnWiseSort.cuh
+++ b/cpp/src_prims/selection/columnWiseSort.cuh
@@ -18,7 +18,7 @@
 
 #include <cuda_runtime.h>
 #include <cstddef>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 #include <cub/cub.cuh>
 #include <limits>
@@ -239,33 +239,33 @@ void sortColumnsPerRow(const InType *in, OutType *out, int n_rows,
       // more staging space for temp output of keys
       if (!sortedKeys)
         workspaceSize +=
-          alignTo(sizeof(InType) * (size_t)totalElements, memAlignWidth);
+          raft::alignTo(sizeof(InType) * (size_t)totalElements, memAlignWidth);
 
       // value in KV pair need to be passed in, out buffer is separate
       workspaceSize +=
-        alignTo(sizeof(OutType) * (size_t)totalElements, memAlignWidth);
+        raft::alignTo(sizeof(OutType) * (size_t)totalElements, memAlignWidth);
 
       // for segment offsets
       workspaceSize +=
-        alignTo(sizeof(int) * (size_t)numSegments, memAlignWidth);
+        raft::alignTo(sizeof(int) * (size_t)numSegments, memAlignWidth);
     } else {
       size_t workspaceOffset = 0;
 
       if (!sortedKeys) {
         sortedKeys = reinterpret_cast<InType *>(workspacePtr);
         workspaceOffset =
-          alignTo(sizeof(InType) * (size_t)totalElements, memAlignWidth);
+          raft::alignTo(sizeof(InType) * (size_t)totalElements, memAlignWidth);
         workspacePtr = (void *)((size_t)workspacePtr + workspaceOffset);
       }
 
       OutType *dValuesIn = reinterpret_cast<OutType *>(workspacePtr);
       workspaceOffset =
-        alignTo(sizeof(OutType) * (size_t)totalElements, memAlignWidth);
+        raft::alignTo(sizeof(OutType) * (size_t)totalElements, memAlignWidth);
       workspacePtr = (void *)((size_t)workspacePtr + workspaceOffset);
 
       int *dSegmentOffsets = reinterpret_cast<int *>(workspacePtr);
       workspaceOffset =
-        alignTo(sizeof(int) * (size_t)numSegments, memAlignWidth);
+        raft::alignTo(sizeof(int) * (size_t)numSegments, memAlignWidth);
       workspacePtr = (void *)((size_t)workspacePtr + workspaceOffset);
 
       // layout idx
@@ -292,10 +292,10 @@ void sortColumnsPerRow(const InType *in, OutType *out, int n_rows,
 
       if (!sortedKeys)
         workspaceSize +=
-          alignTo(sizeof(InType) * (size_t)n_columns, memAlignWidth);
+          raft::alignTo(sizeof(InType) * (size_t)n_columns, memAlignWidth);
 
       workspaceSize +=
-        alignTo(sizeof(OutType) * (size_t)n_columns, memAlignWidth);
+        raft::alignTo(sizeof(OutType) * (size_t)n_columns, memAlignWidth);
     } else {
       size_t workspaceOffset = 0;
       bool userKeyOutputBuffer = true;
@@ -304,13 +304,13 @@ void sortColumnsPerRow(const InType *in, OutType *out, int n_rows,
         userKeyOutputBuffer = false;
         sortedKeys = reinterpret_cast<InType *>(workspacePtr);
         workspaceOffset =
-          alignTo(sizeof(InType) * (size_t)n_columns, memAlignWidth);
+          raft::alignTo(sizeof(InType) * (size_t)n_columns, memAlignWidth);
         workspacePtr = (void *)((size_t)workspacePtr + workspaceOffset);
       }
 
       OutType *dValuesIn = reinterpret_cast<OutType *>(workspacePtr);
       workspaceOffset =
-        alignTo(sizeof(OutType) * (size_t)n_columns, memAlignWidth);
+        raft::alignTo(sizeof(OutType) * (size_t)n_columns, memAlignWidth);
       workspacePtr = (void *)((size_t)workspacePtr + workspaceOffset);
 
       // layout idx
diff --git a/cpp/src_prims/selection/knn.cuh b/cpp/src_prims/selection/knn.cuh
index 5aa7e71d94..23afcc1f49 100644
--- a/cpp/src_prims/selection/knn.cuh
+++ b/cpp/src_prims/selection/knn.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 #include <distance/distance.cuh>
 #include <label/classlabels.cuh>
@@ -188,29 +188,29 @@ inline faiss::MetricType build_faiss_metric(ML::MetricType metric) {
 
 /**
    * Search the kNN for the k-nearest neighbors of a set of query vectors
-   * @param input vector of device device memory array pointers to search
-   * @param sizes vector of memory sizes for each device array pointer in input
-   * @param D number of cols in input and search_items
-   * @param search_items set of vectors to query for neighbors
-   * @param n        number of items in search_items
-   * @param res_I    pointer to device memory for returning k nearest indices
-   * @param res_D    pointer to device memory for returning k nearest distances
-   * @param k        number of neighbors to query
-   * @param allocator the device memory allocator to use for temporary scratch memory
-   * @param userStream the main cuda stream to use
-   * @param internalStreams optional when n_params > 0, the index partitions can be
+   * @param[in] input vector of device device memory array pointers to search
+   * @param[in] sizes vector of memory sizes for each device array pointer in input
+   * @param[in] D number of cols in input and search_items
+   * @param[in] search_items set of vectors to query for neighbors
+   * @param[in] n        number of items in search_items
+   * @param[out] res_I    pointer to device memory for returning k nearest indices
+   * @param[out] res_D    pointer to device memory for returning k nearest distances
+   * @param[in] k        number of neighbors to query
+   * @param[in] allocator the device memory allocator to use for temporary scratch memory
+   * @param[in] userStream the main cuda stream to use
+   * @param[in] internalStreams optional when n_params > 0, the index partitions can be
    *        queried in parallel using these streams. Note that n_int_streams also
    *        has to be > 0 for these to be used and their cardinality does not need
    *        to correspond to n_parts.
-   * @param n_int_streams size of internalStreams. When this is <= 0, only the
+   * @param[in] n_int_streams size of internalStreams. When this is <= 0, only the
    *        user stream will be used.
-   * @param rowMajorIndex are the index arrays in row-major layout?
-   * @param rowMajorQuery are the query array in row-major layout?
-   * @param translations translation ids for indices when index rows represent
+   * @param[in] rowMajorIndex are the index arrays in row-major layout?
+   * @param[in] rowMajorQuery are the query array in row-major layout?
+   * @param[in] translations translation ids for indices when index rows represent
    *        non-contiguous partitions
-   * @param metric corresponds to the FAISS::metricType enum (default is euclidean)
-   * @param metricArg metric argument to use. Corresponds to the p arg for lp norm
-   * @param expanded_form whether or not lp variants should be reduced w/ lp-root
+   * @param[in] metric corresponds to the FAISS::metricType enum (default is euclidean)
+   * @param[in] metricArg metric argument to use. Corresponds to the p arg for lp norm
+   * @param[in] expanded_form whether or not lp variants should be reduced w/ lp-root
    */
 template <typename IntType = int>
 void brute_force_knn(std::vector<float *> &input, std::vector<int> &sizes,
@@ -255,7 +255,7 @@ void brute_force_knn(std::vector<float *> &input, std::vector<int> &sizes,
     input.size());
   for (int i = 0; i < input.size(); i++) {
     metric_processors[i] = create_processor<float>(
-      metric, n, D, k, rowMajorQuery, userStream, allocator);
+      metric, sizes[i], D, k, rowMajorQuery, userStream, allocator);
     metric_processors[i]->preprocess(input[i]);
   }
 
@@ -263,7 +263,8 @@ void brute_force_knn(std::vector<float *> &input, std::vector<int> &sizes,
   CUDA_CHECK(cudaGetDevice(&device));
 
   device_buffer<int64_t> trans(allocator, userStream, id_ranges->size());
-  updateDevice(trans.data(), id_ranges->data(), id_ranges->size(), userStream);
+  raft::update_device(trans.data(), id_ranges->data(), id_ranges->size(),
+                      userStream);
 
   device_buffer<float> all_D(allocator, userStream, 0);
   device_buffer<int64_t> all_I(allocator, userStream, 0);
@@ -286,7 +287,7 @@ void brute_force_knn(std::vector<float *> &input, std::vector<int> &sizes,
     faiss::gpu::StandardGpuResources gpu_res;
 
     cudaStream_t stream =
-      select_stream(userStream, internalStreams, n_int_streams, i);
+      raft::select_stream(userStream, internalStreams, n_int_streams, i);
 
     gpu_res.noTempMemory();
     gpu_res.setCudaMallocWarning(false);
@@ -339,7 +340,7 @@ void brute_force_knn(std::vector<float *> &input, std::vector<int> &sizes,
 	*/
     float p = 0.5;  // standard l2
     if (m == faiss::MetricType::METRIC_Lp) p = 1.0 / metricArg;
-    MLCommon::LinAlg::unaryOp<float>(
+    raft::linalg::unaryOp<float>(
       res_D, res_D, n * k,
       [p] __device__(float input) { return powf(input, p); }, userStream);
   }
@@ -375,16 +376,18 @@ template <typename OutType = int>
 __global__ void class_vote_kernel(OutType *out, const float *class_proba,
                                   int *unique_labels, int n_uniq_labels,
                                   size_t n_samples, int n_outputs,
-                                  int output_offset) {
+                                  int output_offset, bool use_shared_mem) {
   int row = (blockIdx.x * blockDim.x) + threadIdx.x;
   int i = row * n_uniq_labels;
 
   extern __shared__ int label_cache[];
-  for (int j = threadIdx.x; j < n_uniq_labels; j += blockDim.x) {
-    label_cache[j] = unique_labels[j];
-  }
+  if (use_shared_mem) {
+    for (int j = threadIdx.x; j < n_uniq_labels; j += blockDim.x) {
+      label_cache[j] = unique_labels[j];
+    }
 
-  __syncthreads();
+    __syncthreads();
+  }
 
   if (row >= n_samples) return;
   float cur_max = -1.0;
@@ -396,7 +399,10 @@ __global__ void class_vote_kernel(OutType *out, const float *class_proba,
       cur_label = j;
     }
   }
-  out[row * n_outputs + output_offset] = label_cache[cur_label];
+
+  int val = use_shared_mem ? label_cache[cur_label] : unique_labels[cur_label];
+
+  out[row * n_outputs + output_offset] = val;
 }
 
 template <typename LabelType, bool precomp_lbls = false>
@@ -409,7 +415,6 @@ __global__ void regress_avg_kernel(LabelType *out, const int64_t *knn_indices,
 
   if (row >= n_samples) return;
 
-  // should work for moderately small number of classes
   LabelType pred = 0;
   for (int j = 0; j < n_neighbors; j++) {
     pred += get_lbls<precomp_lbls>(labels, knn_indices, i + j);
@@ -422,25 +427,24 @@ __global__ void regress_avg_kernel(LabelType *out, const int64_t *knn_indices,
  * A naive knn classifier to predict probabilities
  * @tparam TPB_X number of threads per block to use. each thread
  *               will process a single row of knn_indices
- *
- * @param out vector of output class probabilities of the same size as y.
- *            each element should be of size size (n_samples * n_classes[i])
  * @tparam precomp_lbls is set to true for the reduction step of MNMG KNN Classifier. In this case,
- * the knn_indices array is not used as the y arrays already store the labels for each row.
- * This makes it possible to compute the reduction step without holding all the data on a single machine.
- * @param knn_indices the index array resulting from a knn search
- * @param y vector of label arrays. for multulabel classification,
+ *         the knn_indices array is not used as the y arrays already store the labels for each row.
+ *         This makes it possible to compute the reduction step without holding all the data on a single machine.
+ * @param[out] out vector of output class probabilities of the same size as y.
+ *            each element should be of size size (n_samples * n_classes[i])
+ * @param[in] knn_indices the index array resulting from a knn search
+ * @param[in] y vector of label arrays. for multulabel classification,
  *          each output in the vector is a different array of labels
  *          corresponding to the i'th output.
- * @param n_query_rows number of rows in knn_indices
- * @param n_index_rows number of vertices in index (eg. size of each y array)
- * @param k number of neighbors in knn_indices
- * @param uniq_labels vector of the sorted unique labels for each array in y
- * @param n_unique vector of sizes for each array in uniq_labels
- * @param allocator device allocator to use for temporary workspace
- * @param user_stream main stream to use for queuing isolated CUDA events
- * @param int_streams internal streams to use for parallelizing independent CUDA events.
- * @param n_int_streams number of elements in int_streams array. If this is less than 1,
+ * @param[in] n_index_rows number of vertices in index (eg. size of each y array)
+ * @param[in] n_query_rows number of rows in knn_indices
+ * @param[in] k number of neighbors in knn_indices
+ * @param[in] uniq_labels vector of the sorted unique labels for each array in y
+ * @param[in] n_unique vector of sizes for each array in uniq_labels
+ * @param[in] allocator device allocator to use for temporary workspace
+ * @param[in] user_stream main stream to use for queuing isolated CUDA events
+ * @param[in] int_streams internal streams to use for parallelizing independent CUDA events.
+ * @param[in] n_int_streams number of elements in int_streams array. If this is less than 1,
  *        the user_stream is used.
  */
 template <int TPB_X = 32, bool precomp_lbls = false>
@@ -453,14 +457,14 @@ void class_probs(std::vector<float *> &out, const int64_t *knn_indices,
                  int n_int_streams = 0) {
   for (int i = 0; i < y.size(); i++) {
     cudaStream_t stream =
-      select_stream(user_stream, int_streams, n_int_streams, i);
+      raft::select_stream(user_stream, int_streams, n_int_streams, i);
 
     int n_unique_labels = n_unique[i];
     int cur_size = n_query_rows * n_unique_labels;
 
     CUDA_CHECK(cudaMemsetAsync(out[i], 0, cur_size * sizeof(float), stream));
 
-    dim3 grid(MLCommon::ceildiv(n_query_rows, (size_t)TPB_X), 1, 1);
+    dim3 grid(raft::ceildiv(n_query_rows, (size_t)TPB_X), 1, 1);
     dim3 blk(TPB_X, 1, 1);
 
     /**
@@ -468,9 +472,20 @@ void class_probs(std::vector<float *> &out, const int64_t *knn_indices,
      * knn_indices and labels
      */
     device_buffer<int> y_normalized(allocator, stream, n_index_rows);
-    MLCommon::Label::make_monotonic(y_normalized.data(), y[i], n_index_rows,
-                                    stream);
-    MLCommon::LinAlg::unaryOp<int>(
+
+    /*
+     * Appending the array of unique labels to the original labels array
+     * to prevent make_monotonic function from producing misleading results
+     * due to the absence of some of the unique labels in the labels array
+     */
+    device_buffer<int> y_tmp(allocator, stream, n_index_rows + n_unique_labels);
+    raft::update_device(y_tmp.data(), y[i], n_index_rows, stream);
+    raft::update_device(y_tmp.data() + n_index_rows, uniq_labels[i],
+                        n_unique_labels, stream);
+
+    MLCommon::Label::make_monotonic(y_normalized.data(), y_tmp.data(),
+                                    y_tmp.size(), stream, allocator);
+    raft::linalg::unaryOp<int>(
       y_normalized.data(), y_normalized.data(), n_index_rows,
       [] __device__(int input) { return input - 1; }, stream);
     class_probs_kernel<float, precomp_lbls>
@@ -489,20 +504,20 @@ void class_probs(std::vector<float *> &out, const int64_t *knn_indices,
  * @tparam precomp_lbls is set to true for the reduction step of MNMG KNN Classifier. In this case,
  * the knn_indices array is not used as the y arrays already store the labels for each row.
  * This makes it possible to compute the reduction step without holding all the data on a single machine.
- * @param out output array of size (n_samples * y.size())
- * @param knn_indices index array from knn search
- * @param y vector of label arrays. for multilabel classification, each
+ * @param[out] out output array of size (n_samples * y.size())
+ * @param[in] knn_indices index array from knn search
+ * @param[in] y vector of label arrays. for multilabel classification, each
  *          element in the vector is a different "output" array of labels corresponding
  *          to the i'th output.
- * @param n_index_rows number of vertices in index (eg. size of each y array)
- * @param n_query_rows number of rows in knn_indices
- * @param k number of neighbors in knn_indices
- * @param uniq_labels vector of the sorted unique labels for each array in y
- * @param n_unique vector of sizes for each array in uniq_labels
- * @param allocator device allocator to use for temporary workspace
- * @param user_stream main stream to use for queuing isolated CUDA events
- * @param int_streams internal streams to use for parallelizing independent CUDA events.
- * @param n_int_streams number of elements in int_streams array. If this is less than 1,
+ * @param[in] n_index_rows number of vertices in index (eg. size of each y array)
+ * @param[in] n_query_rows number of rows in knn_indices
+ * @param[in] k number of neighbors in knn_indices
+ * @param[in] uniq_labels vector of the sorted unique labels for each array in y
+ * @param[in] n_unique vector of sizes for each array in uniq_labels
+ * @param[in] allocator device allocator to use for temporary workspace
+ * @param[in] user_stream main stream to use for queuing isolated CUDA events
+ * @param[in] int_streams internal streams to use for parallelizing independent CUDA events.
+ * @param[in] n_int_streams number of elements in int_streams array. If this is less than 1,
  *        the user_stream is used.
  */
 template <int TPB_X = 32, bool precomp_lbls = false>
@@ -520,7 +535,7 @@ void knn_classify(int *out, const int64_t *knn_indices, std::vector<int *> &y,
     int size = n_unique[i];
 
     cudaStream_t stream =
-      select_stream(user_stream, int_streams, n_int_streams, i);
+      raft::select_stream(user_stream, int_streams, n_int_streams, i);
 
     device_buffer<float> *probs_buff =
       new device_buffer<float>(allocator, stream, n_query_rows * size);
@@ -539,23 +554,25 @@ void knn_classify(int *out, const int64_t *knn_indices, std::vector<int *> &y,
     probs, knn_indices, y, n_index_rows, n_query_rows, k, uniq_labels, n_unique,
     allocator, user_stream, int_streams, n_int_streams);
 
-  dim3 grid(MLCommon::ceildiv(n_query_rows, (size_t)TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(n_query_rows, (size_t)TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   for (int i = 0; i < y.size(); i++) {
     cudaStream_t stream =
-      select_stream(user_stream, int_streams, n_int_streams, i);
+      raft::select_stream(user_stream, int_streams, n_int_streams, i);
 
     int n_unique_labels = n_unique[i];
 
     /**
      * Choose max probability
      */
-
+    // Use shared memory for label lookups if the number of classes is small enough
     int smem = sizeof(int) * n_unique_labels;
-    class_vote_kernel<<<grid, blk, smem, stream>>>(
-      out, probs[i], uniq_labels[i], n_unique_labels, n_query_rows, y.size(),
-      i);
+    bool use_shared_mem = smem < raft::getSharedMemPerBlock();
+
+    class_vote_kernel<<<grid, blk, use_shared_mem ? smem : 0, stream>>>(
+      out, probs[i], uniq_labels[i], n_unique_labels, n_query_rows, y.size(), i,
+      use_shared_mem);
     CUDA_CHECK(cudaPeekAtLastError());
 
     delete tmp_probs[i];
@@ -570,17 +587,17 @@ void knn_classify(int *out, const int64_t *knn_indices, std::vector<int *> &y,
  * @tparam precomp_lbls is set to true for the reduction step of MNMG KNN Regressor. In this case,
  * the knn_indices array is not used as the y arrays already store the output for each row.
  * This makes it possible to compute the reduction step without holding all the data on a single machine.
- * @param out output array of size (n_samples * y.size())
- * @param knn_indices index array from knn search
- * @param y vector of label arrays. for multilabel classification, each
+ * @param[out] out output array of size (n_samples * y.size())
+ * @param[in] knn_indices index array from knn search
+ * @param[in] y vector of label arrays. for multilabel classification, each
  *          element in the vector is a different "output" array of labels corresponding
  *          to the i'th output.
- * @param n_index_rows number of vertices in index (eg. size of each y array)
- * @param n_query_rows number of rows in knn_indices
- * @param k number of neighbors in knn_indices
- * @param user_stream main stream to use for queuing isolated CUDA events
- * @param int_streams internal streams to use for parallelizing independent CUDA events.
- * @param n_int_streams number of elements in int_streams array. If this is less than 1,
+ * @param[in] n_index_rows number of vertices in index (eg. size of each y array)
+ * @param[in] n_query_rows number of rows in knn_indices
+ * @param[in] k number of neighbors in knn_indices
+ * @param[in] user_stream main stream to use for queuing isolated CUDA events
+ * @param[in] int_streams internal streams to use for parallelizing independent CUDA events.
+ * @param[in] n_int_streams number of elements in int_streams array. If this is less than 1,
  *        the user_stream is used.
  */
 
@@ -594,10 +611,10 @@ void knn_regress(ValType *out, const int64_t *knn_indices,
    */
   for (int i = 0; i < y.size(); i++) {
     cudaStream_t stream =
-      select_stream(user_stream, int_streams, n_int_streams, i);
+      raft::select_stream(user_stream, int_streams, n_int_streams, i);
 
     regress_avg_kernel<ValType, precomp_lbls>
-      <<<ceildiv(n_query_rows, (size_t)TPB_X), TPB_X, 0, stream>>>(
+      <<<raft::ceildiv(n_query_rows, (size_t)TPB_X), TPB_X, 0, stream>>>(
         out, knn_indices, y[i], n_query_rows, k, y.size(), i);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
@@ -606,4 +623,4 @@ void knn_regress(ValType *out, const int64_t *knn_indices,
 }
 
 };  // namespace Selection
-};  // namespace MLCommon
\ No newline at end of file
+};  // namespace MLCommon
diff --git a/cpp/src_prims/selection/kselection.cuh b/cpp/src_prims/selection/kselection.cuh
index d48caa6834..21c0244eb9 100644
--- a/cpp/src_prims/selection/kselection.cuh
+++ b/cpp/src_prims/selection/kselection.cuh
@@ -17,8 +17,8 @@
 #pragma once
 
 #include <stdlib.h>
-#include <cuda_utils.cuh>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Selection {
@@ -89,10 +89,11 @@ struct KVPair {
    * @param mask mask of participating threads (Volta+)
    * @return the shuffled value
    */
-  DI Pair shfl(int srcLane, int width = WarpSize, uint32_t mask = 0xffffffffu) {
+  DI Pair shfl(int srcLane, int width = raft::WarpSize,
+               uint32_t mask = 0xffffffffu) {
     Pair ret = *this;
-    ret.val = MLCommon::shfl(ret.val, srcLane, width, mask);
-    ret.key = MLCommon::shfl(ret.key, srcLane, width, mask);
+    ret.val = raft::shfl(ret.val, srcLane, width, mask);
+    ret.key = raft::shfl(ret.key, srcLane, width, mask);
     return ret;
   }
 
@@ -103,11 +104,11 @@ struct KVPair {
    * @param mask mask of participating threads (Volta+)
    * @return the shuffled value
    */
-  DI Pair shfl_xor(int laneMask, int width = WarpSize,
+  DI Pair shfl_xor(int laneMask, int width = raft::WarpSize,
                    uint32_t mask = 0xffffffffu) {
     Pair ret = *this;
-    ret.val = MLCommon::shfl_xor(ret.val, laneMask, width, mask);
-    ret.key = MLCommon::shfl_xor(ret.key, laneMask, width, mask);
+    ret.val = raft::shfl_xor(ret.val, laneMask, width, mask);
+    ret.key = raft::shfl_xor(ret.key, laneMask, width, mask);
     return ret;
   }
 
@@ -142,7 +143,7 @@ struct KVPair {
 template <typename TypeV, typename TypeK, bool Greater, int Log2Stride>
 DI void bitonicSortStage(KVPair<TypeV, TypeK> &current) {
   constexpr int Stride2 = 1 << (Log2Stride + 1);
-  int lid = laneId();
+  int lid = raft::laneId();
   const bool lidMask = lid & Stride2;
 #pragma unroll
   for (int stage = Log2Stride; stage >= 0; --stage) {
@@ -181,9 +182,9 @@ DI void bitonicSort(KVPair<TypeV, TypeK> &current) {
  */
 template <typename TypeV, typename TypeK, bool Greater>
 DI void warpSort(KVPair<TypeV, TypeK> &current) {
-  int lid = laneId();
+  int lid = raft::laneId();
 #pragma unroll
-  for (int stride = WarpSize / 2; stride >= 1; stride /= 2) {
+  for (int stride = raft::WarpSize / 2; stride >= 1; stride /= 2) {
     bool small = !(lid & stride);
     auto other = current.shfl_xor(stride);
     current.cas<Greater>(other, small);
@@ -208,7 +209,7 @@ struct KVArray {
   /** bit-mask representing all valid indices of the array */
   constexpr static int ArrMask = N - 1;
   /** mask representing all threads in a warp */
-  constexpr static int WarpMask = WarpSize - 1;
+  constexpr static int WarpMask = raft::WarpSize - 1;
 
   /** reset the contents of the array */
   DI void reset(TypeV iV, TypeK iK) {
@@ -289,26 +290,26 @@ __global__ void warpTopKkernel(TypeV *outV, TypeK *outK, const TypeV *arr,
   // static_assert(Sort==false, "warpTopK: Sort=true is not yet supported!");
 
   if (Sort == false) {
-    constexpr int RowsPerBlk = TPB / WarpSize;
-    const int warpId = threadIdx.x / WarpSize;
+    constexpr int RowsPerBlk = TPB / raft::WarpSize;
+    const int warpId = threadIdx.x / raft::WarpSize;
     const int rowId = blockIdx.x * RowsPerBlk + warpId;
     if (rowId >= rows) return;
-    const int maxCols = alignTo(cols, WarpSize);
+    const int maxCols = raft::alignTo(cols, raft::WarpSize);
     KVArray<TypeV, TypeK, N, Greater> topk;
     KVPair<TypeV, TypeK> other;
     topk.reset(iV, iK);
-    int colId = threadIdx.x % WarpSize;
-    for (; colId < maxCols; colId += WarpSize) {
+    int colId = threadIdx.x % raft::WarpSize;
+    for (; colId < maxCols; colId += raft::WarpSize) {
       auto idx = rowId * cols + colId;
       other.val = colId < cols ? arr[idx] : iV;
       other.key = colId;
-      warpFence();
+      raft::warpFence();
       topk.topkUpdate(other);
     }
-    int lid = laneId();
+    int lid = raft::laneId();
 #pragma unroll
     for (int i = 0; i < N; ++i) {
-      int col = i * WarpSize + lid;
+      int col = i * raft::WarpSize + lid;
       if (outV != nullptr && col < k) outV[rowId * k + col] = topk.arr[i].val;
       if (outK != nullptr && col < k) outK[rowId * k + col] = topk.arr[i].key;
     }  // end for outV and outK
@@ -338,9 +339,9 @@ void warpTopK(TypeV *outV, TypeK *outK, const TypeV *arr, int k, int rows,
     std::is_same<TypeV, float>::value && (std::is_same<TypeK, int>::value),
     "type not support");
   constexpr int TPB = 256;
-  constexpr int RowsPerBlk = TPB / WarpSize;
-  const int nblks = ceildiv(rows, RowsPerBlk);
-  const int kAligned = alignTo(k, WarpSize) / WarpSize;
+  constexpr int RowsPerBlk = TPB / raft::WarpSize;
+  const int nblks = raft::ceildiv(rows, RowsPerBlk);
+  const int kAligned = raft::alignTo(k, raft::WarpSize) / raft::WarpSize;
   const TypeV iV = Greater ? std::numeric_limits<TypeV>::max()
                            : std::numeric_limits<TypeV>::min();
   const TypeK iK = Greater ? std::numeric_limits<TypeK>::max()
diff --git a/cpp/src_prims/selection/processing.cuh b/cpp/src_prims/selection/processing.cuh
index 1e00cdc01b..911bea4da1 100644
--- a/cpp/src_prims/selection/processing.cuh
+++ b/cpp/src_prims/selection/processing.cuh
@@ -17,12 +17,12 @@
 
 #include <cuml/neighbors/knn.hpp>
 
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/norm.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/linalg/unary_op.cuh>
 
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
+#include <raft/stats/mean.cuh>
+#include <raft/stats/mean_center.cuh>
 
 #include <common/device_buffer.hpp>
 
@@ -74,25 +74,25 @@ class CosineMetricProcessor : public MetricProcessor<math_t> {
       k_(k) {}
 
   void preprocess(math_t *data) {
-    LinAlg::rowNorm(colsums_.data(), data, n_cols_, n_rows_,
-                    LinAlg::NormType::L2Norm, row_major_, stream_,
-                    [] __device__(math_t in) { return sqrtf(in); });
+    raft::linalg::rowNorm(colsums_.data(), data, n_cols_, n_rows_,
+                          raft::linalg::NormType::L2Norm, row_major_, stream_,
+                          [] __device__(math_t in) { return sqrtf(in); });
 
-    LinAlg::matrixVectorOp(
+    raft::linalg::matrixVectorOp(
       data, data, colsums_.data(), n_cols_, n_rows_, row_major_, false,
       [] __device__(math_t mat_in, math_t vec_in) { return mat_in / vec_in; },
       stream_);
   }
 
   void revert(math_t *data) {
-    LinAlg::matrixVectorOp(
+    raft::linalg::matrixVectorOp(
       data, data, colsums_.data(), n_cols_, n_rows_, row_major_, false,
       [] __device__(math_t mat_in, math_t vec_in) { return mat_in * vec_in; },
       stream_);
   }
 
   void postprocess(math_t *data) {
-    LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       data, data, k_ * n_rows_, [] __device__(math_t in) { return 1 - in; },
       stream_);
   }
@@ -115,17 +115,18 @@ class CorrelationMetricProcessor : public CosineMetricProcessor<math_t> {
   void preprocess(math_t *data) {
     math_t normalizer_const = 1.0 / (math_t)cosine::n_cols_;
 
-    LinAlg::reduce(means_.data(), data, cosine::n_cols_, cosine::n_rows_,
-                   (math_t)0.0, cosine::row_major_, true, cosine::stream_);
+    raft::linalg::reduce(means_.data(), data, cosine::n_cols_, cosine::n_rows_,
+                         (math_t)0.0, cosine::row_major_, true,
+                         cosine::stream_);
 
-    LinAlg::unaryOp(
+    raft::linalg::unaryOp(
       means_.data(), means_.data(), cosine::n_rows_,
       [=] __device__(math_t in) { return in * normalizer_const; },
       cosine::stream_);
 
-    Stats::meanCenter(data, data, means_.data(), cosine::n_cols_,
-                      cosine::n_rows_, cosine::row_major_, false,
-                      cosine::stream_);
+    raft::stats::meanCenter(data, data, means_.data(), cosine::n_cols_,
+                            cosine::n_rows_, cosine::row_major_, false,
+                            cosine::stream_);
 
     CosineMetricProcessor<math_t>::preprocess(data);
   }
@@ -133,8 +134,9 @@ class CorrelationMetricProcessor : public CosineMetricProcessor<math_t> {
   void revert(math_t *data) {
     CosineMetricProcessor<math_t>::revert(data);
 
-    Stats::meanAdd(data, data, means_.data(), cosine::n_cols_, cosine::n_rows_,
-                   cosine::row_major_, false, cosine::stream_);
+    raft::stats::meanAdd(data, data, means_.data(), cosine::n_cols_,
+                         cosine::n_rows_, cosine::row_major_, false,
+                         cosine::stream_);
   }
 
   void postprocess(math_t *data) {
diff --git a/cpp/src_prims/sparse/batched/csr.cuh b/cpp/src_prims/sparse/batched/csr.cuh
index e177e58841..c7b95e7bc3 100644
--- a/cpp/src_prims/sparse/batched/csr.cuh
+++ b/cpp/src_prims/sparse/batched/csr.cuh
@@ -37,11 +37,11 @@
 #include <cuml/common/utils.hpp>
 #include <cuml/cuml.hpp>
 
-#include <common/cudart_utils.h>
-#include <linalg/cusolver_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <common/device_buffer.hpp>
 #include <linalg/batched/matrix.cuh>
-#include <matrix/matrix.cuh>
+#include <raft/matrix/matrix.cuh>
 
 namespace MLCommon {
 namespace Sparse {
@@ -170,11 +170,11 @@ class CSR {
       m_col_index(other.m_allocator, other.m_stream, other.m_nnz),
       m_row_index(other.m_allocator, other.m_stream, other.m_shape.first + 1) {
     // Copy the raw data
-    copy(m_values.data(), other.m_values.data(), m_nnz * m_batch_size,
-         m_stream);
-    copy(m_col_index.data(), other.m_col_index.data(), m_nnz, m_stream);
-    copy(m_row_index.data(), other.m_row_index.data(), m_shape.first + 1,
-         m_stream);
+    raft::copy(m_values.data(), other.m_values.data(), m_nnz * m_batch_size,
+               m_stream);
+    raft::copy(m_col_index.data(), other.m_col_index.data(), m_nnz, m_stream);
+    raft::copy(m_row_index.data(), other.m_row_index.data(), m_shape.first + 1,
+               m_stream);
   }
 
   //! Copy assignment operator
@@ -188,11 +188,11 @@ class CSR {
     m_row_index.resize(m_shape.first + 1, m_stream);
 
     // Copy the raw data
-    copy(m_values.data(), other.m_values.data(), m_nnz * m_batch_size,
-         m_stream);
-    copy(m_col_index.data(), other.m_col_index.data(), m_nnz, m_stream);
-    copy(m_row_index.data(), other.m_row_index.data(), m_shape.first + 1,
-         m_stream);
+    raft::copy(m_values.data(), other.m_values.data(), m_nnz * m_batch_size,
+               m_stream);
+    raft::copy(m_col_index.data(), other.m_col_index.data(), m_nnz, m_stream);
+    raft::copy(m_row_index.data(), other.m_row_index.data(), m_shape.first + 1,
+               m_stream);
 
     return *this;
   }
@@ -235,13 +235,13 @@ class CSR {
                         dense.allocator(), dense.stream());
 
     // Copy the host index arrays to the device
-    MLCommon::copy(out.get_col_index(), h_col_index.data(), nnz, out.stream());
-    MLCommon::copy(out.get_row_index(), h_row_index.data(), shape.first + 1,
-                   out.stream());
+    raft::copy(out.get_col_index(), h_col_index.data(), nnz, out.stream());
+    raft::copy(out.get_row_index(), h_row_index.data(), shape.first + 1,
+               out.stream());
 
     // Copy the data from the dense matrix to its sparse representation
     constexpr int TPB = 256;
-    dense_to_csr_kernel<<<MLCommon::ceildiv<int>(out.batches(), TPB), TPB, 0,
+    dense_to_csr_kernel<<<raft::ceildiv<int>(out.batches(), TPB), TPB, 0,
                           out.stream()>>>(
       dense.raw_data(), out.get_col_index(), out.get_row_index(),
       out.get_values(), out.batches(), shape.first, shape.second, nnz);
@@ -262,7 +262,7 @@ class CSR {
 
     // Copy the data from the sparse to the dense representation
     constexpr int TPB = 256;
-    csr_to_dense_kernel<<<MLCommon::ceildiv<int>(m_batch_size, TPB), TPB, 0,
+    csr_to_dense_kernel<<<raft::ceildiv<int>(m_batch_size, TPB), TPB, 0,
                           m_stream>>>(
       dense.raw_data(), m_col_index.data(), m_row_index.data(), m_values.data(),
       m_batch_size, m_shape.first, m_shape.second, m_nnz);
@@ -402,7 +402,7 @@ void b_spmv(T alpha, const CSR<T>& A, const LinAlg::Batched::Matrix<T>& x,
 
   // Execute the kernel
   constexpr int TPB = 256;
-  batched_spmv_kernel<<<MLCommon::ceildiv<int>(A.batches(), TPB), TPB, 0,
+  batched_spmv_kernel<<<raft::ceildiv<int>(A.batches(), TPB), TPB, 0,
                         A.stream()>>>(
     alpha, A.get_col_index(), A.get_row_index(), A.get_values(), x.raw_data(),
     beta, y.raw_data(), m, n, A.batches());
@@ -563,8 +563,8 @@ void b_spmm(T alpha, const CSR<T>& A, const LinAlg::Batched::Matrix<T>& B,
     constexpr int TPB = 256;
     int threads_per_bid =
       nb <= 1024 ? 8 : (nb <= 2048 ? 4 : (nb <= 4096 ? 2 : 1));
-    batched_spmm_kernel<<<MLCommon::ceildiv<int>(nb * threads_per_bid, TPB),
-                          TPB, 0, A.stream()>>>(
+    batched_spmm_kernel<<<raft::ceildiv<int>(nb * threads_per_bid, TPB), TPB, 0,
+                          A.stream()>>>(
       alpha, A.get_col_index(), A.get_row_index(), A.get_values(), B.raw_data(),
       beta, C.raw_data(), m, k, n, nb, threads_per_bid);
     CUDA_CHECK(cudaPeekAtLastError());
diff --git a/cpp/src_prims/sparse/coo.cuh b/cpp/src_prims/sparse/coo.cuh
index 1769cce77a..abe3664e9a 100644
--- a/cpp/src_prims/sparse/coo.cuh
+++ b/cpp/src_prims/sparse/coo.cuh
@@ -17,7 +17,7 @@
 #include <cuml/common/cuml_allocator.hpp>
 #include "csr.cuh"
 
-#include "cusparse_wrappers.h"
+#include <raft/sparse/cusparse_wrappers.h>
 
 #include <common/device_buffer.hpp>
 
@@ -27,9 +27,9 @@
 #include <thrust/device_vector.h>
 #include <thrust/scan.h>
 
-#include <common/cudart_utils.h>
 #include <cuda_runtime.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 #include <iostream>
 #define restrict __restrict__
@@ -175,9 +175,12 @@ class COO {
       cudaStream_t stream;
       CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
 
-      out << arr2Str(c.rows_arr.data(), c.nnz, "rows", stream) << std::endl;
-      out << arr2Str(c.cols_arr.data(), c.nnz, "cols", stream) << std::endl;
-      out << arr2Str(c.vals_arr.data(), c.nnz, "vals", stream) << std::endl;
+      out << raft::arr2Str(c.rows_arr.data(), c.nnz, "rows", stream)
+          << std::endl;
+      out << raft::arr2Str(c.cols_arr.data(), c.nnz, "cols", stream)
+          << std::endl;
+      out << raft::arr2Str(c.vals_arr.data(), c.nnz, "vals", stream)
+          << std::endl;
       out << "nnz=" << c.nnz << std::endl;
       out << "n_rows=" << c.n_rows << std::endl;
       out << "n_cols=" << c.n_cols << std::endl;
@@ -287,12 +290,12 @@ void coo_sort(int m, int n, int nnz, int *rows, int *cols, T *vals,
 
   device_buffer<T> vals_sorted(d_alloc, stream, nnz);
 
-  CUSPARSE_CHECK(
-    cusparsegthr<T>(handle, nnz, vals, vals_sorted.data(), d_P.data(), stream));
+  CUSPARSE_CHECK(raft::sparse::cusparsegthr<T>(
+    handle, nnz, vals, vals_sorted.data(), d_P.data(), stream));
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
-  copy(vals, vals_sorted.data(), nnz, stream);
+  raft::copy(vals, vals_sorted.data(), nnz, stream);
 
   CUSPARSE_CHECK(cusparseDestroy(handle));
 }
@@ -370,7 +373,7 @@ template <int TPB_X>
 __global__ void coo_row_count_kernel(const int *rows, int nnz, int *results) {
   int row = (blockIdx.x * TPB_X) + threadIdx.x;
   if (row < nnz) {
-    atomicAdd(results + rows[row], 1);
+    raft::myAtomicAdd(results + rows[row], 1);
   }
 }
 
@@ -385,7 +388,7 @@ __global__ void coo_row_count_kernel(const int *rows, int nnz, int *results) {
 template <int TPB_X>
 void coo_row_count(const int *rows, int nnz, int *results,
                    cudaStream_t stream) {
-  dim3 grid_rc(MLCommon::ceildiv(nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_kernel<TPB_X>
@@ -403,7 +406,7 @@ void coo_row_count(const int *rows, int nnz, int *results,
  */
 template <int TPB_X, typename T>
 void coo_row_count(COO<T> *in, int *results, cudaStream_t stream) {
-  dim3 grid_rc(MLCommon::ceildiv(in->nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(in->nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_kernel<TPB_X>
@@ -416,7 +419,7 @@ __global__ void coo_row_count_nz_kernel(const int *rows, const T *vals, int nnz,
                                         int *results) {
   int row = (blockIdx.x * TPB_X) + threadIdx.x;
   if (row < nnz && vals[row] != 0.0) {
-    atomicAdd(results + rows[row], 1);
+    raft::myAtomicAdd(results + rows[row], 1);
   }
 }
 
@@ -425,7 +428,7 @@ __global__ void coo_row_count_scalar_kernel(const int *rows, const T *vals,
                                             int nnz, T scalar, int *results) {
   int row = (blockIdx.x * TPB_X) + threadIdx.x;
   if (row < nnz && vals[row] != scalar) {
-    atomicAdd(results + rows[row], 1);
+    raft::myAtomicAdd(results + rows[row], 1);
   }
 }
 
@@ -441,7 +444,7 @@ __global__ void coo_row_count_scalar_kernel(const int *rows, const T *vals,
 template <int TPB_X, typename T>
 void coo_row_count_scalar(COO<T> *in, T scalar, int *results,
                           cudaStream_t stream) {
-  dim3 grid_rc(MLCommon::ceildiv(in->nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(in->nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_scalar_kernel<TPB_X, T><<<grid_rc, blk_rc, 0, stream>>>(
@@ -463,7 +466,7 @@ void coo_row_count_scalar(COO<T> *in, T scalar, int *results,
 template <int TPB_X, typename T>
 void coo_row_count_scalar(const int *rows, const T *vals, int nnz, T scalar,
                           int *results, cudaStream_t stream = 0) {
-  dim3 grid_rc(MLCommon::ceildiv(nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_scalar_kernel<TPB_X, T>
@@ -484,7 +487,7 @@ void coo_row_count_scalar(const int *rows, const T *vals, int nnz, T scalar,
 template <int TPB_X, typename T>
 void coo_row_count_nz(const int *rows, const T *vals, int nnz, int *results,
                       cudaStream_t stream) {
-  dim3 grid_rc(MLCommon::ceildiv(nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_nz_kernel<TPB_X, T>
@@ -502,7 +505,7 @@ void coo_row_count_nz(const int *rows, const T *vals, int nnz, int *results,
  */
 template <int TPB_X, typename T>
 void coo_row_count_nz(COO<T> *in, int *results, cudaStream_t stream) {
-  dim3 grid_rc(MLCommon::ceildiv(in->nnz, TPB_X), 1, 1);
+  dim3 grid_rc(raft::ceildiv(in->nnz, TPB_X), 1, 1);
   dim3 blk_rc(TPB_X, 1, 1);
 
   coo_row_count_nz_kernel<TPB_X, T>
@@ -553,7 +556,7 @@ void coo_remove_scalar(const int *rows, const int *cols, const T *vals, int nnz,
                          dev_cur_cnnz + n, dev_cur_ex_scan);
   CUDA_CHECK(cudaPeekAtLastError());
 
-  dim3 grid(ceildiv(n, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(n, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   coo_remove_scalar_kernel<TPB_X><<<grid, blk, 0, stream>>>(
@@ -649,7 +652,7 @@ __global__ void from_knn_graph_kernel(const long *knn_indices,
 template <typename T>
 void from_knn(const long *knn_indices, const T *knn_dists, int m, int k,
               int *rows, int *cols, T *vals) {
-  dim3 grid(ceildiv(m, 32), 1, 1);
+  dim3 grid(raft::ceildiv(m, 32), 1, 1);
   dim3 blk(32, 1, 1);
   from_knn_graph_kernel<32, T>
     <<<grid, blk>>>(knn_indices, knn_dists, m, k, rows, cols, vals);
@@ -800,7 +803,7 @@ void coo_symmetrize(COO<T> *in, COO<T> *out,
                     Lambda reduction_op,  // two-argument reducer
                     std::shared_ptr<deviceAllocator> d_alloc,
                     cudaStream_t stream) {
-  dim3 grid(ceildiv(in->n_rows, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(in->n_rows, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   ASSERT(!out->validate_mem(), "Expecting unallocated COO for output");
@@ -840,9 +843,9 @@ __global__ static void symmetric_find_size(const math_t *restrict data,
 
   const int col = indices[row * k + j];
   if (j % 2)
-    atomicAdd(&row_sizes[col], 1);
+    raft::myAtomicAdd(&row_sizes[col], 1);
   else
-    atomicAdd(&row_sizes2[col], 1);
+    raft::myAtomicAdd(&row_sizes2[col], 1);
 }
 
 /**
@@ -927,7 +930,7 @@ void from_knn_symmetrize_matrix(const long *restrict knn_indices,
   // (1) Find how much space needed in each row
   // We look through all datapoints and increment the count for each row.
   const dim3 threadsPerBlock(TPB_X, TPB_Y);
-  const dim3 numBlocks(ceildiv(n, TPB_X), ceildiv(k, TPB_Y));
+  const dim3 numBlocks(raft::ceildiv(n, TPB_X), raft::ceildiv(k, TPB_Y));
 
   // Notice n+1 since we can reuse these arrays for transpose_edges, original_edges in step (4)
   device_buffer<int> row_sizes(d_alloc, stream, n);
@@ -940,7 +943,7 @@ void from_knn_symmetrize_matrix(const long *restrict knn_indices,
     knn_dists, knn_indices, n, k, row_sizes.data(), row_sizes2.data());
   CUDA_CHECK(cudaPeekAtLastError());
 
-  reduce_find_size<<<ceildiv(n, 1024), 1024, 0, stream>>>(
+  reduce_find_size<<<raft::ceildiv(n, 1024), 1024, 0, stream>>>(
     n, k, row_sizes.data(), row_sizes2.data());
   CUDA_CHECK(cudaPeekAtLastError());
 
diff --git a/cpp/src_prims/sparse/csr.cuh b/cpp/src_prims/sparse/csr.cuh
index 8239143b66..06d9806646 100644
--- a/cpp/src_prims/sparse/csr.cuh
+++ b/cpp/src_prims/sparse/csr.cuh
@@ -16,8 +16,8 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 
 #include <label/classlabels.cuh>
 
@@ -174,11 +174,14 @@ class CSR {
       cudaStream_t stream;
       CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
 
-      out << arr2Str(c.row_ind_arr.data(), c.n_rows + 1, "row_ind", stream)
+      out << raft::arr2Str(c.row_ind_arr.data(), c.n_rows + 1, "row_ind",
+                           stream)
           << std::endl;
-      out << arr2Str(c.row_ind_ptr_arr.data(), c.nnz, "row_ind_ptr_arr", stream)
+      out << raft::arr2Str(c.row_ind_ptr_arr.data(), c.nnz, "row_ind_ptr_arr",
+                           stream)
+          << std::endl;
+      out << raft::arr2Str(c.vals_arr.data(), c.nnz, "vals_arr", stream)
           << std::endl;
-      out << arr2Str(c.vals_arr.data(), c.nnz, "vals_arr", stream) << std::endl;
       out << "nnz=" << c.nnz << std::endl;
       out << "n_rows=" << c.n_rows << std::endl;
       out << "n_cols=" << c.n_cols << std::endl;
@@ -306,7 +309,7 @@ void csr_row_normalize_l1(const int *ia,  // csr row ex_scan (sorted by row)
                           T *result,
                           cudaStream_t stream) {  // output array
 
-  dim3 grid(MLCommon::ceildiv(m, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(m, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   csr_row_normalize_l1_kernel<TPB_X, T>
@@ -367,7 +370,7 @@ void csr_row_normalize_max(const int *ia,  // csr row ind array (sorted by row)
                            int nnz,  // array of values and number of non-zeros
                            int m,    // num total rows in csr
                            T *result, cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(m, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(m, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   csr_row_normalize_max_kernel<TPB_X, T>
@@ -409,7 +412,7 @@ __global__ void csr_to_coo_kernel(const int *row_ind, int m, int *coo_rows,
 template <int TPB_X>
 void csr_to_coo(const int *row_ind, int m, int *coo_rows, int nnz,
                 cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(m, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(m, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   csr_to_coo_kernel<TPB_X><<<grid, blk, 0, stream>>>(row_ind, m, coo_rows, nnz);
@@ -465,7 +468,7 @@ __global__ void csr_add_calc_row_counts_kernel(
     }
 
     out_rowcounts[row] = final_size;
-    atomicAdd(out_rowcounts + m, final_size);
+    raft::myAtomicAdd(out_rowcounts + m, final_size);
 
     delete arr;
   }
@@ -541,7 +544,7 @@ size_t csr_add_calc_inds(const int *a_ind, const int *a_indptr, const T *a_val,
                          const T *b_val, int nnz2, int m, int *out_ind,
                          std::shared_ptr<deviceAllocator> d_alloc,
                          cudaStream_t stream) {
-  dim3 grid(ceildiv(m, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(m, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   device_buffer<int> row_counts(d_alloc, stream, m + 1);
@@ -553,7 +556,7 @@ size_t csr_add_calc_inds(const int *a_ind, const int *a_indptr, const T *a_val,
                                b_val, nnz2, m, row_counts.data());
 
   int cnnz = 0;
-  MLCommon::updateHost(&cnnz, row_counts.data() + m, 1, stream);
+  raft::update_host(&cnnz, row_counts.data() + m, 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   // create csr compressed row index from row counts
@@ -588,7 +591,7 @@ void csr_add_finalize(const int *a_ind, const int *a_indptr, const T *a_val,
                       int nnz1, const int *b_ind, const int *b_indptr,
                       const T *b_val, int nnz2, int m, int *c_ind,
                       int *c_indptr, T *c_val, cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(m, TPB_X), 1, 1);
+  dim3 grid(raft::ceildiv(m, TPB_X), 1, 1);
   dim3 blk(TPB_X, 1, 1);
 
   csr_add_kernel<T, TPB_X>
@@ -623,7 +626,7 @@ template <typename Index_, int TPB_X = 32,
           typename Lambda = auto(Index_, Index_, Index_)->void>
 void csr_row_op(const Index_ *row_ind, Index_ n_rows, Index_ nnz, Lambda op,
                 cudaStream_t stream) {
-  dim3 grid(MLCommon::ceildiv(n_rows, Index_(TPB_X)), 1, 1);
+  dim3 grid(raft::ceildiv(n_rows, Index_(TPB_X)), 1, 1);
   dim3 blk(TPB_X, 1, 1);
   csr_row_op_kernel<Index_, TPB_X>
     <<<grid, blk, 0, stream>>>(row_ind, n_rows, nnz, op);
@@ -802,7 +805,7 @@ void weak_cc_label_batched(Index_ *labels, const Index_ *row_ind,
   bool *host_fa = (bool *)malloc(sizeof(bool) * N);
   bool *host_xa = (bool *)malloc(sizeof(bool) * N);
 
-  dim3 blocks(ceildiv(batchSize, Index_(TPB_X)));
+  dim3 blocks(raft::ceildiv(batchSize, Index_(TPB_X)));
   dim3 threads(TPB_X);
   Index_ MAX_LABEL = std::numeric_limits<Index_>::max();
 
@@ -821,13 +824,13 @@ void weak_cc_label_batched(Index_ *labels, const Index_ *row_ind,
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     //** swapping F1 and F2
-    MLCommon::updateHost(host_fa, state->fa, N, stream);
-    MLCommon::updateHost(host_xa, state->xa, N, stream);
-    MLCommon::updateDevice(state->fa, host_xa, N, stream);
-    MLCommon::updateDevice(state->xa, host_fa, N, stream);
+    raft::update_host(host_fa, state->fa, N, stream);
+    raft::update_host(host_xa, state->xa, N, stream);
+    raft::update_device(state->fa, host_xa, N, stream);
+    raft::update_device(state->xa, host_fa, N, stream);
 
     //** Updating m *
-    MLCommon::updateHost(&host_m, state->m, 1, stream);
+    raft::update_host(&host_m, state->m, 1, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     n_iters++;
@@ -868,7 +871,7 @@ void weak_cc_batched(Index_ *labels, const Index_ *row_ind,
                      const Index_ *row_ind_ptr, Index_ nnz, Index_ N,
                      Index_ startVertexId, Index_ batchSize, WeakCCState *state,
                      cudaStream_t stream, Lambda filter_op) {
-  dim3 blocks(ceildiv(N, Index_(TPB_X)));
+  dim3 blocks(raft::ceildiv(N, Index_(TPB_X)));
   dim3 threads(TPB_X);
 
   Index_ MAX_LABEL = std::numeric_limits<Index_>::max();
diff --git a/cpp/src_prims/sparse/cusparse_wrappers.h b/cpp/src_prims/sparse/cusparse_wrappers.h
deleted file mode 100644
index 1f7528fe80..0000000000
--- a/cpp/src_prims/sparse/cusparse_wrappers.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cusparse_v2.h>
-#include <cuml/common/logger.hpp>
-#include <cuml/common/utils.hpp>
-
-namespace MLCommon {
-namespace Sparse {
-
-#define _CUSPARSE_ERR_TO_STR(err) \
-  case err:                       \
-    return #err;
-inline const char* cusparseErr2Str(cusparseStatus_t err) {
-#if defined(CUDART_VERSION) && CUDART_VERSION >= 10100
-  return cusparseGetErrorString(status);
-#else   // CUDART_VERSION
-  switch (err) {
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_SUCCESS);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_NOT_INITIALIZED);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ALLOC_FAILED);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INVALID_VALUE);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_ARCH_MISMATCH);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_EXECUTION_FAILED);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_INTERNAL_ERROR);
-    _CUSPARSE_ERR_TO_STR(CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED);
-    default:
-      return "CUSPARSE_STATUS_UNKNOWN";
-  };
-#endif  // CUDART_VERSION
-}
-#undef _CUSPARSE_ERR_TO_STR
-
-/** check for cusparse runtime API errors and assert accordingly */
-#define CUSPARSE_CHECK(call)                                         \
-  do {                                                               \
-    cusparseStatus_t err = call;                                     \
-    ASSERT(err == CUSPARSE_STATUS_SUCCESS,                           \
-           "CUSPARSE call='%s' got errorcode=%d err=%s", #call, err, \
-           MLCommon::Sparse::cusparseErr2Str(err));                  \
-  } while (0)
-
-/** check for cusparse runtime API errors but do not assert */
-#define CUSPARSE_CHECK_NO_THROW(call)                                          \
-  do {                                                                         \
-    cusparseStatus_t err = call;                                               \
-    if (err != CUSPARSE_STATUS_SUCCESS) {                                      \
-      CUML_LOG_ERROR("CUSPARSE call='%s' got errorcode=%d err=%s", #call, err, \
-                     MLCommon::Sparse::cusparseErr2Str(err));                  \
-    }                                                                          \
-  } while (0)
-
-/**
- * @defgroup gthr cusparse gather methods
- * @{
- */
-template <typename T>
-cusparseStatus_t cusparsegthr(cusparseHandle_t handle, int nnz, const T* vals,
-                              T* vals_sorted, int* d_P, cudaStream_t stream);
-template <>
-inline cusparseStatus_t cusparsegthr(cusparseHandle_t handle, int nnz,
-                                     const double* vals, double* vals_sorted,
-                                     int* d_P, cudaStream_t stream) {
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
-  return cusparseDgthr(handle, nnz, vals, vals_sorted, d_P,
-                       CUSPARSE_INDEX_BASE_ZERO);
-}
-template <>
-inline cusparseStatus_t cusparsegthr(cusparseHandle_t handle, int nnz,
-                                     const float* vals, float* vals_sorted,
-                                     int* d_P, cudaStream_t stream) {
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
-  return cusparseSgthr(handle, nnz, vals, vals_sorted, d_P,
-                       CUSPARSE_INDEX_BASE_ZERO);
-}
-/** @} */
-
-/**
- * @defgroup coo2csr cusparse COO to CSR converter methods
- * @{
- */
-template <typename T>
-void cusparsecoo2csr(cusparseHandle_t handle, const T* cooRowInd, int nnz,
-                     int m, T* csrRowPtr, cudaStream_t stream);
-template <>
-inline void cusparsecoo2csr(cusparseHandle_t handle, const int* cooRowInd,
-                            int nnz, int m, int* csrRowPtr,
-                            cudaStream_t stream) {
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
-  CUSPARSE_CHECK(cusparseXcoo2csr(handle, cooRowInd, nnz, m, csrRowPtr,
-                                  CUSPARSE_INDEX_BASE_ZERO));
-}
-/** @} */
-
-/**
- * @defgroup coosort cusparse coo sort methods
- * @{
- */
-template <typename T>
-size_t cusparsecoosort_bufferSizeExt(cusparseHandle_t handle, int m, int n,
-                                     int nnz, const T* cooRows,
-                                     const T* cooCols, cudaStream_t stream);
-template <>
-inline size_t cusparsecoosort_bufferSizeExt(cusparseHandle_t handle, int m,
-                                            int n, int nnz, const int* cooRows,
-                                            const int* cooCols,
-                                            cudaStream_t stream) {
-  size_t val;
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
-  CUSPARSE_CHECK(
-    cusparseXcoosort_bufferSizeExt(handle, m, n, nnz, cooRows, cooCols, &val));
-  return val;
-}
-
-template <typename T>
-void cusparsecoosortByRow(cusparseHandle_t handle, int m, int n, int nnz,
-                          T* cooRows, T* cooCols, T* P, void* pBuffer,
-                          cudaStream_t stream);
-template <>
-inline void cusparsecoosortByRow(cusparseHandle_t handle, int m, int n, int nnz,
-                                 int* cooRows, int* cooCols, int* P,
-                                 void* pBuffer, cudaStream_t stream) {
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
-  CUSPARSE_CHECK(
-    cusparseXcoosortByRow(handle, m, n, nnz, cooRows, cooCols, P, pBuffer));
-}
-/** @} */
-
-/**
- * @defgroup Gemmi cusparse gemmi operations
- * @{
- */
-inline cusparseStatus_t cusparsegemmi(
-  cusparseHandle_t handle, int m, int n, int k, int nnz, const float* alpha,
-  const float* A, int lda, const float* cscValB, const int* cscColPtrB,
-  const int* cscRowIndB, const float* beta, float* C, int ldc) {
-  return cusparseSgemmi(handle, m, n, k, nnz, alpha, A, lda, cscValB,
-                        cscColPtrB, cscRowIndB, beta, C, ldc);
-}
-inline cusparseStatus_t cusparsegemmi(
-  cusparseHandle_t handle, int m, int n, int k, int nnz, const double* alpha,
-  const double* A, int lda, const double* cscValB, const int* cscColPtrB,
-  const int* cscRowIndB, const double* beta, double* C, int ldc) {
-  return cusparseDgemmi(handle, m, n, k, nnz, alpha, A, lda, cscValB,
-                        cscColPtrB, cscRowIndB, beta, C, ldc);
-}
-/** @} */
-
-};  // namespace Sparse
-};  // namespace MLCommon
diff --git a/cpp/src_prims/sparse/spectral.cuh b/cpp/src_prims/sparse/spectral.cuh
index 228065fb76..fe94d52ef0 100644
--- a/cpp/src_prims/sparse/spectral.cuh
+++ b/cpp/src_prims/sparse/spectral.cuh
@@ -14,12 +14,13 @@
  * limitations under the License.
  */
 
+#include <raft/cudart_utils.h>
+#include <raft/sparse/cusparse_wrappers.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/cuda_utils.cuh>
 #include <selection/knn.cuh>
 #include "coo.cuh"
-#include "cusparse_wrappers.h"
 
 #include <raft/spectral/partition.hpp>
 
@@ -36,15 +37,16 @@ void coo2csr(cusparseHandle_t handle, const int *srcRows, const int *srcCols,
                              cudaMemcpyDeviceToDevice, stream));
   CUDA_CHECK(cudaMemcpyAsync(dstCols, srcCols, sizeof(int) * nnz,
                              cudaMemcpyDeviceToDevice, stream));
-  auto buffSize = Sparse::cusparsecoosort_bufferSizeExt(
+  auto buffSize = raft::sparse::cusparsecoosort_bufferSizeExt(
     handle, m, m, nnz, srcRows, srcCols, stream);
   device_buffer<char> pBuffer(d_alloc, stream, buffSize);
   device_buffer<int> P(d_alloc, stream, nnz);
   CUSPARSE_CHECK(cusparseCreateIdentityPermutation(handle, nnz, P.data()));
-  Sparse::cusparsecoosortByRow(handle, m, m, nnz, dstRows.data(), dstCols,
-                               P.data(), pBuffer.data(), stream);
-  Sparse::cusparsegthr(handle, nnz, srcVals, dstVals, P.data(), stream);
-  Sparse::cusparsecoo2csr(handle, dstRows.data(), nnz, m, dst_offsets, stream);
+  raft::sparse::cusparsecoosortByRow(handle, m, m, nnz, dstRows.data(), dstCols,
+                                     P.data(), pBuffer.data(), stream);
+  raft::sparse::cusparsegthr(handle, nnz, srcVals, dstVals, P.data(), stream);
+  raft::sparse::cusparsecoo2csr(handle, dstRows.data(), nnz, m, dst_offsets,
+                                stream);
   CUDA_CHECK(cudaDeviceSynchronize());
 }
 
@@ -116,7 +118,7 @@ void fit_embedding(cusparseHandle_t handle, int *rows, int *cols, T *vals,
                             no_op_cluster_solver_t{}, labels.data(),
                             eigVals.data(), eigVecs.data());
 
-  MLCommon::copy<T>(out, eigVecs.data() + n, n * n_components, stream);
+  raft::copy<T>(out, eigVecs.data() + n, n * n_components, stream);
 
   CUDA_CHECK(cudaGetLastError());
 }
diff --git a/cpp/src_prims/stats/cov.cuh b/cpp/src_prims/stats/cov.cuh
index 9d9e04c7fc..02e90de650 100644
--- a/cpp/src_prims/stats/cov.cuh
+++ b/cpp/src_prims/stats/cov.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/gemm.cuh>
-#include "mean_center.cuh"
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/gemm.cuh>
+#include <raft/stats/mean_center.cuh>
 
 namespace MLCommon {
 namespace Stats {
@@ -45,22 +45,24 @@ namespace Stats {
  * function returns!
  */
 template <typename Type>
-void cov(Type *covar, Type *data, const Type *mu, int D, int N, bool sample,
-         bool rowMajor, bool stable, cublasHandle_t handle,
+void cov(const raft::handle_t &handle, Type *covar, Type *data, const Type *mu,
+         int D, int N, bool sample, bool rowMajor, bool stable,
          cudaStream_t stream) {
   if (stable) {
+    cublasHandle_t cublas_h = handle.get_cublas_handle();
+
     // since mean operation is assumed to be along a given column, broadcast
     // must be along rows!
-    meanCenter(data, data, mu, D, N, rowMajor, true, stream);
+    raft::stats::meanCenter(data, data, mu, D, N, rowMajor, true, stream);
     Type alpha = Type(1) / (sample ? Type(N - 1) : Type(N));
     Type beta = Type(0);
     if (rowMajor) {
-      CUBLAS_CHECK(LinAlg::cublasgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, D, D, N,
-                                      &alpha, data, D, data, D, &beta, covar, D,
-                                      stream));
+      CUBLAS_CHECK(raft::linalg::cublasgemm(cublas_h, CUBLAS_OP_N, CUBLAS_OP_T,
+                                            D, D, N, &alpha, data, D, data, D,
+                                            &beta, covar, D, stream));
     } else {
-      LinAlg::gemm(data, N, D, data, covar, D, D, CUBLAS_OP_T, CUBLAS_OP_N,
-                   alpha, beta, handle, stream);
+      raft::linalg::gemm(handle, data, N, D, data, covar, D, D, CUBLAS_OP_T,
+                         CUBLAS_OP_N, alpha, beta, stream);
     }
   } else {
     ///@todo: implement this using cutlass + customized epilogue!
diff --git a/cpp/src_prims/stats/histogram.cuh b/cpp/src_prims/stats/histogram.cuh
index d0a8220776..453a30428c 100644
--- a/cpp/src_prims/stats/histogram.cuh
+++ b/cpp/src_prims/stats/histogram.cuh
@@ -16,11 +16,11 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <stdint.h>
 #include <common/seive.cuh>
-#include <cuda_utils.cuh>
-#include <vectorized.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/vectorized.cuh>
 
 // This file is a shameless amalgamation of independent works done by
 // Lars Nyland and Andy Adinets
@@ -75,8 +75,9 @@ dim3 computeGridDim(IdxT nrows, IdxT ncols, const void* kernel) {
   int occupancy;
   CUDA_CHECK(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&occupancy, kernel,
                                                            ThreadsPerBlock, 0));
-  const auto maxBlks = occupancy * getMultiProcessorCount();
-  int nblksx = ceildiv<int>(VecLen ? nrows / VecLen : nrows, ThreadsPerBlock);
+  const auto maxBlks = occupancy * raft::getMultiProcessorCount();
+  int nblksx =
+    raft::ceildiv<int>(VecLen ? nrows / VecLen : nrows, ThreadsPerBlock);
   // for cases when there aren't a lot of blocks for computing one histogram
   nblksx = std::min(nblksx, maxBlks);
   return dim3(nblksx, ncols);
@@ -91,8 +92,8 @@ DI void histCoreOp(const DataT* data, IdxT nrows, IdxT nbins, BinnerOp binner,
   IdxT tid = threadIdx.x + bdim * blockIdx.x;
   tid *= VecLen;
   IdxT stride = bdim * gridDim.x * VecLen;
-  int nCeil = alignTo<int>(nrows, stride);
-  typedef TxN_t<DataT, VecLen> VecType;
+  int nCeil = raft::alignTo<int>(nrows, stride);
+  typedef raft::TxN_t<DataT, VecLen> VecType;
   VecType a;
   for (auto i = tid; i < nCeil; i += stride) {
     if (i < nrows) {
@@ -113,13 +114,13 @@ __global__ void gmemHistKernel(int* bins, const DataT* data, IdxT nrows,
     if (row >= nrows) return;
     auto binOffset = col * nbins;
 #if __CUDA_ARCH__ < 700
-    atomicAdd(bins + binOffset + binId, 1);
+    raft::myAtomicAdd(bins + binOffset + binId, 1);
 #else
     auto amask = __activemask();
     auto mask = __match_any_sync(amask, binId);
     auto leader = __ffs(mask) - 1;
-    if (laneId() == leader) {
-      atomicAdd(bins + binOffset + binId, __popc(mask));
+    if (raft::laneId() == leader) {
+      raft::myAtomicAdd(bins + binOffset + binId, __popc(mask));
     }
 #endif  // __CUDA_ARCH__
   };
@@ -148,17 +149,17 @@ __global__ void smemHistKernel(int* bins, const DataT* data, IdxT nrows,
   auto op = [=] __device__(int binId, IdxT row, IdxT col) {
     if (row >= nrows) return;
 #if __CUDA_ARCH__ < 700
-    atomicAdd(sbins + binId, 1);
+    raft::myAtomicAdd<unsigned int>(sbins + binId, 1);
 #else
     if (UseMatchAny) {
       auto amask = __activemask();
       auto mask = __match_any_sync(amask, binId);
       auto leader = __ffs(mask) - 1;
-      if (laneId() == leader) {
-        atomicAdd(sbins + binId, __popc(mask));
+      if (raft::laneId() == leader) {
+        raft::myAtomicAdd<unsigned int>(sbins + binId, __popc(mask));
       }
     } else {
-      atomicAdd(sbins + binId, 1);
+      raft::myAtomicAdd<unsigned int>(sbins + binId, 1);
     }
 #endif  // __CUDA_ARCH__
   };
@@ -170,7 +171,7 @@ __global__ void smemHistKernel(int* bins, const DataT* data, IdxT nrows,
   for (auto i = threadIdx.x; i < nbins; i += blockDim.x) {
     auto val = sbins[i];
     if (val > 0) {
-      atomicAdd(bins + binOffset + i, val);
+      raft::myAtomicAdd<unsigned int>((unsigned int*)bins + binOffset + i, val);
     }
   }
 }
@@ -206,16 +207,18 @@ DI void incrementBin(unsigned* sbins, int* bins, int nbins, int binId) {
   auto new_word = old_word + unsigned(1 << sh);
   if ((new_word >> sh & Bits::BIN_MASK) != 0) return;
   // overflow
-  atomicAdd(bins + binId, Bits::BIN_MASK + 1);
+  raft::myAtomicAdd<unsigned int>((unsigned int*)bins + binId,
+                                  Bits::BIN_MASK + 1);
   for (int dbin = 1; ibin + dbin < Bits::WORD_BINS && binId + dbin < nbins;
        ++dbin) {
     auto sh1 = (ibin + dbin) * Bits::BIN_BITS;
     if ((new_word >> sh1 & Bits::BIN_MASK) == 0) {
       // overflow
-      atomicAdd(bins + binId + dbin, Bits::BIN_MASK);
+      raft::myAtomicAdd<unsigned int>((unsigned int*)bins + binId + dbin,
+                                      Bits::BIN_MASK);
     } else {
       // correction
-      atomicAdd(bins + binId + dbin, -1);
+      raft::myAtomicAdd(bins + binId + dbin, -1);
       break;
     }
   }
@@ -227,7 +230,7 @@ DI void incrementBin<1>(unsigned* sbins, int* bins, int nbins, int binId) {
   auto iword = binId / Bits::WORD_BITS;
   auto sh = binId % Bits::WORD_BITS;
   auto old_word = atomicXor(sbins + iword, unsigned(1 << sh));
-  if ((old_word >> sh & 1) != 0) atomicAdd(bins + binId, 2);
+  if ((old_word >> sh & 1) != 0) raft::myAtomicAdd(bins + binId, 2);
 }
 
 template <typename DataT, typename BinnerOp, typename IdxT, int BIN_BITS,
@@ -236,7 +239,7 @@ __global__ void smemBitsHistKernel(int* bins, const DataT* data, IdxT nrows,
                                    IdxT nbins, BinnerOp binner) {
   extern __shared__ unsigned sbins[];
   typedef BitsInfo<BIN_BITS> Bits;
-  auto nwords = ceildiv<int>(nbins, Bits::WORD_BINS);
+  auto nwords = raft::ceildiv<int>(nbins, Bits::WORD_BINS);
   for (auto j = threadIdx.x; j < nwords; j += blockDim.x) {
     sbins[j] = 0;
   }
@@ -253,7 +256,7 @@ __global__ void smemBitsHistKernel(int* bins, const DataT* data, IdxT nrows,
   for (auto j = threadIdx.x; j < (int)nbins; j += blockDim.x) {
     auto shift = j % Bits::WORD_BINS * Bits::BIN_BITS;
     int count = sbins[j / Bits::WORD_BINS] >> shift & Bits::BIN_MASK;
-    if (count > 0) atomicAdd(bins + binOffset + j, count);
+    if (count > 0) raft::myAtomicAdd(bins + binOffset + j, count);
   }
 }
 
@@ -267,7 +270,8 @@ void smemBitsHist(int* bins, IdxT nbins, const DataT* data, IdxT nrows,
     (const void*)
       smemBitsHistKernel<DataT, BinnerOp, IdxT, Bits::BIN_BITS, VecLen>);
   size_t smemSize =
-    ceildiv<size_t>(nbins, Bits::WORD_BITS / Bits::BIN_BITS) * sizeof(int);
+    raft::ceildiv<size_t>(nbins, Bits::WORD_BITS / Bits::BIN_BITS) *
+    sizeof(int);
   smemBitsHistKernel<DataT, BinnerOp, IdxT, Bits::BIN_BITS, VecLen>
     <<<blks, ThreadsPerBlock, smemSize, stream>>>(bins, data, nrows, nbins,
                                                   binner);
@@ -303,7 +307,7 @@ DI void flushHashTable(int2* ht, int hashSize, int* bins, int nbins, int col) {
   int binOffset = col * nbins;
   for (auto i = threadIdx.x; i < hashSize; i += blockDim.x) {
     if (ht[i].x != INVALID_KEY && ht[i].y > 0) {
-      atomicAdd(bins + binOffset + ht[i].x, ht[i].y);
+      raft::myAtomicAdd(bins + binOffset + ht[i].x, ht[i].y);
     }
     ht[i] = {INVALID_KEY, 0};
   }
@@ -327,7 +331,7 @@ __global__ void smemHashHistKernel(int* bins, const DataT* data, IdxT nrows,
     if (row < nrows) {
       int hidx = findEntry(ht, hashSize, binId, threshold);
       if (hidx >= 0) {
-        atomicAdd(&(ht[hidx].y), 1);
+        raft::myAtomicAdd(&(ht[hidx].y), 1);
       } else {
         needFlush[0] = 1;
         iNeedFlush = true;
@@ -347,7 +351,7 @@ __global__ void smemHashHistKernel(int* bins, const DataT* data, IdxT nrows,
       // all threads are bound to get one valid entry as all threads in this
       // block will make forward progress due to the __syncthreads call in the
       // subsequent iteration
-      atomicAdd(&(ht[hidx].y), 1);
+      raft::myAtomicAdd(&(ht[hidx].y), 1);
     }
   };
   IdxT col = blockIdx.y;
@@ -361,7 +365,7 @@ inline int computeHashTableSize() {
   // we shouldn't have this much of shared memory available anytime soon!
   static const unsigned maxBinsEverPossible = 256 * 1024;
   static Seive primes(maxBinsEverPossible);
-  unsigned smem = getSharedMemPerBlock();
+  unsigned smem = raft::getSharedMemPerBlock();
   // divide-by-2 because hash table entry stores 2 elements: idx and count
   auto binsPossible = smem / sizeof(unsigned) / 2;
   for (; binsPossible > 1; --binsPossible) {
@@ -457,14 +461,14 @@ void histogramImpl(HistType type, int* bins, IdxT nbins, const DataT* data,
 
 template <typename IdxT>
 HistType selectBestHistAlgo(IdxT nbins) {
-  size_t smem = getSharedMemPerBlock();
+  size_t smem = raft::getSharedMemPerBlock();
   size_t requiredSize = nbins * sizeof(unsigned);
   if (requiredSize <= smem) {
     return HistTypeSmem;
   }
   for (int bits = 16; bits >= 1; bits >>= 1) {
-    auto nBytesForBins = ceildiv<size_t>(bits * nbins, 8);
-    requiredSize = alignTo<size_t>(nBytesForBins, sizeof(unsigned));
+    auto nBytesForBins = raft::ceildiv<size_t>(bits * nbins, 8);
+    requiredSize = raft::alignTo<size_t>(nBytesForBins, sizeof(unsigned));
     if (requiredSize <= smem) {
       return static_cast<HistType>(bits);
     }
diff --git a/cpp/src_prims/stats/mean.cuh b/cpp/src_prims/stats/mean.cuh
deleted file mode 100644
index c663896a8d..0000000000
--- a/cpp/src_prims/stats/mean.cuh
+++ /dev/null
@@ -1,104 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <linalg/eltwise.cuh>
-
-namespace MLCommon {
-namespace Stats {
-
-///@todo: ColsPerBlk has been tested only for 32!
-template <typename Type, typename IdxType, int TPB, int ColsPerBlk = 32>
-__global__ void meanKernelRowMajor(Type *mu, const Type *data, IdxType D,
-                                   IdxType N) {
-  const int RowsPerBlkPerIter = TPB / ColsPerBlk;
-  IdxType thisColId = threadIdx.x % ColsPerBlk;
-  IdxType thisRowId = threadIdx.x / ColsPerBlk;
-  IdxType colId = thisColId + ((IdxType)blockIdx.y * ColsPerBlk);
-  IdxType rowId = thisRowId + ((IdxType)blockIdx.x * RowsPerBlkPerIter);
-  Type thread_data = Type(0);
-  const IdxType stride = RowsPerBlkPerIter * gridDim.x;
-  for (IdxType i = rowId; i < N; i += stride)
-    thread_data += (colId < D) ? data[i * D + colId] : Type(0);
-  __shared__ Type smu[ColsPerBlk];
-  if (threadIdx.x < ColsPerBlk) smu[threadIdx.x] = Type(0);
-  __syncthreads();
-  myAtomicAdd(smu + thisColId, thread_data);
-  __syncthreads();
-  if (threadIdx.x < ColsPerBlk) myAtomicAdd(mu + colId, smu[thisColId]);
-}
-
-template <typename Type, typename IdxType, int TPB>
-__global__ void meanKernelColMajor(Type *mu, const Type *data, IdxType D,
-                                   IdxType N) {
-  typedef cub::BlockReduce<Type, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  Type thread_data = Type(0);
-  IdxType colStart = N * blockIdx.x;
-  for (IdxType i = threadIdx.x; i < N; i += TPB) {
-    IdxType idx = colStart + i;
-    thread_data += data[idx];
-  }
-  Type acc = BlockReduce(temp_storage).Sum(thread_data);
-  if (threadIdx.x == 0) {
-    mu[blockIdx.x] = acc / N;
-  }
-}
-
-/**
- * @brief Compute mean of the input matrix
- *
- * Mean operation is assumed to be performed on a given column.
- *
- * @tparam Type: the data type
- * @tparam IdxType Integer type used to for addressing
- * @param mu: the output mean vector
- * @param data: the input matrix
- * @param D: number of columns of data
- * @param N: number of rows of data
- * @param sample: whether to evaluate sample mean or not. In other words,
- * whether
- *  to normalize the output using N-1 or N, for true or false, respectively
- * @param rowMajor: whether the input data is row or col major
- * @param stream: cuda stream
- */
-template <typename Type, typename IdxType = int>
-void mean(Type *mu, const Type *data, IdxType D, IdxType N, bool sample,
-          bool rowMajor, cudaStream_t stream) {
-  static const int TPB = 256;
-  if (rowMajor) {
-    static const int RowsPerThread = 4;
-    static const int ColsPerBlk = 32;
-    static const int RowsPerBlk = (TPB / ColsPerBlk) * RowsPerThread;
-    dim3 grid(ceildiv(N, (IdxType)RowsPerBlk), ceildiv(D, (IdxType)ColsPerBlk));
-    CUDA_CHECK(cudaMemsetAsync(mu, 0, sizeof(Type) * D, stream));
-    meanKernelRowMajor<Type, IdxType, TPB, ColsPerBlk>
-      <<<grid, TPB, 0, stream>>>(mu, data, D, N);
-    CUDA_CHECK(cudaPeekAtLastError());
-    Type ratio = Type(1) / (sample ? Type(N - 1) : Type(N));
-    LinAlg::scalarMultiply(mu, mu, ratio, D, stream);
-  } else {
-    meanKernelColMajor<Type, IdxType, TPB>
-      <<<D, TPB, 0, stream>>>(mu, data, D, N);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace Stats
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/stats/mean_center.cuh b/cpp/src_prims/stats/mean_center.cuh
deleted file mode 100644
index b55d51f21f..0000000000
--- a/cpp/src_prims/stats/mean_center.cuh
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <vectorized.cuh>
-
-namespace MLCommon {
-namespace Stats {
-
-/**
- * @brief Center the input matrix wrt its mean
- * @tparam Type the data type
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads per block of the cuda kernel launched
- * @param out the output mean-centered matrix
- * @param data input matrix
- * @param mu the mean vector
- * @param D number of columns of data
- * @param N number of rows of data
- * @param rowMajor whether input is row or col major
- * @param bcastAlongRows whether to broadcast vector along rows or columns
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename IdxType = int, int TPB = 256>
-void meanCenter(Type *out, const Type *data, const Type *mu, IdxType D,
-                IdxType N, bool rowMajor, bool bcastAlongRows,
-                cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    out, data, mu, D, N, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a - b; }, stream);
-}
-
-/**
- * @brief Add the input matrix wrt its mean
- * @tparam Type the data type
- * @tparam IdxType Integer type used to for addressing
- * @tparam TPB threads per block of the cuda kernel launched
- * @param out the output mean-added matrix
- * @param data input matrix
- * @param mu the mean vector
- * @param D number of columns of data
- * @param N number of rows of data
- * @param rowMajor whether input is row or col major
- * @param bcastAlongRows whether to broadcast vector along rows or columns
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename IdxType = int, int TPB = 256>
-void meanAdd(Type *out, const Type *data, const Type *mu, IdxType D, IdxType N,
-             bool rowMajor, bool bcastAlongRows, cudaStream_t stream) {
-  LinAlg::matrixVectorOp(
-    out, data, mu, D, N, rowMajor, bcastAlongRows,
-    [] __device__(Type a, Type b) { return a + b; }, stream);
-}
-
-};  // end namespace Stats
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/stats/minmax.cuh b/cpp/src_prims/stats/minmax.cuh
index adad1e6e86..98ffc3b820 100644
--- a/cpp/src_prims/stats/minmax.cuh
+++ b/cpp/src_prims/stats/minmax.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <limits>
+#include <raft/cuda_utils.cuh>
 
 namespace MLCommon {
 namespace Stats {
@@ -177,20 +177,20 @@ void minmax(const T* data, const unsigned* rowids, const unsigned* colids,
             int nrows, int ncols, int row_stride, T* globalmin, T* globalmax,
             T* sampledcols, cudaStream_t stream) {
   using E = typename encode_traits<T>::E;
-  int nblks = ceildiv(ncols, TPB);
+  int nblks = raft::ceildiv(ncols, TPB);
   T init_val = std::numeric_limits<T>::max();
   minmaxInitKernel<T, E>
     <<<nblks, TPB, 0, stream>>>(ncols, globalmin, globalmax, init_val);
   CUDA_CHECK(cudaPeekAtLastError());
-  nblks = ceildiv(nrows * ncols, TPB);
+  nblks = raft::ceildiv(nrows * ncols, TPB);
   nblks = min(nblks, 65536);
   size_t smemSize = sizeof(T) * 2 * ncols;
 
   // Compute the batch_ncols, in [1, ncols] range, that meet the available
   // shared memory constraints.
-  auto smemPerBlk = getSharedMemPerBlock();
+  auto smemPerBlk = raft::getSharedMemPerBlock();
   int batch_ncols = min(ncols, (int)(smemPerBlk / (sizeof(T) * 2)));
-  int num_batches = ceildiv(ncols, batch_ncols);
+  int num_batches = raft::ceildiv(ncols, batch_ncols);
   smemSize = sizeof(T) * 2 * batch_ncols;
 
   minmaxKernel<T, E><<<nblks, TPB, smemSize, stream>>>(
diff --git a/cpp/src_prims/stats/stddev.cuh b/cpp/src_prims/stats/stddev.cuh
deleted file mode 100644
index 39b232c423..0000000000
--- a/cpp/src_prims/stats/stddev.cuh
+++ /dev/null
@@ -1,171 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
-
-namespace MLCommon {
-namespace Stats {
-
-///@todo: ColPerBlk has been tested only for 32!
-template <typename Type, typename IdxType, int TPB, int ColsPerBlk = 32>
-__global__ void stddevKernelRowMajor(Type *std, const Type *data, IdxType D,
-                                     IdxType N) {
-  const int RowsPerBlkPerIter = TPB / ColsPerBlk;
-  IdxType thisColId = threadIdx.x % ColsPerBlk;
-  IdxType thisRowId = threadIdx.x / ColsPerBlk;
-  IdxType colId = thisColId + ((IdxType)blockIdx.y * ColsPerBlk);
-  IdxType rowId = thisRowId + ((IdxType)blockIdx.x * RowsPerBlkPerIter);
-  Type thread_data = Type(0);
-  const IdxType stride = RowsPerBlkPerIter * gridDim.x;
-  for (IdxType i = rowId; i < N; i += stride) {
-    Type val = (colId < D) ? data[i * D + colId] : Type(0);
-    thread_data += val * val;
-  }
-  __shared__ Type sstd[ColsPerBlk];
-  if (threadIdx.x < ColsPerBlk) sstd[threadIdx.x] = Type(0);
-  __syncthreads();
-  myAtomicAdd(sstd + thisColId, thread_data);
-  __syncthreads();
-  if (threadIdx.x < ColsPerBlk) myAtomicAdd(std + colId, sstd[thisColId]);
-}
-
-template <typename Type, typename IdxType, int TPB>
-__global__ void stddevKernelColMajor(Type *std, const Type *data,
-                                     const Type *mu, IdxType D, IdxType N) {
-  typedef cub::BlockReduce<Type, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  Type thread_data = Type(0);
-  IdxType colStart = N * blockIdx.x;
-  Type m = mu[blockIdx.x];
-  for (IdxType i = threadIdx.x; i < N; i += TPB) {
-    IdxType idx = colStart + i;
-    Type diff = data[idx] - m;
-    thread_data += diff * diff;
-  }
-  Type acc = BlockReduce(temp_storage).Sum(thread_data);
-  if (threadIdx.x == 0) {
-    std[blockIdx.x] = mySqrt(acc / N);
-  }
-}
-
-template <typename Type, typename IdxType, int TPB>
-__global__ void varsKernelColMajor(Type *var, const Type *data, const Type *mu,
-                                   IdxType D, IdxType N) {
-  typedef cub::BlockReduce<Type, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  Type thread_data = Type(0);
-  IdxType colStart = N * blockIdx.x;
-  Type m = mu[blockIdx.x];
-  for (IdxType i = threadIdx.x; i < N; i += TPB) {
-    IdxType idx = colStart + i;
-    Type diff = data[idx] - m;
-    thread_data += diff * diff;
-  }
-  Type acc = BlockReduce(temp_storage).Sum(thread_data);
-  if (threadIdx.x == 0) {
-    var[blockIdx.x] = acc / N;
-  }
-}
-
-/**
- * @brief Compute stddev of the input matrix
- *
- * Stddev operation is assumed to be performed on a given column.
- *
- * @tparam Type the data type
- * @tparam IdxType Integer type used to for addressing
- * @param std the output stddev vector
- * @param data the input matrix
- * @param mu the mean vector
- * @param D number of columns of data
- * @param N number of rows of data
- * @param sample whether to evaluate sample stddev or not. In other words,
- * whether
- *  to normalize the output using N-1 or N, for true or false, respectively
- * @param rowMajor whether the input data is row or col major
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename IdxType = int>
-void stddev(Type *std, const Type *data, const Type *mu, IdxType D, IdxType N,
-            bool sample, bool rowMajor, cudaStream_t stream) {
-  static const int TPB = 256;
-  if (rowMajor) {
-    static const int RowsPerThread = 4;
-    static const int ColsPerBlk = 32;
-    static const int RowsPerBlk = (TPB / ColsPerBlk) * RowsPerThread;
-    dim3 grid(ceildiv(N, (IdxType)RowsPerBlk), ceildiv(D, (IdxType)ColsPerBlk));
-    CUDA_CHECK(cudaMemset(std, 0, sizeof(Type) * D));
-    stddevKernelRowMajor<Type, IdxType, TPB, ColsPerBlk>
-      <<<grid, TPB, 0, stream>>>(std, data, D, N);
-    Type ratio = Type(1) / (sample ? Type(N - 1) : Type(N));
-    LinAlg::binaryOp(
-      std, std, mu, D,
-      [ratio] __device__(Type a, Type b) { return mySqrt(a * ratio - b * b); },
-      stream);
-  } else {
-    stddevKernelColMajor<Type, IdxType, TPB>
-      <<<D, TPB, 0, stream>>>(std, data, mu, D, N);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-/**
- * @brief Compute variance of the input matrix
- *
- * Variance operation is assumed to be performed on a given column.
- *
- * @tparam Type the data type
- * @tparam IdxType Integer type used to for addressing
- * @param var the output stddev vector
- * @param data the input matrix
- * @param mu the mean vector
- * @param D number of columns of data
- * @param N number of rows of data
- * @param sample whether to evaluate sample stddev or not. In other words,
- * whether
- *  to normalize the output using N-1 or N, for true or false, respectively
- * @param rowMajor whether the input data is row or col major
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename IdxType = int>
-void vars(Type *var, const Type *data, const Type *mu, IdxType D, IdxType N,
-          bool sample, bool rowMajor, cudaStream_t stream) {
-  static const int TPB = 256;
-  if (rowMajor) {
-    static const int RowsPerThread = 4;
-    static const int ColsPerBlk = 32;
-    static const int RowsPerBlk = (TPB / ColsPerBlk) * RowsPerThread;
-    dim3 grid(ceildiv(N, (IdxType)RowsPerBlk), ceildiv(D, (IdxType)ColsPerBlk));
-    CUDA_CHECK(cudaMemset(var, 0, sizeof(Type) * D));
-    stddevKernelRowMajor<Type, IdxType, TPB, ColsPerBlk>
-      <<<grid, TPB, 0, stream>>>(var, data, D, N);
-    Type ratio = Type(1) / (sample ? Type(N - 1) : Type(N));
-    LinAlg::binaryOp(
-      var, var, mu, D,
-      [ratio] __device__(Type a, Type b) { return a * ratio - b * b; }, stream);
-  } else {
-    varsKernelColMajor<Type, IdxType, TPB>
-      <<<D, TPB, 0, stream>>>(var, data, mu, D, N);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace Stats
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/stats/sum.cuh b/cpp/src_prims/stats/sum.cuh
deleted file mode 100644
index 26023fefd2..0000000000
--- a/cpp/src_prims/stats/sum.cuh
+++ /dev/null
@@ -1,98 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <linalg/eltwise.cuh>
-
-namespace MLCommon {
-namespace Stats {
-
-///@todo: ColsPerBlk has been tested only for 32!
-template <typename Type, typename IdxType, int TPB, int ColsPerBlk = 32>
-__global__ void sumKernelRowMajor(Type *mu, const Type *data, IdxType D,
-                                  IdxType N) {
-  const int RowsPerBlkPerIter = TPB / ColsPerBlk;
-  IdxType thisColId = threadIdx.x % ColsPerBlk;
-  IdxType thisRowId = threadIdx.x / ColsPerBlk;
-  IdxType colId = thisColId + ((IdxType)blockIdx.y * ColsPerBlk);
-  IdxType rowId = thisRowId + ((IdxType)blockIdx.x * RowsPerBlkPerIter);
-  Type thread_data = Type(0);
-  const IdxType stride = RowsPerBlkPerIter * gridDim.x;
-  for (IdxType i = rowId; i < N; i += stride)
-    thread_data += (colId < D) ? data[i * D + colId] : Type(0);
-  __shared__ Type smu[ColsPerBlk];
-  if (threadIdx.x < ColsPerBlk) smu[threadIdx.x] = Type(0);
-  __syncthreads();
-  myAtomicAdd(smu + thisColId, thread_data);
-  __syncthreads();
-  if (threadIdx.x < ColsPerBlk) myAtomicAdd(mu + colId, smu[thisColId]);
-}
-
-template <typename Type, typename IdxType, int TPB>
-__global__ void sumKernelColMajor(Type *mu, const Type *data, IdxType D,
-                                  IdxType N) {
-  typedef cub::BlockReduce<Type, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  Type thread_data = Type(0);
-  IdxType colStart = N * blockIdx.x;
-  for (IdxType i = threadIdx.x; i < N; i += TPB) {
-    IdxType idx = colStart + i;
-    thread_data += data[idx];
-  }
-  Type acc = BlockReduce(temp_storage).Sum(thread_data);
-  if (threadIdx.x == 0) {
-    mu[blockIdx.x] = acc;
-  }
-}
-
-/**
- * @brief Compute sum of the input matrix
- *
- * Sum operation is assumed to be performed on a given column.
- *
- * @tparam Type the data type
- * @tparam IdxType Integer type used to for addressing
- * @param output the output mean vector
- * @param input the input matrix
- * @param D number of columns of data
- * @param N number of rows of data
- * @param rowMajor whether the input data is row or col major
- * @param stream cuda stream where to launch work
- */
-template <typename Type, typename IdxType = int>
-void sum(Type *output, const Type *input, IdxType D, IdxType N, bool rowMajor,
-         cudaStream_t stream) {
-  static const int TPB = 256;
-  if (rowMajor) {
-    static const int RowsPerThread = 4;
-    static const int ColsPerBlk = 32;
-    static const int RowsPerBlk = (TPB / ColsPerBlk) * RowsPerThread;
-    dim3 grid(ceildiv(N, (IdxType)RowsPerBlk), ceildiv(D, (IdxType)ColsPerBlk));
-    CUDA_CHECK(cudaMemset(output, 0, sizeof(Type) * D));
-    sumKernelRowMajor<Type, IdxType, TPB, ColsPerBlk>
-      <<<grid, TPB, 0, stream>>>(output, input, D, N);
-  } else {
-    sumKernelColMajor<Type, IdxType, TPB>
-      <<<D, TPB, 0, stream>>>(output, input, D, N);
-  }
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-};  // end namespace Stats
-};  // end namespace MLCommon
diff --git a/cpp/src_prims/stats/weighted_mean.cuh b/cpp/src_prims/stats/weighted_mean.cuh
index ca2ea7cbc6..b02b306cc1 100644
--- a/cpp/src_prims/stats/weighted_mean.cuh
+++ b/cpp/src_prims/stats/weighted_mean.cuh
@@ -16,9 +16,9 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
-#include <linalg/coalesced_reduction.cuh>
-#include <linalg/strided_reduction.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/coalesced_reduction.cuh>
+#include <raft/linalg/strided_reduction.cuh>
 
 namespace MLCommon {
 namespace Stats {
@@ -39,10 +39,10 @@ void rowWeightedMean(Type *mu, const Type *data, const Type *weights, int D,
                      int N, cudaStream_t stream) {
   //sum the weights & copy back to CPU
   Type WS = 0;
-  LinAlg::coalescedReduction(mu, weights, D, 1, (Type)0, stream, false);
-  updateHost(&WS, mu, 1, stream);
+  raft::linalg::coalescedReduction(mu, weights, D, 1, (Type)0, stream, false);
+  raft::update_host(&WS, mu, 1, stream);
 
-  LinAlg::coalescedReduction(
+  raft::linalg::coalescedReduction(
     mu, data, D, N, (Type)0, stream, false,
     [weights] __device__(Type v, int i) { return v * weights[i]; },
     [] __device__(Type a, Type b) { return a + b; },
@@ -65,10 +65,10 @@ void colWeightedMean(Type *mu, const Type *data, const Type *weights, int D,
                      int N, cudaStream_t stream) {
   //sum the weights & copy back to CPU
   Type WS = 0;
-  LinAlg::stridedReduction(mu, weights, 1, N, (Type)0, stream, false);
-  updateHost(&WS, mu, 1, stream);
+  raft::linalg::stridedReduction(mu, weights, 1, N, (Type)0, stream, false);
+  raft::update_host(&WS, mu, 1, stream);
 
-  LinAlg::stridedReduction(
+  raft::linalg::stridedReduction(
     mu, data, D, N, (Type)0, stream, false,
     [weights] __device__(Type v, int i) { return v * weights[i]; },
     [] __device__(Type a, Type b) { return a + b; },
diff --git a/cpp/src_prims/timeSeries/arima_helpers.cuh b/cpp/src_prims/timeSeries/arima_helpers.cuh
index 03fa2fe3eb..e4456eb205 100644
--- a/cpp/src_prims/timeSeries/arima_helpers.cuh
+++ b/cpp/src_prims/timeSeries/arima_helpers.cuh
@@ -18,12 +18,12 @@
 
 #include <cuda_runtime.h>
 
-#include <common/cudart_utils.h>
 #include <cuml/tsa/arima_common.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <linalg/batched/matrix.cuh>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/unary_op.cuh>
 #include "jones_transform.cuh"
 
 namespace MLCommon {
@@ -114,7 +114,7 @@ void prepare_data(DataT* d_out, const DataT* d_in, int batch_size, int n_obs,
   }
   // If no difference and the pointers are different, copy in to out
   else if (d + D == 0 && d_in != d_out) {
-    MLCommon::copy(d_out, d_in, n_obs * batch_size, stream);
+    raft::copy(d_out, d_in, n_obs * batch_size, stream);
   }
   // Other cases: no difference and the pointers are the same, nothing to do
 }
@@ -158,7 +158,7 @@ __global__ void _undiff_kernel(DataT* d_fc, const DataT* d_in, int num_steps,
 }
 
 /**
- * @brief Finalizes a forecast by adding the trend and/or undifferencing
+ * @brief Finalizes a forecast by undifferencing
  *
  * @note: It is assumed that d + D <= 2. This is enforced on the Python side
  *
@@ -182,12 +182,12 @@ void finalize_forecast(DataT* d_fc, const DataT* d_in, int num_steps,
   constexpr int TPB = 64;  // One thread per series -> avoid big blocks
   if (d + D == 1) {
     _undiff_kernel<false>
-      <<<MLCommon::ceildiv<int>(batch_size, TPB), TPB, 0, stream>>>(
+      <<<raft::ceildiv<int>(batch_size, TPB), TPB, 0, stream>>>(
         d_fc, d_in, num_steps, batch_size, in_ld, n_in, d ? 1 : s);
     CUDA_CHECK(cudaPeekAtLastError());
   } else if (d + D == 2) {
     _undiff_kernel<true>
-      <<<MLCommon::ceildiv<int>(batch_size, TPB), TPB, 0, stream>>>(
+      <<<raft::ceildiv<int>(batch_size, TPB), TPB, 0, stream>>>(
         d_fc, d_in, num_steps, batch_size, in_ld, n_in, d ? 1 : s,
         d == 2 ? 1 : s);
     CUDA_CHECK(cudaPeekAtLastError());
@@ -228,7 +228,7 @@ void batched_jones_transform(const ML::ARIMAOrder& order, int batch_size,
 
   // Constrain sigma2 to be strictly positive
   constexpr DataT min_sigma2 = 1e-6;
-  LinAlg::unaryOp<DataT>(
+  raft::linalg::unaryOp<DataT>(
     Tparams.sigma2, params.sigma2, batch_size,
     [=] __device__(DataT input) { return max(input, min_sigma2); }, stream);
 }
diff --git a/cpp/src_prims/timeSeries/jones_transform.cuh b/cpp/src_prims/timeSeries/jones_transform.cuh
index 80c59952f4..5d37bd15ae 100644
--- a/cpp/src_prims/timeSeries/jones_transform.cuh
+++ b/cpp/src_prims/timeSeries/jones_transform.cuh
@@ -21,12 +21,12 @@
 
 #pragma once
 
-#include <common/cudart_utils.h>
 #include <math.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/unary_op.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/unary_op.cuh>
 
 namespace MLCommon {
 
@@ -41,7 +41,7 @@ namespace TimeSeries {
 */
 template <typename Type>
 struct PAC {
-  HDI Type operator()(Type in) { return myTanh(in * 0.5); }
+  HDI Type operator()(Type in) { return raft::myTanh(in * 0.5); }
 };
 
 /**
@@ -129,7 +129,7 @@ inline __device__ void invtransform(DataT* tmp, DataT* myNewParams, bool isAr) {
   }
 
   for (int i = 0; i < VALUE; ++i) {
-    myNewParams[i] = 2 * myATanh(myNewParams[i]);
+    myNewParams[i] = 2 * raft::myATanh(myNewParams[i]);
   }
 }
 
@@ -200,12 +200,12 @@ void jones_transform(const DataT* params, IdxT batchSize, IdxT parameter,
   IdxT nElements = batchSize * parameter;
 
   //copying contents
-  copy(newParams, params, (size_t)nElements, stream);
+  raft::copy(newParams, params, (size_t)nElements, stream);
 
   //setting the kernel configuration
   static const int BLOCK_DIM_Y = 1, BLOCK_DIM_X = 256;
   dim3 numThreadsPerBlock(BLOCK_DIM_X, BLOCK_DIM_Y);
-  dim3 numBlocks(ceildiv<int>(batchSize, numThreadsPerBlock.x), 1);
+  dim3 numBlocks(raft::ceildiv<int>(batchSize, numThreadsPerBlock.x), 1);
 
   //calling the kernel
 
diff --git a/cpp/src_prims/timeSeries/stationarity.cuh b/cpp/src_prims/timeSeries/stationarity.cuh
index 8f7eca3654..1afa4fb275 100644
--- a/cpp/src_prims/timeSeries/stationarity.cuh
+++ b/cpp/src_prims/timeSeries/stationarity.cuh
@@ -33,13 +33,13 @@
 #include <thrust/scan.h>
 #include <vector>
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
-#include <linalg/matrix_vector_op.cuh>
-#include <linalg/reduce.cuh>
-#include <stats/mean.cuh>
+#include <raft/linalg/matrix_vector_op.cuh>
+#include <raft/linalg/reduce.cuh>
+#include <raft/stats/mean.cuh>
 #include <timeSeries/arima_helpers.cuh>
 
 namespace MLCommon {
@@ -198,25 +198,27 @@ static void _kpss_test(const DataT* d_y, bool* results, IdxT batch_size,
                        cudaStream_t stream, DataT pval_threshold) {
   constexpr int TPB = 256;
   dim3 block = choose_block_dims<TPB>(batch_size);
-  dim3 grid(ceildiv<IdxT>(n_obs, block.x), ceildiv<IdxT>(batch_size, block.y));
+  dim3 grid(raft::ceildiv<IdxT>(n_obs, block.x),
+            raft::ceildiv<IdxT>(batch_size, block.y));
 
   DataT n_obs_f = static_cast<DataT>(n_obs);
 
   // Compute mean
   device_buffer<DataT> y_means(allocator, stream, batch_size);
-  Stats::mean(y_means.data(), d_y, batch_size, n_obs, false, false, stream);
+  raft::stats::mean(y_means.data(), d_y, batch_size, n_obs, false, false,
+                    stream);
 
   // Center the data around its mean
   device_buffer<DataT> y_cent(allocator, stream, batch_size * n_obs);
-  LinAlg::matrixVectorOp(
+  raft::linalg::matrixVectorOp(
     y_cent.data(), d_y, y_means.data(), batch_size, n_obs, false, true,
     [] __device__(DataT a, DataT b) { return a - b; }, stream);
 
   // This calculates the first sum in eq. 10 (first part of s^2)
   device_buffer<DataT> s2A(allocator, stream, batch_size);
-  LinAlg::reduce(s2A.data(), y_cent.data(), batch_size, n_obs,
-                 static_cast<DataT>(0.0), false, false, stream, false,
-                 L2Op<DataT>(), Sum<DataT>());
+  raft::linalg::reduce(s2A.data(), y_cent.data(), batch_size, n_obs,
+                       static_cast<DataT>(0.0), false, false, stream, false,
+                       raft::L2Op<DataT>(), raft::Sum<DataT>());
 
   // From Kwiatkowski et al. referencing Schwert (1989)
   DataT lags_f = ceil(12.0 * pow(n_obs_f / 100.0, 0.25));
@@ -233,8 +235,8 @@ static void _kpss_test(const DataT* d_y, bool* results, IdxT batch_size,
     -coeff_base / (lags_f + static_cast<DataT>(1.0)), coeff_base);
   CUDA_CHECK(cudaPeekAtLastError());
   device_buffer<DataT> s2B(allocator, stream, batch_size);
-  LinAlg::reduce(s2B.data(), accumulator.data(), batch_size, n_obs,
-                 static_cast<DataT>(0.0), false, false, stream, false);
+  raft::linalg::reduce(s2B.data(), accumulator.data(), batch_size, n_obs,
+                       static_cast<DataT>(0.0), false, false, stream, false);
 
   // Cumulative sum (inclusive scan with + operator)
   thrust::counting_iterator<IdxT> c_first(0);
@@ -246,13 +248,13 @@ static void _kpss_test(const DataT* d_y, bool* results, IdxT batch_size,
 
   // Eq. 11 (eta)
   device_buffer<DataT> eta(allocator, stream, batch_size);
-  LinAlg::reduce(eta.data(), accumulator.data(), batch_size, n_obs,
-                 static_cast<DataT>(0.0), false, false, stream, false,
-                 L2Op<DataT>(), Sum<DataT>());
+  raft::linalg::reduce(eta.data(), accumulator.data(), batch_size, n_obs,
+                       static_cast<DataT>(0.0), false, false, stream, false,
+                       raft::L2Op<DataT>(), raft::Sum<DataT>());
 
   /* The following kernel will decide whether each series is stationary based on
    * s^2 and eta */
-  kpss_stationarity_check_kernel<<<ceildiv<int>(batch_size, TPB), TPB, 0,
+  kpss_stationarity_check_kernel<<<raft::ceildiv<int>(batch_size, TPB), TPB, 0,
                                    stream>>>(results, s2A.data(), s2B.data(),
                                              eta.data(), batch_size, n_obs_f,
                                              pval_threshold);
diff --git a/cpp/src_prims/vectorized.cuh b/cpp/src_prims/vectorized.cuh
deleted file mode 100644
index edad71eb3a..0000000000
--- a/cpp/src_prims/vectorized.cuh
+++ /dev/null
@@ -1,346 +0,0 @@
-/*
- * Copyright (c) 2018, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_fp16.h>
-#include "cuda_utils.cuh"
-
-namespace MLCommon {
-
-template <typename math_, int VecLen>
-struct IOType {};
-
-template <>
-struct IOType<int8_t, 1> {
-  typedef int8_t Type;
-};
-template <>
-struct IOType<int8_t, 2> {
-  typedef int16_t Type;
-};
-template <>
-struct IOType<int8_t, 4> {
-  typedef int32_t Type;
-};
-template <>
-struct IOType<int8_t, 8> {
-  typedef int2 Type;
-};
-template <>
-struct IOType<int8_t, 16> {
-  typedef int4 Type;
-};
-
-template <>
-struct IOType<uint8_t, 1> {
-  typedef uint8_t Type;
-};
-template <>
-struct IOType<uint8_t, 2> {
-  typedef uint16_t Type;
-};
-template <>
-struct IOType<uint8_t, 4> {
-  typedef uint32_t Type;
-};
-template <>
-struct IOType<uint8_t, 8> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<uint8_t, 16> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<int16_t, 1> {
-  typedef int16_t Type;
-};
-template <>
-struct IOType<int16_t, 2> {
-  typedef int32_t Type;
-};
-template <>
-struct IOType<int16_t, 4> {
-  typedef int2 Type;
-};
-template <>
-struct IOType<int16_t, 8> {
-  typedef int4 Type;
-};
-
-template <>
-struct IOType<uint16_t, 1> {
-  typedef uint16_t Type;
-};
-template <>
-struct IOType<uint16_t, 2> {
-  typedef uint32_t Type;
-};
-template <>
-struct IOType<uint16_t, 4> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<uint16_t, 8> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<__half, 1> {
-  typedef __half Type;
-};
-template <>
-struct IOType<__half, 2> {
-  typedef __half2 Type;
-};
-template <>
-struct IOType<__half, 4> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<__half, 8> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<__half2, 1> {
-  typedef __half2 Type;
-};
-template <>
-struct IOType<__half2, 2> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<__half2, 4> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<int32_t, 1> {
-  typedef int32_t Type;
-};
-template <>
-struct IOType<int32_t, 2> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<int32_t, 4> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<uint32_t, 1> {
-  typedef uint32_t Type;
-};
-template <>
-struct IOType<uint32_t, 2> {
-  typedef uint2 Type;
-};
-template <>
-struct IOType<uint32_t, 4> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<float, 1> {
-  typedef float Type;
-};
-template <>
-struct IOType<float, 2> {
-  typedef float2 Type;
-};
-template <>
-struct IOType<float, 4> {
-  typedef float4 Type;
-};
-
-template <>
-struct IOType<int64_t, 1> {
-  typedef int64_t Type;
-};
-template <>
-struct IOType<int64_t, 2> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<uint64_t, 1> {
-  typedef uint64_t Type;
-};
-template <>
-struct IOType<uint64_t, 2> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<unsigned long long, 1> {
-  typedef unsigned long long Type;
-};
-template <>
-struct IOType<unsigned long long, 2> {
-  typedef uint4 Type;
-};
-
-template <>
-struct IOType<double, 1> {
-  typedef double Type;
-};
-template <>
-struct IOType<double, 2> {
-  typedef double2 Type;
-};
-
-// template <int Size> struct Cases {};
-
-// template <> struct Cases<1> {
-//     static const int arr[5] = {1, 2, 4, 8, 16};
-// };
-// template <> struct Cases<2> {
-//     static const int arr[4] = {1, 2, 4, 8};
-// };
-// template <> struct Cases<4> {
-//     static const int arr[3] = {1, 2, 4};
-// };
-// template <> struct Cases<8> {
-//     static const int arr[2] = {1, 2};
-// };
-
-/**
- * @struct TxN_t
- *
- * @brief Internal data structure that is used to define a facade for vectorized
- * loads/stores across the most common POD types. The goal of his file is to
- * provide with CUDA programmers, an easy way to have compiler issue vectorized
- * load or store instructions to memory (either global or shared). Vectorized
- * accesses to memory are important as they'll utilize its resources
- * efficiently,
- * when compared to their non-vectorized counterparts. Obviously, for whatever
- * reasons if one is unable to issue such vectorized operations, one can always
- * fallback to using POD types.
- *
- * Example demonstrating the use of load operations, performing math on such
- * loaded data and finally storing it back.
- * @code{.cu}
- * TxN_t<uint8_t,8> mydata1, mydata2;
- * int idx = (threadIdx.x + (blockIdx.x * blockDim.x)) * mydata1.Ratio;
- * mydata1.load(ptr1, idx);
- * mydata2.load(ptr2, idx);
- * #pragma unroll
- * for(int i=0;i<mydata1.Ratio;++i) {
- *     mydata1.val.data[i] += mydata2.val.data[i];
- * }
- * mydata1.store(ptr1, idx);
- * @endcode
- *
- * By doing as above, the interesting thing is that the code effectively remains
- * almost the same, in case one wants to upgrade to TxN_t<uint16_t,16> type.
- * Only change required is to replace variable declaration appropriately.
- *
- * Obviously, it's caller's responsibility to take care of pointer alignment!
- *
- * @tparam math_ the data-type in which the compute/math needs to happen
- * @tparam veclen_ the number of 'math_' types to be loaded/stored per
- * instruction
- */
-template <typename math_, int veclen_>
-struct TxN_t {
-  /** underlying math data type */
-  typedef math_ math_t;
-  /** internal storage data type */
-  typedef typename IOType<math_t, veclen_>::Type io_t;
-
-  /** defines the number of 'math_t' types stored by this struct */
-  static const int Ratio = veclen_;
-
-  union {
-    /** the vectorized data that is used for subsequent operations */
-    math_t data[Ratio];
-    /** internal data used to ensure vectorized loads/stores */
-    io_t internal;
-  } val;
-
-  ///@todo: add default constructor
-
-  /**
-   * @brief Fill the contents of this structure with a constant value
-   * @param _val the constant to be filled
-   */
-  DI void fill(math_t _val) {
-#pragma unroll
-    for (int i = 0; i < Ratio; ++i) {
-      val.data[i] = _val;
-    }
-  }
-
-  ///@todo: how to handle out-of-bounds!!?
-
-  /**
-   * @defgroup LoadsStores Global/Shared vectored loads or stores
-   *
-   * @brief Perform vectored loads/stores on this structure
-   * @tparam idx_t index data type
-   * @param ptr base pointer from where to load (or store) the data. It must
-   *  be aligned to 'sizeof(io_t)'!
-   * @param idx the offset from the base pointer which will be loaded
-   *  (or stored) by the current thread. This must be aligned to 'Ratio'!
-   *
-   * @note: In case of loads, after a successful execution, the val.data will
-   *  be populated with the desired data loaded from the pointer location. In
-   * case of stores, the data in the val.data will be stored to that location.
-   * @{
-   */
-  template <typename idx_t = int>
-  DI void load(const math_t *ptr, idx_t idx) {
-    const io_t *bptr = reinterpret_cast<const io_t *>(&ptr[idx]);
-    val.internal = __ldg(bptr);
-  }
-
-  template <typename idx_t = int>
-  DI void load(math_t *ptr, idx_t idx) {
-    io_t *bptr = reinterpret_cast<io_t *>(&ptr[idx]);
-    val.internal = *bptr;
-  }
-
-  template <typename idx_t = int>
-  DI void store(math_t *ptr, idx_t idx) {
-    io_t *bptr = reinterpret_cast<io_t *>(&ptr[idx]);
-    *bptr = val.internal;
-  }
-  /** @} */
-};
-
-/** this is just to keep the compiler happy! */
-template <typename math_>
-struct TxN_t<math_, 0> {
-  typedef math_ math_t;
-  static const int Ratio = 1;
-
-  union {
-    math_t data[1];
-  } val;
-
-  DI void fill(math_t _val) {}
-  template <typename idx_t = int>
-  DI void load(const math_t *ptr, idx_t idx) {}
-  template <typename idx_t = int>
-  DI void load(math_t *ptr, idx_t idx) {}
-  template <typename idx_t = int>
-  DI void store(math_t *ptr, idx_t idx) {}
-};
-
-};  // namespace MLCommon
diff --git a/cpp/test/CMakeLists.txt b/cpp/test/CMakeLists.txt
index 566cdaa57c..db7bc0de48 100644
--- a/cpp/test/CMakeLists.txt
+++ b/cpp/test/CMakeLists.txt
@@ -16,12 +16,14 @@
 
 set(CUML_TEST_INCLUDE_DIRS
   ${CUML_INCLUDE_DIRECTORIES}
-  ${GTEST_DIR}/include
+  ${GTEST_INCLUDE_DIRS}/include
   ${TREELITE_DIR}/runtime/native/include)
 
-
 set(CUML_TEST_LINK_LIBRARIES
   ${CUML_CPP_TARGET}
+  FAISS::FAISS
+  GTest::GTest
+  GTest::Main
   treelite::treelite
   treelite::treelite_runtime
   ${CUDA_cublas_LIBRARY}
@@ -29,11 +31,8 @@ set(CUML_TEST_LINK_LIBRARIES
   ${CUDA_cusolver_LIBRARY}
   ${CUDA_CUDART_LIBRARY}
   ${CUDA_cusparse_LIBRARY}
-  gtestlib
-  gtest_mainlib
   )
 
-
 set(PRIMS_TEST_LINK_LIBRARIES
   ${CUDA_cublas_LIBRARY}
   ${CUDA_curand_LIBRARY}
@@ -49,16 +48,21 @@ if(BUILD_CUML_TESTS)
   add_executable(ml
     sg/cd_test.cu
     sg/dbscan_test.cu
+    sg/decisiontree_batchedlevel_algo.cu
     sg/fil_test.cu
     sg/handle_test.cu
     sg/holtwinters_test.cu
     sg/kmeans_test.cu
     sg/knn_test.cu
     sg/logger.cpp
+    sg/nvtx_test.cpp
     sg/ols.cu
     sg/pca_test.cu
     sg/quasi_newton.cu
     sg/rf_accuracy_test.cu
+    sg/rf_batched_classification_test.cu
+    sg/rf_batched_regression_test.cu
+    sg/rf_depth_test.cu
     sg/rf_test.cu
     sg/rf_treelite_test.cu
     sg/ridge.cu
@@ -70,16 +74,18 @@ if(BUILD_CUML_TESTS)
     sg/tsvd_test.cu
     sg/umap_parametrizable_test.cu)
 
+  add_dependencies(ml cutlass)
+
   target_include_directories(ml PRIVATE ${CUML_TEST_INCLUDE_DIRS})
 
   target_link_libraries(ml ${CUML_TEST_LINK_LIBRARIES})
+
 endif(BUILD_CUML_TESTS)
 
 ##############################################################################
 # - build test_ml_mg executable ----------------------------------------------
 
 if(BUILD_CUML_MG_TESTS)
-  find_package(MPI)
 
   if(MPI_CXX_FOUND)
     # (please keep the filenames in alphabetical order)
@@ -88,23 +94,21 @@ if(BUILD_CUML_MG_TESTS)
       mg/knn_classify.cu
       mg/knn_regress.cu
       mg/main.cu
-      mg/pca.cu
-      mg/test_ml_mg_utils.cu)
-
-    add_dependencies(ml_mg cumlcommsmpi)
+      mg/pca.cu)
 
     set(CUML_TEST_INCLUDE_DIRS
         ${CUML_TEST_INCLUDE_DIRS}
+        NCCL::NCCL
         ${MPI_CXX_INCLUDE_PATH}
-        ${CMAKE_CURRENT_SOURCE_DIR}/../comms/mpi/include)
+        ${cumlprims_mg_INCLUDE_DIRS})
 
     target_include_directories(ml_mg PUBLIC ${CUML_TEST_INCLUDE_DIRS})
 
-    set(CUML_TEST_LINK_LIBRARIES ${CUML_TEST_LINK_LIBRARIES} ${MPI_CXX_LIBRARIES})
-
-    set(CUML_TEST_LINK_LIBRARIES ${CUML_TEST_LINK_LIBRARIES} cumlcommsmpi)
-
-    set(CUML_TEST_LINK_LIBRARIES ${CUML_TEST_LINK_LIBRARIES} cumlprims)
+    set(CUML_TEST_LINK_LIBRARIES
+    	${CUML_TEST_LINK_LIBRARIES}
+    	NCCL::NCCL
+    	${MPI_CXX_LIBRARIES}
+    	cumlprims_mg::cumlprims_mg)
 
     target_link_libraries(ml_mg ${CUML_TEST_LINK_LIBRARIES})
 
@@ -119,7 +123,6 @@ endif(BUILD_CUML_MG_TESTS)
 if(BUILD_PRIMS_TESTS)
   # (please keep the filenames in alphabetical order)
   add_executable(prims
-    prims/add.cu
     prims/add_sub_dev_scalar.cu
     prims/adjustedRandIndex.cu
     prims/batched/csr.cu
@@ -127,17 +130,14 @@ if(BUILD_PRIMS_TESTS)
     prims/batched/information_criterion.cu
     prims/batched/make_symm.cu
     prims/batched/matrix.cu
-    prims/binary_op.cu
-    prims/ternary_op.cu
     prims/cache.cu
-    prims/coalesced_reduction.cu
-    prims/cuda_utils.cu
     prims/columnSort.cu
     prims/completenessScore.cu
     prims/contingencyMatrix.cu
     prims/coo.cu
     prims/cov.cu
     prims/csr.cu
+    prims/cutlass_gemm.cu
     prims/decoupled_lookback.cu
     prims/device_utils.cu
     prims/dispersion.cu
@@ -146,18 +146,12 @@ if(BUILD_PRIMS_TESTS)
     prims/dist_euc_exp.cu
     prims/dist_euc_unexp.cu
     prims/dist_l1.cu
-    prims/divide.cu
-    prims/eig.cu
-    prims/eig_sel.cu
-    prims/eltwise.cu
     prims/eltwise2d.cu
     prims/entropy.cu
     prims/epsilon_neighborhood.cu
     prims/fast_int_div.cu
     prims/fused_l2_nn.cu
     prims/gather.cu
-    prims/gemm.cu
-    prims/gemm_layout.cu
     prims/gram.cu
     prims/grid_sync.cu
     prims/hinge.cu
@@ -177,56 +171,36 @@ if(BUILD_PRIMS_TESTS)
     prims/make_arima.cu
     prims/make_blobs.cu
     prims/make_regression.cu
-    prims/map_then_reduce.cu
-    prims/math.cu
-    prims/matrix.cu
-    prims/matrix_vector_op.cu
-    prims/mean.cu
-    prims/mean_center.cu
     prims/minmax.cu
     prims/mvg.cu
-    prims/multiply.cu
     prims/mutualInfoScore.cu
-    prims/norm.cu
     prims/penalty.cu
     prims/permute.cu
     prims/power.cu
     prims/randIndex.cu
-    prims/reduce.cu
     prims/reduce_cols_by_key.cu
     prims/reduce_rows_by_key.cu
     prims/reverse.cu
-    prims/rng.cu
-    prims/rng_int.cu
     prims/rsvd.cu
-    prims/sample_without_replacement.cu
-    prims/scatter.cu
     prims/score.cu
     prims/seive.cu
     prims/sigmoid.cu
     prims/silhouetteScore.cu
     prims/sqrt.cu
-    prims/stddev.cu
-    prims/strided_reduction.cu
-    prims/subtract.cu
-    prims/sum.cu
-    prims/svd.cu
     prims/ternary_op.cu
-    prims/transpose.cu
     prims/trustworthiness.cu
-    prims/unary_op.cu
     prims/vMeasure.cu
     prims/weighted_mean.cu
-    ../src/common/logger.cpp  # because prims is header only!
-    )
+    ../src/common/logger.cpp)  # because prims is header only!
 
   target_include_directories(prims PRIVATE ${CUML_TEST_INCLUDE_DIRS})
 
   add_dependencies(prims cutlass)
 
   target_link_libraries(prims
-    gtestlib
-    gtest_mainlib
-    ${PRIMS_TEST_LINK_LIBRARIES}
-    faisslib)
+  	GTest::GTest
+  	GTest::Main
+  	FAISS::FAISS
+    ${PRIMS_TEST_LINK_LIBRARIES})
+
 endif(BUILD_PRIMS_TESTS)
diff --git a/cpp/test/mg/knn.cu b/cpp/test/mg/knn.cu
index bc7c583e49..d72dc3d436 100644
--- a/cpp/test/mg/knn.cu
+++ b/cpp/test/mg/knn.cu
@@ -21,16 +21,13 @@
 #include "../prims/test_utils.h"
 #include "test_opg_utils.h"
 
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
-
-#include <common/cuml_comms_iface.hpp>
-#include <common/cuml_comms_int.hpp>
+#include <raft/comms/mpi_comms.hpp>
 
 #include <common/cumlHandle.hpp>
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace KNN {
@@ -61,19 +58,17 @@ class BruteForceKNNTest : public ::testing::TestWithParam<KNNParams> {
   }
 
   bool runTest(const KNNParams &params) {
-    ML::cumlHandle *handle = new ML::cumlHandle();
-    ML::initialize_mpi_comms(*handle, MPI_COMM_WORLD);
-    const ML::cumlHandle_impl &h = handle->getImpl();
-    const cumlCommunicator &comm = h.getCommunicator();
-    const std::shared_ptr<deviceAllocator> allocator = h.getDeviceAllocator();
+    raft::comms::initialize_mpi_comms(&handle, MPI_COMM_WORLD);
+    const auto &comm = handle.get_comms();
+    const auto allocator = handle.get_device_allocator();
 
-    cudaStream_t stream = h.getStream();
+    cudaStream_t stream = handle.get_stream();
 
-    int my_rank = comm.getRank();
-    int size = comm.getSize();
+    int my_rank = comm.get_rank();
+    int size = comm.get_size();
 
-    int index_parts_per_rank = MLCommon::ceildiv(params.n_index_parts, size);
-    int query_parts_per_rank = MLCommon::ceildiv(params.n_query_parts, size);
+    int index_parts_per_rank = raft::ceildiv(params.n_index_parts, size);
+    int query_parts_per_rank = raft::ceildiv(params.n_query_parts, size);
     std::vector<Matrix::RankSizePair *> idxPartsToRanks;
     std::vector<Matrix::RankSizePair *> queryPartsToRanks;
     for (int cur_rank = 0; cur_rank < size; cur_rank++) {
@@ -161,31 +156,25 @@ class BruteForceKNNTest : public ::testing::TestWithParam<KNNParams> {
 
     Matrix::PartDescriptor idx_desc(params.min_rows * params.n_index_parts,
                                     params.n_cols, idxPartsToRanks,
-                                    comm.getRank());
+                                    comm.get_rank());
 
     Matrix::PartDescriptor query_desc(params.min_rows * params.n_query_parts,
                                       params.n_cols, queryPartsToRanks,
-                                      comm.getRank());
+                                      comm.get_rank());
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
-    std::cout << "Ready to call KNN" << std::endl;
-
     /**
          * Execute brute_force_knn()
          */
-    brute_force_knn(*handle, out_i_parts, out_d_parts, index_parts, idx_desc,
+    brute_force_knn(handle, out_i_parts, out_d_parts, index_parts, idx_desc,
                     query_parts, query_desc, params.k, params.batch_size, true);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
-    std::cout << "Finished!" << std::endl;
-
-    std::cout << MLCommon::arr2Str(out_i_parts[0]->ptr, 10, "final_out_I",
-                                   stream)
+    std::cout << raft::arr2Str(out_i_parts[0]->ptr, 10, "final_out_I", stream)
               << std::endl;
-    std::cout << MLCommon::arr2Str(out_d_parts[0]->ptr, 10, "final_out_D",
-                                   stream)
+    std::cout << raft::arr2Str(out_d_parts[0]->ptr, 10, "final_out_D", stream)
               << std::endl;
 
     /**
@@ -224,12 +213,13 @@ class BruteForceKNNTest : public ::testing::TestWithParam<KNNParams> {
       delete rsp;
     }
 
-    delete handle;
-
     int actual = 1;
     int expected = 1;
-    return CompareApprox<int>(1)(actual, expected);
+    return raft::CompareApprox<int>(1)(actual, expected);
   }
+
+ private:
+  raft::handle_t handle;
 };
 
 const std::vector<KNNParams> inputs = {
diff --git a/cpp/test/mg/knn_classify.cu b/cpp/test/mg/knn_classify.cu
index cc9a7d7c6f..1ec9b2b7fe 100644
--- a/cpp/test/mg/knn_classify.cu
+++ b/cpp/test/mg/knn_classify.cu
@@ -50,13 +50,14 @@ class KNNClassifyTest : public ::testing::TestWithParam<KNNParams> {
       }
       uniq_labels[i] = (int *)knn_th.allocator.get()->allocate(nu * sizeof(int),
                                                                knn_th.stream);
-      updateDevice(uniq_labels[i], ul_h.data(), ul_h.size(), knn_th.stream);
+      raft::update_device(uniq_labels[i], ul_h.data(), ul_h.size(),
+                          knn_th.stream);
     }
 
     /**
      * Execute knn_classify()
      */
-    knn_classify(*(knn_th.handle), &(knn_th.out_parts), &(knn_th.out_i_parts),
+    knn_classify(knn_th.handle, &(knn_th.out_parts), &(knn_th.out_i_parts),
                  &(knn_th.out_d_parts), nullptr, knn_th.index_parts,
                  *(knn_th.idx_desc), knn_th.query_parts, *(knn_th.query_desc),
                  knn_th.y, uniq_labels, n_unique, false, false, false, params.k,
@@ -67,7 +68,7 @@ class KNNClassifyTest : public ::testing::TestWithParam<KNNParams> {
 
     int actual = 1;
     int expected = 1;
-    return CompareApprox<int>(1)(actual, expected);
+    return raft::CompareApprox<int>(1)(actual, expected);
   }
 };
 
diff --git a/cpp/test/mg/knn_regress.cu b/cpp/test/mg/knn_regress.cu
index f4ed840ea9..39e07ba965 100644
--- a/cpp/test/mg/knn_regress.cu
+++ b/cpp/test/mg/knn_regress.cu
@@ -40,7 +40,7 @@ class KNNRegressTest : public ::testing::TestWithParam<KNNParams> {
     /**
      * Execute knn_regress()
      */
-    knn_regress(*(knn_th.handle), &(knn_th.out_parts), &(knn_th.out_i_parts),
+    knn_regress(knn_th.handle, &(knn_th.out_parts), &(knn_th.out_i_parts),
                 &(knn_th.out_d_parts), knn_th.index_parts, *(knn_th.idx_desc),
                 knn_th.query_parts, *(knn_th.query_desc), knn_th.y, false,
                 false, params.k, params.n_outputs, params.batch_size, true);
@@ -50,7 +50,7 @@ class KNNRegressTest : public ::testing::TestWithParam<KNNParams> {
 
     int actual = 1;
     int expected = 1;
-    return CompareApprox<int>(1)(actual, expected);
+    return raft::CompareApprox<int>(1)(actual, expected);
   }
 };
 
diff --git a/cpp/test/mg/knn_test_helper.cuh b/cpp/test/mg/knn_test_helper.cuh
index c6b0d98c19..7a3000b696 100644
--- a/cpp/test/mg/knn_test_helper.cuh
+++ b/cpp/test/mg/knn_test_helper.cuh
@@ -21,19 +21,17 @@
 #include "../prims/test_utils.h"
 #include "test_opg_utils.h"
 
+#include <raft/comms/mpi_comms.hpp>
+
 #include <linalg/reduce_rows_by_key.cuh>
 #include <selection/knn.cuh>
 
-#include <common/cuml_comms_int.hpp>
 #include <common/device_buffer.hpp>
 #include <cuml/common/cuml_allocator.hpp>
 
-#include <common/cuml_comms_iface.hpp>
-#include <common/cuml_comms_int.hpp>
-
 #include <common/cumlHandle.hpp>
 
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace KNN {
@@ -63,19 +61,17 @@ template <typename T>
 class KNNTestHelper {
  public:
   void generate_data(const KNNParams &params) {
-    this->handle = new ML::cumlHandle();
-    ML::initialize_mpi_comms(*handle, MPI_COMM_WORLD);
-    const ML::cumlHandle_impl &h = handle->getImpl();
-    const cumlCommunicator &comm = h.getCommunicator();
-    this->allocator = h.getDeviceAllocator();
+    raft::comms::initialize_mpi_comms(&handle, MPI_COMM_WORLD);
+    const auto &comm = handle.get_comms();
+    this->allocator = handle.get_device_allocator();
 
-    this->stream = h.getStream();
+    this->stream = handle.get_stream();
 
-    int my_rank = comm.getRank();
-    int size = comm.getSize();
+    int my_rank = comm.get_rank();
+    int size = comm.get_size();
 
-    this->index_parts_per_rank = MLCommon::ceildiv(params.n_index_parts, size);
-    this->query_parts_per_rank = MLCommon::ceildiv(params.n_query_parts, size);
+    this->index_parts_per_rank = raft::ceildiv(params.n_index_parts, size);
+    this->query_parts_per_rank = raft::ceildiv(params.n_query_parts, size);
 
     for (int cur_rank = 0; cur_rank < size; cur_rank++) {
       int ippr = this->index_parts_per_rank;
@@ -100,11 +96,11 @@ class KNNTestHelper {
 
     this->idx_desc = new Matrix::PartDescriptor(
       params.min_rows * params.n_index_parts, params.n_cols,
-      this->idxPartsToRanks, comm.getRank());
+      this->idxPartsToRanks, comm.get_rank());
 
     this->query_desc = new Matrix::PartDescriptor(
       params.min_rows * params.n_query_parts, params.n_cols,
-      this->queryPartsToRanks, comm.getRank());
+      this->queryPartsToRanks, comm.get_rank());
 
     if (my_rank == size - 1) {
       this->index_parts_per_rank =
@@ -181,13 +177,11 @@ class KNNTestHelper {
 
     std::cout << "Finished!" << std::endl;
 
-    std::cout << MLCommon::arr2Str(out_parts[0]->ptr, 10, "final_out", stream)
+    std::cout << raft::arr2Str(out_parts[0]->ptr, 10, "final_out", stream)
               << std::endl;
-    std::cout << MLCommon::arr2Str(out_i_parts[0]->ptr, 10, "final_out_I",
-                                   stream)
+    std::cout << raft::arr2Str(out_i_parts[0]->ptr, 10, "final_out_I", stream)
               << std::endl;
-    std::cout << MLCommon::arr2Str(out_d_parts[0]->ptr, 10, "final_out_D",
-                                   stream)
+    std::cout << raft::arr2Str(out_d_parts[0]->ptr, 10, "final_out_D", stream)
               << std::endl;
   }
 
@@ -239,11 +233,9 @@ class KNNTestHelper {
     for (Matrix::RankSizePair *rsp : this->idxPartsToRanks) {
       delete rsp;
     }
-
-    delete handle;
   }
 
-  ML::cumlHandle *handle;
+  raft::handle_t handle;
   std::vector<Matrix::Data<T> *> out_parts;
   std::vector<Matrix::Data<int64_t> *> out_i_parts;
   std::vector<Matrix::floatData_t *> out_d_parts;
@@ -268,4 +260,4 @@ class KNNTestHelper {
 
 }  // namespace opg
 }  // namespace KNN
-}  // namespace ML
\ No newline at end of file
+}  // namespace ML
diff --git a/cpp/test/mg/pca.cu b/cpp/test/mg/pca.cu
index 4595cc7afb..b70e83412a 100644
--- a/cpp/test/mg/pca.cu
+++ b/cpp/test/mg/pca.cu
@@ -14,18 +14,23 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
+#include <common/device_buffer.hpp>
 #include <cuml/common/logger.hpp>
 #include <cuml/decomposition/pca_mg.hpp>
-#include <matrix/matrix.cuh>
 #include <opg/linalg/gemm.hpp>
 #include <opg/matrix/matrix_utils.hpp>
+#include <raft/cuda_utils.cuh>
+#include <raft/matrix/matrix.cuh>
 #include "test_opg_utils.h"
 
+#include <common/cumlHandle.hpp>
+
+#include <raft/comms/mpi_comms.hpp>
+
 namespace MLCommon {
 namespace Test {
 namespace opg {
@@ -46,19 +51,18 @@ class PCAOpgTest : public testing::TestWithParam<PCAOpgParams> {
  public:
   void SetUp() {
     params = GetParam();
-    handle = new ML::cumlHandle();
-    ML::initialize_mpi_comms(*handle, MPI_COMM_WORLD);
+    raft::comms::initialize_mpi_comms(&handle, MPI_COMM_WORLD);
 
     // Prepare resource
-    const ML::cumlHandle_impl& h = handle->getImpl();
-    const cumlCommunicator& comm = h.getCommunicator();
-    stream = h.getStream();
-    const std::shared_ptr<deviceAllocator> allocator = h.getDeviceAllocator();
-    cublasHandle_t cublasHandle = h.getCublasHandle();
 
-    myRank = comm.getRank();
-    totalRanks = comm.getSize();
-    Random::Rng r(params.seed + myRank);
+    const raft::comms::comms_t& comm = handle.get_comms();
+    stream = handle.get_stream();
+    const auto allocator = handle.get_device_allocator();
+    cublasHandle_t cublasHandle = handle.get_cublas_handle();
+
+    myRank = comm.get_rank();
+    totalRanks = comm.get_size();
+    raft::random::Rng r(params.seed + myRank);
 
     CUBLAS_CHECK(cublasSetStream(cublasHandle, stream));
 
@@ -75,12 +79,12 @@ class PCAOpgTest : public testing::TestWithParam<PCAOpgParams> {
       totalPartsToRanks.push_back(rspt);
     }
     Matrix::PartDescriptor desc(params.M, params.N, totalPartsToRanks,
-                                comm.getRank(), params.layout);
+                                comm.get_rank(), params.layout);
     std::vector<Matrix::Data<T>*> inParts;
-    Matrix::opg::allocate(h, inParts, desc, myRank, stream);
-    Matrix::opg::randomize(h, r, inParts, desc, myRank, stream, T(10.0),
+    Matrix::opg::allocate(handle, inParts, desc, myRank, stream);
+    Matrix::opg::randomize(handle, r, inParts, desc, myRank, stream, T(10.0),
                            T(20.0));
-    h.waitOnUserStream();
+    handle.wait_on_user_stream();
 
     prmsPCA.n_rows = params.M;
     prmsPCA.n_cols = params.N;
@@ -104,37 +108,35 @@ class PCAOpgTest : public testing::TestWithParam<PCAOpgParams> {
 
     device_buffer<T> noise_vars(allocator, stream, prmsPCA.n_components);
 
-    ML::PCA::opg::fit(*handle, inParts, desc, components.data(),
+    ML::PCA::opg::fit(handle, inParts, desc, components.data(),
                       explained_var.data(), explained_var_ratio.data(),
                       singular_vals.data(), mu.data(), noise_vars.data(),
                       prmsPCA, false);
 
-    CUML_LOG_DEBUG(MLCommon::arr2Str(singular_vals.data(), params.N_components,
-                                     "Singular Vals", stream)
+    CUML_LOG_DEBUG(raft::arr2Str(singular_vals.data(), params.N_components,
+                                 "Singular Vals", stream)
                      .c_str());
 
-    CUML_LOG_DEBUG(MLCommon::arr2Str(explained_var.data(), params.N_components,
-                                     "Explained Variance", stream)
+    CUML_LOG_DEBUG(raft::arr2Str(explained_var.data(), params.N_components,
+                                 "Explained Variance", stream)
                      .c_str());
 
-    CUML_LOG_DEBUG(MLCommon::arr2Str(explained_var_ratio.data(),
-                                     params.N_components,
-                                     "Explained Variance Ratio", stream)
+    CUML_LOG_DEBUG(raft::arr2Str(explained_var_ratio.data(),
+                                 params.N_components,
+                                 "Explained Variance Ratio", stream)
                      .c_str());
 
-    CUML_LOG_DEBUG(MLCommon::arr2Str(components.data(),
-                                     params.N_components * params.N,
-                                     "Components", stream)
+    CUML_LOG_DEBUG(raft::arr2Str(components.data(),
+                                 params.N_components * params.N, "Components",
+                                 stream)
                      .c_str());
 
-    Matrix::opg::deallocate(h, inParts, desc, myRank, stream);
+    Matrix::opg::deallocate(handle, inParts, desc, myRank, stream);
   }
 
-  void TearDown() { delete handle; }
-
  protected:
   PCAOpgParams params;
-  ML::cumlHandle* handle;
+  raft::handle_t handle;
   cudaStream_t stream;
   int myRank;
   int totalRanks;
@@ -194,4 +196,4 @@ INSTANTIATE_TEST_CASE_P(PCAOpgTest, PCAOpgTestD, ::testing::ValuesIn(inputs));
 
 }  // end namespace opg
 }  // end namespace Test
-}  // end namespace MLCommon
\ No newline at end of file
+}  // end namespace MLCommon
diff --git a/cpp/test/mg/test_ml_mg_utils.cu b/cpp/test/mg/test_ml_mg_utils.cu
deleted file mode 100644
index 18e3cb1c29..0000000000
--- a/cpp/test/mg/test_ml_mg_utils.cu
+++ /dev/null
@@ -1,94 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <test_utils.h>
-#include <cuda_utils.cuh>
-#include <iostream>
-#include <ml_mg_utils.cuh>
-#include <vector>
-
-namespace ML {
-
-using namespace MLCommon;
-
-/**
- *
- * NOTE: Not exhaustively testing the kNN implementation since
- * we are using FAISS for this. Just testing API to verify the
- * knn.cu class is accepting inputs and providing outputs as
- * expected.
- */
-template <typename T>
-class ML_MG_UtilsTest : public ::testing::Test {
- protected:
-  void basicTest() {
-    // make test data on host
-    std::vector<T> ptr_h = {1.0, 50.0, 51.0, 1.0, 50.0, 51.0, 1.0, 50.0, 51.0};
-    ptr_h.resize(9);
-
-    cudaStream_t stream;
-    cudaStreamCreate(&stream);
-
-    params = new float *[2];
-    sizes = new int[2];
-
-    expected_params = new float *[2];
-    expected_sizes = new int[2];
-
-    MLCommon::allocate(expected_params[0], 5);
-    MLCommon::updateDevice(expected_params[0], ptr_h.data(), 5, stream);
-
-    expected_sizes[0] = 5;
-    expected_sizes[1] = 4;
-
-    MLCommon::allocate(expected_params[1], 4);
-    MLCommon::updateDevice(expected_params[1], ptr_h.data() + 5, 4, stream);
-
-    int *devices = new int[2]{0, 1};
-
-    chunk_to_device<float>(ptr_h.data(), 9, 1, devices, params, sizes, 2,
-                           stream);
-
-    cudaStreamDestroy(stream);
-    delete devices;
-  }
-
-  void SetUp() override { basicTest(); }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(params[0]));
-    CUDA_CHECK(cudaFree(params[1]));
-  }
-
- protected:
-  float **params;
-  int *sizes;
-
-  float **expected_params;
-  int *expected_sizes;
-};
-
-typedef ML_MG_UtilsTest<float> ChunkToDeviceTest;
-TEST_F(ChunkToDeviceTest, Fit) {
-  ASSERT_TRUE(sizes[0] == 5);
-  ASSERT_TRUE(sizes[1] == 4);
-  ASSERT_TRUE(devArrMatch(expected_params[0], params[0], 5, Compare<float>()));
-  ASSERT_TRUE(devArrMatch(expected_params[1], params[1], 4, Compare<float>()));
-}
-
-}  // end namespace ML
diff --git a/cpp/test/mg/test_opg_utils.h b/cpp/test/mg/test_opg_utils.h
index 0cc3367fe0..a30e657a4f 100644
--- a/cpp/test/mg/test_opg_utils.h
+++ b/cpp/test/mg/test_opg_utils.h
@@ -18,40 +18,20 @@
 
 #include <gtest/gtest.h>
 #include <mpi.h>
-#include <cuda_utils.cuh>
-#include <opg/matrix/data.hpp>
-#include <opg/matrix/descriptor.hpp>
+#include <raft/cuda_utils.cuh>
 
 #include <common/cumlHandle.hpp>
-#include <common/cuml_comms_int.hpp>
-#include <cuML_comms.hpp>
 
 namespace MLCommon {
 namespace Test {
 namespace opg {
-/**
-   * @brief A naive attempt at creating different inputs in each of the ranks
-   * @param seed the input seed (typically as defined in the test params)
-   * @return a unique rank across all ranks
-   */
-inline unsigned long long rankBasedSeed(const cumlCommunicator &comm,
-                                        unsigned long long seed) {
-  int myRank = comm.getRank();
-  return seed + myRank;
-}
-
-/** checks whether the current process is the root or not */
-inline bool amIroot(const cumlCommunicator &comm, int rootRank = 0) {
-  int myRank = comm.getRank();
-  return myRank == rootRank;
-}
 
 /**
-     *
-     * @brief Testing environment to handle googletest runs
-     * @note Inspired from:
-     * http://www.parresianz.com/mpi/c++/mpi-unit-testing-googletests-cmake/
-     */
+ *
+ * @brief Testing environment to handle googletest runs
+ * @note Inspired from:
+ * http://www.parresianz.com/mpi/c++/mpi-unit-testing-googletests-cmake/
+ */
 class MPIEnvironment : public ::testing::Environment {
  public:
   void SetUp() {
@@ -74,83 +54,6 @@ class MPIEnvironment : public ::testing::Environment {
   void TearDown() { MPI_Finalize(); }
 };
 
-template <typename Type>
-size_t computeTotalSize(const Matrix::Descriptor &desc, int rank) {
-  auto myBlocks = desc.totalBlocksOwnedBy(rank);
-  ///@todo: handle ragged cases!?
-  return sizeof(Type) * myBlocks * desc.MB * desc.NB;
-}
-
-/*
-    * @brief allocate the buffer for this worker
-    * @param desc the descriptor showing the data allocation
-    */
-template <typename T>
-void data_alloc(Matrix::Data<T> &data, const Matrix::Descriptor &desc, int rank,
-                bool setZero = false) {
-  ASSERT(data.ptr == nullptr, "Data seems to have already been allocated!");
-  data.totalSize = computeTotalSize<T>(desc, rank);
-  CUDA_CHECK(cudaMalloc((void **)&data.ptr, data.totalSize));
-  if (setZero) CUDA_CHECK(cudaMemset(data.ptr, 0, data.totalSize));
-}
-
-/**
-   * @brief Convert input data distribution to be one as requested by the caller
-   * @param out the output converted data
-   * @param outDesc the distribution descriptor for the output
-   * @param in the input data
-   * @param inDesc the distribution descriptor for the input
-   * @note This currently only supports the special case of MB,NB being equal both
-   *  for input and output buffers.
-   */
-template <typename Type>
-void redistribute(Matrix::Data<Type> &out, const Matrix::Descriptor &outDesc,
-                  const Matrix::Data<Type> &in,
-                  const Matrix::Descriptor &inDesc,
-                  const cumlCommunicator &comm) {
-  ASSERT(outDesc.M == inDesc.M && outDesc.N == inDesc.N &&
-           outDesc.MB == inDesc.MB && outDesc.NB == inDesc.NB &&
-           outDesc.intraBlockLayout == inDesc.intraBlockLayout,
-         "redistribute: currently only supports same values of M,N,MB,NB! "
-         "You've passed in=%d,%d,%d,%d and out=%d,%d,%d,%d.",
-         inDesc.M, inDesc.N, inDesc.MB, inDesc.NB, outDesc.M, outDesc.N,
-         outDesc.MB, outDesc.NB);
-  auto nBlocks = outDesc.blocks2device.size();
-  int myRank = comm.getRank();
-
-  auto myInBlocks = inDesc.totalBlocksOwnedBy(myRank);
-  auto myOutBlocks = outDesc.totalBlocksOwnedBy(myRank);
-  auto totalBlocks = myInBlocks + myOutBlocks;
-
-  std::vector<MLCommon::cumlCommunicator::request_t> requests;
-  requests.resize(totalBlocks);
-
-  auto blockLen = inDesc.MB * inDesc.NB;
-  // all the messages that I need to send
-  int reqIdx = 0;
-  for (size_t i = 0; i < nBlocks; ++i) {
-    if (inDesc.blocks2device[i] == myRank) {
-      auto dstRank = outDesc.blocks2device[i];
-      comm.isend(in.ptr + reqIdx * blockLen, blockLen, dstRank, 0,
-                 requests.data() + reqIdx);
-      ++reqIdx;
-    }
-  }
-  // all the messages that I need to receive
-  reqIdx = 0;
-  for (size_t i = 0; i < nBlocks; ++i) {
-    if (outDesc.blocks2device[i] == myRank) {
-      auto srcRank = inDesc.blocks2device[i];
-      comm.irecv(out.ptr + reqIdx * blockLen, blockLen, srcRank, 0,
-                 requests.data() + reqIdx);
-      ++reqIdx;
-    }
-  }
-
-  comm.waitall(requests.size(), requests.data());
-  comm.barrier();
-}
-
 };  // end namespace opg
 };  // end namespace Test
 };  // end namespace MLCommon
diff --git a/cpp/test/prims/add.cu b/cpp/test/prims/add.cu
deleted file mode 100644
index 9d74a443bf..0000000000
--- a/cpp/test/prims/add.cu
+++ /dev/null
@@ -1,94 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/add.cuh>
-#include <random/rng.cuh>
-#include "add.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InT, typename OutT = InT>
-class AddTest : public ::testing::TestWithParam<AddInputs<InT, OutT>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<AddInputs<InT, OutT>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in1, len, InT(-1.0), InT(1.0), stream);
-    r.uniform(in2, len, InT(-1.0), InT(1.0), stream);
-    naiveAddElem<InT, OutT>(out_ref, in1, in2, len);
-    add<InT, OutT>(out, in1, in2, len, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    CUDA_CHECK(cudaFree(in1));
-    CUDA_CHECK(cudaFree(in2));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void compare() {
-    ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                            CompareApprox<OutT>(params.tolerance)));
-  }
-
- protected:
-  AddInputs<InT, OutT> params;
-  InT *in1, *in2;
-  OutT *out_ref, *out;
-  cudaStream_t stream;
-};
-
-const std::vector<AddInputs<float>> inputsf = {
-  {0.000001f, 1024 * 1024, 1234ULL},
-  {0.000001f, 1024 * 1024 + 2, 1234ULL},
-  {0.000001f, 1024 * 1024 + 1, 1234ULL},
-};
-typedef AddTest<float> AddTestF;
-TEST_P(AddTestF, Result) { compare(); }
-INSTANTIATE_TEST_CASE_P(AddTests, AddTestF, ::testing::ValuesIn(inputsf));
-
-const std::vector<AddInputs<double>> inputsd = {
-  {0.00000001, 1024 * 1024, 1234ULL},
-  {0.00000001, 1024 * 1024 + 2, 1234ULL},
-  {0.00000001, 1024 * 1024 + 1, 1234ULL},
-};
-typedef AddTest<double> AddTestD;
-TEST_P(AddTestD, Result) { compare(); }
-INSTANTIATE_TEST_CASE_P(AddTests, AddTestD, ::testing::ValuesIn(inputsd));
-
-const std::vector<AddInputs<float, double>> inputsfd = {
-  {0.00000001, 1024 * 1024, 1234ULL},
-  {0.00000001, 1024 * 1024 + 2, 1234ULL},
-  {0.00000001, 1024 * 1024 + 1, 1234ULL},
-};
-typedef AddTest<float, double> AddTestFD;
-TEST_P(AddTestFD, Result) { compare(); }
-INSTANTIATE_TEST_CASE_P(AddTests, AddTestFD, ::testing::ValuesIn(inputsfd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/add.cuh b/cpp/test/prims/add.cuh
deleted file mode 100644
index ba24efe586..0000000000
--- a/cpp/test/prims/add.cuh
+++ /dev/null
@@ -1,56 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <linalg/add.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InT, typename OutT = InT>
-__global__ void naiveAddElemKernel(OutT *out, const InT *in1, const InT *in2,
-                                   int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = OutT(in1[idx] + in2[idx]);
-  }
-}
-
-template <typename InT, typename OutT = InT>
-void naiveAddElem(OutT *out, const InT *in1, const InT *in2, int len) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveAddElemKernel<InT, OutT><<<nblks, TPB>>>(out, in1, in2, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename InT, typename OutT = InT>
-struct AddInputs {
-  OutT tolerance;
-  int len;
-  unsigned long long int seed;
-};
-
-template <typename InT, typename OutT = InT>
-::std::ostream &operator<<(::std::ostream &os,
-                           const AddInputs<InT, OutT> &dims) {
-  return os;
-}
-
-};  // end namespace LinAlg
-};  // end namespace MLCommon
diff --git a/cpp/test/prims/add_sub_dev_scalar.cu b/cpp/test/prims/add_sub_dev_scalar.cu
index 16e43ae5b6..f56cd8e626 100644
--- a/cpp/test/prims/add_sub_dev_scalar.cu
+++ b/cpp/test/prims/add_sub_dev_scalar.cu
@@ -14,16 +14,16 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/add.cuh>
-#include <linalg/subtract.cuh>
-#include <linalg/unary_op.cuh>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/add.cuh>
+#include <raft/linalg/subtract.cuh>
+#include <raft/linalg/unary_op.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
-namespace MLCommon {
-namespace LinAlg {
+namespace raft {
+namespace linalg {
 
 template <typename T, typename IdxType = int>
 struct DevScalarInputs {
@@ -40,7 +40,7 @@ struct DevScalarInputs {
 template <typename T, typename IdxType = int>
 void unaryOpLaunch(T *out, const T *in, T scalar, IdxType len, bool add,
                    cudaStream_t stream) {
-  unaryOp(
+  raft::linalg::unaryOp(
     out, in, len,
     [scalar, add] __device__(T in) { return add ? in + scalar : in - scalar; },
     stream);
@@ -52,17 +52,17 @@ class DevScalarTest
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<DevScalarInputs<T, IdxType>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
     auto len = params.len;
 
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    allocate(scalar, (size_t)1);
-    updateDevice(scalar, &params.scalar, 1, stream);
+    raft::allocate(in, len);
+    raft::allocate(out_ref, len);
+    raft::allocate(out, len);
+    raft::allocate(scalar, (size_t)1);
+    raft::update_device(scalar, &params.scalar, 1, stream);
     r.uniform(in, len, T(-1.0), T(1.0), stream);
     unaryOpLaunch(out_ref, in, params.scalar, len, params.add, stream);
     if (params.add) {
@@ -91,7 +91,7 @@ const std::vector<DevScalarInputs<float, int>> inputsf_i32 = {
 typedef DevScalarTest<float, int> DevScalarTestF_i32;
 TEST_P(DevScalarTestF_i32, Result) {
   ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DevScalarTests, DevScalarTestF_i32,
                         ::testing::ValuesIn(inputsf_i32));
@@ -102,7 +102,7 @@ const std::vector<DevScalarInputs<float, size_t>> inputsf_i64 = {
 typedef DevScalarTest<float, size_t> DevScalarTestF_i64;
 TEST_P(DevScalarTestF_i64, Result) {
   ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DevScalarTests, DevScalarTestF_i64,
                         ::testing::ValuesIn(inputsf_i64));
@@ -113,7 +113,7 @@ const std::vector<DevScalarInputs<double, int>> inputsd_i32 = {
 typedef DevScalarTest<double, int> DevScalarTestD_i32;
 TEST_P(DevScalarTestD_i32, Result) {
   ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DevScalarTests, DevScalarTestD_i32,
                         ::testing::ValuesIn(inputsd_i32));
@@ -124,10 +124,10 @@ const std::vector<DevScalarInputs<double, size_t>> inputsd_i64 = {
 typedef DevScalarTest<double, size_t> DevScalarTestD_i64;
 TEST_P(DevScalarTestD_i64, Result) {
   ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DevScalarTests, DevScalarTestD_i64,
                         ::testing::ValuesIn(inputsd_i64));
 
-}  // end namespace LinAlg
-}  // end namespace MLCommon
+}  // end namespace linalg
+}  // end namespace raft
diff --git a/cpp/test/prims/adjustedRandIndex.cu b/cpp/test/prims/adjustedRandIndex.cu
index 461de96d07..a82f4272a3 100644
--- a/cpp/test/prims/adjustedRandIndex.cu
+++ b/cpp/test/prims/adjustedRandIndex.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -45,10 +45,11 @@ class AdjustedRandIndexTest
   void SetUp() override {
     params = ::testing::TestWithParam<AdjustedRandIndexParam>::GetParam();
     nElements = params.nElements;
-    allocate(firstClusterArray, nElements, true);
-    allocate(secondClusterArray, nElements, true);
+    raft::allocate(firstClusterArray, nElements, true);
+    raft::allocate(secondClusterArray, nElements, true);
     CUDA_CHECK(cudaStreamCreate(&stream));
-    std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
+    std::shared_ptr<deviceAllocator> allocator(
+      new raft::mr::device::default_allocator);
     if (!params.testZeroArray) {
       SetUpDifferentArrays();
     } else {
@@ -126,8 +127,8 @@ class AdjustedRandIndexTest
         (index - expectedIndex) / (maxIndex - expectedIndex);
     else
       truthAdjustedRandIndex = 0;
-    updateDevice(firstClusterArray, &arr1[0], nElements, stream);
-    updateDevice(secondClusterArray, &arr2[0], nElements, stream);
+    raft::update_device(firstClusterArray, &arr1[0], nElements, stream);
+    raft::update_device(secondClusterArray, &arr2[0], nElements, stream);
   }
 
   void SetupZeroArray() {
diff --git a/cpp/test/prims/batched/csr.cu b/cpp/test/prims/batched/csr.cu
index 3993c2f8cc..2d91f456c3 100644
--- a/cpp/test/prims/batched/csr.cu
+++ b/cpp/test/prims/batched/csr.cu
@@ -20,8 +20,9 @@
 #include <random>
 #include <vector>
 
-#include <common/cudart_utils.h>
 #include <linalg_naive.h>
+#include <raft/cudart_utils.h>
+#include <test_utils.h>
 #include <linalg/batched/matrix.cuh>
 #include <sparse/batched/csr.cuh>
 #include "../test_utils.h"
@@ -104,7 +105,7 @@ class CSRTest : public ::testing::TestWithParam<CSRInputs<T>> {
     CUBLAS_CHECK(cublasCreate(&handle));
     CUDA_CHECK(cudaStreamCreate(&stream));
     CUSOLVER_CHECK(cusolverSpCreate(&cusolverSpHandle));
-    auto allocator = std::make_shared<MLCommon::defaultDeviceAllocator>();
+    auto allocator = std::make_shared<raft::mr::device::default_allocator>();
 
     // Created batched dense matrices
     LinAlg::Batched::Matrix<T> AbM(params.m, params.n, params.batch_size,
@@ -117,9 +118,9 @@ class CSRTest : public ::testing::TestWithParam<CSRInputs<T>> {
                                             allocator, stream);
 
     // Copy the data to the device
-    updateDevice(AbM.raw_data(), A.data(), A.size(), stream);
-    updateDevice(BxbM.raw_data(), Bx.data(), Bx.size(), stream);
-    updateDevice(res_bM->raw_data(), res_h.data(), res_h.size(), stream);
+    raft::update_device(AbM.raw_data(), A.data(), A.size(), stream);
+    raft::update_device(BxbM.raw_data(), Bx.data(), Bx.size(), stream);
+    raft::update_device(res_bM->raw_data(), res_h.data(), res_h.size(), stream);
 
     // Create sparse matrix A from the dense A and the mask
     CSR<T> AbS = CSR<T>::from_dense(AbM, mask, cusolverSpHandle);
@@ -197,11 +198,13 @@ using BatchedCSRTestD = CSRTest<double>;
 using BatchedCSRTestF = CSRTest<float>;
 TEST_P(BatchedCSRTestD, Result) {
   ASSERT_TRUE(devArrMatchHost(res_h.data(), res_bM->raw_data(), res_h.size(),
-                              CompareApprox<double>(params.tolerance), stream));
+                              raft::CompareApprox<double>(params.tolerance),
+                              stream));
 }
 TEST_P(BatchedCSRTestF, Result) {
   ASSERT_TRUE(devArrMatchHost(res_h.data(), res_bM->raw_data(), res_h.size(),
-                              CompareApprox<float>(params.tolerance), stream));
+                              raft::CompareApprox<float>(params.tolerance),
+                              stream));
 }
 
 INSTANTIATE_TEST_CASE_P(BatchedCSRTests, BatchedCSRTestD,
diff --git a/cpp/test/prims/batched/gemv.cu b/cpp/test/prims/batched/gemv.cu
index f7314dae71..182781a919 100644
--- a/cpp/test/prims/batched/gemv.cu
+++ b/cpp/test/prims/batched/gemv.cu
@@ -14,10 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
+#include <test_utils.h>
 #include <linalg/batched/gemv.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "../test_utils.h"
 
 namespace MLCommon {
@@ -44,14 +45,14 @@ __global__ void naiveBatchGemvKernel(Type *y, const Type *A, const Type *x,
   int col = threadIdx.x;
   if (row < m && col < n) {
     auto prod = A[batch * m * n + row * n + col] * x[batch * n + col];
-    atomicAdd(y + batch * m + row, prod);
+    raft::myAtomicAdd(y + batch * m + row, prod);
   }
 }
 
 template <typename Type>
 void naiveBatchGemv(Type *y, const Type *A, const Type *x, int m, int n,
                     int batchSize, cudaStream_t stream) {
-  static int TPB = ceildiv(n, WarpSize) * WarpSize;
+  static int TPB = raft::ceildiv(n, raft::WarpSize) * raft::WarpSize;
   dim3 nblks(m, batchSize);
   naiveBatchGemvKernel<Type><<<nblks, TPB, 0, stream>>>(y, A, x, m, n);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -62,16 +63,16 @@ class BatchGemvTest : public ::testing::TestWithParam<BatchGemvInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<BatchGemvInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.batchSize * params.m * params.n;
     int vecleny = params.batchSize * params.m;
     int veclenx = params.batchSize * params.n;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
-    allocate(A, len);
-    allocate(x, veclenx);
-    allocate(out_ref, vecleny);
-    allocate(out, vecleny);
+    raft::allocate(A, len);
+    raft::allocate(x, veclenx);
+    raft::allocate(out_ref, vecleny);
+    raft::allocate(out, vecleny);
     r.uniform(A, len, T(-1.0), T(1.0), stream);
     r.uniform(x, veclenx, T(-1.0), T(1.0), stream);
     CUDA_CHECK(cudaMemsetAsync(out_ref, 0, sizeof(T) * vecleny, stream));
@@ -106,8 +107,8 @@ const std::vector<BatchGemvInputs<float>> inputsf = {
 typedef BatchGemvTest<float> BatchGemvTestF;
 TEST_P(BatchGemvTestF, Result) {
   int vecleny = params.batchSize * params.m;
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, vecleny, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(out_ref, out, vecleny,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(BatchGemvTests, BatchGemvTestF,
                         ::testing::ValuesIn(inputsf));
@@ -123,7 +124,7 @@ const std::vector<BatchGemvInputs<double>> inputsd = {
 TEST_P(BatchGemvTestD, Result) {
   int vecleny = params.batchSize * params.m;
   ASSERT_TRUE(devArrMatch(out_ref, out, vecleny,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(BatchGemvTests, BatchGemvTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/batched/information_criterion.cu b/cpp/test/prims/batched/information_criterion.cu
index 03d1266329..9d0c6585ab 100644
--- a/cpp/test/prims/batched/information_criterion.cu
+++ b/cpp/test/prims/batched/information_criterion.cu
@@ -21,7 +21,8 @@
 #include <random>
 #include <vector>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
+#include <test_utils.h>
 #include <metrics/batched/information_criterion.cuh>
 #include "../test_utils.h"
 
@@ -70,7 +71,7 @@ class BatchedICTest : public ::testing::TestWithParam<BatchedICInputs<T>> {
 
     // Create stream and allocator
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator = std::make_shared<MLCommon::defaultDeviceAllocator>();
+    allocator = std::make_shared<raft::mr::device::default_allocator>();
 
     // Create arrays
     std::vector<T> loglike_h = std::vector<T>(params.batch_size);
@@ -87,7 +88,7 @@ class BatchedICTest : public ::testing::TestWithParam<BatchedICInputs<T>> {
       loglike_h[i] = std::log(udis(gen));
 
     // Copy the data to the device
-    updateDevice(loglike_d, loglike_h.data(), params.batch_size, stream);
+    raft::update_device(loglike_d, loglike_h.data(), params.batch_size, stream);
 
     // Compute the tested results
     information_criterion(res_d, loglike_d, params.ic_type, params.n_params,
@@ -106,7 +107,7 @@ class BatchedICTest : public ::testing::TestWithParam<BatchedICInputs<T>> {
   }
 
  protected:
-  std::shared_ptr<MLCommon::defaultDeviceAllocator> allocator;
+  std::shared_ptr<raft::mr::device::default_allocator> allocator;
   BatchedICInputs<T> params;
   T *res_d;
   std::vector<T> res_h;
@@ -125,11 +126,13 @@ using BatchedICTestD = BatchedICTest<double>;
 using BatchedICTestF = BatchedICTest<float>;
 TEST_P(BatchedICTestD, Result) {
   ASSERT_TRUE(devArrMatchHost(res_h.data(), res_d, params.batch_size,
-                              CompareApprox<double>(params.tolerance), stream));
+                              raft::CompareApprox<double>(params.tolerance),
+                              stream));
 }
 TEST_P(BatchedICTestF, Result) {
   ASSERT_TRUE(devArrMatchHost(res_h.data(), res_d, params.batch_size,
-                              CompareApprox<float>(params.tolerance), stream));
+                              raft::CompareApprox<float>(params.tolerance),
+                              stream));
 }
 
 INSTANTIATE_TEST_CASE_P(BatchedICTests, BatchedICTestD,
diff --git a/cpp/test/prims/batched/make_symm.cu b/cpp/test/prims/batched/make_symm.cu
index 463c0236fc..d751099687 100644
--- a/cpp/test/prims/batched/make_symm.cu
+++ b/cpp/test/prims/batched/make_symm.cu
@@ -14,10 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
+#include <test_utils.h>
 #include <linalg/batched/make_symm.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "../test_utils.h"
 
 namespace MLCommon {
@@ -53,7 +54,7 @@ template <typename Type>
 void naiveBatchMakeSymm(Type *y, const Type *x, int batchSize, int n,
                         cudaStream_t stream) {
   dim3 blk(16, 16);
-  int nblks = ceildiv<int>(n, blk.x);
+  int nblks = raft::ceildiv<int>(n, blk.x);
   dim3 grid(nblks, nblks, batchSize);
   naiveBatchMakeSymmKernel<Type><<<grid, blk, 0, stream>>>(y, x, n);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -65,13 +66,13 @@ class BatchMakeSymmTest
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<BatchMakeSymmInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.batchSize * params.n * params.n;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
-    allocate(x, len);
-    allocate(out_ref, len);
-    allocate(out, len);
+    raft::allocate(x, len);
+    raft::allocate(out_ref, len);
+    raft::allocate(out, len);
     r.uniform(x, len, T(-1.0), T(1.0), stream);
     naiveBatchMakeSymm(out_ref, x, params.batchSize, params.n, stream);
     make_symm<T, int>(out, x, params.batchSize, params.n, stream);
@@ -100,8 +101,8 @@ const std::vector<BatchMakeSymmInputs<float>> inputsf = {
 typedef BatchMakeSymmTest<float> BatchMakeSymmTestF;
 TEST_P(BatchMakeSymmTestF, Result) {
   int len = params.batchSize * params.n * params.n;
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, len, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(out_ref, out, len,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(BatchMakeSymmTests, BatchMakeSymmTestF,
                         ::testing::ValuesIn(inputsf));
@@ -114,8 +115,8 @@ const std::vector<BatchMakeSymmInputs<double>> inputsd = {
 };
 TEST_P(BatchMakeSymmTestD, Result) {
   int len = params.batchSize * params.n * params.n;
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, len, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(out_ref, out, len,
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(BatchMakeSymmTests, BatchMakeSymmTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/batched/matrix.cu b/cpp/test/prims/batched/matrix.cu
index 883751cadf..0e114546cc 100644
--- a/cpp/test/prims/batched/matrix.cu
+++ b/cpp/test/prims/batched/matrix.cu
@@ -23,9 +23,10 @@
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/cuml.hpp>
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
+#include <test_utils.h>
 #include <linalg/batched/matrix.cuh>
-#include "../add.cuh"
+#include <raft/linalg/add.cuh>
 #include "../linalg_naive.h"
 #include "../test_utils.h"
 
@@ -157,7 +158,7 @@ class MatrixTest : public ::testing::TestWithParam<MatrixInputs<T>> {
     // Create handles, stream, allocator
     CUBLAS_CHECK(cublasCreate(&handle));
     CUDA_CHECK(cudaStreamCreate(&stream));
-    auto allocator = std::make_shared<MLCommon::defaultDeviceAllocator>();
+    auto allocator = std::make_shared<raft::mr::device::default_allocator>();
 
     // Created batched matrices
     Matrix<T> AbM(params.m, params.n, params.batch_size, handle, allocator,
@@ -168,9 +169,9 @@ class MatrixTest : public ::testing::TestWithParam<MatrixInputs<T>> {
                   allocator, stream);
 
     // Copy the data to the device
-    if (use_A) updateDevice(AbM.raw_data(), A.data(), A.size(), stream);
-    if (use_B) updateDevice(BbM.raw_data(), B.data(), B.size(), stream);
-    if (use_Z) updateDevice(ZbM.raw_data(), Z.data(), Z.size(), stream);
+    if (use_A) raft::update_device(AbM.raw_data(), A.data(), A.size(), stream);
+    if (use_B) raft::update_device(BbM.raw_data(), B.data(), B.size(), stream);
+    if (use_Z) raft::update_device(ZbM.raw_data(), Z.data(), Z.size(), stream);
 
     // Create fake batched matrices to be overwritten by results
     res_bM = new Matrix<T>(1, 1, 1, handle, allocator, stream);
@@ -219,26 +220,27 @@ class MatrixTest : public ::testing::TestWithParam<MatrixInputs<T>> {
 
         // Check that H is in Hessenberg form
         std::vector<T> H = std::vector<T>(n * n * params.batch_size);
-        updateHost(H.data(), HbM.raw_data(), H.size(), stream);
+        raft::update_host(H.data(), HbM.raw_data(), H.size(), stream);
         CUDA_CHECK(cudaStreamSynchronize(stream));
         for (int ib = 0; ib < params.batch_size; ib++) {
           for (int j = 0; j < n - 2; j++) {
             for (int i = j + 2; i < n; i++) {
-              ASSERT_TRUE(std::abs(H[n * n * ib + n * j + i]) < zero_tolerance);
+              ASSERT_TRUE(raft::abs(H[n * n * ib + n * j + i]) <
+                          zero_tolerance);
             }
           }
         }
 
         // Check that U is unitary (UU'=I)
         std::vector<T> UUt = std::vector<T>(n * n * params.batch_size);
-        updateHost(UUt.data(), b_gemm(UbM, UbM, false, true).raw_data(),
-                   UUt.size(), stream);
+        raft::update_host(UUt.data(), b_gemm(UbM, UbM, false, true).raw_data(),
+                          UUt.size(), stream);
         CUDA_CHECK(cudaStreamSynchronize(stream));
         for (int ib = 0; ib < params.batch_size; ib++) {
           for (int i = 0; i < n; i++) {
             for (int j = 0; j < n; j++) {
-              ASSERT_TRUE(std::abs(UUt[n * n * ib + n * j + i] -
-                                   (i == j ? (T)1 : (T)0)) < zero_tolerance);
+              ASSERT_TRUE(raft::abs(UUt[n * n * ib + n * j + i] -
+                                    (i == j ? (T)1 : (T)0)) < zero_tolerance);
             }
           }
         }
@@ -258,34 +260,35 @@ class MatrixTest : public ::testing::TestWithParam<MatrixInputs<T>> {
 
         // Check that S is in Schur form
         std::vector<T> S = std::vector<T>(n * n * params.batch_size);
-        updateHost(S.data(), SbM.raw_data(), S.size(), stream);
+        raft::update_host(S.data(), SbM.raw_data(), S.size(), stream);
         CUDA_CHECK(cudaStreamSynchronize(stream));
         for (int ib = 0; ib < params.batch_size; ib++) {
           for (int j = 0; j < n - 2; j++) {
             for (int i = j + 2; i < n; i++) {
-              ASSERT_TRUE(std::abs(S[n * n * ib + n * j + i]) < zero_tolerance);
+              ASSERT_TRUE(raft::abs(S[n * n * ib + n * j + i]) <
+                          zero_tolerance);
             }
           }
         }
         for (int ib = 0; ib < params.batch_size; ib++) {
           for (int k = 0; k < n - 3; k++) {
             ASSERT_FALSE(
-              std::abs(S[n * n * ib + n * k + k + 1]) > zero_tolerance &&
-              std::abs(S[n * n * ib + n * (k + 1) + k + 2]) > zero_tolerance &&
-              std::abs(S[n * n * ib + n * (k + 2) + k + 3]) > zero_tolerance);
+              raft::abs(S[n * n * ib + n * k + k + 1]) > zero_tolerance &&
+              raft::abs(S[n * n * ib + n * (k + 1) + k + 2]) > zero_tolerance &&
+              raft::abs(S[n * n * ib + n * (k + 2) + k + 3]) > zero_tolerance);
           }
         }
 
         // Check that U is unitary (UU'=I)
         std::vector<T> UUt = std::vector<T>(n * n * params.batch_size);
-        updateHost(UUt.data(), b_gemm(UbM, UbM, false, true).raw_data(),
-                   UUt.size(), stream);
+        raft::update_host(UUt.data(), b_gemm(UbM, UbM, false, true).raw_data(),
+                          UUt.size(), stream);
         CUDA_CHECK(cudaStreamSynchronize(stream));
         for (int ib = 0; ib < params.batch_size; ib++) {
           for (int i = 0; i < n; i++) {
             for (int j = 0; j < n; j++) {
-              ASSERT_TRUE(std::abs(UUt[n * n * ib + n * j + i] -
-                                   (i == j ? (T)1 : (T)0)) < zero_tolerance);
+              ASSERT_TRUE(raft::abs(UUt[n * n * ib + n * j + i] -
+                                    (i == j ? (T)1 : (T)0)) < zero_tolerance);
             }
           }
         }
@@ -458,12 +461,14 @@ const std::vector<MatrixInputs<float>> inputsf = {
 using BatchedMatrixTestD = MatrixTest<double>;
 using BatchedMatrixTestF = MatrixTest<float>;
 TEST_P(BatchedMatrixTestD, Result) {
-  ASSERT_TRUE(devArrMatchHost(res_h.data(), res_bM->raw_data(), res_h.size(),
-                              CompareApprox<double>(params.tolerance), stream));
+  ASSERT_TRUE(raft::devArrMatchHost(
+    res_h.data(), res_bM->raw_data(), res_h.size(),
+    raft::CompareApprox<double>(params.tolerance), stream));
 }
 TEST_P(BatchedMatrixTestF, Result) {
-  ASSERT_TRUE(devArrMatchHost(res_h.data(), res_bM->raw_data(), res_h.size(),
-                              CompareApprox<float>(params.tolerance), stream));
+  ASSERT_TRUE(raft::devArrMatchHost(
+    res_h.data(), res_bM->raw_data(), res_h.size(),
+    raft::CompareApprox<float>(params.tolerance), stream));
 }
 
 INSTANTIATE_TEST_CASE_P(BatchedMatrixTests, BatchedMatrixTestD,
diff --git a/cpp/test/prims/binary_op.cu b/cpp/test/prims/binary_op.cu
deleted file mode 100644
index 95db321ca8..0000000000
--- a/cpp/test/prims/binary_op.cu
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/binary_op.cuh>
-#include <random/rng.cuh>
-#include "binary_op.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename InType, typename IdxType, typename OutType>
-void binaryOpLaunch(OutType *out, const InType *in1, const InType *in2,
-                    IdxType len, cudaStream_t stream) {
-  binaryOp(
-    out, in1, in2, len, [] __device__(InType a, InType b) { return a + b; },
-    stream);
-}
-
-template <typename InType, typename IdxType, typename OutType = InType>
-class BinaryOpTest
-  : public ::testing::TestWithParam<BinaryOpInputs<InType, IdxType, OutType>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<
-      BinaryOpInputs<InType, IdxType, OutType>>::GetParam();
-    Random::Rng r(params.seed);
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    IdxType len = params.len;
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in1, len, InType(-1.0), InType(1.0), stream);
-    r.uniform(in2, len, InType(-1.0), InType(1.0), stream);
-    naiveAdd(out_ref, in1, in2, len);
-    binaryOpLaunch(out, in1, in2, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in1));
-    CUDA_CHECK(cudaFree(in2));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  BinaryOpInputs<InType, IdxType, OutType> params;
-  InType *in1, *in2;
-  OutType *out_ref, *out;
-};
-
-const std::vector<BinaryOpInputs<float, int>> inputsf_i32 = {
-  {0.000001f, 1024 * 1024, 1234ULL}};
-typedef BinaryOpTest<float, int> BinaryOpTestF_i32;
-TEST_P(BinaryOpTestF_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(BinaryOpTests, BinaryOpTestF_i32,
-                        ::testing::ValuesIn(inputsf_i32));
-
-const std::vector<BinaryOpInputs<float, size_t>> inputsf_i64 = {
-  {0.000001f, 1024 * 1024, 1234ULL}};
-typedef BinaryOpTest<float, size_t> BinaryOpTestF_i64;
-TEST_P(BinaryOpTestF_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(BinaryOpTests, BinaryOpTestF_i64,
-                        ::testing::ValuesIn(inputsf_i64));
-
-const std::vector<BinaryOpInputs<float, int, double>> inputsf_i32_d = {
-  {0.000001f, 1024 * 1024, 1234ULL}};
-typedef BinaryOpTest<float, int, double> BinaryOpTestF_i32_D;
-TEST_P(BinaryOpTestF_i32_D, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(BinaryOpTests, BinaryOpTestF_i32_D,
-                        ::testing::ValuesIn(inputsf_i32_d));
-
-const std::vector<BinaryOpInputs<double, int>> inputsd_i32 = {
-  {0.00000001, 1024 * 1024, 1234ULL}};
-typedef BinaryOpTest<double, int> BinaryOpTestD_i32;
-TEST_P(BinaryOpTestD_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(BinaryOpTests, BinaryOpTestD_i32,
-                        ::testing::ValuesIn(inputsd_i32));
-
-const std::vector<BinaryOpInputs<double, size_t>> inputsd_i64 = {
-  {0.00000001, 1024 * 1024, 1234ULL}};
-typedef BinaryOpTest<double, size_t> BinaryOpTestD_i64;
-TEST_P(BinaryOpTestD_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(BinaryOpTests, BinaryOpTestD_i64,
-                        ::testing::ValuesIn(inputsd_i64));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/binary_op.cuh b/cpp/test/prims/binary_op.cuh
deleted file mode 100644
index 8f74e87812..0000000000
--- a/cpp/test/prims/binary_op.cuh
+++ /dev/null
@@ -1,56 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <linalg/binary_op.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InType, typename OutType, typename IdxType>
-__global__ void naiveAddKernel(OutType *out, const InType *in1,
-                               const InType *in2, IdxType len) {
-  IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * (IdxType)blockDim.x);
-  if (idx < len) {
-    out[idx] = static_cast<OutType>(in1[idx] + in2[idx]);
-  }
-}
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-void naiveAdd(OutType *out, const InType *in1, const InType *in2, IdxType len) {
-  static const IdxType TPB = 64;
-  IdxType nblks = ceildiv(len, TPB);
-  naiveAddKernel<InType, OutType, IdxType><<<nblks, TPB>>>(out, in1, in2, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-struct BinaryOpInputs {
-  InType tolerance;
-  IdxType len;
-  unsigned long long int seed;
-};
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-::std::ostream &operator<<(::std::ostream &os,
-                           const BinaryOpInputs<InType, IdxType, OutType> &d) {
-  return os;
-}
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/cache.cu b/cpp/test/prims/cache.cu
index eec83dd382..b3dc54812c 100644
--- a/cpp/test/prims/cache.cu
+++ b/cpp/test/prims/cache.cu
@@ -14,12 +14,12 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <cache/cache.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -29,19 +29,20 @@ class CacheTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator = std::shared_ptr<deviceAllocator>(new defaultDeviceAllocator());
-    allocate(x_dev, n_rows * n_cols);
-    updateDevice(x_dev, x_host, n_rows * n_cols, stream);
-    allocate(tile_dev, n_rows * n_cols);
-
-    allocate(keys_dev, n);
-    allocate(is_cached, n);
-    allocate(cache_idx_dev, n);
-    updateDevice(keys_dev, keys_host, n, stream);
-    allocate(zeroone_dev, n);
-    allocate(int_array_dev, 12);
-    updateDevice(zeroone_dev, zeroone_host, n, stream);
-    allocate(argfirst_dev, n_rows);
+    allocator = std::shared_ptr<deviceAllocator>(
+      new raft::mr::device::default_allocator());
+    raft::allocate(x_dev, n_rows * n_cols);
+    raft::update_device(x_dev, x_host, n_rows * n_cols, stream);
+    raft::allocate(tile_dev, n_rows * n_cols);
+
+    raft::allocate(keys_dev, n);
+    raft::allocate(is_cached, n);
+    raft::allocate(cache_idx_dev, n);
+    raft::update_device(keys_dev, keys_host, n, stream);
+    raft::allocate(zeroone_dev, n);
+    raft::allocate(int_array_dev, 12);
+    raft::update_device(zeroone_dev, zeroone_host, n, stream);
+    raft::allocate(argfirst_dev, n_rows);
   }
 
   void TearDown() override {
@@ -89,11 +90,12 @@ __global__ void test_argfirst(const int *array, int n, int *res) {
 
 TEST_F(CacheTest, TestArgFirst) {
   int argfirst_host[10] = {0, 1, 1, 1, 2, 2, 4, 4, 6, 7};
-  updateDevice(argfirst_dev, argfirst_host, 10, stream);
+  raft::update_device(argfirst_dev, argfirst_host, 10, stream);
 
   test_argfirst<<<1, 10>>>(argfirst_dev, 10, int_array_dev);
   int idx_exp[10] = {0, 1, 4, 6, 6, 8, 8, 9, 10, 10};
-  EXPECT_TRUE(devArrMatchHost(idx_exp, int_array_dev, 10, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(idx_exp, int_array_dev, 10, raft::Compare<int>()));
 }
 
 __global__ void test_nth_occurrence(const int *array, int n, int val,
@@ -105,15 +107,17 @@ __global__ void test_nth_occurrence(const int *array, int n, int val,
 TEST_F(CacheTest, TestNthOccurrence) {
   test_nth_occurrence<<<1, 10>>>(zeroone_dev, 10, 0, int_array_dev);
   int idx_exp[10] = {0, 1, 2, 3, 4, -1, -1, -1, -1, -1};
-  EXPECT_TRUE(devArrMatchHost(idx_exp, int_array_dev, 10, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(idx_exp, int_array_dev, 10, raft::Compare<int>()));
   test_nth_occurrence<<<1, 10>>>(zeroone_dev, 10, 1, int_array_dev);
   int idx_exp2[10] = {5, 6, 7, 8, 9, -1, -1, -1, -1, -1};
-  EXPECT_TRUE(devArrMatchHost(idx_exp2, int_array_dev, 10, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(idx_exp2, int_array_dev, 10, raft::Compare<int>()));
 }
 
 template <int nthreads, int associativity>
 __global__ void test_rank_set_entries(const int *array, int n, int *res) {
-  const int items_per_thread = ceildiv(associativity, nthreads);
+  const int items_per_thread = raft::ceildiv(associativity, nthreads);
   __shared__ int rank[items_per_thread * nthreads];
 
   rank_set_entries<nthreads, associativity>(array, n, rank);
@@ -130,7 +134,7 @@ __global__ void test_rank_set_entries(const int *array, int n, int *res) {
 TEST_F(CacheTest, TestRankEntries) {
   // Three cache sets, with 4 elements each
   int val[12] = {12, 11, 10, 9, 8, 6, 7, 5, 4, 1, 2, 3};
-  updateDevice(int_array_dev, val, 12, stream);
+  raft::update_device(int_array_dev, val, 12, stream);
 
   const int nthreads = 4;
   test_rank_set_entries<nthreads, 4>
@@ -140,14 +144,16 @@ TEST_F(CacheTest, TestRankEntries) {
   // the indices that sorts the block are the following
   int idx_exp[12] = {3, 2, 1, 0, 3, 1, 2, 0, 3, 0, 1, 2};
 
-  EXPECT_TRUE(devArrMatchHost(idx_exp, int_array_dev, 12, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(idx_exp, int_array_dev, 12, raft::Compare<int>()));
 
   // do the same with less than 4 threads
   const int nthreads3 = 3;
-  updateDevice(int_array_dev, val, 12, stream);
+  raft::update_device(int_array_dev, val, 12, stream);
   test_rank_set_entries<nthreads3, 4>
     <<<3, nthreads3>>>(int_array_dev, 12, int_array_dev);
-  EXPECT_TRUE(devArrMatchHost(idx_exp, int_array_dev, 12, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(idx_exp, int_array_dev, 12, raft::Compare<int>()));
 }
 
 TEST_F(CacheTest, TestSimple) {
@@ -157,10 +163,11 @@ TEST_F(CacheTest, TestSimple) {
   ASSERT_EQ(cache.GetSize(), 4);
 
   cache.GetCacheIdx(keys_dev, n, cache_idx_dev, is_cached, stream);
-  EXPECT_TRUE(devArrMatch(false, is_cached, n, Compare<bool>()));
+  EXPECT_TRUE(devArrMatch(false, is_cached, n, raft::Compare<bool>()));
 
   int cache_set[10] = {0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
-  EXPECT_TRUE(devArrMatchHost(cache_set, cache_idx_dev, n, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(cache_set, cache_idx_dev, n, raft::Compare<int>()));
   int n_cached = 1;
   cache.GetCacheIdxPartitioned(keys_dev, n, cache_idx_dev, &n_cached, stream);
   EXPECT_EQ(n_cached, 0);
@@ -179,20 +186,22 @@ TEST_F(CacheTest, TestAssignCacheIdx) {
 
   int cache_idx_exp[10] = {0, 1, -1, -1, -1, 2, 3, -1, -1, -1};
   int keys_exp[10] = {8, 6, 4, 2, 0, 9, 7, 5, 3, 1};
-  EXPECT_TRUE(devArrMatchHost(cache_idx_exp, cache_idx_dev, n, Compare<int>()));
-  EXPECT_TRUE(devArrMatchHost(keys_exp, keys_dev, n, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(cache_idx_exp, cache_idx_dev, n, raft::Compare<int>()));
+  EXPECT_TRUE(devArrMatchHost(keys_exp, keys_dev, n, raft::Compare<int>()));
 
   // Now the elements that have been assigned a cache slot are considered cached
   // A subsequent cache lookup should give us their cache indices.
-  updateDevice(keys_dev, keys_host, n, stream);
+  raft::update_device(keys_dev, keys_host, n, stream);
   cache.GetCacheIdxPartitioned(keys_dev, n, cache_idx_dev, &n_cached, stream);
   ASSERT_EQ(n_cached, 4);
 
   int keys_exp2[4] = {6, 7, 8, 9};
-  EXPECT_TRUE(devArrMatchHost(keys_exp2, keys_dev, n_cached, Compare<int>()));
-  int cache_idx_exp2[4] = {1, 3, 0, 2};
   EXPECT_TRUE(
-    devArrMatchHost(cache_idx_exp2, cache_idx_dev, n_cached, Compare<int>()));
+    devArrMatchHost(keys_exp2, keys_dev, n_cached, raft::Compare<int>()));
+  int cache_idx_exp2[4] = {1, 3, 0, 2};
+  EXPECT_TRUE(devArrMatchHost(cache_idx_exp2, cache_idx_dev, n_cached,
+                              raft::Compare<int>()));
 
   // Find cache slots, when not available
   int non_cached = n - n_cached;
@@ -201,7 +210,7 @@ TEST_F(CacheTest, TestAssignCacheIdx) {
 
   int cache_idx_exp3[6] = {-1, -1, -1, -1, -1, -1};
   EXPECT_TRUE(devArrMatchHost(cache_idx_exp3, cache_idx_dev + n_cached,
-                              non_cached, Compare<int>()));
+                              non_cached, raft::Compare<int>()));
 }
 
 TEST_F(CacheTest, TestEvict) {
@@ -217,25 +226,26 @@ TEST_F(CacheTest, TestEvict) {
 
   int cache_idx_exp[5] = {0, 1, 2, 4, 5};
   int keys_exp[5] = {4, 2, 0, 3, 1};
-  EXPECT_TRUE(devArrMatchHost(cache_idx_exp, cache_idx_dev, 5, Compare<int>()));
-  EXPECT_TRUE(devArrMatchHost(keys_exp, keys_dev, 5, Compare<int>()));
+  EXPECT_TRUE(
+    devArrMatchHost(cache_idx_exp, cache_idx_dev, 5, raft::Compare<int>()));
+  EXPECT_TRUE(devArrMatchHost(keys_exp, keys_dev, 5, raft::Compare<int>()));
 
   int idx_host[10] = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
-  updateDevice(keys_dev, idx_host, 10, stream);
+  raft::update_device(keys_dev, idx_host, 10, stream);
   cache.GetCacheIdxPartitioned(keys_dev, 10, cache_idx_dev, &n_cached, stream);
   EXPECT_EQ(n_cached, 3);
   int cache_idx_exp2[3] = {1, 4, 0};
   EXPECT_TRUE(
-    devArrMatchHost(cache_idx_exp2, cache_idx_dev, 3, Compare<int>()));
+    devArrMatchHost(cache_idx_exp2, cache_idx_dev, 3, raft::Compare<int>()));
 
   cache.AssignCacheIdx(keys_dev + n_cached, 10 - n_cached,
                        cache_idx_dev + n_cached, stream);
 
   int keys_exp3[10] = {2, 3, 4, 10, 8, 6, 11, 9, 7, 5};
   int cache_idx_exp3[10] = {1, 4, 0, 3, 2, -1, 6, 7, 5, -1};
-  EXPECT_TRUE(devArrMatchHost(keys_exp3, keys_dev, 10, Compare<int>()));
+  EXPECT_TRUE(devArrMatchHost(keys_exp3, keys_dev, 10, raft::Compare<int>()));
   EXPECT_TRUE(
-    devArrMatchHost(cache_idx_exp3, cache_idx_dev, 10, Compare<int>()));
+    devArrMatchHost(cache_idx_exp3, cache_idx_dev, 10, raft::Compare<int>()));
 }
 
 TEST_F(CacheTest, TestStoreCollect) {
@@ -255,13 +265,14 @@ TEST_F(CacheTest, TestStoreCollect) {
   cache.GetVecs(cache_idx_dev, n_cached, tile_dev, stream);
 
   int cache_idx_host[10];
-  updateHost(cache_idx_host, cache_idx_dev, n_cached, stream);
+  raft::update_host(cache_idx_host, cache_idx_dev, n_cached, stream);
   int keys_host[10];
-  updateHost(keys_host, keys_dev, n_cached, stream);
+  raft::update_host(keys_host, keys_dev, n_cached, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (int i = 0; i < n_cached; i++) {
     EXPECT_TRUE(devArrMatch(x_dev + keys_host[i] * n_cols,
-                            tile_dev + i * n_cols, n_cols, Compare<int>()))
+                            tile_dev + i * n_cols, n_cols,
+                            raft::Compare<int>()))
       << "vector " << i;
   }
 
@@ -281,13 +292,14 @@ TEST_F(CacheTest, TestStoreCollect) {
 
     cache.GetVecs(cache_idx_dev, 10, tile_dev, stream);
 
-    updateHost(cache_idx_host, cache_idx_dev, 10, stream);
-    updateHost(keys_host, keys_dev, 10, stream);
+    raft::update_host(cache_idx_host, cache_idx_dev, 10, stream);
+    raft::update_host(keys_host, keys_dev, 10, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     for (int i = 0; i < 10; i++) {
       if (cache_idx_host[i] >= 0) {
         EXPECT_TRUE(devArrMatch(x_dev + keys_host[i] * n_cols,
-                                tile_dev + i * n_cols, n_cols, Compare<int>()))
+                                tile_dev + i * n_cols, n_cols,
+                                raft::Compare<int>()))
           << "vector " << i;
       }
     }
diff --git a/cpp/test/prims/coalesced_reduction.cu b/cpp/test/prims/coalesced_reduction.cu
deleted file mode 100644
index ff1bf63e88..0000000000
--- a/cpp/test/prims/coalesced_reduction.cu
+++ /dev/null
@@ -1,118 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/coalesced_reduction.cuh>
-#include <random/rng.cuh>
-#include "reduce.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct coalescedReductionInputs {
-  T tolerance;
-  int rows, cols;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os,
-                           const coalescedReductionInputs<T> &dims) {
-  return os;
-}
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename T>
-void coalescedReductionLaunch(T *dots, const T *data, int cols, int rows,
-                              cudaStream_t stream, bool inplace = false) {
-  coalescedReduction(dots, data, cols, rows, (T)0, stream, inplace,
-                     [] __device__(T in, int i) { return in * in; });
-}
-
-template <typename T>
-class coalescedReductionTest
-  : public ::testing::TestWithParam<coalescedReductionInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<coalescedReductionInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-    allocate(dots_exp, rows);
-    allocate(dots_act, rows);
-    r.uniform(data, len, T(-1.0), T(1.0), stream);
-    naiveCoalescedReduction(dots_exp, data, cols, rows, stream);
-
-    // Perform reduction with default inplace = false first
-    coalescedReductionLaunch(dots_act, data, cols, rows, stream);
-    // Add to result with inplace = true next
-    coalescedReductionLaunch(dots_act, data, cols, rows, stream, true);
-
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(dots_exp));
-    CUDA_CHECK(cudaFree(dots_act));
-  }
-
- protected:
-  coalescedReductionInputs<T> params;
-  T *data, *dots_exp, *dots_act;
-};
-
-const std::vector<coalescedReductionInputs<float>> inputsf = {
-  {0.000002f, 1024, 32, 1234ULL},
-  {0.000002f, 1024, 64, 1234ULL},
-  {0.000002f, 1024, 128, 1234ULL},
-  {0.000002f, 1024, 256, 1234ULL}};
-
-const std::vector<coalescedReductionInputs<double>> inputsd = {
-  {0.000000001, 1024, 32, 1234ULL},
-  {0.000000001, 1024, 64, 1234ULL},
-  {0.000000001, 1024, 128, 1234ULL},
-  {0.000000001, 1024, 256, 1234ULL}};
-
-typedef coalescedReductionTest<float> coalescedReductionTestF;
-TEST_P(coalescedReductionTestF, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.rows,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef coalescedReductionTest<double> coalescedReductionTestD;
-TEST_P(coalescedReductionTestD, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.rows,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(coalescedReductionTests, coalescedReductionTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(coalescedReductionTests, coalescedReductionTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/columnSort.cu b/cpp/test/prims/columnSort.cu
index 6c2e50c4b3..8c240fb4fc 100644
--- a/cpp/test/prims/columnSort.cu
+++ b/cpp/test/prims/columnSort.cu
@@ -14,8 +14,8 @@
 * limitations under the License.
 */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <numeric>
 #include <selection/columnWiseSort.cuh>
@@ -57,12 +57,12 @@ class ColumnSort : public ::testing::TestWithParam<columnSort<T>> {
     int len = params.n_row * params.n_col;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(keyIn, len);
-    allocate(valueOut, len);
-    allocate(goldenValOut, len);
+    raft::allocate(keyIn, len);
+    raft::allocate(valueOut, len);
+    raft::allocate(goldenValOut, len);
     if (params.testKeys) {
-      allocate(keySorted, len);
-      allocate(keySortGolden, len);
+      raft::allocate(keySorted, len);
+      raft::allocate(keySortGolden, len);
     }
 
     std::vector<T> vals(len);
@@ -88,18 +88,18 @@ class ColumnSort : public ::testing::TestWithParam<columnSort<T>> {
       }
     }
 
-    updateDevice(keyIn, &vals[0], len, stream);
-    updateDevice(goldenValOut, &cValGolden[0], len, stream);
+    raft::update_device(keyIn, &vals[0], len, stream);
+    raft::update_device(goldenValOut, &cValGolden[0], len, stream);
 
     if (params.testKeys)
-      updateDevice(keySortGolden, &cKeyGolden[0], len, stream);
+      raft::update_device(keySortGolden, &cKeyGolden[0], len, stream);
 
     bool needWorkspace = false;
     size_t workspaceSize = 0;
     sortColumnsPerRow(keyIn, valueOut, params.n_row, params.n_col,
                       needWorkspace, NULL, workspaceSize, stream, keySorted);
     if (needWorkspace) {
-      allocate(workspacePtr, workspaceSize);
+      raft::allocate(workspacePtr, workspaceSize);
       sortColumnsPerRow(keyIn, valueOut, params.n_row, params.n_col,
                         needWorkspace, workspacePtr, workspaceSize, stream,
                         keySorted);
@@ -135,11 +135,11 @@ const std::vector<columnSort<float>> inputsf1 = {{0.000001f, 503, 2000, false},
 typedef ColumnSort<float> ColumnSortF;
 TEST_P(ColumnSortF, Result) {
   ASSERT_TRUE(devArrMatch(valueOut, goldenValOut, params.n_row * params.n_col,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
   if (params.testKeys) {
     ASSERT_TRUE(devArrMatch(keySorted, keySortGolden,
                             params.n_row * params.n_col,
-                            CompareApprox<float>(params.tolerance)));
+                            raft::CompareApprox<float>(params.tolerance)));
   }
 }
 
diff --git a/cpp/test/prims/completenessScore.cu b/cpp/test/prims/completenessScore.cu
index 7ef0c93d55..6dad55e2f7 100644
--- a/cpp/test/prims/completenessScore.cu
+++ b/cpp/test/prims/completenessScore.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -67,13 +67,13 @@ class completenessTest : public ::testing::TestWithParam<completenessParam> {
     //allocating and initializing memory to the GPU
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(truthClusterArray, nElements, true);
-    MLCommon::allocate(predClusterArray, nElements, true);
+    raft::allocate(truthClusterArray, nElements, true);
+    raft::allocate(predClusterArray, nElements, true);
 
-    MLCommon::updateDevice(truthClusterArray, &arr1[0], (int)nElements, stream);
-    MLCommon::updateDevice(predClusterArray, &arr2[0], (int)nElements, stream);
+    raft::update_device(truthClusterArray, &arr1[0], (int)nElements, stream);
+    raft::update_device(predClusterArray, &arr2[0], (int)nElements, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calculating the golden output
     double truthMI, truthEntropy;
diff --git a/cpp/test/prims/contingencyMatrix.cu b/cpp/test/prims/contingencyMatrix.cu
index ddec8d5d64..0ebdb108bf 100644
--- a/cpp/test/prims/contingencyMatrix.cu
+++ b/cpp/test/prims/contingencyMatrix.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <iostream>
 #include <metrics/contingencyMatrix.cuh>
@@ -72,11 +72,11 @@ class ContingencyMatrixTest
     }
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(dY, numElements);
-    MLCommon::allocate(dYHat, numElements);
+    raft::allocate(dY, numElements);
+    raft::allocate(dYHat, numElements);
 
-    MLCommon::updateDevice(dYHat, &y_hat[0], numElements, stream);
-    MLCommon::updateDevice(dY, &y[0], numElements, stream);
+    raft::update_device(dYHat, &y_hat[0], numElements, stream);
+    raft::update_device(dY, &y[0], numElements, stream);
 
     if (params.calcCardinality) {
       MLCommon::Metrics::getInputClassCardinality(dY, numElements, stream,
@@ -88,8 +88,8 @@ class ContingencyMatrixTest
 
     numUniqueClasses = maxLabel - minLabel + 1;
 
-    MLCommon::allocate(dComputedOutput, numUniqueClasses * numUniqueClasses);
-    MLCommon::allocate(dGoldenOutput, numUniqueClasses * numUniqueClasses);
+    raft::allocate(dComputedOutput, numUniqueClasses * numUniqueClasses);
+    raft::allocate(dGoldenOutput, numUniqueClasses * numUniqueClasses);
 
     // generate golden output on CPU
     size_t sizeOfMat = numUniqueClasses * numUniqueClasses * sizeof(int);
@@ -102,12 +102,12 @@ class ContingencyMatrixTest
       hGoldenOutput[row * numUniqueClasses + column] += 1;
     }
 
-    MLCommon::updateDevice(dGoldenOutput, hGoldenOutput,
-                           numUniqueClasses * numUniqueClasses, stream);
+    raft::update_device(dGoldenOutput, hGoldenOutput,
+                        numUniqueClasses * numUniqueClasses, stream);
 
     workspaceSz = MLCommon::Metrics::getContingencyMatrixWorkspaceSize(
       numElements, dY, stream, minLabel, maxLabel);
-    if (workspaceSz != 0) MLCommon::allocate(pWorkspace, workspaceSz);
+    if (workspaceSz != 0) raft::allocate(pWorkspace, workspaceSz);
   }
 
   void TearDown() override {
@@ -126,8 +126,9 @@ class ContingencyMatrixTest
     MLCommon::Metrics::contingencyMatrix(
       dY, dYHat, numElements, dComputedOutput, stream, (void *)pWorkspace,
       workspaceSz, minLabel, maxLabel);
-    ASSERT_TRUE(devArrMatch(dComputedOutput, dGoldenOutput,
-                            numUniqueClasses * numUniqueClasses, Compare<T>()));
+    ASSERT_TRUE(raft::devArrMatch(dComputedOutput, dGoldenOutput,
+                                  numUniqueClasses * numUniqueClasses,
+                                  raft::Compare<T>()));
   }
 
   ContingencyMatrixParam params;
diff --git a/cpp/test/prims/coo.cu b/cpp/test/prims/coo.cu
index e59368ea20..812724a105 100644
--- a/cpp/test/prims/coo.cu
+++ b/cpp/test/prims/coo.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
 #include <sparse/coo.cuh>
 #include "coo.h"
 #include "test_utils.h"
@@ -43,7 +43,8 @@ typedef COOTest<float> SortedCOOToCSR;
 TEST_P(SortedCOOToCSR, Result) {
   cudaStream_t stream;
   cudaStreamCreate(&stream);
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
 
   int nnz = 8;
 
@@ -52,16 +53,16 @@ TEST_P(SortedCOOToCSR, Result) {
   int *in_h = new int[nnz]{0, 0, 1, 1, 2, 2, 3, 3};
   int *exp_h = new int[4]{0, 2, 4, 6};
 
-  allocate(in, nnz, true);
-  allocate(exp, 4, true);
-  allocate(out, 4, true);
+  raft::allocate(in, nnz, true);
+  raft::allocate(exp, 4, true);
+  raft::allocate(out, 4, true);
 
-  updateDevice(in, in_h, nnz, stream);
-  updateDevice(exp, exp_h, 4, stream);
+  raft::update_device(in, in_h, nnz, stream);
+  raft::update_device(exp, exp_h, 4, stream);
 
   sorted_coo_to_csr<int>(in, nnz, out, 4, alloc, stream);
 
-  ASSERT_TRUE(devArrMatch<int>(out, exp, 4, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(out, exp, 4, raft::Compare<int>()));
 
   cudaStreamDestroy(stream);
 
@@ -78,7 +79,8 @@ TEST_P(COOSymmetrize, Result) {
   cudaStream_t stream;
   cudaStreamCreate(&stream);
 
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
 
   int nnz = 8;
 
@@ -94,9 +96,9 @@ TEST_P(COOSymmetrize, Result) {
                                          0.5, 0.5, 0.5, 0, 1.5, 0.5, 0.5, 0.0};
 
   COO<float> in(alloc, stream, nnz, 4, 4);
-  updateDevice(in.rows(), *&in_rows_h, nnz, stream);
-  updateDevice(in.cols(), *&in_cols_h, nnz, stream);
-  updateDevice(in.vals(), *&in_vals_h, nnz, stream);
+  raft::update_device(in.rows(), *&in_rows_h, nnz, stream);
+  raft::update_device(in.cols(), *&in_cols_h, nnz, stream);
+  raft::update_device(in.vals(), *&in_vals_h, nnz, stream);
 
   COO<float> out(alloc, stream);
 
@@ -111,12 +113,12 @@ TEST_P(COOSymmetrize, Result) {
   std::cout << out << std::endl;
 
   ASSERT_TRUE(out.nnz == nnz * 2);
-  ASSERT_TRUE(
-    devArrMatch<int>(out.rows(), exp_rows_h, out.nnz, Compare<int>()));
-  ASSERT_TRUE(
-    devArrMatch<int>(out.cols(), exp_cols_h, out.nnz, Compare<int>()));
-  ASSERT_TRUE(
-    devArrMatch<float>(out.vals(), exp_vals_h, out.nnz, Compare<float>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(out.rows(), exp_rows_h, out.nnz,
+                                     raft::Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(out.cols(), exp_cols_h, out.nnz,
+                                     raft::Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<float>(out.vals(), exp_vals_h, out.nnz,
+                                       raft::Compare<float>()));
 
   cudaStreamDestroy(stream);
 
@@ -135,12 +137,13 @@ TEST_P(COOSort, Result) {
   float *in_vals;
 
   params = ::testing::TestWithParam<COOInputs<float>>::GetParam();
-  Random::Rng r(params.seed);
+  raft::random::Rng r(params.seed);
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
 
-  allocate(in_vals, params.nnz);
+  raft::allocate(in_vals, params.nnz);
   r.uniform(in_vals, params.nnz, float(-1.0), float(1.0), stream);
 
   int *in_rows_h = (int *)malloc(params.nnz * sizeof(int));
@@ -153,19 +156,20 @@ TEST_P(COOSort, Result) {
     in_cols_h[i] = i;
   }
 
-  allocate(in_rows, params.nnz);
-  allocate(in_cols, params.nnz);
-  allocate(verify, params.nnz);
+  raft::allocate(in_rows, params.nnz);
+  raft::allocate(in_cols, params.nnz);
+  raft::allocate(verify, params.nnz);
 
-  updateDevice(in_rows, in_rows_h, params.nnz, stream);
+  raft::update_device(in_rows, in_rows_h, params.nnz, stream);
 
-  updateDevice(in_cols, in_cols_h, params.nnz, stream);
-  updateDevice(verify, verify_h, params.nnz, stream);
+  raft::update_device(in_cols, in_cols_h, params.nnz, stream);
+  raft::update_device(verify, verify_h, params.nnz, stream);
 
   coo_sort(params.m, params.n, params.nnz, in_rows, in_cols, in_vals, alloc,
            stream);
 
-  ASSERT_TRUE(devArrMatch<int>(verify, in_rows, params.nnz, Compare<int>()));
+  ASSERT_TRUE(
+    raft::devArrMatch<int>(verify, in_rows, params.nnz, raft::Compare<int>()));
 
   delete[] in_rows_h;
   delete[] in_cols_h;
@@ -182,17 +186,18 @@ typedef COOTest<float> COORemoveZeros;
 TEST_P(COORemoveZeros, Result) {
   cudaStream_t stream;
   cudaStreamCreate(&stream);
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
   params = ::testing::TestWithParam<COOInputs<float>>::GetParam();
 
   float *in_h_vals = new float[params.nnz];
 
   COO<float> in(alloc, stream, params.nnz, 5, 5);
 
-  Random::Rng r(params.seed);
+  raft::random::Rng r(params.seed);
   r.uniform(in.vals(), params.nnz, float(-1.0), float(1.0), stream);
 
-  updateHost(in_h_vals, in.vals(), params.nnz, stream);
+  raft::update_host(in_h_vals, in.vals(), params.nnz, stream);
 
   in_h_vals[0] = 0;
   in_h_vals[2] = 0;
@@ -206,9 +211,9 @@ TEST_P(COORemoveZeros, Result) {
     in_h_cols[i] = i;
   }
 
-  updateDevice(in.rows(), in_h_rows, params.nnz, stream);
-  updateDevice(in.cols(), in_h_cols, params.nnz, stream);
-  updateDevice(in.vals(), in_h_vals, params.nnz, stream);
+  raft::update_device(in.rows(), in_h_rows, params.nnz, stream);
+  raft::update_device(in.cols(), in_h_cols, params.nnz, stream);
+  raft::update_device(in.vals(), in_h_vals, params.nnz, stream);
 
   coo_sort<float>(&in, alloc, stream);
 
@@ -222,16 +227,18 @@ TEST_P(COORemoveZeros, Result) {
   COO<float> out_ref(alloc, stream, 2, 5, 5);
   COO<float> out(alloc, stream);
 
-  updateDevice(out_ref.rows(), *&out_rows_ref_h, 2, stream);
-  updateDevice(out_ref.cols(), *&out_cols_ref_h, 2, stream);
-  updateDevice(out_ref.vals(), out_vals_ref_h, 2, stream);
+  raft::update_device(out_ref.rows(), *&out_rows_ref_h, 2, stream);
+  raft::update_device(out_ref.cols(), *&out_cols_ref_h, 2, stream);
+  raft::update_device(out_ref.vals(), out_vals_ref_h, 2, stream);
 
   coo_remove_zeros<32, float>(&in, &out, alloc, stream);
 
-  ASSERT_TRUE(devArrMatch<int>(out_ref.rows(), out.rows(), 2, Compare<int>()));
-  ASSERT_TRUE(devArrMatch<int>(out_ref.cols(), out.cols(), 2, Compare<int>()));
-  ASSERT_TRUE(
-    devArrMatch<float>(out_ref.vals(), out.vals(), 2, Compare<float>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(out_ref.rows(), out.rows(), 2,
+                                     raft::Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(out_ref.cols(), out.cols(), 2,
+                                     raft::Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<float>(out_ref.vals(), out.vals(), 2,
+                                       raft::Compare<float>()));
 
   CUDA_CHECK(cudaStreamDestroy(stream));
   free(out_vals_ref_h);
@@ -248,17 +255,17 @@ TEST_P(COORowCount, Result) {
   int in_rows_h[5] = {0, 0, 1, 2, 2};
   int verify_h[5] = {2, 1, 2, 0, 0};
 
-  allocate(in_rows, 5);
-  allocate(verify, 5, true);
-  allocate(results, 5, true);
+  raft::allocate(in_rows, 5);
+  raft::allocate(verify, 5, true);
+  raft::allocate(results, 5, true);
 
-  updateDevice(in_rows, *&in_rows_h, 5, 0);
-  updateDevice(verify, *&verify_h, 5, 0);
+  raft::update_device(in_rows, *&in_rows_h, 5, 0);
+  raft::update_device(verify, *&verify_h, 5, 0);
 
   coo_row_count<32>(in_rows, 5, results, 0);
   cudaDeviceSynchronize();
 
-  ASSERT_TRUE(devArrMatch<int>(verify, results, 5, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify, results, 5, raft::Compare<int>()));
 
   CUDA_CHECK(cudaFree(in_rows));
   CUDA_CHECK(cudaFree(verify));
@@ -276,19 +283,19 @@ TEST_P(COORowCountNonzero, Result) {
   float in_vals_h[5] = {0.0, 5.0, 0.0, 1.0, 1.0};
   int verify_h[5] = {1, 0, 2, 0, 0};
 
-  allocate(in_rows, 5);
-  allocate(verify, 5, true);
-  allocate(results, 5, true);
-  allocate(in_vals, 5, true);
+  raft::allocate(in_rows, 5);
+  raft::allocate(verify, 5, true);
+  raft::allocate(results, 5, true);
+  raft::allocate(in_vals, 5, true);
 
-  updateDevice(in_rows, *&in_rows_h, 5, 0);
-  updateDevice(verify, *&verify_h, 5, 0);
-  updateDevice(in_vals, *&in_vals_h, 5, 0);
+  raft::update_device(in_rows, *&in_rows_h, 5, 0);
+  raft::update_device(verify, *&verify_h, 5, 0);
+  raft::update_device(in_vals, *&in_vals_h, 5, 0);
 
   coo_row_count_nz<32, float>(in_rows, in_vals, 5, results, stream);
   cudaDeviceSynchronize();
 
-  ASSERT_TRUE(devArrMatch<int>(verify, results, 5, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify, results, 5, raft::Compare<int>()));
 
   CUDA_CHECK(cudaFree(in_rows));
   CUDA_CHECK(cudaFree(verify));
diff --git a/cpp/test/prims/cov.cu b/cpp/test/prims/cov.cu
index 2857209ace..6c5705e45c 100644
--- a/cpp/test/prims/cov.cu
+++ b/cpp/test/prims/cov.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
+#include <raft/stats/mean.cuh>
 #include <stats/cov.cuh>
-#include <stats/mean.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -41,35 +41,37 @@ template <typename T>
 class CovTest : public ::testing::TestWithParam<CovInputs<T>> {
  protected:
   void SetUp() override {
-    CUBLAS_CHECK(cublasCreate(&handle));
-    CUDA_CHECK(cudaStreamCreate(&stream));
+    raft::handle_t handle;
+    cudaStream_t stream = handle.get_stream();
+
     params = ::testing::TestWithParam<CovInputs<T>>::GetParam();
     params.tolerance *= 2;
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int rows = params.rows, cols = params.cols;
     int len = rows * cols;
     T var = params.var;
-    allocate(data, len);
-    allocate(mean_act, cols);
-    allocate(cov_act, cols * cols);
+    raft::allocate(data, len);
+    raft::allocate(mean_act, cols);
+    raft::allocate(cov_act, cols * cols);
     r.normal(data, len, params.mean, var, stream);
-    mean(mean_act, data, cols, rows, params.sample, params.rowMajor, stream);
-    cov(cov_act, data, mean_act, cols, rows, params.sample, params.rowMajor,
-        params.stable, handle, stream);
+    raft::stats::mean(mean_act, data, cols, rows, params.sample,
+                      params.rowMajor, stream);
+    cov(handle, cov_act, data, mean_act, cols, rows, params.sample,
+        params.rowMajor, params.stable, stream);
 
     T data_h[6] = {1.0, 2.0, 5.0, 4.0, 2.0, 1.0};
     T cov_cm_ref_h[4] = {4.3333, -2.8333, -2.8333, 2.333};
 
-    allocate(data_cm, 6);
-    allocate(cov_cm, 4);
-    allocate(cov_cm_ref, 4);
-    allocate(mean_cm, 2);
+    raft::allocate(data_cm, 6);
+    raft::allocate(cov_cm, 4);
+    raft::allocate(cov_cm_ref, 4);
+    raft::allocate(mean_cm, 2);
 
-    updateDevice(data_cm, data_h, 6, stream);
-    updateDevice(cov_cm_ref, cov_cm_ref_h, 4, stream);
+    raft::update_device(data_cm, data_h, 6, stream);
+    raft::update_device(cov_cm_ref, cov_cm_ref_h, 4, stream);
 
-    mean(mean_cm, data_cm, 2, 3, true, false, stream);
-    cov(cov_cm, data_cm, mean_cm, 2, 3, true, false, true, handle, stream);
+    raft::stats::mean(mean_cm, data_cm, 2, 3, true, false, stream);
+    cov(handle, cov_cm, data_cm, mean_cm, 2, 3, true, false, true, stream);
   }
 
   void TearDown() override {
@@ -80,8 +82,6 @@ class CovTest : public ::testing::TestWithParam<CovInputs<T>> {
     CUDA_CHECK(cudaFree(cov_cm));
     CUDA_CHECK(cudaFree(cov_cm_ref));
     CUDA_CHECK(cudaFree(mean_cm));
-    CUBLAS_CHECK(cublasDestroy(handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
   }
 
  protected:
@@ -132,28 +132,28 @@ const std::vector<CovInputs<double>> inputsd = {
 
 typedef CovTest<float> CovTestF;
 TEST_P(CovTestF, Result) {
-  ASSERT_TRUE(diagonalMatch(params.var * params.var, cov_act, params.cols,
-                            params.cols,
-                            CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::diagonalMatch(
+    params.var * params.var, cov_act, params.cols, params.cols,
+    raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef CovTest<double> CovTestD;
 TEST_P(CovTestD, Result) {
-  ASSERT_TRUE(diagonalMatch(params.var * params.var, cov_act, params.cols,
-                            params.cols,
-                            CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::diagonalMatch(
+    params.var * params.var, cov_act, params.cols, params.cols,
+    raft::CompareApprox<double>(params.tolerance)));
 }
 
 typedef CovTest<float> CovTestSmallF;
 TEST_P(CovTestSmallF, Result) {
-  ASSERT_TRUE(devArrMatch(cov_cm_ref, cov_cm, 2, 2,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(cov_cm_ref, cov_cm, 2, 2,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef CovTest<double> CovTestSmallD;
 TEST_P(CovTestSmallD, Result) {
-  ASSERT_TRUE(devArrMatch(cov_cm_ref, cov_cm, 2, 2,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(cov_cm_ref, cov_cm, 2, 2,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(CovTests, CovTestF, ::testing::ValuesIn(inputsf));
diff --git a/cpp/test/prims/csr.cu b/cpp/test/prims/csr.cu
index 0aa9cac71a..a9ac48a9d0 100644
--- a/cpp/test/prims/csr.cu
+++ b/cpp/test/prims/csr.cu
@@ -18,8 +18,8 @@
 #include <sparse/csr.cuh>
 #include "csr.h"
 
-#include <common/cudart_utils.h>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 #include <iostream>
@@ -52,16 +52,17 @@ TEST_P(CSRToCOO, Result) {
   int *ex_scan_h = new int[4]{0, 4, 8, 9};
   int *verify_h = new int[10]{0, 0, 0, 0, 1, 1, 1, 1, 2, 3};
 
-  allocate(verify, 10);
-  allocate(ex_scan, 4);
-  allocate(result, 10, true);
+  raft::allocate(verify, 10);
+  raft::allocate(ex_scan, 4);
+  raft::allocate(result, 10, true);
 
-  updateDevice(ex_scan, ex_scan_h, 4, stream);
-  updateDevice(verify, verify_h, 10, stream);
+  raft::update_device(ex_scan, ex_scan_h, 4, stream);
+  raft::update_device(verify, verify_h, 10, stream);
 
   csr_to_coo<32>(ex_scan, 4, result, 10, stream);
 
-  ASSERT_TRUE(devArrMatch<int>(verify, result, 10, Compare<float>(), stream));
+  ASSERT_TRUE(
+    raft::devArrMatch<int>(verify, result, 10, raft::Compare<float>(), stream));
 
   delete[] ex_scan_h;
   delete[] verify_h;
@@ -86,18 +87,19 @@ TEST_P(CSRRowNormalizeMax, Result) {
 
   float verify_h[10] = {1.0, 0.2, 0.0, 0.0, 1.0, 0.1, 0.0, 0.0, 1, 0.0};
 
-  allocate(in_vals, 10);
-  allocate(verify, 10);
-  allocate(ex_scan, 4);
-  allocate(result, 10, true);
+  raft::allocate(in_vals, 10);
+  raft::allocate(verify, 10);
+  raft::allocate(ex_scan, 4);
+  raft::allocate(result, 10, true);
 
-  updateDevice(ex_scan, *&ex_scan_h, 4, stream);
-  updateDevice(in_vals, *&in_vals_h, 10, stream);
-  updateDevice(verify, *&verify_h, 10, stream);
+  raft::update_device(ex_scan, *&ex_scan_h, 4, stream);
+  raft::update_device(in_vals, *&in_vals_h, 10, stream);
+  raft::update_device(verify, *&verify_h, 10, stream);
 
   csr_row_normalize_max<32, float>(ex_scan, in_vals, 10, 4, result, stream);
 
-  ASSERT_TRUE(devArrMatch<float>(verify, result, 10, Compare<float>()));
+  ASSERT_TRUE(
+    raft::devArrMatch<float>(verify, result, 10, raft::Compare<float>()));
 
   cudaStreamDestroy(stream);
 
@@ -117,19 +119,20 @@ TEST_P(CSRRowNormalizeL1, Result) {
 
   float verify_h[10] = {0.5, 0.5, 0.0, 0.0, 0.5, 0.5, 0.0, 0.0, 1, 0.0};
 
-  allocate(in_vals, 10);
-  allocate(verify, 10);
-  allocate(ex_scan, 4);
-  allocate(result, 10, true);
+  raft::allocate(in_vals, 10);
+  raft::allocate(verify, 10);
+  raft::allocate(ex_scan, 4);
+  raft::allocate(result, 10, true);
 
-  updateDevice(ex_scan, *&ex_scan_h, 4, 0);
-  updateDevice(in_vals, *&in_vals_h, 10, 0);
-  updateDevice(verify, *&verify_h, 10, 0);
+  raft::update_device(ex_scan, *&ex_scan_h, 4, 0);
+  raft::update_device(in_vals, *&in_vals_h, 10, 0);
+  raft::update_device(verify, *&verify_h, 10, 0);
 
   csr_row_normalize_l1<32, float>(ex_scan, in_vals, 10, 4, result, 0);
   cudaDeviceSynchronize();
 
-  ASSERT_TRUE(devArrMatch<float>(verify, result, 10, Compare<float>()));
+  ASSERT_TRUE(
+    raft::devArrMatch<float>(verify, result, 10, raft::Compare<float>()));
 
   CUDA_CHECK(cudaFree(ex_scan));
   CUDA_CHECK(cudaFree(in_vals));
@@ -142,7 +145,8 @@ TEST_P(CSRSum, Result) {
   cudaStream_t stream;
   cudaStreamCreate(&stream);
 
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
 
   int *ex_scan, *ind_ptr_a, *ind_ptr_b, *verify_indptr;
   float *in_vals_a, *in_vals_b, *verify;
@@ -158,25 +162,25 @@ TEST_P(CSRSum, Result) {
                         1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
   int verify_indptr_h[14] = {1, 2, 3, 4, 5, 1, 2, 3, 5, 0, 0, 1, 1, 0};
 
-  allocate(in_vals_a, 10);
-  allocate(in_vals_b, 10);
-  allocate(verify, 14);
-  allocate(ex_scan, 4);
-  allocate(verify_indptr, 14);
+  raft::allocate(in_vals_a, 10);
+  raft::allocate(in_vals_b, 10);
+  raft::allocate(verify, 14);
+  raft::allocate(ex_scan, 4);
+  raft::allocate(verify_indptr, 14);
 
-  allocate(ind_ptr_a, 10);
-  allocate(ind_ptr_b, 10);
+  raft::allocate(ind_ptr_a, 10);
+  raft::allocate(ind_ptr_b, 10);
 
-  updateDevice(ex_scan, *&ex_scan_h, 4, stream);
-  updateDevice(in_vals_a, *&in_vals_h, 10, stream);
-  updateDevice(in_vals_b, *&in_vals_h, 10, stream);
-  updateDevice(verify, *&verify_h, 14, stream);
-  updateDevice(verify_indptr, *&verify_indptr_h, 14, stream);
-  updateDevice(ind_ptr_a, *&indptr_a_h, 10, stream);
-  updateDevice(ind_ptr_b, *&indptr_b_h, 10, stream);
+  raft::update_device(ex_scan, *&ex_scan_h, 4, stream);
+  raft::update_device(in_vals_a, *&in_vals_h, 10, stream);
+  raft::update_device(in_vals_b, *&in_vals_h, 10, stream);
+  raft::update_device(verify, *&verify_h, 14, stream);
+  raft::update_device(verify_indptr, *&verify_indptr_h, 14, stream);
+  raft::update_device(ind_ptr_a, *&indptr_a_h, 10, stream);
+  raft::update_device(ind_ptr_b, *&indptr_b_h, 10, stream);
 
   int *result_ind;
-  allocate(result_ind, 4);
+  raft::allocate(result_ind, 4);
 
   int nnz = csr_add_calc_inds<float, 32>(ex_scan, ind_ptr_a, in_vals_a, 10,
                                          ex_scan, ind_ptr_b, in_vals_b, 10, 4,
@@ -184,8 +188,8 @@ TEST_P(CSRSum, Result) {
 
   int *result_indptr;
   float *result_val;
-  allocate(result_indptr, nnz);
-  allocate(result_val, nnz);
+  raft::allocate(result_indptr, nnz);
+  raft::allocate(result_val, nnz);
 
   csr_add_finalize<float, 32>(ex_scan, ind_ptr_a, in_vals_a, 10, ex_scan,
                               ind_ptr_b, in_vals_b, 10, 4, result_ind,
@@ -193,9 +197,10 @@ TEST_P(CSRSum, Result) {
 
   ASSERT_TRUE(nnz == 14);
 
-  ASSERT_TRUE(devArrMatch<float>(verify, result_val, nnz, Compare<float>()));
   ASSERT_TRUE(
-    devArrMatch<int>(verify_indptr, result_indptr, nnz, Compare<int>()));
+    raft::devArrMatch<float>(verify, result_val, nnz, raft::Compare<float>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify_indptr, result_indptr, nnz,
+                                     raft::Compare<int>()));
 
   cudaStreamDestroy(stream);
 
@@ -221,12 +226,12 @@ TEST_P(CSRRowOpTest, Result) {
 
   float verify_h[10] = {0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0};
 
-  allocate(verify, 10);
-  allocate(ex_scan, 4);
-  allocate(result, 10, true);
+  raft::allocate(verify, 10);
+  raft::allocate(ex_scan, 4);
+  raft::allocate(result, 10, true);
 
-  updateDevice(ex_scan, *&ex_scan_h, 4, stream);
-  updateDevice(verify, *&verify_h, 10, stream);
+  raft::update_device(ex_scan, *&ex_scan_h, 4, stream);
+  raft::update_device(verify, *&verify_h, 10, stream);
 
   csr_row_op<int, 32>(
     ex_scan, 4, 10,
@@ -235,7 +240,8 @@ TEST_P(CSRRowOpTest, Result) {
     },
     stream);
 
-  ASSERT_TRUE(devArrMatch<float>(verify, result, 10, Compare<float>()));
+  ASSERT_TRUE(
+    raft::devArrMatch<float>(verify, result, 10, raft::Compare<float>()));
 
   cudaStreamDestroy(stream);
 
@@ -257,18 +263,18 @@ TEST_P(AdjGraphTest, Result) {
 
   int verify_h[9] = {0, 1, 2, 0, 1, 2, 0, 1, 2};
 
-  allocate(row_ind, 3);
-  allocate(adj, 18);
-  allocate(result, 9, true);
-  allocate(verify, 9);
+  raft::allocate(row_ind, 3);
+  raft::allocate(adj, 18);
+  raft::allocate(result, 9, true);
+  raft::allocate(verify, 9);
 
-  updateDevice(row_ind, *&row_ind_h, 3, stream);
-  updateDevice(adj, *&adj_h, 18, stream);
-  updateDevice(verify, *&verify_h, 9, stream);
+  raft::update_device(row_ind, *&row_ind_h, 3, stream);
+  raft::update_device(adj, *&adj_h, 18, stream);
+  raft::update_device(verify, *&verify_h, 9, stream);
 
   csr_adj_graph_batched<int, 32>(row_ind, 6, 9, 3, adj, result, stream);
 
-  ASSERT_TRUE(devArrMatch<int>(verify, result, 9, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify, result, 9, raft::Compare<int>()));
 
   cudaStreamDestroy(stream);
 
@@ -283,7 +289,8 @@ TEST_P(WeakCCTest, Result) {
   cudaStream_t stream;
   cudaStreamCreate(&stream);
 
-  std::shared_ptr<deviceAllocator> alloc(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> alloc(
+    new raft::mr::device::default_allocator);
   int *row_ind, *row_ind_ptr, *result, *verify;
 
   int row_ind_h1[3] = {0, 3, 6};
@@ -294,10 +301,10 @@ TEST_P(WeakCCTest, Result) {
   int row_ind_ptr_h2[5] = {3, 4, 3, 4, 5};
   int verify_h2[6] = {1, 1, 1, 5, 5, 5};
 
-  allocate(row_ind, 3);
-  allocate(row_ind_ptr, 9);
-  allocate(result, 9, true);
-  allocate(verify, 9);
+  raft::allocate(row_ind, 3);
+  raft::allocate(row_ind_ptr, 9);
+  raft::allocate(result, 9, true);
+  raft::allocate(verify, 9);
 
   device_buffer<bool> xa(alloc, stream, 6);
   device_buffer<bool> fa(alloc, stream, 6);
@@ -307,27 +314,27 @@ TEST_P(WeakCCTest, Result) {
   /**
      * Run batch #1
      */
-  updateDevice(row_ind, *&row_ind_h1, 3, stream);
-  updateDevice(row_ind_ptr, *&row_ind_ptr_h1, 9, stream);
-  updateDevice(verify, *&verify_h1, 6, stream);
+  raft::update_device(row_ind, *&row_ind_h1, 3, stream);
+  raft::update_device(row_ind_ptr, *&row_ind_ptr_h1, 9, stream);
+  raft::update_device(verify, *&verify_h1, 6, stream);
 
   weak_cc_batched<int, 32>(result, row_ind, row_ind_ptr, 9, 6, 0, 3, &state,
                            stream);
 
   cudaStreamSynchronize(stream);
-  ASSERT_TRUE(devArrMatch<int>(verify, result, 6, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify, result, 6, raft::Compare<int>()));
 
   /**
      * Run batch #2
      */
-  updateDevice(row_ind, *&row_ind_h2, 3, stream);
-  updateDevice(row_ind_ptr, *&row_ind_ptr_h2, 5, stream);
-  updateDevice(verify, *&verify_h2, 6, stream);
+  raft::update_device(row_ind, *&row_ind_h2, 3, stream);
+  raft::update_device(row_ind_ptr, *&row_ind_ptr_h2, 5, stream);
+  raft::update_device(verify, *&verify_h2, 6, stream);
 
   weak_cc_batched<int, 32>(result, row_ind, row_ind_ptr, 5, 6, 4, 3, &state,
                            stream);
 
-  ASSERT_TRUE(devArrMatch<int>(verify, result, 6, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch<int>(verify, result, 6, raft::Compare<int>()));
 
   cudaStreamSynchronize(stream);
 
diff --git a/cpp/test/prims/cuda_utils.cu b/cpp/test/prims/cuda_utils.cu
deleted file mode 100644
index 594e98ab8d..0000000000
--- a/cpp/test/prims/cuda_utils.cu
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-
-namespace MLCommon {
-
-TEST(Utils, Assert) {
-  ASSERT_NO_THROW(ASSERT(1 == 1, "Should not assert!"));
-  ASSERT_THROW(ASSERT(1 != 1, "Should assert!"), Exception);
-}
-
-TEST(Utils, CudaCheck) { ASSERT_NO_THROW(CUDA_CHECK(cudaFree(nullptr))); }
-
-// we want the functions like 'log2' to work both at compile and runtimes!
-static const int log2Of1024 = log2(1024);
-static const int log2Of1023 = log2(1023);
-TEST(Utils, log2) {
-  ASSERT_EQ(10, log2(1024));
-  ASSERT_EQ(9, log2(1023));
-  ASSERT_EQ(10, log2Of1024);
-  ASSERT_EQ(9, log2Of1023);
-}
-
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/gemm.cu b/cpp/test/prims/cutlass_gemm.cu
similarity index 90%
rename from cpp/test/prims/gemm.cu
rename to cpp/test/prims/cutlass_gemm.cu
index 0d40abc620..1bd6d709df 100644
--- a/cpp/test/prims/gemm.cu
+++ b/cpp/test/prims/cutlass_gemm.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/gemm.cuh>
+#include <raft/cudart_utils.h>
+#include <linalg/cutlass_gemm.cuh>
 
 namespace MLCommon {
 namespace LinAlg {
@@ -30,7 +30,7 @@ __global__ void fillKernel(T *arr, T val, int N) {
 
 template <typename T, int NTHREADS = 256, int NITEMS = 4>
 void fill(T *arr, T val, int N) {
-  const int nblks = ceildiv<int>(N, NTHREADS * NITEMS);
+  const int nblks = raft::ceildiv<int>(N, NTHREADS * NITEMS);
   fillKernel<T><<<nblks, NTHREADS>>>(arr, val, N);
   CUDA_CHECK(cudaPeekAtLastError());
 }
@@ -51,7 +51,7 @@ TEST(Gemm, Gemm_128x128x8) {
   gemm<float, float, float, cutlass::Shape<8, 128, 128>>(
     CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, 1.f, B, N, A, K, 1.f, C, N, D, stream);
   float *hD = new float[M * N];
-  updateHost<float>(hD, D, M * N, stream);
+  raft::update_host<float>(hD, D, M * N, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (int i = 0; i < M * N; ++i) {
     ASSERT_FLOAT_EQ(0.5f * K + 2.f, hD[i]) << " @hD[" << i << "]";
@@ -64,5 +64,5 @@ TEST(Gemm, Gemm_128x128x8) {
   CUDA_CHECK(cudaFree(D));
 }
 
-}  // end namespace LinAlg
-}  // end namespace MLCommon
+}  // namespace LinAlg
+}  // namespace MLCommon
diff --git a/cpp/test/prims/decoupled_lookback.cu b/cpp/test/prims/decoupled_lookback.cu
index ce0128183f..9409872f7c 100644
--- a/cpp/test/prims/decoupled_lookback.cu
+++ b/cpp/test/prims/decoupled_lookback.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <decoupled_lookback.cuh>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -35,7 +35,7 @@ void dlbTest(int len, int *out) {
   int nblks = len;
   size_t workspaceSize = DecoupledLookBack<int>::computeWorkspaceSize(nblks);
   char *workspace;
-  allocate(workspace, workspaceSize);
+  raft::allocate(workspace, workspaceSize);
   CUDA_CHECK(cudaMemset(workspace, 0, workspaceSize));
   dlbTestKernel<TPB><<<nblks, TPB>>>(workspace, len, out);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -55,7 +55,7 @@ class DlbTest : public ::testing::TestWithParam<DlbInputs> {
   void SetUp() override {
     params = ::testing::TestWithParam<DlbInputs>::GetParam();
     int len = params.len;
-    allocate(out, len);
+    raft::allocate(out, len);
     dlbTest(len, out);
   }
 
@@ -71,7 +71,7 @@ template <typename T, typename L>
                                              L eq_compare,
                                              cudaStream_t stream = 0) {
   std::vector<T> act_h(size);
-  updateHost<T>(&(act_h[0]), actual, size, stream);
+  raft::update_host<T>(&(act_h[0]), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < size; ++i) {
     auto act = act_h[i];
@@ -86,7 +86,7 @@ template <typename T, typename L>
 
 const std::vector<DlbInputs> inputs = {{4}, {16}, {64}, {256}, {2048}};
 TEST_P(DlbTest, Result) {
-  ASSERT_TRUE(devArrMatchCustom(out, params.len, Compare<int>()));
+  ASSERT_TRUE(devArrMatchCustom(out, params.len, raft::Compare<int>()));
 }
 INSTANTIATE_TEST_CASE_P(DlbTests, DlbTest, ::testing::ValuesIn(inputs));
 
diff --git a/cpp/test/prims/device_utils.cu b/cpp/test/prims/device_utils.cu
index 4d2675a624..fb76634659 100644
--- a/cpp/test/prims/device_utils.cu
+++ b/cpp/test/prims/device_utils.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <common/device_utils.cuh>
 #include "test_utils.h"
 
@@ -55,7 +55,7 @@ struct BatchedBlockReduceInputs {
 template <int NThreads>
 void batchedBlockReduceTest(int* out, const BatchedBlockReduceInputs& param,
                             cudaStream_t stream) {
-  size_t smemSize = sizeof(int) * (param.blkDim / WarpSize) * NThreads;
+  size_t smemSize = sizeof(int) * (param.blkDim / raft::WarpSize) * NThreads;
   batchedBlockReduceTestKernel<NThreads>
     <<<1, param.blkDim, smemSize, stream>>>(out);
   CUDA_CHECK(cudaGetLastError());
@@ -73,8 +73,8 @@ class BatchedBlockReduceTest
   void SetUp() override {
     params = ::testing::TestWithParam<BatchedBlockReduceInputs>::GetParam();
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(out, NThreads, true);
-    allocate(refOut, NThreads, true);
+    raft::allocate(out, NThreads, true);
+    raft::allocate(refOut, NThreads, true);
     computeRef();
     batchedBlockReduceTest<NThreads>(out, params, stream);
   }
@@ -95,7 +95,7 @@ class BatchedBlockReduceTest
         ref[i] += j * NThreads + i;
       }
     }
-    updateDevice(refOut, ref, NThreads, stream);
+    raft::update_device(refOut, ref, NThreads, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     delete[] ref;
   }
@@ -115,7 +115,7 @@ const std::vector<BatchedBlockReduceInputs> inputs = {
 };
 
 TEST_P(BBTest8, Result) {
-  ASSERT_TRUE(devArrMatch(refOut, out, 8, Compare<int>()));
+  ASSERT_TRUE(devArrMatch(refOut, out, 8, raft::Compare<int>()));
 }
 INSTANTIATE_TEST_CASE_P(BatchedBlockReduceTests, BBTest8,
                         ::testing::ValuesIn(inputs));
diff --git a/cpp/test/prims/dispersion.cu b/cpp/test/prims/dispersion.cu
index b5c5301496..7e553daa13 100644
--- a/cpp/test/prims/dispersion.cu
+++ b/cpp/test/prims/dispersion.cu
@@ -14,13 +14,13 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <stdio.h>
 #include <stdlib.h>
-#include <cuda_utils.cuh>
 #include <metrics/dispersion.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <vector>
 #include "test_utils.h"
 
@@ -45,18 +45,18 @@ class DispersionTest : public ::testing::TestWithParam<DispersionInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<DispersionInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.clusters * params.dim;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator.reset(new defaultDeviceAllocator);
-    allocate(data, len);
-    allocate(counts, params.clusters);
-    allocate(exp_mean, params.dim);
-    allocate(act_mean, params.dim);
+    allocator.reset(new raft::mr::device::default_allocator);
+    raft::allocate(data, len);
+    raft::allocate(counts, params.clusters);
+    raft::allocate(exp_mean, params.dim);
+    raft::allocate(act_mean, params.dim);
     r.uniform(data, len, (T)-1.0, (T)1.0, stream);
     r.uniformInt(counts, params.clusters, 1, 100, stream);
     std::vector<int> h_counts(params.clusters, 0);
-    updateHost(&(h_counts[0]), counts, params.clusters, stream);
+    raft::update_host(&(h_counts[0]), counts, params.clusters, stream);
     npoints = 0;
     for (const auto &val : h_counts) {
       npoints += val;
@@ -65,7 +65,7 @@ class DispersionTest : public ::testing::TestWithParam<DispersionInputs<T>> {
                            params.dim, allocator, stream);
     expectedVal = T(0);
     std::vector<T> h_data(len, T(0));
-    updateHost(&(h_data[0]), data, len, stream);
+    raft::update_host(&(h_data[0]), data, len, stream);
     std::vector<T> mean(params.dim, T(0));
     for (int i = 0; i < params.clusters; ++i) {
       for (int j = 0; j < params.dim; ++j) {
@@ -75,7 +75,7 @@ class DispersionTest : public ::testing::TestWithParam<DispersionInputs<T>> {
     for (int i = 0; i < params.dim; ++i) {
       mean[i] /= T(npoints);
     }
-    updateDevice(exp_mean, &(mean[0]), params.dim, stream);
+    raft::update_device(exp_mean, &(mean[0]), params.dim, stream);
     for (int i = 0; i < params.clusters; ++i) {
       for (int j = 0; j < params.dim; ++j) {
         auto diff = h_data[i * params.dim + j] - mean[j];
@@ -110,7 +110,7 @@ const std::vector<DispersionInputs<float>> inputsf = {
   {0.001f, 1000, 1000, 1234ULL}};
 typedef DispersionTest<float> DispersionTestF;
 TEST_P(DispersionTestF, Result) {
-  auto eq = CompareApprox<float>(params.tolerance);
+  auto eq = raft::CompareApprox<float>(params.tolerance);
   ASSERT_TRUE(devArrMatch(exp_mean, act_mean, params.dim, eq));
   ASSERT_TRUE(match(expectedVal, actualVal, eq));
 }
@@ -123,7 +123,7 @@ const std::vector<DispersionInputs<double>> inputsd = {
   {0.001, 1000, 1000, 1234ULL}};
 typedef DispersionTest<double> DispersionTestD;
 TEST_P(DispersionTestD, Result) {
-  auto eq = CompareApprox<double>(params.tolerance);
+  auto eq = raft::CompareApprox<double>(params.tolerance);
   ASSERT_TRUE(devArrMatch(exp_mean, act_mean, params.dim, eq));
   ASSERT_TRUE(match(expectedVal, actualVal, eq));
 }
diff --git a/cpp/test/prims/dist_adj.cu b/cpp/test/prims/dist_adj.cu
index 1aecec330d..0c5b2df22e 100644
--- a/cpp/test/prims/dist_adj.cu
+++ b/cpp/test/prims/dist_adj.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <distance/distance.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -46,7 +46,7 @@ template <typename DataType>
 void naiveDistanceAdj(bool *dist, const DataType *x, const DataType *y, int m,
                       int n, int k, DataType eps, bool isRowMajor) {
   static const dim3 TPB(16, 32, 1);
-  dim3 nblks(ceildiv(m, (int)TPB.x), ceildiv(n, (int)TPB.y), 1);
+  dim3 nblks(raft::ceildiv(m, (int)TPB.x), raft::ceildiv(n, (int)TPB.y), 1);
   naiveDistanceAdjKernel<DataType>
     <<<nblks, TPB>>>(dist, x, y, m, n, k, eps, isRowMajor);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -72,17 +72,17 @@ class DistanceAdjTest
  public:
   void SetUp() override {
     params = ::testing::TestWithParam<DistanceAdjInputs<DataType>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int m = params.m;
     int n = params.n;
     int k = params.k;
     bool isRowMajor = params.isRowMajor;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(x, m * k);
-    allocate(y, n * k);
-    allocate(dist_ref, m * n);
-    allocate(dist, m * n);
+    raft::allocate(x, m * k);
+    raft::allocate(y, n * k);
+    raft::allocate(dist_ref, m * n);
+    raft::allocate(dist, m * n);
     r.uniform(x, m * k, DataType(-1.0), DataType(1.0), stream);
     r.uniform(y, n * k, DataType(-1.0), DataType(1.0), stream);
 
@@ -91,17 +91,19 @@ class DistanceAdjTest
     naiveDistanceAdj(dist_ref, x, y, m, n, k, threshold, isRowMajor);
     char *workspace = nullptr;
     size_t worksize =
-      getWorkspaceSize<EucExpandedL2, DataType, DataType, bool>(x, y, m, n, k);
+      getWorkspaceSize<ML::Distance::DistanceType::EucExpandedL2, DataType,
+                       DataType, bool>(x, y, m, n, k);
     if (worksize != 0) {
-      allocate(workspace, worksize);
+      raft::allocate(workspace, worksize);
     }
 
     typedef cutlass::Shape<8, 128, 128> OutputTile_t;
     auto fin_op = [threshold] __device__(DataType d_val, int g_d_idx) {
       return d_val <= threshold;
     };
-    distance<EucExpandedL2, DataType, DataType, bool, OutputTile_t>(
-      x, y, dist, m, n, k, workspace, worksize, fin_op, stream, isRowMajor);
+    distance<ML::Distance::DistanceType::EucExpandedL2, DataType, DataType,
+             bool, OutputTile_t>(x, y, dist, m, n, k, workspace, worksize,
+                                 fin_op, stream, isRowMajor);
     CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(workspace));
   }
@@ -133,7 +135,7 @@ typedef DistanceAdjTest<float> DistanceAdjTestF;
 TEST_P(DistanceAdjTestF, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n, Compare<bool>()));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n, raft::Compare<bool>()));
 }
 INSTANTIATE_TEST_CASE_P(DistanceAdjTests, DistanceAdjTestF,
                         ::testing::ValuesIn(inputsf));
@@ -152,7 +154,7 @@ typedef DistanceAdjTest<double> DistanceAdjTestD;
 TEST_P(DistanceAdjTestD, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n, Compare<bool>()));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n, raft::Compare<bool>()));
 }
 INSTANTIATE_TEST_CASE_P(DistanceAdjTests, DistanceAdjTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/dist_cos.cu b/cpp/test/prims/dist_cos.cu
index 06c9c2cd50..1c9ff80946 100644
--- a/cpp/test/prims/dist_cos.cu
+++ b/cpp/test/prims/dist_cos.cu
@@ -15,12 +15,15 @@
  */
 
 #include "distance_base.cuh"
+#include "test_utils.h"
 
 namespace MLCommon {
 namespace Distance {
 
 template <typename DataType>
-class DistanceExpCos : public DistanceTest<EucExpandedCosine, DataType> {};
+class DistanceExpCos
+  : public DistanceTest<ML::Distance::DistanceType::EucExpandedCosine,
+                        DataType> {};
 
 const std::vector<DistanceInputs<float>> inputsf = {
   {0.001f, 1024, 1024, 32, true, 1234ULL},
@@ -36,8 +39,8 @@ typedef DistanceExpCos<float> DistanceExpCosF;
 TEST_P(DistanceExpCosF, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceExpCosF,
                         ::testing::ValuesIn(inputsf));
@@ -56,8 +59,8 @@ typedef DistanceExpCos<double> DistanceExpCosD;
 TEST_P(DistanceExpCosD, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceExpCosD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/dist_euc_exp.cu b/cpp/test/prims/dist_euc_exp.cu
index c4cfae5f91..de5bfac6de 100644
--- a/cpp/test/prims/dist_euc_exp.cu
+++ b/cpp/test/prims/dist_euc_exp.cu
@@ -15,12 +15,14 @@
  */
 
 #include "distance_base.cuh"
+#include "test_utils.h"
 
 namespace MLCommon {
 namespace Distance {
 
 template <typename DataType>
-class DistanceEucExpTest : public DistanceTest<EucExpandedL2, DataType> {};
+class DistanceEucExpTest
+  : public DistanceTest<ML::Distance::DistanceType::EucExpandedL2, DataType> {};
 
 const std::vector<DistanceInputs<float>> inputsf = {
   {0.001f, 1024, 1024, 32, true, 1234ULL},
@@ -36,8 +38,8 @@ typedef DistanceEucExpTest<float> DistanceEucExpTestF;
 TEST_P(DistanceEucExpTestF, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceEucExpTestF,
                         ::testing::ValuesIn(inputsf));
@@ -56,8 +58,8 @@ typedef DistanceEucExpTest<double> DistanceEucExpTestD;
 TEST_P(DistanceEucExpTestD, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceEucExpTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/dist_euc_unexp.cu b/cpp/test/prims/dist_euc_unexp.cu
index 90c3d7fc86..918f1a4137 100644
--- a/cpp/test/prims/dist_euc_unexp.cu
+++ b/cpp/test/prims/dist_euc_unexp.cu
@@ -15,12 +15,15 @@
  */
 
 #include "distance_base.cuh"
+#include "test_utils.h"
 
 namespace MLCommon {
 namespace Distance {
 
 template <typename DataType>
-class DistanceEucUnexpTest : public DistanceTest<EucUnexpandedL2, DataType> {};
+class DistanceEucUnexpTest
+  : public DistanceTest<ML::Distance::DistanceType::EucUnexpandedL2, DataType> {
+};
 
 const std::vector<DistanceInputs<float>> inputsf = {
   {0.001f, 1024, 1024, 32, true, 1234ULL},
@@ -36,8 +39,8 @@ typedef DistanceEucUnexpTest<float> DistanceEucUnexpTestF;
 TEST_P(DistanceEucUnexpTestF, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceEucUnexpTestF,
                         ::testing::ValuesIn(inputsf));
@@ -56,8 +59,8 @@ typedef DistanceEucUnexpTest<double> DistanceEucUnexpTestD;
 TEST_P(DistanceEucUnexpTestD, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(dist_ref, dist, m, n,
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceEucUnexpTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/dist_l1.cu b/cpp/test/prims/dist_l1.cu
index d6b9dc90eb..3229f32b7e 100644
--- a/cpp/test/prims/dist_l1.cu
+++ b/cpp/test/prims/dist_l1.cu
@@ -15,12 +15,15 @@
  */
 
 #include "distance_base.cuh"
+#include "test_utils.h"
 
 namespace MLCommon {
 namespace Distance {
 
 template <typename DataType>
-class DistanceUnexpL1 : public DistanceTest<EucUnexpandedL1, DataType> {};
+class DistanceUnexpL1
+  : public DistanceTest<ML::Distance::DistanceType::EucUnexpandedL1, DataType> {
+};
 
 const std::vector<DistanceInputs<float>> inputsf = {
   {0.001f, 1024, 1024, 32, true, 1234ULL},
@@ -36,8 +39,8 @@ typedef DistanceUnexpL1<float> DistanceUnexpL1F;
 TEST_P(DistanceUnexpL1F, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(dist_ref, dist, m, n,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceUnexpL1F,
                         ::testing::ValuesIn(inputsf));
@@ -56,8 +59,8 @@ typedef DistanceUnexpL1<double> DistanceUnexpL1D;
 TEST_P(DistanceUnexpL1D, Result) {
   int m = params.isRowMajor ? params.m : params.n;
   int n = params.isRowMajor ? params.n : params.m;
-  ASSERT_TRUE(
-    devArrMatch(dist_ref, dist, m, n, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(dist_ref, dist, m, n,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(DistanceTests, DistanceUnexpL1D,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/distance_base.cuh b/cpp/test/prims/distance_base.cuh
index 642f9e2c93..c44ea91f70 100644
--- a/cpp/test/prims/distance_base.cuh
+++ b/cpp/test/prims/distance_base.cuh
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <distance/distance.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -27,7 +27,8 @@ namespace Distance {
 template <typename DataType>
 __global__ void naiveDistanceKernel(DataType *dist, const DataType *x,
                                     const DataType *y, int m, int n, int k,
-                                    DistanceType type, bool isRowMajor) {
+                                    ML::Distance::DistanceType type,
+                                    bool isRowMajor) {
   int midx = threadIdx.x + blockIdx.x * blockDim.x;
   int nidx = threadIdx.y + blockIdx.y * blockDim.y;
   if (midx >= m || nidx >= n) return;
@@ -38,8 +39,9 @@ __global__ void naiveDistanceKernel(DataType *dist, const DataType *x,
     auto diff = x[xidx] - y[yidx];
     acc += diff * diff;
   }
-  if (type == EucExpandedL2Sqrt || type == EucUnexpandedL2Sqrt)
-    acc = mySqrt(acc);
+  if (type == ML::Distance::DistanceType::EucExpandedL2Sqrt ||
+      type == ML::Distance::DistanceType::EucUnexpandedL2Sqrt)
+    acc = raft::mySqrt(acc);
   int outidx = isRowMajor ? midx * n + nidx : midx + m * nidx;
   dist[outidx] = acc;
 }
@@ -93,28 +95,32 @@ __global__ void naiveCosineDistanceKernel(DataType *dist, const DataType *x,
   }
 
   int outidx = isRowMajor ? midx * n + nidx : midx + m * nidx;
-  dist[outidx] = acc_ab / (mySqrt(acc_a) * mySqrt(acc_b));
+
+  // Use 1.0 - (cosine similarity) to calc the distance
+  dist[outidx] =
+    (DataType)1.0 - acc_ab / (raft::mySqrt(acc_a) * raft::mySqrt(acc_b));
 }
 
 template <typename DataType>
 void naiveDistance(DataType *dist, const DataType *x, const DataType *y, int m,
-                   int n, int k, DistanceType type, bool isRowMajor) {
+                   int n, int k, ML::Distance::DistanceType type,
+                   bool isRowMajor) {
   static const dim3 TPB(16, 32, 1);
-  dim3 nblks(ceildiv(m, (int)TPB.x), ceildiv(n, (int)TPB.y), 1);
+  dim3 nblks(raft::ceildiv(m, (int)TPB.x), raft::ceildiv(n, (int)TPB.y), 1);
 
   switch (type) {
-    case EucUnexpandedL1:
+    case ML::Distance::DistanceType::EucUnexpandedL1:
       naiveL1DistanceKernel<DataType>
         <<<nblks, TPB>>>(dist, x, y, m, n, k, isRowMajor);
       break;
-    case EucUnexpandedL2Sqrt:
-    case EucUnexpandedL2:
-    case EucExpandedL2Sqrt:
-    case EucExpandedL2:
+    case ML::Distance::DistanceType::EucUnexpandedL2Sqrt:
+    case ML::Distance::DistanceType::EucUnexpandedL2:
+    case ML::Distance::DistanceType::EucExpandedL2Sqrt:
+    case ML::Distance::DistanceType::EucExpandedL2:
       naiveDistanceKernel<DataType>
         <<<nblks, TPB>>>(dist, x, y, m, n, k, type, isRowMajor);
       break;
-    case EucExpandedCosine:
+    case ML::Distance::DistanceType::EucExpandedCosine:
       naiveCosineDistanceKernel<DataType>
         <<<nblks, TPB>>>(dist, x, y, m, n, k, isRowMajor);
       break;
@@ -138,7 +144,8 @@ template <typename DataType>
   return os;
 }
 
-template <DistanceType distanceType, typename DataType, typename OutputTile_t>
+template <ML::Distance::DistanceType distanceType, typename DataType,
+          typename OutputTile_t>
 void distanceLauncher(DataType *x, DataType *y, DataType *dist, DataType *dist2,
                       int m, int n, int k, DistanceInputs<DataType> &params,
                       DataType threshold, char *workspace, size_t worksize,
@@ -151,23 +158,23 @@ void distanceLauncher(DataType *x, DataType *y, DataType *dist, DataType *dist2,
     x, y, dist, m, n, k, workspace, worksize, fin_op, stream, isRowMajor);
 }
 
-template <DistanceType distanceType, typename DataType>
+template <ML::Distance::DistanceType distanceType, typename DataType>
 class DistanceTest : public ::testing::TestWithParam<DistanceInputs<DataType>> {
  public:
   void SetUp() override {
     params = ::testing::TestWithParam<DistanceInputs<DataType>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int m = params.m;
     int n = params.n;
     int k = params.k;
     bool isRowMajor = params.isRowMajor;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(x, m * k);
-    allocate(y, n * k);
-    allocate(dist_ref, m * n);
-    allocate(dist, m * n);
-    allocate(dist2, m * n);
+    raft::allocate(x, m * k);
+    raft::allocate(y, n * k);
+    raft::allocate(dist_ref, m * n);
+    raft::allocate(dist, m * n);
+    raft::allocate(dist2, m * n);
     r.uniform(x, m * k, DataType(-1.0), DataType(1.0), stream);
     r.uniform(y, n * k, DataType(-1.0), DataType(1.0), stream);
     naiveDistance(dist_ref, x, y, m, n, k, distanceType, isRowMajor);
@@ -176,7 +183,7 @@ class DistanceTest : public ::testing::TestWithParam<DistanceInputs<DataType>> {
       getWorkspaceSize<distanceType, DataType, DataType, DataType>(x, y, m, n,
                                                                    k);
     if (worksize != 0) {
-      allocate(workspace, worksize);
+      raft::allocate(workspace, worksize);
     }
 
     typedef cutlass::Shape<8, 128, 128> OutputTile_t;
diff --git a/cpp/test/prims/divide.cu b/cpp/test/prims/divide.cu
deleted file mode 100644
index ab21228434..0000000000
--- a/cpp/test/prims/divide.cu
+++ /dev/null
@@ -1,94 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/divide.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type>
-__global__ void naiveDivideKernel(Type *out, const Type *in, Type scalar,
-                                  int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = in[idx] / scalar;
-  }
-}
-
-template <typename Type>
-void naiveDivide(Type *out, const Type *in, Type scalar, int len,
-                 cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveDivideKernel<Type><<<nblks, TPB, 0, stream>>>(out, in, scalar, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-class DivideTest : public ::testing::TestWithParam<UnaryOpInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<UnaryOpInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in, len, T(-1.0), T(1.0), stream);
-    naiveDivide(out_ref, in, params.scalar, len, stream);
-    divideScalar(out, in, params.scalar, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  UnaryOpInputs<T> params;
-  T *in, *out_ref, *out;
-};
-
-const std::vector<UnaryOpInputs<float>> inputsf = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-typedef DivideTest<float> DivideTestF;
-TEST_P(DivideTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(DivideTests, DivideTestF, ::testing::ValuesIn(inputsf));
-
-typedef DivideTest<double> DivideTestD;
-const std::vector<UnaryOpInputs<double>> inputsd = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-TEST_P(DivideTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(DivideTests, DivideTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/eig.cu b/cpp/test/prims/eig.cu
deleted file mode 100644
index 7d062c068c..0000000000
--- a/cpp/test/prims/eig.cu
+++ /dev/null
@@ -1,213 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/eig.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct EigInputs {
-  T tolerance;
-  int len;
-  int n_row;
-  int n_col;
-  unsigned long long int seed;
-  int n;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const EigInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class EigTest : public ::testing::TestWithParam<EigInputs<T>> {
- protected:
-  void SetUp() override {
-    CUSOLVER_CHECK(cusolverDnCreate(&cusolverH));
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
-
-    params = ::testing::TestWithParam<EigInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-
-    allocate(cov_matrix, len);
-    T cov_matrix_h[] = {1.0,  0.9, 0.81, 0.729, 0.9,   1.0,  0.9, 0.81,
-                        0.81, 0.9, 1.0,  0.9,   0.729, 0.81, 0.9, 1.0};
-    ASSERT(len == 16, "This test only works with 4x4 matrices!");
-    updateDevice(cov_matrix, cov_matrix_h, len, stream);
-
-    allocate(eig_vectors, len);
-    allocate(eig_vals, params.n_col);
-    allocate(eig_vectors_jacobi, len);
-    allocate(eig_vals_jacobi, params.n_col);
-
-    T eig_vectors_ref_h[] = {0.2790, -0.6498, 0.6498, -0.2789, -0.5123, 0.4874,
-                             0.4874, -0.5123, 0.6498, 0.2789,  -0.2789, -0.6498,
-                             0.4874, 0.5123,  0.5123, 0.4874};
-    T eig_vals_ref_h[] = {0.0614, 0.1024, 0.3096, 3.5266};
-
-    allocate(eig_vectors_ref, len);
-    allocate(eig_vals_ref, params.n_col);
-
-    updateDevice(eig_vectors_ref, eig_vectors_ref_h, len, stream);
-    updateDevice(eig_vals_ref, eig_vals_ref_h, params.n_col, stream);
-
-    eigDC(cov_matrix, params.n_row, params.n_col, eig_vectors, eig_vals,
-          cusolverH, stream, allocator);
-
-    T tol = 1.e-7;
-    int sweeps = 15;
-    eigJacobi(cov_matrix, params.n_row, params.n_col, eig_vectors_jacobi,
-              eig_vals_jacobi, cusolverH, stream, allocator, tol, sweeps);
-
-    // test code for comparing two methods
-    len = params.n * params.n;
-    allocate(cov_matrix_large, len);
-    allocate(eig_vectors_large, len);
-    allocate(eig_vectors_jacobi_large, len);
-    allocate(eig_vals_large, params.n);
-    allocate(eig_vals_jacobi_large, params.n);
-
-    r.uniform(cov_matrix_large, len, T(-1.0), T(1.0), stream);
-
-    eigDC(cov_matrix_large, params.n, params.n, eig_vectors_large,
-          eig_vals_large, cusolverH, stream, allocator);
-    eigJacobi(cov_matrix_large, params.n, params.n, eig_vectors_jacobi_large,
-              eig_vals_jacobi_large, cusolverH, stream, allocator, tol, sweeps);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(cov_matrix));
-    CUDA_CHECK(cudaFree(eig_vectors));
-    CUDA_CHECK(cudaFree(eig_vectors_jacobi));
-    CUDA_CHECK(cudaFree(eig_vals));
-    CUDA_CHECK(cudaFree(eig_vals_jacobi));
-    CUDA_CHECK(cudaFree(eig_vectors_ref));
-    CUDA_CHECK(cudaFree(eig_vals_ref));
-    CUSOLVER_CHECK(cusolverDnDestroy(cusolverH));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  EigInputs<T> params;
-  T *cov_matrix, *eig_vectors, *eig_vectors_jacobi, *eig_vectors_ref, *eig_vals,
-    *eig_vals_jacobi, *eig_vals_ref;
-
-  T *cov_matrix_large, *eig_vectors_large, *eig_vectors_jacobi_large,
-    *eig_vals_large, *eig_vals_jacobi_large;
-
-  cusolverDnHandle_t cusolverH = NULL;
-  cudaStream_t stream;
-};
-
-const std::vector<EigInputs<float>> inputsf2 = {
-  {0.001f, 4 * 4, 4, 4, 1234ULL, 256}};
-
-const std::vector<EigInputs<double>> inputsd2 = {
-  {0.001, 4 * 4, 4, 4, 1234ULL, 256}};
-
-typedef EigTest<float> EigTestValF;
-TEST_P(EigTestValF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals, params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigTest<double> EigTestValD;
-TEST_P(EigTestValD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals, params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef EigTest<float> EigTestVecF;
-TEST_P(EigTestVecF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigTest<double> EigTestVecD;
-TEST_P(EigTestVecD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef EigTest<float> EigTestValJacobiF;
-TEST_P(EigTestValJacobiF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals_jacobi, params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigTest<double> EigTestValJacobiD;
-TEST_P(EigTestValJacobiD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals_jacobi, params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef EigTest<float> EigTestVecJacobiF;
-TEST_P(EigTestVecJacobiF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors_jacobi, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigTest<double> EigTestVecJacobiD;
-TEST_P(EigTestVecJacobiD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors_jacobi, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef EigTest<float> EigTestVecCompareF;
-TEST_P(EigTestVecCompareF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_large, eig_vectors_jacobi_large,
-                          (params.n * params.n),
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigTest<double> EigTestVecCompareD;
-TEST_P(EigTestVecCompareD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_large, eig_vectors_jacobi_large,
-                          (params.n * params.n),
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestValF, ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestValD, ::testing::ValuesIn(inputsd2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestVecF, ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestVecD, ::testing::ValuesIn(inputsd2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestValJacobiF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestValJacobiD,
-                        ::testing::ValuesIn(inputsd2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestVecJacobiF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigTests, EigTestVecJacobiD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/eig_sel.cu b/cpp/test/prims/eig_sel.cu
deleted file mode 100644
index 49b4bfa675..0000000000
--- a/cpp/test/prims/eig_sel.cu
+++ /dev/null
@@ -1,140 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#if CUDART_VERSION >= 10010
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/eig.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct EigSelInputs {
-  T tolerance;
-  int len;
-  int n_row;
-  int n_col;
-  unsigned long long int seed;
-  int n;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const EigSelInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class EigSelTest : public ::testing::TestWithParam<EigSelInputs<T>> {
- protected:
-  void SetUp() override {
-    CUSOLVER_CHECK(cusolverDnCreate(&cusolverH));
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
-    params = ::testing::TestWithParam<EigSelInputs<T>>::GetParam();
-    int len = params.len;
-
-    allocate(cov_matrix, len);
-    T cov_matrix_h[] = {1.0,  0.9, 0.81, 0.729, 0.9,   1.0,  0.9, 0.81,
-                        0.81, 0.9, 1.0,  0.9,   0.729, 0.81, 0.9, 1.0};
-    ASSERT(len == 16, "This test only works with 4x4 matrices!");
-    updateDevice(cov_matrix, cov_matrix_h, len, stream);
-
-    allocate(eig_vectors, 12);
-    allocate(eig_vals, params.n_col);
-
-    T eig_vectors_ref_h[] = {-0.5123, 0.4874,  0.4874, -0.5123, 0.6498, 0.2789,
-                             -0.2789, -0.6498, 0.4874, 0.5123,  0.5123, 0.4874};
-    T eig_vals_ref_h[] = {0.1024, 0.3096, 3.5266, 3.5266};
-
-    allocate(eig_vectors_ref, 12);
-    allocate(eig_vals_ref, params.n_col);
-
-    updateDevice(eig_vectors_ref, eig_vectors_ref_h, 12, stream);
-    updateDevice(eig_vals_ref, eig_vals_ref_h, 4, stream);
-
-    eigSelDC(cov_matrix, params.n_row, params.n_col, 3, eig_vectors, eig_vals,
-             EigVecMemUsage::OVERWRITE_INPUT, cusolverH, stream, allocator);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(cov_matrix));
-    CUDA_CHECK(cudaFree(eig_vectors));
-    CUDA_CHECK(cudaFree(eig_vals));
-    CUDA_CHECK(cudaFree(eig_vectors_ref));
-    CUDA_CHECK(cudaFree(eig_vals_ref));
-    CUSOLVER_CHECK(cusolverDnDestroy(cusolverH));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  EigSelInputs<T> params;
-  T *cov_matrix, *eig_vectors, *eig_vectors_ref, *eig_vals, *eig_vals_ref;
-
-  cusolverDnHandle_t cusolverH = NULL;
-  cudaStream_t stream;
-};
-
-const std::vector<EigSelInputs<float>> inputsf2 = {
-  {0.001f, 4 * 4, 4, 4, 1234ULL, 256}};
-
-const std::vector<EigSelInputs<double>> inputsd2 = {
-  {0.001, 4 * 4, 4, 4, 1234ULL, 256}};
-
-typedef EigSelTest<float> EigSelTestValF;
-TEST_P(EigSelTestValF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals, params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigSelTest<double> EigSelTestValD;
-TEST_P(EigSelTestValD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vals_ref, eig_vals, params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef EigSelTest<float> EigSelTestVecF;
-TEST_P(EigSelTestVecF, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors, 12,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef EigSelTest<double> EigSelTestVecD;
-TEST_P(EigSelTestVecD, Result) {
-  ASSERT_TRUE(devArrMatch(eig_vectors_ref, eig_vectors, 12,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(EigSelTest, EigSelTestValF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigSelTest, EigSelTestValD,
-                        ::testing::ValuesIn(inputsd2));
-
-INSTANTIATE_TEST_CASE_P(EigSelTest, EigSelTestVecF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EigSelTest, EigSelTestVecD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
-
-#endif
diff --git a/cpp/test/prims/eltwise.cu b/cpp/test/prims/eltwise.cu
deleted file mode 100644
index 859efc015a..0000000000
--- a/cpp/test/prims/eltwise.cu
+++ /dev/null
@@ -1,205 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/eltwise.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-//// Testing unary ops
-
-template <typename Type>
-__global__ void naiveScaleKernel(Type *out, const Type *in, Type scalar,
-                                 int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = scalar * in[idx];
-  }
-}
-
-template <typename Type>
-void naiveScale(Type *out, const Type *in, Type scalar, int len,
-                cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveScaleKernel<Type><<<nblks, TPB, 0, stream>>>(out, in, scalar, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-struct ScalarMultiplyInputs {
-  T tolerance;
-  int len;
-  T scalar;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os,
-                           const ScalarMultiplyInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class ScalarMultiplyTest
-  : public ::testing::TestWithParam<ScalarMultiplyInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<ScalarMultiplyInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    T scalar = params.scalar;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in, len, T(-1.0), T(1.0), stream);
-    naiveScale(out_ref, in, scalar, len, stream);
-    scalarMultiply(out, in, scalar, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  ScalarMultiplyInputs<T> params;
-  T *in, *out_ref, *out;
-};
-
-const std::vector<ScalarMultiplyInputs<float>> inputsf1 = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-
-const std::vector<ScalarMultiplyInputs<double>> inputsd1 = {
-  {0.00000001, 1024 * 1024, 2.0, 1234ULL}};
-
-typedef ScalarMultiplyTest<float> ScalarMultiplyTestF;
-TEST_P(ScalarMultiplyTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef ScalarMultiplyTest<double> ScalarMultiplyTestD;
-TEST_P(ScalarMultiplyTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(ScalarMultiplyTests, ScalarMultiplyTestF,
-                        ::testing::ValuesIn(inputsf1));
-
-INSTANTIATE_TEST_CASE_P(ScalarMultiplyTests, ScalarMultiplyTestD,
-                        ::testing::ValuesIn(inputsd1));
-
-//// Testing binary ops
-
-template <typename Type>
-__global__ void naiveAddKernel(Type *out, const Type *in1, const Type *in2,
-                               int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = in1[idx] + in2[idx];
-  }
-}
-
-template <typename Type>
-void naiveAdd(Type *out, const Type *in1, const Type *in2, int len,
-              cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveAddKernel<Type><<<nblks, TPB, 0, stream>>>(out, in1, in2, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-struct EltwiseAddInputs {
-  T tolerance;
-  int len;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os,
-                           const EltwiseAddInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class EltwiseAddTest : public ::testing::TestWithParam<EltwiseAddInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<EltwiseAddInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    int len = params.len;
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in1, len, T(-1.0), T(1.0), stream);
-    r.uniform(in2, len, T(-1.0), T(1.0), stream);
-    naiveAdd(out_ref, in1, in2, len, stream);
-    eltwiseAdd(out, in1, in2, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in1));
-    CUDA_CHECK(cudaFree(in2));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  EltwiseAddInputs<T> params;
-  T *in1, *in2, *out_ref, *out;
-};
-
-const std::vector<EltwiseAddInputs<float>> inputsf2 = {
-  {0.000001f, 1024 * 1024, 1234ULL}};
-
-const std::vector<EltwiseAddInputs<double>> inputsd2 = {
-  {0.00000001, 1024 * 1024, 1234ULL}};
-
-typedef EltwiseAddTest<float> EltwiseAddTestF;
-TEST_P(EltwiseAddTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef EltwiseAddTest<double> EltwiseAddTestD;
-TEST_P(EltwiseAddTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(EltwiseAddTests, EltwiseAddTestF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(EltwiseAddTests, EltwiseAddTestD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/eltwise2d.cu b/cpp/test/prims/eltwise2d.cu
index 32ae38ad14..1818bbb28b 100644
--- a/cpp/test/prims/eltwise2d.cu
+++ b/cpp/test/prims/eltwise2d.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <linalg/eltwise2d.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -48,7 +48,7 @@ void naiveEltwise2DAdd(int rows, int cols, const Type *aPtr, const Type *bPtr,
                        const Type *cPtr, Type *dPtr, Type alpha, Type beta,
                        cudaStream_t stream) {
   static const int TPB = 64;
-  int nblks = ceildiv(rows * cols, TPB);
+  int nblks = raft::ceildiv(rows * cols, TPB);
   naiveEltwise2DAddKernel<Type><<<nblks, TPB, 0, stream>>>(
     rows, cols, aPtr, bPtr, cPtr, dPtr, alpha, beta);
   CUDA_CHECK(cudaPeekAtLastError());
@@ -79,16 +79,16 @@ class Eltwise2dTest : public ::testing::TestWithParam<Eltwise2dInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<Eltwise2dInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
     auto w = params.w;
     auto h = params.h;
     auto len = w * h;
-    allocate(in1, h);
-    allocate(in2, w);
-    allocate(out_ref, len);
-    allocate(out, len);
+    raft::allocate(in1, h);
+    raft::allocate(in2, w);
+    raft::allocate(out_ref, len);
+    raft::allocate(out, len);
     r.uniform(in1, h, T(-1.0), T(1.0), stream);
     r.uniform(in2, w, T(-1.0), T(1.0), stream);
 
@@ -117,14 +117,14 @@ const std::vector<Eltwise2dInputs<double>> inputsd2 = {
 
 typedef Eltwise2dTest<float> Eltwise2dTestF;
 TEST_P(Eltwise2dTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.w * params.h,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.w * params.h,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef Eltwise2dTest<double> Eltwise2dTestD;
 TEST_P(Eltwise2dTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.w * params.h,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.w * params.h,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(Eltwise2dTests, Eltwise2dTestF,
diff --git a/cpp/test/prims/entropy.cu b/cpp/test/prims/entropy.cu
index 2265d07498..e258695e9e 100644
--- a/cpp/test/prims/entropy.cu
+++ b/cpp/test/prims/entropy.cu
@@ -13,13 +13,13 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
 #include <metrics/entropy.cuh>
+#include <raft/cuda_utils.cuh>
 #include <random>
 #include "test_utils.h"
 
@@ -76,11 +76,11 @@ class entropyTest : public ::testing::TestWithParam<entropyParam> {
 
     //allocating and initializing memory to the GPU
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(clusterArray, nElements, true);
-    MLCommon::updateDevice(clusterArray, &arr1[0], (int)nElements, stream);
+    raft::allocate(clusterArray, nElements, true);
+    raft::update_device(clusterArray, &arr1[0], (int)nElements, stream);
 
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
     //calling the entropy CUDA implementation
diff --git a/cpp/test/prims/epsilon_neighborhood.cu b/cpp/test/prims/epsilon_neighborhood.cu
index e50bd55a77..930dec9445 100644
--- a/cpp/test/prims/epsilon_neighborhood.cu
+++ b/cpp/test/prims/epsilon_neighborhood.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
 #include <distance/epsilon_neighborhood.cuh>
 #include <random/make_blobs.cuh>
@@ -41,12 +41,12 @@ class EpsNeighTest : public ::testing::TestWithParam<EpsInputs<T, IdxT>> {
   void SetUp() override {
     param = ::testing::TestWithParam<EpsInputs<T, IdxT>>::GetParam();
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, param.n_row * param.n_col);
-    allocate(labels, param.n_row);
+    raft::allocate(data, param.n_row * param.n_col);
+    raft::allocate(labels, param.n_row);
     batchSize = param.n_row / param.n_batches;
-    allocate(adj, param.n_row * batchSize);
-    allocate(vd, batchSize + 1, true);
-    allocator.reset(new defaultDeviceAllocator);
+    raft::allocate(adj, param.n_row * batchSize);
+    raft::allocate(vd, batchSize + 1, true);
+    allocator.reset(new raft::mr::device::default_allocator);
     Random::make_blobs<T, IdxT>(data, labels, param.n_row, param.n_col,
                                 param.n_centers, allocator, stream, true,
                                 nullptr, nullptr, T(0.01), false);
@@ -86,8 +86,8 @@ TEST_P(EpsNeighTestFI, Result) {
     epsUnexpL2SqNeighborhood<float, int>(
       adj, vd, data, data + (i * batchSize * param.n_col), param.n_row,
       batchSize, param.n_col, param.eps * param.eps, stream);
-    ASSERT_TRUE(devArrMatch(param.n_row / param.n_centers, vd, batchSize,
-                            Compare<int>(), stream));
+    ASSERT_TRUE(raft::devArrMatch(param.n_row / param.n_centers, vd, batchSize,
+                                  raft::Compare<int>(), stream));
   }
 }
 INSTANTIATE_TEST_CASE_P(EpsNeighTests, EpsNeighTestFI,
diff --git a/cpp/test/prims/fast_int_div.cu b/cpp/test/prims/fast_int_div.cu
index 950613a2ca..c70802ceb8 100644
--- a/cpp/test/prims/fast_int_div.cu
+++ b/cpp/test/prims/fast_int_div.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <common/fast_int_div.cuh>
 #include "test_utils.h"
 
@@ -66,9 +66,9 @@ TEST(FastIntDiv, GpuTest) {
   static const int len = 100000;
   static const int TPB = 128;
   int *computed, *correct, *in;
-  allocate(computed, len * 2);
-  allocate(correct, len * 2);
-  allocate(in, len);
+  raft::allocate(computed, len * 2);
+  raft::allocate(correct, len * 2);
+  raft::allocate(in, len);
   for (int i = 0; i < 100; ++i) {
     // get a positive divisor
     int divisor;
@@ -81,12 +81,12 @@ TEST(FastIntDiv, GpuTest) {
     for (int i = 0; i < len; ++i) {
       h_in[i] = rand();
     }
-    updateDevice(in, h_in, len, 0);
-    int nblks = ceildiv(len, TPB);
+    raft::update_device(in, h_in, len, 0);
+    int nblks = raft::ceildiv(len, TPB);
     fastIntDivTestKernel<<<nblks, TPB, 0, 0>>>(computed, correct, in, fid,
                                                divisor, len);
     CUDA_CHECK(cudaStreamSynchronize(0));
-    ASSERT_TRUE(devArrMatch(correct, computed, len * 2, Compare<int>()))
+    ASSERT_TRUE(devArrMatch(correct, computed, len * 2, raft::Compare<int>()))
       << " divisor=" << divisor;
   }
 }
@@ -97,8 +97,8 @@ FastIntDiv dummyFunc(int num) {
 }
 
 TEST(FastIntDiv, IncorrectUsage) {
-  ASSERT_THROW(dummyFunc(-1), MLCommon::Exception);
-  ASSERT_THROW(dummyFunc(0), MLCommon::Exception);
+  ASSERT_THROW(dummyFunc(-1), raft::exception);
+  ASSERT_THROW(dummyFunc(0), raft::exception);
 }
 
 }  // namespace MLCommon
diff --git a/cpp/test/prims/fused_l2_nn.cu b/cpp/test/prims/fused_l2_nn.cu
index 75d8146104..cd3cb59dc7 100644
--- a/cpp/test/prims/fused_l2_nn.cu
+++ b/cpp/test/prims/fused_l2_nn.cu
@@ -14,12 +14,12 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <distance/fused_l2_nn.cuh>
-#include <linalg/norm.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/norm.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -39,17 +39,17 @@ __global__ void naiveKernel(cub::KeyValuePair<int, DataT> *min, DataT *x,
     acc += diff * diff;
   }
   if (Sqrt) {
-    acc = mySqrt(acc);
+    acc = raft::mySqrt(acc);
   }
   ReduceOpT redOp;
   typedef cub::WarpReduce<cub::KeyValuePair<int, DataT>> WarpReduce;
   __shared__ typename WarpReduce::TempStorage temp[NWARPS];
-  int warpId = threadIdx.x / WarpSize;
+  int warpId = threadIdx.x / raft::WarpSize;
   cub::KeyValuePair<int, DataT> tmp;
   tmp.key = nidx;
   tmp.value = midx >= m || nidx >= n ? maxVal : acc;
   tmp = WarpReduce(temp[warpId]).Reduce(tmp, KVPMinReduce<int, DataT>());
-  if (threadIdx.x % WarpSize == 0 && midx < m) {
+  if (threadIdx.x % raft::WarpSize == 0 && midx < m) {
     while (atomicCAS(workspace + midx, 0, 1) == 1)
       ;
     __threadfence();
@@ -63,9 +63,9 @@ template <typename DataT, bool Sqrt>
 void naive(cub::KeyValuePair<int, DataT> *min, DataT *x, DataT *y, int m, int n,
            int k, int *workspace, cudaStream_t stream) {
   static const dim3 TPB(32, 16, 1);
-  dim3 nblks(ceildiv(n, (int)TPB.x), ceildiv(m, (int)TPB.y), 1);
+  dim3 nblks(raft::ceildiv(n, (int)TPB.x), raft::ceildiv(m, (int)TPB.y), 1);
   CUDA_CHECK(cudaMemsetAsync(workspace, 0, sizeof(int) * m, stream));
-  auto blks = ceildiv(m, 256);
+  auto blks = raft::ceildiv(m, 256);
   MinAndDistanceReduceOp<int, DataT> op;
   initKernel<DataT, cub::KeyValuePair<int, DataT>, int>
     <<<blks, 256, 0, stream>>>(min, m, std::numeric_limits<DataT>::max(), op);
@@ -88,23 +88,23 @@ class FusedL2NNTest : public ::testing::TestWithParam<Inputs<DataT>> {
  public:
   void SetUp() override {
     params = ::testing::TestWithParam<Inputs<DataT>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int m = params.m;
     int n = params.n;
     int k = params.k;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(x, m * k);
-    allocate(y, n * k);
-    allocate(xn, m);
-    allocate(yn, n);
-    allocate(workspace, sizeof(int) * m);
-    allocate(min, m);
-    allocate(min_ref, m);
+    raft::allocate(x, m * k);
+    raft::allocate(y, n * k);
+    raft::allocate(xn, m);
+    raft::allocate(yn, n);
+    raft::allocate(workspace, sizeof(int) * m);
+    raft::allocate(min, m);
+    raft::allocate(min_ref, m);
     r.uniform(x, m * k, DataT(-1.0), DataT(1.0), stream);
     r.uniform(y, n * k, DataT(-1.0), DataT(1.0), stream);
     generateGoldenResult();
-    LinAlg::rowNorm(xn, x, k, m, LinAlg::L2Norm, true, stream);
-    LinAlg::rowNorm(yn, y, k, n, LinAlg::L2Norm, true, stream);
+    raft::linalg::rowNorm(xn, x, k, m, raft::linalg::L2Norm, true, stream);
+    raft::linalg::rowNorm(yn, y, k, n, raft::linalg::L2Norm, true, stream);
   }
 
   void TearDown() override {
@@ -150,8 +150,8 @@ struct CompareApproxAbsKVP {
   CompareApproxAbsKVP(T eps_) : eps(eps_) {}
   bool operator()(const KVP &a, const KVP &b) const {
     if (a.key != b.key) return false;
-    T diff = abs(abs(a.value) - abs(b.value));
-    T m = std::max(abs(a.value), abs(b.value));
+    T diff = raft::abs(raft::abs(a.value) - raft::abs(b.value));
+    T m = std::max(raft::abs(a.value), raft::abs(b.value));
     T ratio = m >= eps ? diff / m : diff;
     return (ratio <= eps);
   }
@@ -178,8 +178,8 @@ template <typename K, typename V, typename L>
   typedef typename cub::KeyValuePair<K, V> KVP;
   std::shared_ptr<KVP> exp_h(new KVP[size]);
   std::shared_ptr<KVP> act_h(new KVP[size]);
-  updateHost<KVP>(exp_h.get(), expected, size, stream);
-  updateHost<KVP>(act_h.get(), actual, size, stream);
+  raft::update_host<KVP>(exp_h.get(), expected, size, stream);
+  raft::update_host<KVP>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < size; ++i) {
     auto exp = exp_h.get()[i];
@@ -269,7 +269,7 @@ class FusedL2NNDetTest : public FusedL2NNTest<DataT, Sqrt> {
   void SetUp() override {
     FusedL2NNTest<DataT, Sqrt>::SetUp();
     int m = this->params.m;
-    allocate(min1, m);
+    raft::allocate(min1, m);
   }
 
   void TearDown() override {
diff --git a/cpp/test/prims/gather.cu b/cpp/test/prims/gather.cu
index d1ae4f9b90..dcda9a9a7c 100644
--- a/cpp/test/prims/gather.cu
+++ b/cpp/test/prims/gather.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <matrix/gather.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -63,8 +63,8 @@ class GatherTest : public ::testing::TestWithParam<GatherInputs> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<GatherInputs>::GetParam();
-    Random::Rng r(params.seed);
-    Random::Rng r_int(params.seed);
+    raft::random::Rng r(params.seed);
+    raft::random::Rng r_int(params.seed);
     CUDA_CHECK(cudaStreamCreate(&stream));
 
     uint32_t nrows = params.nrows;
@@ -73,25 +73,25 @@ class GatherTest : public ::testing::TestWithParam<GatherInputs> {
     uint32_t len = nrows * ncols;
 
     // input matrix setup
-    allocate(d_in, nrows * ncols);
+    raft::allocate(d_in, nrows * ncols);
     h_in = (MatrixT *)malloc(sizeof(MatrixT) * nrows * ncols);
     r.uniform(d_in, len, MatrixT(-1.0), MatrixT(1.0), stream);
-    updateHost(h_in, d_in, len, stream);
+    raft::update_host(h_in, d_in, len, stream);
 
     // map setup
-    allocate(d_map, map_length);
+    raft::allocate(d_map, map_length);
     h_map = (MapT *)malloc(sizeof(MapT) * map_length);
     r_int.uniformInt(d_map, map_length, (MapT)0, nrows, stream);
-    updateHost(h_map, d_map, map_length, stream);
+    raft::update_host(h_map, d_map, map_length, stream);
 
     // expected and actual output matrix setup
     h_out = (MatrixT *)malloc(sizeof(MatrixT) * map_length * ncols);
-    allocate(d_out_exp, map_length * ncols);
-    allocate(d_out_act, map_length * ncols);
+    raft::allocate(d_out_exp, map_length * ncols);
+    raft::allocate(d_out_act, map_length * ncols);
 
     // launch gather on the host and copy the results to device
     naiveGather(h_in, ncols, nrows, h_map, map_length, h_out);
-    updateDevice(d_out_exp, h_out, map_length * ncols, stream);
+    raft::update_device(d_out_exp, h_out, map_length * ncols, stream);
 
     // launch device version of the kernel
     gatherLaunch(d_in, ncols, nrows, d_map, map_length, d_out_act, stream);
@@ -128,13 +128,15 @@ const std::vector<GatherInputs> inputs = {
 typedef GatherTest<float, uint32_t> GatherTestF;
 TEST_P(GatherTestF, Result) {
   ASSERT_TRUE(devArrMatch(d_out_exp, d_out_act,
-                          params.map_length * params.ncols, Compare<float>()));
+                          params.map_length * params.ncols,
+                          raft::Compare<float>()));
 }
 
 typedef GatherTest<double, uint32_t> GatherTestD;
 TEST_P(GatherTestD, Result) {
   ASSERT_TRUE(devArrMatch(d_out_exp, d_out_act,
-                          params.map_length * params.ncols, Compare<double>()));
+                          params.map_length * params.ncols,
+                          raft::Compare<double>()));
 }
 
 INSTANTIATE_TEST_CASE_P(GatherTests, GatherTestF, ::testing::ValuesIn(inputs));
diff --git a/cpp/test/prims/gemm_layout.cu b/cpp/test/prims/gemm_layout.cu
deleted file mode 100644
index 670f077ec4..0000000000
--- a/cpp/test/prims/gemm_layout.cu
+++ /dev/null
@@ -1,156 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/gemm.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct GemmLayoutInputs {
-  int M;
-  int N;
-  int K;
-  bool zLayout;
-  bool xLayout;
-  bool yLayout;
-  unsigned long long int seed;
-};
-
-// Reference GEMM implementation.
-template <typename T>
-__global__ void naiveGemm(T *Z, T *X, T *Y, int M, int N, int K,
-                          bool isZColMajor, bool isXColMajor,
-                          bool isYColMajor) {
-  int tidx = blockIdx.x * blockDim.x + threadIdx.x;
-  int tidy = blockIdx.y * blockDim.y + threadIdx.y;
-
-  for (int m = tidy; m < M; m += (blockDim.y * gridDim.y)) {
-    for (int n = tidx; n < N; n += (blockDim.x * gridDim.x)) {
-      T temp = T(0.0);
-      for (int k = 0; k < K; k++) {
-        int xIndex = isXColMajor ? m + k * M : m * K + k;
-        int yIndex = isYColMajor ? k + n * K : k * N + n;
-        temp += X[xIndex] * Y[yIndex];
-      }
-      int zIndex = isZColMajor ? m + n * M : m * N + n;
-      Z[zIndex] = temp;
-    }
-  }
-}
-
-template <typename T>
-class GemmLayoutTest : public ::testing::TestWithParam<GemmLayoutInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<GemmLayoutInputs<T>>::GetParam();
-    cudaStream_t stream;
-    cublasHandle_t handle;
-    CUBLAS_CHECK(cublasCreate(&handle));
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    Random::Rng r(params.seed);
-
-    // We compute Z = X * Y and compare against reference result
-    // Dimensions of X : M x K
-    // Dimensions of Y : K x N
-    // Dimensions of Z : M x N
-
-    T *X = NULL;  // Argument X
-    T *Y = NULL;  // Argument Y
-
-    size_t xElems = params.M * params.K;
-    size_t yElems = params.K * params.N;
-    size_t zElems = params.M * params.N;
-
-    CUDA_CHECK(cudaMalloc(&X, xElems * sizeof(T)));
-    CUDA_CHECK(cudaMalloc(&Y, yElems * sizeof(T)));
-    CUDA_CHECK(cudaMalloc(&refZ, zElems * sizeof(T)));
-    CUDA_CHECK(cudaMalloc(&Z, zElems * sizeof(T)));
-
-    r.uniform(X, xElems, T(-10.0), T(10.0), stream);
-    r.uniform(Y, yElems, T(-10.0), T(10.0), stream);
-
-    dim3 blocks(ceildiv<int>(params.M, 128), ceildiv<int>(params.N, 4), 1);
-    dim3 threads(128, 4, 1);
-
-    naiveGemm<<<blocks, threads>>>(refZ, X, Y, params.M, params.N, params.K,
-                                   params.zLayout, params.xLayout,
-                                   params.yLayout);
-
-    gemm(handle, Z, X, Y, params.M, params.N, params.K, params.zLayout,
-         params.xLayout, params.yLayout, stream);
-
-    CUDA_CHECK(cudaFree(X));
-    CUDA_CHECK(cudaFree(Y));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-    CUBLAS_CHECK(cublasDestroy(handle));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(refZ));
-    CUDA_CHECK(cudaFree(Z));
-  }
-
- protected:
-  GemmLayoutInputs<T> params;
-  T *refZ = NULL;  // Reference result for comparison
-  T *Z = NULL;     // Computed result
-};
-
-const std::vector<GemmLayoutInputs<float>> inputsf = {
-  {80, 70, 80, true, true, true, 76433ULL},
-  {80, 100, 40, true, true, false, 426646ULL},
-  {20, 100, 20, true, false, true, 237703ULL},
-  {100, 60, 30, true, false, false, 538004ULL},
-  {50, 10, 60, false, true, true, 73012ULL},
-  {90, 90, 30, false, true, false, 538147ULL},
-  {30, 100, 10, false, false, true, 412352ULL},
-  {40, 80, 100, false, false, false, 297941ULL}};
-
-const std::vector<GemmLayoutInputs<double>> inputsd = {
-  {10, 70, 40, true, true, true, 535648ULL},
-  {30, 30, 30, true, true, false, 956681ULL},
-  {70, 80, 50, true, false, true, 875083ULL},
-  {80, 90, 70, true, false, false, 50744ULL},
-  {90, 90, 30, false, true, true, 506321ULL},
-  {40, 100, 70, false, true, false, 638418ULL},
-  {80, 50, 30, false, false, true, 701529ULL},
-  {50, 80, 60, false, false, false, 893038ULL}};
-
-typedef GemmLayoutTest<float> GemmLayoutTestF;
-TEST_P(GemmLayoutTestF, Result) {
-  ASSERT_TRUE(
-    devArrMatch(refZ, Z, params.M * params.N, CompareApprox<float>(1e-4)));
-}
-
-typedef GemmLayoutTest<double> GemmLayoutTestD;
-TEST_P(GemmLayoutTestD, Result) {
-  ASSERT_TRUE(
-    devArrMatch(refZ, Z, params.M * params.N, CompareApprox<float>(1e-6)));
-}
-
-INSTANTIATE_TEST_CASE_P(GemmLayoutTests, GemmLayoutTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(GemmLayoutTests, GemmLayoutTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/gram.cu b/cpp/test/prims/gram.cu
index 404c68e3e3..86d70161d4 100644
--- a/cpp/test/prims/gram.cu
+++ b/cpp/test/prims/gram.cu
@@ -14,17 +14,17 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/matrix/kernelparams.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <common/host_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <iostream>
 #include <matrix/grammatrix.cuh>
 #include <matrix/kernelfactory.cuh>
 #include <memory>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -35,12 +35,12 @@ class GramMatrixTest : public ::testing::Test {
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
     CUBLAS_CHECK(cublasCreate(&cublas_handle));
-    allocator = std::make_shared<defaultDeviceAllocator>();
-    host_allocator = std::make_shared<defaultHostAllocator>();
-    allocate(x_dev, n1 * n_cols);
-    updateDevice(x_dev, x_host, n1 * n_cols, stream);
+    allocator = std::make_shared<raft::mr::device::default_allocator>();
+    host_allocator = std::make_shared<raft::mr::host::default_allocator>();
+    raft::allocate(x_dev, n1 * n_cols);
+    raft::update_device(x_dev, x_host, n1 * n_cols, stream);
 
-    allocate(gram_dev, n1 * n1);
+    raft::allocate(gram_dev, n1 * n1);
   }
 
   void TearDown() override {
@@ -53,9 +53,9 @@ class GramMatrixTest : public ::testing::Test {
   void naiveRBFKernel(float *x1_dev, int n1, int n_cols, float *x2_dev, int n2,
                       float gamma) {
     host_buffer<float> x1_host(host_allocator, stream, n1 * n_cols);
-    updateHost(x1_host.data(), x1_dev, n1 * n_cols, stream);
+    raft::update_host(x1_host.data(), x1_dev, n1 * n_cols, stream);
     host_buffer<float> x2_host(host_allocator, stream, n2 * n_cols);
-    updateHost(x2_host.data(), x2_dev, n2 * n_cols, stream);
+    raft::update_host(x2_host.data(), x2_dev, n2 * n_cols, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     for (int i = 0; i < n1; i++) {
       for (int j = 0; j < n2; j++) {
@@ -86,8 +86,8 @@ class GramMatrixTest : public ::testing::Test {
 TEST_F(GramMatrixTest, Base) {
   GramMatrixBase<float> kernel(cublas_handle);
   kernel(x_dev, n1, n_cols, x_dev, n1, gram_dev, stream);
-  ASSERT_TRUE(devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
-                              CompareApprox<float>(1e-6f)));
+  ASSERT_TRUE(raft::devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
+                                    raft::CompareApprox<float>(1e-6f)));
 }
 TEST_F(GramMatrixTest, Poly) {
   float offset = 2.4;
@@ -100,8 +100,8 @@ TEST_F(GramMatrixTest, Poly) {
 
   PolynomialKernel<float, int> kernel(2, gain, offset, cublas_handle);
   kernel(x_dev, n1, n_cols, x_dev, n1, gram_dev, stream);
-  ASSERT_TRUE(devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
-                              CompareApprox<float>(1e-6f)));
+  ASSERT_TRUE(raft::devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
+                                    raft::CompareApprox<float>(1e-6f)));
 }
 
 TEST_F(GramMatrixTest, Tanh) {
@@ -113,8 +113,8 @@ TEST_F(GramMatrixTest, Tanh) {
   }
   TanhKernel<float> kernel(gain, offset, cublas_handle);
   kernel(x_dev, n1, n_cols, x_dev, n1, gram_dev, stream);
-  ASSERT_TRUE(devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
-                              CompareApprox<float>(1e-6f)));
+  ASSERT_TRUE(raft::devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
+                                    raft::CompareApprox<float>(1e-6f)));
 }
 
 TEST_F(GramMatrixTest, RBF) {
@@ -122,8 +122,8 @@ TEST_F(GramMatrixTest, RBF) {
   naiveRBFKernel(x_dev, n1, n_cols, x_dev, n1, gamma);
   RBFKernel<float> kernel(gamma);
   kernel(x_dev, n1, n_cols, x_dev, n1, gram_dev, stream);
-  ASSERT_TRUE(devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
-                              CompareApprox<float>(3e-6f)));
+  ASSERT_TRUE(raft::devArrMatchHost(gram_host_expected, gram_dev, n1 * n1,
+                                    raft::CompareApprox<float>(3e-6f)));
 }
 
 TEST_F(GramMatrixTest, RBF_Rectangular) {
@@ -164,13 +164,13 @@ TEST_F(GramMatrixTest, RBF_Rectangular) {
   }
 
   device_buffer<float> x1_dev(allocator, stream, n1 * n_cols);
-  updateDevice(x1_dev.data(), x1, n1 * n_cols, stream);
+  raft::update_device(x1_dev.data(), x1, n1 * n_cols, stream);
   device_buffer<float> x2_dev(allocator, stream, n2 * n_cols);
-  updateDevice(x2_dev.data(), x2, n2 * n_cols, stream);
+  raft::update_device(x2_dev.data(), x2, n2 * n_cols, stream);
 
   kernel(x1_dev.data(), n1, n_cols, x2_dev.data(), n2, gram_dev, stream);
-  ASSERT_TRUE(
-    devArrMatchHost(K, gram_dev, n1 * n2, CompareApprox<float>(1e-6f)));
+  ASSERT_TRUE(raft::devArrMatchHost(K, gram_dev, n1 * n2,
+                                    raft::CompareApprox<float>(1e-6f)));
 }
 };  // end namespace Matrix
 };  // end namespace MLCommon
diff --git a/cpp/test/prims/grid_sync.cu b/cpp/test/prims/grid_sync.cu
index 0fc541efd2..20d70be40b 100644
--- a/cpp/test/prims/grid_sync.cu
+++ b/cpp/test/prims/grid_sync.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <common/grid_sync.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -43,7 +43,7 @@ __global__ void gridSyncTestKernel(void* workspace, int* out, SyncType type) {
   int val = out[updatePosition];
   // make sure everybody has read the updated value!
   gs.sync();
-  atomicAdd(out + updatePosition, val);
+  raft::myAtomicAdd(out + updatePosition, val);
 }
 
 struct GridSyncInputs {
@@ -56,7 +56,7 @@ void gridSyncTest(int* out, int* out1, const GridSyncInputs& params) {
   size_t workspaceSize =
     GridSync::computeWorkspaceSize(params.gridDim, params.type, true);
   char* workspace;
-  allocate(workspace, workspaceSize);
+  raft::allocate(workspace, workspaceSize);
   CUDA_CHECK(cudaMemset(workspace, 0, workspaceSize));
   gridSyncTestKernel<<<params.gridDim, params.blockDim>>>(workspace, out,
                                                           params.type);
@@ -79,8 +79,8 @@ class GridSyncTest : public ::testing::TestWithParam<GridSyncInputs> {
   void SetUp() override {
     params = ::testing::TestWithParam<GridSyncInputs>::GetParam();
     size_t len = computeOutLen();
-    allocate(out, len);
-    allocate(out1, len);
+    raft::allocate(out, len);
+    raft::allocate(out1, len);
     gridSyncTest(out, out1, params);
   }
 
@@ -133,15 +133,15 @@ const std::vector<GridSyncInputs> inputs = {
   {{32, 256, 1}, {1, 1, 1}, true, ACROSS_X}};
 TEST_P(GridSyncTest, Result) {
   size_t len = computeOutLen();
-  // number of blocks atomicAdd'ing the same location
+  // number of blocks raft::myAtomicAdd'ing the same location
   int nblks = params.type == ACROSS_X
                 ? params.gridDim.x
                 : params.gridDim.x * params.gridDim.y * params.gridDim.z;
   int nthreads = params.blockDim.x * params.blockDim.y * params.blockDim.z;
   int expected = (nblks * nthreads) + 1;
-  ASSERT_TRUE(devArrMatch(expected, out, len, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch(expected, out, len, raft::Compare<int>()));
   if (params.checkWorkspaceReuse) {
-    ASSERT_TRUE(devArrMatch(expected, out1, len, Compare<int>()));
+    ASSERT_TRUE(raft::devArrMatch(expected, out1, len, raft::Compare<int>()));
   }
 }
 INSTANTIATE_TEST_CASE_P(GridSyncTests, GridSyncTest,
diff --git a/cpp/test/prims/hinge.cu b/cpp/test/prims/hinge.cu
index 6ffd4b8373..e6968e479a 100644
--- a/cpp/test/prims/hinge.cu
+++ b/cpp/test/prims/hinge.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <functions/hinge.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -42,115 +42,106 @@ class HingeLossTest : public ::testing::TestWithParam<HingeLossInputs<T>> {
 
     T *labels, *coef;
 
-    cublasHandle_t cublas_handle;
-    CUBLAS_CHECK(cublasCreate(&cublas_handle));
-
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocator.reset(new defaultDeviceAllocator);
-
-    allocate(in, len);
-    allocate(out, 1);
-    allocate(out_lasso, 1);
-    allocate(out_ridge, 1);
-    allocate(out_elasticnet, 1);
-    allocate(out_grad, n_cols);
-    allocate(out_lasso_grad, n_cols);
-    allocate(out_ridge_grad, n_cols);
-    allocate(out_elasticnet_grad, n_cols);
-    allocate(out_ref, 1);
-    allocate(out_lasso_ref, 1);
-    allocate(out_ridge_ref, 1);
-    allocate(out_elasticnet_ref, 1);
-    allocate(out_grad_ref, n_cols);
-    allocate(out_lasso_grad_ref, n_cols);
-    allocate(out_ridge_grad_ref, n_cols);
-    allocate(out_elasticnet_grad_ref, n_cols);
-
-    allocate(labels, params.n_rows);
-    allocate(coef, params.n_cols);
+    raft::handle_t handle;
+    cudaStream_t stream = handle.get_stream();
+
+    raft::allocate(in, len);
+    raft::allocate(out, 1);
+    raft::allocate(out_lasso, 1);
+    raft::allocate(out_ridge, 1);
+    raft::allocate(out_elasticnet, 1);
+    raft::allocate(out_grad, n_cols);
+    raft::allocate(out_lasso_grad, n_cols);
+    raft::allocate(out_ridge_grad, n_cols);
+    raft::allocate(out_elasticnet_grad, n_cols);
+    raft::allocate(out_ref, 1);
+    raft::allocate(out_lasso_ref, 1);
+    raft::allocate(out_ridge_ref, 1);
+    raft::allocate(out_elasticnet_ref, 1);
+    raft::allocate(out_grad_ref, n_cols);
+    raft::allocate(out_lasso_grad_ref, n_cols);
+    raft::allocate(out_ridge_grad_ref, n_cols);
+    raft::allocate(out_elasticnet_grad_ref, n_cols);
+
+    raft::allocate(labels, params.n_rows);
+    raft::allocate(coef, params.n_cols);
 
     T h_in[len] = {0.1, 0.35, -0.9, -1.4, 2.0, 3.1};
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
     T h_labels[n_rows] = {0.3, 2.0, -1.1};
-    updateDevice(labels, h_labels, n_rows, stream);
+    raft::update_device(labels, h_labels, n_rows, stream);
 
     T h_coef[n_cols] = {0.35, -0.24};
-    updateDevice(coef, h_coef, n_cols, stream);
+    raft::update_device(coef, h_coef, n_cols, stream);
 
     T h_out_ref[1] = {2.6037};
-    updateDevice(out_ref, h_out_ref, 1, stream);
+    raft::update_device(out_ref, h_out_ref, 1, stream);
 
     T h_out_lasso_ref[1] = {2.9577};
-    updateDevice(out_lasso_ref, h_out_lasso_ref, 1, stream);
+    raft::update_device(out_lasso_ref, h_out_lasso_ref, 1, stream);
 
     T h_out_ridge_ref[1] = {2.71176};
-    updateDevice(out_ridge_ref, h_out_ridge_ref, 1, stream);
+    raft::update_device(out_ridge_ref, h_out_ridge_ref, 1, stream);
 
     T h_out_elasticnet_ref[1] = {2.83473};
-    updateDevice(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
+    raft::update_device(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
 
     T h_out_grad_ref[n_cols] = {-0.24333, -1.1933};
-    updateDevice(out_grad_ref, h_out_grad_ref, n_cols, stream);
+    raft::update_device(out_grad_ref, h_out_grad_ref, n_cols, stream);
 
     T h_out_lasso_grad_ref[n_cols] = {0.3566, -1.7933};
-    updateDevice(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols, stream);
+    raft::update_device(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols,
+                        stream);
 
     T h_out_ridge_grad_ref[n_cols] = {0.1766, -1.4813};
-    updateDevice(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols, stream);
+    raft::update_device(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols,
+                        stream);
 
     T h_out_elasticnet_grad_ref[n_cols] = {0.2666, -1.63733};
-    updateDevice(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref, n_cols,
-                 stream);
+    raft::update_device(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref,
+                        n_cols, stream);
 
     T alpha = 0.6;
     T l1_ratio = 0.5;
 
-    hingeLoss(in, params.n_rows, params.n_cols, labels, coef, out,
-              penalty::NONE, alpha, l1_ratio, cublas_handle, allocator, stream);
+    hingeLoss(handle, in, params.n_rows, params.n_cols, labels, coef, out,
+              penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    hingeLossGrads(in, params.n_rows, params.n_cols, labels, coef, out_grad,
-                   penalty::NONE, alpha, l1_ratio, cublas_handle, allocator,
-                   stream);
+    hingeLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                   out_grad, penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    hingeLoss(in, params.n_rows, params.n_cols, labels, coef, out_lasso,
-              penalty::L1, alpha, l1_ratio, cublas_handle, allocator, stream);
+    hingeLoss(handle, in, params.n_rows, params.n_cols, labels, coef, out_lasso,
+              penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    hingeLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                   out_lasso_grad, penalty::L1, alpha, l1_ratio, cublas_handle,
-                   allocator, stream);
+    hingeLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                   out_lasso_grad, penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    hingeLoss(in, params.n_rows, params.n_cols, labels, coef, out_ridge,
-              penalty::L2, alpha, l1_ratio, cublas_handle, allocator, stream);
+    hingeLoss(handle, in, params.n_rows, params.n_cols, labels, coef, out_ridge,
+              penalty::L2, alpha, l1_ratio, stream);
 
-    hingeLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                   out_ridge_grad, penalty::L2, alpha, l1_ratio, cublas_handle,
-                   allocator, stream);
+    hingeLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                   out_ridge_grad, penalty::L2, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    hingeLoss(in, params.n_rows, params.n_cols, labels, coef, out_elasticnet,
-              penalty::ELASTICNET, alpha, l1_ratio, cublas_handle, allocator,
-              stream);
+    hingeLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+              out_elasticnet, penalty::ELASTICNET, alpha, l1_ratio, stream);
 
-    hingeLossGrads(in, params.n_rows, params.n_cols, labels, coef,
+    hingeLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
                    out_elasticnet_grad, penalty::ELASTICNET, alpha, l1_ratio,
-                   cublas_handle, allocator, stream);
+                   stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    CUBLAS_CHECK(cublasDestroy(cublas_handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(labels));
     CUDA_CHECK(cudaFree(coef));
   }
@@ -192,58 +183,62 @@ const std::vector<HingeLossInputs<double>> inputsd = {{0.01, 3, 2, 6}};
 
 typedef HingeLossTest<float> HingeLossTestF;
 TEST_P(HingeLossTestF, Result) {
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_ref, out_lasso, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_ref, out_ridge, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_grad_ref, out_grad, params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_grad_ref, out_lasso_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_grad_ref, out_ridge_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef HingeLossTest<double> HingeLossTestD;
 TEST_P(HingeLossTestD, Result) {
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_ref, out_lasso, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_ref, out_ridge, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_grad_ref, out_grad, params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_grad_ref, out_lasso_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_grad_ref, out_ridge_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(HingeLossTests, HingeLossTestF,
diff --git a/cpp/test/prims/histogram.cu b/cpp/test/prims/histogram.cu
index b14481385c..e5e3aced41 100644
--- a/cpp/test/prims/histogram.cu
+++ b/cpp/test/prims/histogram.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <stats/histogram.cuh>
 #include "test_utils.h"
 
@@ -37,14 +37,14 @@ __global__ void naiveHistKernel(int* bins, int nbins, int* in, int nrows) {
     else if (id >= nbins)
       id = nbins - 1;
     in[offset + tid] = id;
-    atomicAdd(bins + binOffset + id, 1);
+    raft::myAtomicAdd(bins + binOffset + id, 1);
   }
 }
 
 void naiveHist(int* bins, int nbins, int* in, int nrows, int ncols,
                cudaStream_t stream) {
   const int TPB = 128;
-  int nblksx = ceildiv(nrows, TPB);
+  int nblksx = raft::ceildiv(nrows, TPB);
   dim3 blks(nblksx, ncols);
   naiveHistKernel<<<blks, TPB, 0, stream>>>(bins, nbins, in, nrows);
   CUDA_CHECK(cudaGetLastError());
@@ -62,17 +62,17 @@ class HistTest : public ::testing::TestWithParam<HistInputs> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<HistInputs>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     CUDA_CHECK(cudaStreamCreate(&stream));
     int len = params.nrows * params.ncols;
-    allocate(in, len);
+    raft::allocate(in, len);
     if (params.isNormal) {
       r.normalInt(in, len, params.start, params.end, stream);
     } else {
       r.uniformInt(in, len, params.start, params.end, stream);
     }
-    allocate(bins, params.nbins * params.ncols);
-    allocate(ref_bins, params.nbins * params.ncols);
+    raft::allocate(bins, params.nbins * params.ncols);
+    raft::allocate(ref_bins, params.nbins * params.ncols);
     CUDA_CHECK(cudaMemsetAsync(
       ref_bins, 0, sizeof(int) * params.nbins * params.ncols, stream));
     naiveHist(ref_bins, params.nbins, in, params.nrows, params.ncols, stream);
@@ -253,8 +253,8 @@ const std::vector<HistInputs> inputs = {
   {oneM + 2, 21, 2 * oneK, true, HistTypeAuto, 1000, 50, 1234ULL},
 };
 TEST_P(HistTest, Result) {
-  ASSERT_TRUE(
-    devArrMatch(ref_bins, bins, params.nbins * params.ncols, Compare<int>()));
+  ASSERT_TRUE(raft::devArrMatch(ref_bins, bins, params.nbins * params.ncols,
+                                raft::Compare<int>()));
 }
 INSTANTIATE_TEST_CASE_P(HistTests, HistTest, ::testing::ValuesIn(inputs));
 
diff --git a/cpp/test/prims/homogeneityScore.cu b/cpp/test/prims/homogeneityScore.cu
index 5742706624..efbb337a57 100644
--- a/cpp/test/prims/homogeneityScore.cu
+++ b/cpp/test/prims/homogeneityScore.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -67,13 +67,13 @@ class homogeneityTest : public ::testing::TestWithParam<homogeneityParam> {
     //allocating and initializing memory to the GPU
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(truthClusterArray, nElements, true);
-    MLCommon::allocate(predClusterArray, nElements, true);
+    raft::allocate(truthClusterArray, nElements, true);
+    raft::allocate(predClusterArray, nElements, true);
 
-    MLCommon::updateDevice(truthClusterArray, &arr1[0], (int)nElements, stream);
-    MLCommon::updateDevice(predClusterArray, &arr2[0], (int)nElements, stream);
+    raft::update_device(truthClusterArray, &arr1[0], (int)nElements, stream);
+    raft::update_device(predClusterArray, &arr2[0], (int)nElements, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calculating the golden output
     double truthMI, truthEntropy;
diff --git a/cpp/test/prims/host_buffer.cu b/cpp/test/prims/host_buffer.cu
index 922b34ae1b..1b74e95a79 100644
--- a/cpp/test/prims/host_buffer.cu
+++ b/cpp/test/prims/host_buffer.cu
@@ -25,7 +25,8 @@
 namespace MLCommon {
 
 TEST(HostBufferTest, ctor) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   const int size = 4;
@@ -34,7 +35,8 @@ TEST(HostBufferTest, ctor) {
 }
 
 TEST(HostBufferTest, clear) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   const int size = 8;
@@ -45,7 +47,8 @@ TEST(HostBufferTest, clear) {
 }
 
 TEST(HostBufferTest, itiface) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   const int size = 8;
@@ -54,7 +57,8 @@ TEST(HostBufferTest, itiface) {
 }
 
 TEST(HostBufferTest, reserve) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   constexpr int size = 8;
@@ -73,7 +77,8 @@ TEST(HostBufferTest, reserve) {
 }
 
 TEST(HostBufferTest, resize) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   std::srand(std::time(nullptr));
@@ -91,7 +96,8 @@ TEST(HostBufferTest, resize) {
 }
 
 TEST(HostBufferTest, release) {
-  std::shared_ptr<hostAllocator> allocator(new defaultHostAllocator);
+  std::shared_ptr<hostAllocator> allocator(
+    new raft::mr::host::default_allocator);
   cudaStream_t stream = 0;
 
   const int size = 8;
diff --git a/cpp/test/prims/jones_transform.cu b/cpp/test/prims/jones_transform.cu
index 7ff253901e..067847d65f 100644
--- a/cpp/test/prims/jones_transform.cu
+++ b/cpp/test/prims/jones_transform.cu
@@ -12,8 +12,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -92,15 +92,15 @@ template
 
     //allocating and initializing device memory
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(d_golden_ar_trans, nElements, true);
-    MLCommon::allocate(d_computed_ar_trans, nElements, true);
-    MLCommon::allocate(d_params, nElements, true);
+    raft::allocate(d_golden_ar_trans, nElements, true);
+    raft::allocate(d_computed_ar_trans, nElements, true);
+    raft::allocate(d_params, nElements, true);
 
-    MLCommon::updateDevice(d_params, &arr1[0], (size_t)nElements, stream);
-    MLCommon::updateDevice(d_golden_ar_trans, newParams, (size_t)nElements,
-                           stream);
+    raft::update_device(d_params, &arr1[0], (size_t)nElements, stream);
+    raft::update_device(d_golden_ar_trans, newParams, (size_t)nElements,
+                        stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calling the ar_trans_param CUDA implementation
     MLCommon::TimeSeries::jones_transform(d_params, params.batchSize,
@@ -142,11 +142,11 @@ template
     }
 
     //allocating and initializing device memory
-    MLCommon::allocate(d_golden_ma_trans, nElements, true);
-    MLCommon::allocate(d_computed_ma_trans, nElements, true);
+    raft::allocate(d_golden_ma_trans, nElements, true);
+    raft::allocate(d_computed_ma_trans, nElements, true);
 
-    MLCommon::updateDevice(d_golden_ma_trans, newParams, (size_t)nElements,
-                           stream);
+    raft::update_device(d_golden_ma_trans, newParams, (size_t)nElements,
+                        stream);
 
     //calling the ma_param_transform CUDA implementation
     MLCommon::TimeSeries::jones_transform(d_params, params.batchSize,
@@ -156,7 +156,7 @@ template
     //>>>>>>>>>>>>>>>>> AR inverse transform <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
     //allocating and initializing device memory
-    MLCommon::allocate(d_computed_ar_invtrans, nElements, true);
+    raft::allocate(d_computed_ar_invtrans, nElements, true);
 
     //calling the ar_param_inverse_transform CUDA implementation
     MLCommon::TimeSeries::jones_transform(d_computed_ar_trans, params.batchSize,
@@ -165,7 +165,7 @@ template
 
     //>>>>>>>>>>>>>>>>> MA inverse transform <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
-    MLCommon::allocate(d_computed_ma_invtrans, nElements, true);
+    raft::allocate(d_computed_ma_invtrans, nElements, true);
 
     //calling the ma_param_inverse_transform CUDA implementation
     MLCommon::TimeSeries::jones_transform(d_computed_ma_trans, params.batchSize,
@@ -210,20 +210,22 @@ const std::vector<JonesTransParam> inputs = {
 //writing the test suite
 typedef JonesTransTest<double> JonesTransTestClass;
 TEST_P(JonesTransTestClass, Result) {
-  ASSERT_TRUE(devArrMatch(d_computed_ar_trans, d_golden_ar_trans, nElements,
-                          CompareApprox<double>(params.tolerance)));
-  ASSERT_TRUE(devArrMatch(d_computed_ma_trans, d_golden_ma_trans, nElements,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(d_computed_ar_trans, d_golden_ar_trans,
+                                nElements,
+                                raft::CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(d_computed_ma_trans, d_golden_ma_trans,
+                                nElements,
+                                raft::CompareApprox<double>(params.tolerance)));
   /*
   Test verifying the inversion property:
   initially generated random coefficients -> ar_param_transform() / ma_param_transform() -> 
   transformed coefficients -> ar_param_inverse_transform()/ma_param_inverse_transform() -> 
   initially generated random coefficients
   */
-  ASSERT_TRUE(devArrMatch(d_computed_ma_invtrans, d_params, nElements,
-                          CompareApprox<double>(params.tolerance)));
-  ASSERT_TRUE(devArrMatch(d_computed_ar_invtrans, d_params, nElements,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(d_computed_ma_invtrans, d_params, nElements,
+                                raft::CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(d_computed_ar_invtrans, d_params, nElements,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(JonesTrans, JonesTransTestClass,
                         ::testing::ValuesIn(inputs));
diff --git a/cpp/test/prims/klDivergence.cu b/cpp/test/prims/klDivergence.cu
index 486ee7a6cd..f14989077e 100644
--- a/cpp/test/prims/klDivergence.cu
+++ b/cpp/test/prims/klDivergence.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -56,14 +56,14 @@ class klDivergenceTest : public ::testing::TestWithParam<klDivergenceParam> {
 
     //allocating and initializing memory to the GPU
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(d_modelPDF, nElements, true);
-    MLCommon::allocate(d_candidatePDF, nElements, true);
+    raft::allocate(d_modelPDF, nElements, true);
+    raft::allocate(d_candidatePDF, nElements, true);
 
-    MLCommon::updateDevice(d_modelPDF, &h_modelPDF[0], (int)nElements, stream);
-    MLCommon::updateDevice(d_candidatePDF, &h_candidatePDF[0], (int)nElements,
-                           stream);
+    raft::update_device(d_modelPDF, &h_modelPDF[0], (int)nElements, stream);
+    raft::update_device(d_candidatePDF, &h_candidatePDF[0], (int)nElements,
+                        stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //generating the golden output
     for (int i = 0; i < nElements; ++i) {
diff --git a/cpp/test/prims/knn.cu b/cpp/test/prims/knn.cu
index 77595f0d35..5c02a80867 100644
--- a/cpp/test/prims/knn.cu
+++ b/cpp/test/prims/knn.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <iostream>
+#include <raft/cuda_utils.cuh>
 #include <selection/knn.cuh>
 #include <vector>
 #include "test_utils.h"
@@ -36,31 +36,31 @@ template <typename T>
 class KNNTest : public ::testing::Test {
  protected:
   void basicTest() {
-    auto alloc = std::make_shared<defaultDeviceAllocator>();
+    auto alloc = std::make_shared<raft::mr::device::default_allocator>();
 
     // Allocate input
-    allocate(d_train_inputs, n * d);
+    raft::allocate(d_train_inputs, n * d);
 
     // Allocate reference arrays
-    allocate<long>(d_ref_I, n * n);
-    allocate(d_ref_D, n * n);
+    raft::allocate<long>(d_ref_I, n * n);
+    raft::allocate(d_ref_D, n * n);
 
     // Allocate predicted arrays
-    allocate<long>(d_pred_I, n * n);
-    allocate(d_pred_D, n * n);
+    raft::allocate<long>(d_pred_I, n * n);
+    raft::allocate(d_pred_D, n * n);
 
     // make testdata on host
     std::vector<T> h_train_inputs = {1.0, 50.0, 51.0};
     h_train_inputs.resize(n);
-    updateDevice(d_train_inputs, h_train_inputs.data(), n * d, 0);
+    raft::update_device(d_train_inputs, h_train_inputs.data(), n * d, 0);
 
     std::vector<T> h_res_D = {0.0, 49.0, 50.0, 0.0, 1.0, 49.0, 0.0, 1.0, 50.0};
     h_res_D.resize(n * n);
-    updateDevice(d_ref_D, h_res_D.data(), n * n, 0);
+    raft::update_device(d_ref_D, h_res_D.data(), n * n, 0);
 
     std::vector<long> h_res_I = {0, 1, 2, 1, 2, 0, 2, 1, 0};
     h_res_I.resize(n * n);
-    updateDevice<long>(d_ref_I, h_res_I.data(), n * n, 0);
+    raft::update_device<long>(d_ref_I, h_res_I.data(), n * n, 0);
 
     std::vector<float *> input_vec = {d_train_inputs};
     std::vector<int> sizes_vec = {n};
@@ -99,9 +99,10 @@ class KNNTest : public ::testing::Test {
 
 typedef KNNTest<float> KNNTestF;
 TEST_F(KNNTestF, Fit) {
+  ASSERT_TRUE(raft::devArrMatch(d_ref_D, d_pred_D, n * n,
+                                raft::CompareApprox<float>(1e-3)));
   ASSERT_TRUE(
-    devArrMatch(d_ref_D, d_pred_D, n * n, CompareApprox<float>(1e-3)));
-  ASSERT_TRUE(devArrMatch(d_ref_I, d_pred_I, n * n, Compare<long>()));
+    raft::devArrMatch(d_ref_I, d_pred_I, n * n, raft::Compare<long>()));
 }
 
 };  // end namespace Selection
diff --git a/cpp/test/prims/knn_classify.cu b/cpp/test/prims/knn_classify.cu
index c5ac4cc9e5..f13a4ff38d 100644
--- a/cpp/test/prims/knn_classify.cu
+++ b/cpp/test/prims/knn_classify.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <iostream>
 #include <label/classlabels.cuh>
+#include <raft/cuda_utils.cuh>
 #include <random/make_blobs.cuh>
 #include <selection/knn.cuh>
 #include <vector>
@@ -39,20 +39,20 @@ class KNNClassifyTest : public ::testing::TestWithParam<KNNClassifyInputs> {
  protected:
   void basicTest() {
     std::shared_ptr<MLCommon::deviceAllocator> alloc(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
     params = ::testing::TestWithParam<KNNClassifyInputs>::GetParam();
 
-    allocate(train_samples, params.rows * params.cols);
-    allocate(train_labels, params.rows);
+    raft::allocate(train_samples, params.rows * params.cols);
+    raft::allocate(train_labels, params.rows);
 
-    allocate(pred_labels, params.rows);
-    allocate(unique_labels, params.n_labels, true);
+    raft::allocate(pred_labels, params.rows);
+    raft::allocate(unique_labels, params.n_labels, true);
 
-    allocate(knn_indices, params.rows * params.k);
-    allocate(knn_dists, params.rows * params.k);
+    raft::allocate(knn_indices, params.rows * params.k);
+    raft::allocate(knn_dists, params.rows * params.k);
 
     MLCommon::Random::make_blobs<float, int>(
       train_samples, train_labels, params.rows, params.cols, params.n_labels,
@@ -117,7 +117,7 @@ class KNNClassifyTest : public ::testing::TestWithParam<KNNClassifyInputs> {
 typedef KNNClassifyTest KNNClassifyTestF;
 TEST_P(KNNClassifyTestF, Fit) {
   ASSERT_TRUE(
-    devArrMatch(train_labels, pred_labels, params.rows, Compare<int>()));
+    devArrMatch(train_labels, pred_labels, params.rows, raft::Compare<int>()));
 }
 
 const std::vector<KNNClassifyInputs> inputsf = {
diff --git a/cpp/test/prims/knn_regression.cu b/cpp/test/prims/knn_regression.cu
index 392b5eb340..844ddcb176 100644
--- a/cpp/test/prims/knn_regression.cu
+++ b/cpp/test/prims/knn_regression.cu
@@ -14,14 +14,14 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/cusolver_wrappers.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <iostream>
 #include <label/classlabels.cuh>
-#include <linalg/reduce.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/reduce.cuh>
+#include <raft/random/rng.cuh>
 #include <selection/knn.cuh>
 #include <vector>
 #include "test_utils.h"
@@ -42,24 +42,24 @@ struct KNNRegressionInputs {
 
 void generate_data(float *out_samples, float *out_labels, int n_rows,
                    int n_cols, cudaStream_t stream) {
-  Random::Rng r(0ULL, MLCommon::Random::GenTaps);
+  raft::random::Rng r(0ULL, raft::random::GenTaps);
 
   r.uniform(out_samples, n_rows * n_cols, 0.0f, 1.0f, stream);
 
-  MLCommon::LinAlg::unaryOp<float>(
+  raft::linalg::unaryOp<float>(
     out_samples, out_samples, n_rows,
     [=] __device__(float input) { return 2 * input - 1; }, stream);
 
-  MLCommon::LinAlg::reduce(
+  raft::linalg::reduce(
     out_labels, out_samples, n_cols, n_rows, 0.0f, true, true, stream, false,
-    [=] __device__(float in, int n) { return in * in; }, Sum<float>(),
+    [=] __device__(float in, int n) { return in * in; }, raft::Sum<float>(),
     [=] __device__(float in) { return sqrt(in); });
 
   thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(out_labels);
   float max =
     *(thrust::max_element(thrust::cuda::par.on(stream), d_ptr, d_ptr + n_rows));
 
-  MLCommon::LinAlg::unaryOp<float>(
+  raft::linalg::unaryOp<float>(
     out_labels, out_labels, n_rows,
     [=] __device__(float input) { return input / max; }, stream);
 }
@@ -68,7 +68,7 @@ class KNNRegressionTest : public ::testing::TestWithParam<KNNRegressionInputs> {
  protected:
   void basicTest() {
     std::shared_ptr<MLCommon::deviceAllocator> alloc(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
@@ -80,13 +80,13 @@ class KNNRegressionTest : public ::testing::TestWithParam<KNNRegressionInputs> {
 
     params = ::testing::TestWithParam<KNNRegressionInputs>::GetParam();
 
-    allocate(train_samples, params.rows * params.cols);
-    allocate(train_labels, params.rows);
+    raft::allocate(train_samples, params.rows * params.cols);
+    raft::allocate(train_labels, params.rows);
 
-    allocate(pred_labels, params.rows);
+    raft::allocate(pred_labels, params.rows);
 
-    allocate(knn_indices, params.rows * params.k);
-    allocate(knn_dists, params.rows * params.k);
+    raft::allocate(knn_indices, params.rows * params.k);
+    raft::allocate(knn_dists, params.rows * params.k);
 
     generate_data(train_samples, train_labels, params.rows, params.cols,
                   stream);
@@ -136,7 +136,7 @@ class KNNRegressionTest : public ::testing::TestWithParam<KNNRegressionInputs> {
 typedef KNNRegressionTest KNNRegressionTestF;
 TEST_P(KNNRegressionTestF, Fit) {
   ASSERT_TRUE(devArrMatch(train_labels, pred_labels, params.rows,
-                          CompareApprox<float>(0.3)));
+                          raft::CompareApprox<float>(0.3)));
 }
 
 const std::vector<KNNRegressionInputs> inputsf = {
diff --git a/cpp/test/prims/kselection.cu b/cpp/test/prims/kselection.cu
index f902124d68..2adf25f1bc 100644
--- a/cpp/test/prims/kselection.cu
+++ b/cpp/test/prims/kselection.cu
@@ -14,12 +14,12 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <stdlib.h>
 #include <algorithm>
 #include <limits>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include <selection/kselection.cuh>
 
 namespace MLCommon {
@@ -30,12 +30,12 @@ __global__ void sortTestKernel(TypeK *key) {
   KVArray<TypeV, TypeK, N, Greater> arr;
 #pragma unroll
   for (int i = 0; i < N; ++i) {
-    arr.arr[i].val = (TypeV)laneId();
-    arr.arr[i].key = (TypeK)laneId();
+    arr.arr[i].val = (TypeV)raft::laneId();
+    arr.arr[i].key = (TypeK)raft::laneId();
   }
-  warpFence();
+  raft::warpFence();
   arr.sort();
-  warpFence();
+  raft::warpFence();
 #pragma unroll
   for (int i = 0; i < N; ++i)
     arr.arr[i].store(nullptr, key + threadIdx.x + i * TPB);
@@ -47,7 +47,7 @@ void sortTest(TypeK *key) {
   CUDA_CHECK(cudaMalloc((void **)&dkey, sizeof(TypeK) * TPB * N));
   sortTestKernel<TypeV, TypeK, N, TPB, Greater><<<1, TPB>>>(dkey);
   CUDA_CHECK(cudaPeekAtLastError());
-  updateHost<TypeK>(key, dkey, TPB * N, 0);
+  raft::update_host<TypeK>(key, dkey, TPB * N, 0);
   CUDA_CHECK(cudaFree(dkey));
 }
 
@@ -78,7 +78,7 @@ template <typename TypeV, typename TypeK, bool Greater>
   for (int rIndex = 0; rIndex < rows; rIndex++) {
     // input data
     TypeV *h_arr = new TypeV[N];
-    updateHost(h_arr, d_arr + rIndex * N, N, 0);
+    raft::update_host(h_arr, d_arr + rIndex * N, N, 0);
     KVPair<TypeV, TypeK> *topk = new KVPair<TypeV, TypeK>[N];
     for (int j = 0; j < N; j++) {
       topk[j].val = h_arr[j];
@@ -86,9 +86,9 @@ template <typename TypeV, typename TypeK, bool Greater>
     }
     // result reference
     TypeV *h_outv = new TypeV[k];
-    updateHost(h_outv, d_outv + rIndex * k, k, 0);
+    raft::update_host(h_outv, d_outv + rIndex * k, k, 0);
     TypeK *h_outk = new TypeK[k];
-    updateHost(h_outk, d_outk + rIndex * k, k, 0);
+    raft::update_host(h_outk, d_outk + rIndex * k, k, 0);
     // calculate the result
     partSortKVPair<TypeV, TypeK, Greater>(topk, N, k);
 
@@ -132,12 +132,12 @@ class WarpTopKTest : public ::testing::TestWithParam<WarpTopKInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<WarpTopKInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(arr, params.rows * params.cols);
-    allocate(outk, params.rows * params.k);
-    allocate(outv, params.rows * params.k);
+    raft::allocate(arr, params.rows * params.cols);
+    raft::allocate(outk, params.rows * params.k);
+    raft::allocate(outv, params.rows * params.k);
     r.uniform(arr, params.rows * params.cols, T(-1.0), T(1.0), stream);
 
     static const bool Sort = false;
diff --git a/cpp/test/prims/label.cu b/cpp/test/prims/label.cu
index 821d502855..b0e4d629da 100644
--- a/cpp/test/prims/label.cu
+++ b/cpp/test/prims/label.cu
@@ -18,9 +18,9 @@
 
 #include <label/classlabels.cuh>
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <cuml/common/cuml_allocator.hpp>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 #include <iostream>
@@ -44,9 +44,9 @@ TEST_F(MakeMonotonicTest, Result) {
 
   float *data, *actual, *expected;
 
-  allocate(data, m, true);
-  allocate(actual, m, true);
-  allocate(expected, m, true);
+  raft::allocate(data, m, true);
+  raft::allocate(actual, m, true);
+  raft::allocate(expected, m, true);
 
   float *data_h =
     new float[m]{1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 8.0, 7.0, 8.0, 8.0, 25.0, 80.0};
@@ -54,14 +54,16 @@ TEST_F(MakeMonotonicTest, Result) {
   float *expected_h =
     new float[m]{1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 5.0, 4.0, 5.0, 5.0, 6.0, 7.0};
 
-  updateDevice(data, data_h, m, stream);
-  updateDevice(expected, expected_h, m, stream);
+  raft::update_device(data, data_h, m, stream);
+  raft::update_device(expected, expected_h, m, stream);
 
-  make_monotonic(actual, data, m, stream);
+  std::shared_ptr<deviceAllocator> allocator(
+    new raft::mr::device::default_allocator);
+  make_monotonic(actual, data, m, stream, allocator);
 
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
-  ASSERT_TRUE(devArrMatch(actual, expected, m, Compare<bool>(), stream));
+  ASSERT_TRUE(devArrMatch(actual, expected, m, raft::Compare<bool>(), stream));
 
   CUDA_CHECK(cudaStreamDestroy(stream));
   CUDA_CHECK(cudaFree(data));
@@ -74,14 +76,15 @@ TEST_F(MakeMonotonicTest, Result) {
 TEST(LabelTest, ClassLabels) {
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));
-  std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
+  std::shared_ptr<deviceAllocator> allocator(
+    new raft::mr::device::default_allocator);
 
   int n_rows = 6;
   float *y_d;
-  allocate(y_d, n_rows);
+  raft::allocate(y_d, n_rows);
 
   float y_h[] = {2, -1, 1, 2, 1, 1};
-  updateDevice(y_d, y_h, n_rows, stream);
+  raft::update_device(y_d, y_h, n_rows, stream);
 
   int n_classes;
   float *y_unique_d;
@@ -91,16 +94,16 @@ TEST(LabelTest, ClassLabels) {
 
   float y_unique_exp[] = {-1, 1, 2};
   EXPECT_TRUE(devArrMatchHost(y_unique_exp, y_unique_d, n_classes,
-                              Compare<float>(), stream));
+                              raft::Compare<float>(), stream));
 
   float *y_relabeled_d;
-  allocate(y_relabeled_d, n_rows);
+  raft::allocate(y_relabeled_d, n_rows);
 
   getOvrLabels(y_d, n_rows, y_unique_d, n_classes, y_relabeled_d, 2, stream);
 
   float y_relabeled_exp[] = {1, -1, -1, 1, -1, -1};
   EXPECT_TRUE(devArrMatchHost(y_relabeled_exp, y_relabeled_d, n_rows,
-                              Compare<float>(), stream));
+                              raft::Compare<float>(), stream));
 
   CUDA_CHECK(cudaStreamDestroy(stream));
   CUDA_CHECK(cudaFree(y_d));
diff --git a/cpp/test/prims/linearReg.cu b/cpp/test/prims/linearReg.cu
index 90ba76dfd7..e91ea5a159 100644
--- a/cpp/test/prims/linearReg.cu
+++ b/cpp/test/prims/linearReg.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <functions/linearReg.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -42,118 +42,107 @@ class LinRegLossTest : public ::testing::TestWithParam<LinRegLossInputs<T>> {
 
     T *labels, *coef;
 
-    cublasHandle_t cublas_handle;
-    CUBLAS_CHECK(cublasCreate(&cublas_handle));
-
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocator.reset(new defaultDeviceAllocator);
-
-    allocate(in, len);
-    allocate(out, 1);
-    allocate(out_lasso, 1);
-    allocate(out_ridge, 1);
-    allocate(out_elasticnet, 1);
-    allocate(out_grad, n_cols);
-    allocate(out_lasso_grad, n_cols);
-    allocate(out_ridge_grad, n_cols);
-    allocate(out_elasticnet_grad, n_cols);
-    allocate(out_ref, 1);
-    allocate(out_lasso_ref, 1);
-    allocate(out_ridge_ref, 1);
-    allocate(out_elasticnet_ref, 1);
-    allocate(out_grad_ref, n_cols);
-    allocate(out_lasso_grad_ref, n_cols);
-    allocate(out_ridge_grad_ref, n_cols);
-    allocate(out_elasticnet_grad_ref, n_cols);
-
-    allocate(labels, params.n_rows);
-    allocate(coef, params.n_cols);
+    raft::handle_t handle;
+
+    cudaStream_t stream = handle.get_stream();
+
+    raft::allocate(in, len);
+    raft::allocate(out, 1);
+    raft::allocate(out_lasso, 1);
+    raft::allocate(out_ridge, 1);
+    raft::allocate(out_elasticnet, 1);
+    raft::allocate(out_grad, n_cols);
+    raft::allocate(out_lasso_grad, n_cols);
+    raft::allocate(out_ridge_grad, n_cols);
+    raft::allocate(out_elasticnet_grad, n_cols);
+    raft::allocate(out_ref, 1);
+    raft::allocate(out_lasso_ref, 1);
+    raft::allocate(out_ridge_ref, 1);
+    raft::allocate(out_elasticnet_ref, 1);
+    raft::allocate(out_grad_ref, n_cols);
+    raft::allocate(out_lasso_grad_ref, n_cols);
+    raft::allocate(out_ridge_grad_ref, n_cols);
+    raft::allocate(out_elasticnet_grad_ref, n_cols);
+
+    raft::allocate(labels, params.n_rows);
+    raft::allocate(coef, params.n_cols);
 
     T h_in[len] = {0.1, 0.35, -0.9, -1.4, 2.0, 3.1};
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
     T h_labels[n_rows] = {0.3, 2.0, -1.1};
-    updateDevice(labels, h_labels, n_rows, stream);
+    raft::update_device(labels, h_labels, n_rows, stream);
 
     T h_coef[n_cols] = {0.35, -0.24};
-    updateDevice(coef, h_coef, n_cols, stream);
+    raft::update_device(coef, h_coef, n_cols, stream);
 
     T h_out_ref[1] = {1.854842};
-    updateDevice(out_ref, h_out_ref, 1, stream);
+    raft::update_device(out_ref, h_out_ref, 1, stream);
 
     T h_out_lasso_ref[1] = {2.2088};
-    updateDevice(out_lasso_ref, h_out_lasso_ref, 1, stream);
+    raft::update_device(out_lasso_ref, h_out_lasso_ref, 1, stream);
 
     T h_out_ridge_ref[1] = {1.9629};
-    updateDevice(out_ridge_ref, h_out_ridge_ref, 1, stream);
+    raft::update_device(out_ridge_ref, h_out_ridge_ref, 1, stream);
 
     T h_out_elasticnet_ref[1] = {2.0858};
-    updateDevice(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
+    raft::update_device(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
 
     T h_out_grad_ref[n_cols] = {-0.56995, -3.12486};
-    updateDevice(out_grad_ref, h_out_grad_ref, n_cols, stream);
+    raft::update_device(out_grad_ref, h_out_grad_ref, n_cols, stream);
 
     T h_out_lasso_grad_ref[n_cols] = {0.03005, -3.724866};
-    updateDevice(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols, stream);
+    raft::update_device(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols,
+                        stream);
 
     T h_out_ridge_grad_ref[n_cols] = {-0.14995, -3.412866};
-    updateDevice(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols, stream);
+    raft::update_device(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols,
+                        stream);
 
     T h_out_elasticnet_grad_ref[n_cols] = {-0.05995, -3.568866};
-    updateDevice(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref, n_cols,
-                 stream);
+    raft::update_device(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref,
+                        n_cols, stream);
 
     T alpha = 0.6;
     T l1_ratio = 0.5;
 
-    linearRegLoss(in, params.n_rows, params.n_cols, labels, coef, out,
-                  penalty::NONE, alpha, l1_ratio, cublas_handle, allocator,
-                  stream);
+    linearRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef, out,
+                  penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    linearRegLossGrads(in, params.n_rows, params.n_cols, labels, coef, out_grad,
-                       penalty::NONE, alpha, l1_ratio, cublas_handle, allocator,
-                       stream);
+    linearRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                       out_grad, penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    linearRegLoss(in, params.n_rows, params.n_cols, labels, coef, out_lasso,
-                  penalty::L1, alpha, l1_ratio, cublas_handle, allocator,
-                  stream);
+    linearRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+                  out_lasso, penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    linearRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                       out_lasso_grad, penalty::L1, alpha, l1_ratio,
-                       cublas_handle, allocator, stream);
+    linearRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                       out_lasso_grad, penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    linearRegLoss(in, params.n_rows, params.n_cols, labels, coef, out_ridge,
-                  penalty::L2, alpha, l1_ratio, cublas_handle, allocator,
-                  stream);
+    linearRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+                  out_ridge, penalty::L2, alpha, l1_ratio, stream);
 
-    linearRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                       out_ridge_grad, penalty::L2, alpha, l1_ratio,
-                       cublas_handle, allocator, stream);
+    linearRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                       out_ridge_grad, penalty::L2, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    linearRegLoss(in, params.n_rows, params.n_cols, labels, coef,
-                  out_elasticnet, penalty::ELASTICNET, alpha, l1_ratio,
-                  cublas_handle, allocator, stream);
+    linearRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+                  out_elasticnet, penalty::ELASTICNET, alpha, l1_ratio, stream);
 
-    linearRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
+    linearRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
                        out_elasticnet_grad, penalty::ELASTICNET, alpha,
-                       l1_ratio, cublas_handle, allocator, stream);
+                       l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    CUBLAS_CHECK(cublasDestroy(cublas_handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(labels));
     CUDA_CHECK(cudaFree(coef));
   }
@@ -196,57 +185,57 @@ const std::vector<LinRegLossInputs<double>> inputsd = {{0.01, 3, 2, 6}};
 typedef LinRegLossTest<float> LinRegLossTestF;
 TEST_P(LinRegLossTestF, Result) {
   ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<float>(params.tolerance)));
+    devArrMatch(out_ref, out, 1, raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
                           params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef LinRegLossTest<double> LinRegLossTestD;
 TEST_P(LinRegLossTestD, Result) {
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(devArrMatch(out_ref, out, 1,
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
                           params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(LinRegLossTests, LinRegLossTestF,
diff --git a/cpp/test/prims/log.cu b/cpp/test/prims/log.cu
index 066e23057e..c4397ac23a 100644
--- a/cpp/test/prims/log.cu
+++ b/cpp/test/prims/log.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <functions/log.cuh>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -44,15 +44,15 @@ class LogTest : public ::testing::TestWithParam<LogInputs<T>> {
 
     int len = params.len;
 
-    allocate(data, len);
+    raft::allocate(data, len);
     T data_h[params.len] = {2.1, 4.5, 0.34, 10.0};
-    updateDevice(data, data_h, len, stream);
+    raft::update_device(data, data_h, len, stream);
 
-    allocate(result, len);
-    allocate(result_ref, len);
+    raft::allocate(result, len);
+    raft::allocate(result_ref, len);
     T result_ref_h[params.len] = {0.74193734, 1.5040774, -1.07880966,
                                   2.30258509};
-    updateDevice(result_ref, result_ref_h, len, stream);
+    raft::update_device(result_ref, result_ref_h, len, stream);
 
     f_log(result, data, T(1), len, stream);
     CUDA_CHECK(cudaStreamDestroy(stream));
@@ -76,13 +76,13 @@ const std::vector<LogInputs<double>> inputsd2 = {{0.001, 4}};
 typedef LogTest<float> LogTestValF;
 TEST_P(LogTestValF, Result) {
   ASSERT_TRUE(devArrMatch(result_ref, result, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef LogTest<double> LogTestValD;
 TEST_P(LogTestValD, Result) {
   ASSERT_TRUE(devArrMatch(result_ref, result, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(LogTests, LogTestValF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/prims/logisticReg.cu b/cpp/test/prims/logisticReg.cu
index 2b6ab3405b..b2eb3dc28f 100644
--- a/cpp/test/prims/logisticReg.cu
+++ b/cpp/test/prims/logisticReg.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <functions/logisticReg.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -42,117 +42,110 @@ class LogRegLossTest : public ::testing::TestWithParam<LogRegLossInputs<T>> {
 
     T *labels, *coef;
 
-    cublasHandle_t cublas_handle;
-    CUBLAS_CHECK(cublasCreate(&cublas_handle));
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocator.reset(new defaultDeviceAllocator);
-
-    allocate(in, len);
-    allocate(out, 1);
-    allocate(out_lasso, 1);
-    allocate(out_ridge, 1);
-    allocate(out_elasticnet, 1);
-    allocate(out_grad, n_cols);
-    allocate(out_lasso_grad, n_cols);
-    allocate(out_ridge_grad, n_cols);
-    allocate(out_elasticnet_grad, n_cols);
-    allocate(out_ref, 1);
-    allocate(out_lasso_ref, 1);
-    allocate(out_ridge_ref, 1);
-    allocate(out_elasticnet_ref, 1);
-    allocate(out_grad_ref, n_cols);
-    allocate(out_lasso_grad_ref, n_cols);
-    allocate(out_ridge_grad_ref, n_cols);
-    allocate(out_elasticnet_grad_ref, n_cols);
-
-    allocate(labels, params.n_rows);
-    allocate(coef, params.n_cols);
+    raft::handle_t handle;
+
+    cudaStream_t stream = handle.get_stream();
+
+    allocator.reset(new raft::mr::device::default_allocator);
+
+    raft::allocate(in, len);
+    raft::allocate(out, 1);
+    raft::allocate(out_lasso, 1);
+    raft::allocate(out_ridge, 1);
+    raft::allocate(out_elasticnet, 1);
+    raft::allocate(out_grad, n_cols);
+    raft::allocate(out_lasso_grad, n_cols);
+    raft::allocate(out_ridge_grad, n_cols);
+    raft::allocate(out_elasticnet_grad, n_cols);
+    raft::allocate(out_ref, 1);
+    raft::allocate(out_lasso_ref, 1);
+    raft::allocate(out_ridge_ref, 1);
+    raft::allocate(out_elasticnet_ref, 1);
+    raft::allocate(out_grad_ref, n_cols);
+    raft::allocate(out_lasso_grad_ref, n_cols);
+    raft::allocate(out_ridge_grad_ref, n_cols);
+    raft::allocate(out_elasticnet_grad_ref, n_cols);
+
+    raft::allocate(labels, params.n_rows);
+    raft::allocate(coef, params.n_cols);
 
     T h_in[len] = {0.1, 0.35, -0.9, -1.4, 2.0, 3.1};
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
     T h_labels[n_rows] = {0.3, 2.0, -1.1};
-    updateDevice(labels, h_labels, n_rows, stream);
+    raft::update_device(labels, h_labels, n_rows, stream);
 
     T h_coef[n_cols] = {0.35, -0.24};
-    updateDevice(coef, h_coef, n_cols, stream);
+    raft::update_device(coef, h_coef, n_cols, stream);
 
     T h_out_ref[1] = {0.38752545};
-    updateDevice(out_ref, h_out_ref, 1, stream);
+    raft::update_device(out_ref, h_out_ref, 1, stream);
 
     T h_out_lasso_ref[1] = {0.74152};
-    updateDevice(out_lasso_ref, h_out_lasso_ref, 1, stream);
+    raft::update_device(out_lasso_ref, h_out_lasso_ref, 1, stream);
 
     T h_out_ridge_ref[1] = {0.4955854};
-    updateDevice(out_ridge_ref, h_out_ridge_ref, 1, stream);
+    raft::update_device(out_ridge_ref, h_out_ridge_ref, 1, stream);
 
     T h_out_elasticnet_ref[1] = {0.618555};
-    updateDevice(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
+    raft::update_device(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
 
     T h_out_grad_ref[n_cols] = {-0.58284, 0.207666};
-    updateDevice(out_grad_ref, h_out_grad_ref, n_cols, stream);
+    raft::update_device(out_grad_ref, h_out_grad_ref, n_cols, stream);
 
     T h_out_lasso_grad_ref[n_cols] = {0.0171, -0.39233};
-    updateDevice(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols, stream);
+    raft::update_device(out_lasso_grad_ref, h_out_lasso_grad_ref, n_cols,
+                        stream);
 
     T h_out_ridge_grad_ref[n_cols] = {-0.16284, -0.080333};
-    updateDevice(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols, stream);
+    raft::update_device(out_ridge_grad_ref, h_out_ridge_grad_ref, n_cols,
+                        stream);
 
     T h_out_elasticnet_grad_ref[n_cols] = {-0.07284, -0.23633};
-    updateDevice(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref, n_cols,
-                 stream);
+    raft::update_device(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref,
+                        n_cols, stream);
 
     T alpha = 0.6;
     T l1_ratio = 0.5;
 
-    logisticRegLoss(in, params.n_rows, params.n_cols, labels, coef, out,
-                    penalty::NONE, alpha, l1_ratio, cublas_handle, allocator,
-                    stream);
+    logisticRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef, out,
+                    penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    logisticRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                         out_grad, penalty::NONE, alpha, l1_ratio,
-                         cublas_handle, allocator, stream);
+    logisticRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                         out_grad, penalty::NONE, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    logisticRegLoss(in, params.n_rows, params.n_cols, labels, coef, out_lasso,
-                    penalty::L1, alpha, l1_ratio, cublas_handle, allocator,
-                    stream);
+    logisticRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+                    out_lasso, penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    logisticRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                         out_lasso_grad, penalty::L1, alpha, l1_ratio,
-                         cublas_handle, allocator, stream);
+    logisticRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                         out_lasso_grad, penalty::L1, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    logisticRegLoss(in, params.n_rows, params.n_cols, labels, coef, out_ridge,
-                    penalty::L2, alpha, l1_ratio, cublas_handle, allocator,
-                    stream);
+    logisticRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
+                    out_ridge, penalty::L2, alpha, l1_ratio, stream);
 
-    logisticRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
-                         out_ridge_grad, penalty::L2, alpha, l1_ratio,
-                         cublas_handle, allocator, stream);
+    logisticRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
+                         out_ridge_grad, penalty::L2, alpha, l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    logisticRegLoss(in, params.n_rows, params.n_cols, labels, coef,
+    logisticRegLoss(handle, in, params.n_rows, params.n_cols, labels, coef,
                     out_elasticnet, penalty::ELASTICNET, alpha, l1_ratio,
-                    cublas_handle, allocator, stream);
+                    stream);
 
-    logisticRegLossGrads(in, params.n_rows, params.n_cols, labels, coef,
+    logisticRegLossGrads(handle, in, params.n_rows, params.n_cols, labels, coef,
                          out_elasticnet_grad, penalty::ELASTICNET, alpha,
-                         l1_ratio, cublas_handle, allocator, stream);
+                         l1_ratio, stream);
 
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
-    CUBLAS_CHECK(cublasDestroy(cublas_handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(labels));
     CUDA_CHECK(cudaFree(coef));
   }
@@ -194,58 +187,62 @@ const std::vector<LogRegLossInputs<double>> inputsd = {{0.01, 3, 2, 6}};
 
 typedef LogRegLossTest<float> LogRegLossTestF;
 TEST_P(LogRegLossTestF, Result) {
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_ref, out_lasso, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_ref, out_ridge, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_grad_ref, out_grad, params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_grad_ref, out_lasso_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_grad_ref, out_ridge_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.n_cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
+                                params.n_cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef LogRegLossTest<double> LogRegLossTestD;
 TEST_P(LogRegLossTestD, Result) {
-  ASSERT_TRUE(
-    devArrMatch(out_ref, out, 1, CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_ref, out_lasso, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_ref, out_ridge, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_grad_ref, out_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_grad_ref, out_grad, params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_lasso_grad_ref, out_lasso_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ridge_grad_ref, out_ridge_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.n_cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
+                                params.n_cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(LogRegLossTests, LogRegLossTestF,
diff --git a/cpp/test/prims/make_arima.cu b/cpp/test/prims/make_arima.cu
index 50f60acc1c..07f2120cf4 100644
--- a/cpp/test/prims/make_arima.cu
+++ b/cpp/test/prims/make_arima.cu
@@ -18,8 +18,8 @@
 #include <thrust/count.h>
 #include <thrust/device_vector.h>
 
-#include <common/cudart_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 #include <random/make_arima.cuh>
 #include "test_utils.h"
 
@@ -32,7 +32,7 @@ namespace Random {
 struct MakeArimaInputs {
   int batch_size, n_obs;
   int p, d, q, P, D, Q, s, k;
-  GeneratorType gtype;
+  raft::random::GeneratorType gtype;
   uint64_t seed;
 };
 
@@ -50,10 +50,10 @@ class MakeArimaTest : public ::testing::TestWithParam<MakeArimaInputs> {
     ML::ARIMAOrder order = {params.p, params.d, params.q, params.P,
                             params.D, params.Q, params.s, params.k};
 
-    allocator.reset(new defaultDeviceAllocator);
+    allocator.reset(new raft::mr::device::default_allocator);
     CUDA_CHECK(cudaStreamCreate(&stream));
 
-    allocate(data, params.batch_size * params.n_obs);
+    raft::allocate(data, params.batch_size * params.n_obs);
 
     // Create the time series dataset
     make_arima(data, params.batch_size, params.n_obs, order, allocator, stream,
@@ -73,9 +73,9 @@ class MakeArimaTest : public ::testing::TestWithParam<MakeArimaInputs> {
 };
 
 const std::vector<MakeArimaInputs> make_arima_inputs = {
-  {100, 200, 1, 1, 2, 0, 0, 0, 0, 1, GenPhilox, 1234ULL},
-  {1000, 100, 3, 0, 0, 1, 1, 0, 4, 1, GenPhilox, 1234ULL},
-  {10000, 150, 2, 1, 2, 0, 1, 2, 4, 0, GenPhilox, 1234ULL}};
+  {100, 200, 1, 1, 2, 0, 0, 0, 0, 1, raft::random::GenPhilox, 1234ULL},
+  {1000, 100, 3, 0, 0, 1, 1, 0, 4, 1, raft::random::GenPhilox, 1234ULL},
+  {10000, 150, 2, 1, 2, 0, 1, 2, 4, 0, raft::random::GenPhilox, 1234ULL}};
 
 typedef MakeArimaTest<float> MakeArimaTestF;
 TEST_P(MakeArimaTestF, Result) { CUDA_CHECK(cudaStreamSynchronize(stream)); }
diff --git a/cpp/test/prims/make_blobs.cu b/cpp/test/prims/make_blobs.cu
index 54b7e9d2a4..81bf1627fc 100644
--- a/cpp/test/prims/make_blobs.cu
+++ b/cpp/test/prims/make_blobs.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include <random/make_blobs.cuh>
 #include "test_utils.h"
 
@@ -35,10 +35,10 @@ __global__ void meanKernel(T* out, int* lens, const T* data, const int* labels,
     T val = data[tid];
     int label = labels[rowid];
     int idx = row_major ? label * ncols + colid : colid * nclusters + label;
-    myAtomicAdd(out + idx * 2, val);
-    myAtomicAdd(out + idx * 2 + 1, val * val);
+    raft::myAtomicAdd(out + idx * 2, val);
+    raft::myAtomicAdd(out + idx * 2 + 1, val * val);
     if (colid == 0) {
-      myAtomicAdd(lens + label, 1);
+      raft::myAtomicAdd(lens + label, 1);
     }
   }
 }
@@ -64,7 +64,7 @@ struct MakeBlobsInputs {
   int rows, cols, n_clusters;
   T std;
   bool row_major, shuffle;
-  GeneratorType gtype;
+  raft::random::GeneratorType gtype;
   uint64_t seed;
 };
 
@@ -75,17 +75,17 @@ class MakeBlobsTest : public ::testing::TestWithParam<MakeBlobsInputs<T>> {
     // Tests are configured with their expected test-values sigma. For example,
     // 4 x sigma indicates the test shouldn't fail 99.9% of the time.
     num_sigma = 50;
-    allocator.reset(new defaultDeviceAllocator);
+    allocator.reset(new raft::mr::device::default_allocator);
     params = ::testing::TestWithParam<MakeBlobsInputs<T>>::GetParam();
     int len = params.rows * params.cols;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    Rng r(params.seed, params.gtype);
-    allocate(data, len);
-    allocate(labels, params.rows);
-    allocate(stats, 2 * params.n_clusters * params.cols, true);
-    allocate(mean_var, 2 * params.n_clusters * params.cols, true);
-    allocate(mu_vec, params.cols * params.n_clusters);
-    allocate(lens, params.n_clusters, true);
+    raft::random::Rng r(params.seed, params.gtype);
+    raft::allocate(data, len);
+    raft::allocate(labels, params.rows);
+    raft::allocate(stats, 2 * params.n_clusters * params.cols, true);
+    raft::allocate(mean_var, 2 * params.n_clusters * params.cols, true);
+    raft::allocate(mu_vec, params.cols * params.n_clusters);
+    raft::allocate(lens, params.n_clusters, true);
     r.uniform(mu_vec, params.cols * params.n_clusters, T(-10.0), T(10.0),
               stream);
     T* sigma_vec = nullptr;
@@ -94,11 +94,11 @@ class MakeBlobsTest : public ::testing::TestWithParam<MakeBlobsInputs<T>> {
                params.std, params.shuffle, T(-10.0), T(10.0), params.seed,
                params.gtype);
     static const int threads = 128;
-    meanKernel<T><<<ceildiv(len, threads), threads, 0, stream>>>(
+    meanKernel<T><<<raft::ceildiv(len, threads), threads, 0, stream>>>(
       stats, lens, data, labels, params.rows, params.cols, params.n_clusters,
       params.row_major);
     int len1 = params.n_clusters * params.cols;
-    compute_mean_var<T><<<ceildiv(len1, threads), threads, 0, stream>>>(
+    compute_mean_var<T><<<raft::ceildiv(len1, threads), threads, 0, stream>>>(
       mean_var, stats, lens, params.n_clusters, params.cols, params.row_major);
   }
 
@@ -113,9 +113,9 @@ class MakeBlobsTest : public ::testing::TestWithParam<MakeBlobsInputs<T>> {
 
   void check() {
     int len = params.n_clusters * params.cols;
-    auto compare = CompareApprox<T>(num_sigma * params.tolerance);
-    ASSERT_TRUE(devArrMatch(mu_vec, mean_var, len, compare));
-    ASSERT_TRUE(devArrMatch(params.std, mean_var + len, len, compare));
+    auto compare = raft::CompareApprox<T>(num_sigma * params.tolerance);
+    ASSERT_TRUE(raft::devArrMatch(mu_vec, mean_var, len, compare));
+    ASSERT_TRUE(raft::devArrMatch(params.std, mean_var + len, len, compare));
   }
 
  protected:
@@ -129,55 +129,55 @@ class MakeBlobsTest : public ::testing::TestWithParam<MakeBlobsInputs<T>> {
 
 typedef MakeBlobsTest<float> MakeBlobsTestF;
 const std::vector<MakeBlobsInputs<float>> inputsf_t = {
-  {0.0055, 1024, 32, 3, 1.f, true, false, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, false, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, true, false, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, false, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, true, false, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, false, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, false, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, false, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, false, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, false, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, false, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, false, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, true, true, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, true, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, true, true, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, true, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, true, true, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, true, true, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, true, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, true, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, true, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, true, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.f, false, true, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.f, false, true, GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.f, false, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.f, false, true, raft::random::GenKiss99, 1234ULL},
 
-  {0.0055, 5003, 32, 5, 1.f, true, false, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, false, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, true, false, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, false, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, true, false, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, false, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, false, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, false, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, false, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, false, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, false, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, false, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, true, true, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, true, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, true, true, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, true, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, true, true, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, true, true, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, true, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, true, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, true, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, true, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.f, false, true, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.f, false, true, GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.f, false, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.f, false, true, raft::random::GenKiss99, 1234ULL},
 };
 
 TEST_P(MakeBlobsTestF, Result) { check(); }
@@ -186,55 +186,55 @@ INSTANTIATE_TEST_CASE_P(MakeBlobsTests, MakeBlobsTestF,
 
 typedef MakeBlobsTest<double> MakeBlobsTestD;
 const std::vector<MakeBlobsInputs<double>> inputsd_t = {
-  {0.0055, 1024, 32, 3, 1.0, true, false, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, false, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, true, false, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, false, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, true, false, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, false, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, false, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, false, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, false, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, false, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, false, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, false, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, true, true, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, true, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, true, true, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, true, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, true, true, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, true, true, GenKiss99, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, true, GenPhilox, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, true, GenPhilox, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, true, GenTaps, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, true, GenTaps, 1234ULL},
-  {0.0055, 1024, 32, 3, 1.0, false, true, GenKiss99, 1234ULL},
-  {0.011, 1024, 8, 3, 1.0, false, true, GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 1024, 32, 3, 1.0, false, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 1024, 8, 3, 1.0, false, true, raft::random::GenKiss99, 1234ULL},
 
-  {0.0055, 5003, 32, 5, 1.0, true, false, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, false, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, true, false, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, false, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, true, false, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, false, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, false, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, false, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, false, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, false, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, false, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, false, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, true, true, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, true, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, true, true, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, true, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, true, true, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, true, true, GenKiss99, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, true, GenPhilox, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, true, GenPhilox, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, true, GenTaps, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, true, GenTaps, 1234ULL},
-  {0.0055, 5003, 32, 5, 1.0, false, true, GenKiss99, 1234ULL},
-  {0.011, 5003, 8, 5, 1.0, false, true, GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, false, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, false, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, false, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, false, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, true, true, raft::random::GenKiss99, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, true, raft::random::GenPhilox, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, true, raft::random::GenTaps, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, true, raft::random::GenTaps, 1234ULL},
+  {0.0055, 5003, 32, 5, 1.0, false, true, raft::random::GenKiss99, 1234ULL},
+  {0.011, 5003, 8, 5, 1.0, false, true, raft::random::GenKiss99, 1234ULL},
 };
 TEST_P(MakeBlobsTestD, Result) { check(); }
 INSTANTIATE_TEST_CASE_P(MakeBlobsTests, MakeBlobsTestD,
diff --git a/cpp/test/prims/make_regression.cu b/cpp/test/prims/make_regression.cu
index bcca6cc114..1f8305fdd6 100644
--- a/cpp/test/prims/make_regression.cu
+++ b/cpp/test/prims/make_regression.cu
@@ -18,11 +18,11 @@
 #include <thrust/count.h>
 #include <thrust/device_vector.h>
 
-#include <common/cudart_utils.h>
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
-#include <cuda_utils.cuh>
-#include <linalg/subtract.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/subtract.cuh>
 #include <random/make_regression.cuh>
 #include "test_utils.h"
 
@@ -35,7 +35,7 @@ struct MakeRegressionInputs {
   int n_samples, n_features, n_informative, n_targets, effective_rank;
   T bias;
   bool shuffle;
-  GeneratorType gtype;
+  raft::random::GeneratorType gtype;
   uint64_t seed;
 };
 
@@ -49,38 +49,36 @@ class MakeRegressionTest
     // Noise must be zero to compare the actual and expected values
     T noise = (T)0.0, tail_strength = (T)0.5;
 
-    allocator.reset(new defaultDeviceAllocator);
-    CUBLAS_CHECK(cublasCreate(&cublas_handle));
-    CUSOLVER_CHECK(cusolverDnCreate(&cusolver_handle));
-    CUDA_CHECK(cudaStreamCreate(&stream));
+    raft::handle_t handle;
+    stream = handle.get_stream();
 
-    allocate(data, params.n_samples * params.n_features);
-    allocate(values_ret, params.n_samples * params.n_targets);
-    allocate(values_prod, params.n_samples * params.n_targets);
-    allocate(values_cm, params.n_samples * params.n_targets);
-    allocate(coef, params.n_features * params.n_targets);
+    raft::allocate(data, params.n_samples * params.n_features);
+    raft::allocate(values_ret, params.n_samples * params.n_targets);
+    raft::allocate(values_prod, params.n_samples * params.n_targets);
+    raft::allocate(values_cm, params.n_samples * params.n_targets);
+    raft::allocate(coef, params.n_features * params.n_targets);
 
     // Create the regression problem
-    make_regression(data, values_ret, params.n_samples, params.n_features,
-                    params.n_informative, cublas_handle, cusolver_handle,
-                    allocator, stream, coef, params.n_targets, params.bias,
-                    params.effective_rank, tail_strength, noise, params.shuffle,
-                    params.seed, params.gtype);
+    make_regression(handle, data, values_ret, params.n_samples,
+                    params.n_features, params.n_informative, stream, coef,
+                    params.n_targets, params.bias, params.effective_rank,
+                    tail_strength, noise, params.shuffle, params.seed,
+                    params.gtype);
 
     // Calculate the values from the data and coefficients (column-major)
     T alpha = (T)1.0, beta = (T)0.0;
-    CUBLAS_CHECK(LinAlg::cublasgemm(
-      cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T, params.n_samples,
+    CUBLAS_CHECK(raft::linalg::cublasgemm(
+      handle.get_cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_T, params.n_samples,
       params.n_targets, params.n_features, &alpha, data, params.n_features,
       coef, params.n_targets, &beta, values_cm, params.n_samples, stream));
 
     // Transpose the values to row-major
-    LinAlg::transpose(values_cm, values_prod, params.n_samples,
-                      params.n_targets, cublas_handle, stream);
+    raft::linalg::transpose(handle, values_cm, values_prod, params.n_samples,
+                            params.n_targets, stream);
 
     // Add the bias
-    LinAlg::addScalar(values_prod, values_prod, params.bias,
-                      params.n_samples * params.n_targets, stream);
+    raft::linalg::addScalar(values_prod, values_prod, params.bias,
+                            params.n_samples * params.n_targets, stream);
 
     // Count the number of zeroes in the coefficients
     thrust::device_ptr<T> __coef = thrust::device_pointer_cast(coef);
@@ -93,51 +91,47 @@ class MakeRegressionTest
     CUDA_CHECK(cudaFree(values_ret));
     CUDA_CHECK(cudaFree(values_prod));
     CUDA_CHECK(cudaFree(values_cm));
-    CUBLAS_CHECK(cublasDestroy(cublas_handle));
-    CUSOLVER_CHECK(cusolverDnDestroy(cusolver_handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
   }
 
  protected:
   MakeRegressionInputs<T> params;
   T *data, *values_ret, *values_prod, *values_cm, *coef;
   int zero_count;
-  std::shared_ptr<deviceAllocator> allocator;
   cudaStream_t stream;
-  cublasHandle_t cublas_handle;
-  cusolverDnHandle_t cusolver_handle;
 };
 
 typedef MakeRegressionTest<float> MakeRegressionTestF;
 const std::vector<MakeRegressionInputs<float>> inputsf_t = {
-  {0.01f, 256, 32, 16, 1, -1, 0.f, true, GenPhilox, 1234ULL},
-  {0.01f, 1000, 100, 47, 4, 65, 4.2f, true, GenPhilox, 1234ULL},
-  {0.01f, 20000, 500, 450, 13, -1, -3.f, false, GenPhilox, 1234ULL}};
+  {0.01f, 256, 32, 16, 1, -1, 0.f, true, raft::random::GenPhilox, 1234ULL},
+  {0.01f, 1000, 100, 47, 4, 65, 4.2f, true, raft::random::GenPhilox, 1234ULL},
+  {0.01f, 20000, 500, 450, 13, -1, -3.f, false, raft::random::GenPhilox,
+   1234ULL}};
 
 TEST_P(MakeRegressionTestF, Result) {
   ASSERT_TRUE(
     match(params.n_targets * (params.n_features - params.n_informative),
-          zero_count, Compare<int>()));
-  ASSERT_TRUE(devArrMatch(values_ret, values_prod, params.n_samples,
-                          params.n_targets,
-                          CompareApprox<float>(params.tolerance), stream));
+          zero_count, raft::Compare<int>()));
+  ASSERT_TRUE(
+    devArrMatch(values_ret, values_prod, params.n_samples, params.n_targets,
+                raft::CompareApprox<float>(params.tolerance), stream));
 }
 INSTANTIATE_TEST_CASE_P(MakeRegressionTests, MakeRegressionTestF,
                         ::testing::ValuesIn(inputsf_t));
 
 typedef MakeRegressionTest<double> MakeRegressionTestD;
 const std::vector<MakeRegressionInputs<double>> inputsd_t = {
-  {0.01, 256, 32, 16, 1, -1, 0.0, true, GenPhilox, 1234ULL},
-  {0.01, 1000, 100, 47, 4, 65, 4.2, true, GenPhilox, 1234ULL},
-  {0.01, 20000, 500, 450, 13, -1, -3.0, false, GenPhilox, 1234ULL}};
+  {0.01, 256, 32, 16, 1, -1, 0.0, true, raft::random::GenPhilox, 1234ULL},
+  {0.01, 1000, 100, 47, 4, 65, 4.2, true, raft::random::GenPhilox, 1234ULL},
+  {0.01, 20000, 500, 450, 13, -1, -3.0, false, raft::random::GenPhilox,
+   1234ULL}};
 
 TEST_P(MakeRegressionTestD, Result) {
   ASSERT_TRUE(
     match(params.n_targets * (params.n_features - params.n_informative),
-          zero_count, Compare<int>()));
-  ASSERT_TRUE(devArrMatch(values_ret, values_prod, params.n_samples,
-                          params.n_targets,
-                          CompareApprox<double>(params.tolerance), stream));
+          zero_count, raft::Compare<int>()));
+  ASSERT_TRUE(
+    devArrMatch(values_ret, values_prod, params.n_samples, params.n_targets,
+                raft::CompareApprox<double>(params.tolerance), stream));
 }
 INSTANTIATE_TEST_CASE_P(MakeRegressionTests, MakeRegressionTestD,
                         ::testing::ValuesIn(inputsd_t));
diff --git a/cpp/test/prims/map_then_reduce.cu b/cpp/test/prims/map_then_reduce.cu
deleted file mode 100644
index 0c54297085..0000000000
--- a/cpp/test/prims/map_then_reduce.cu
+++ /dev/null
@@ -1,117 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/map_then_reduce.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type, typename MapOp>
-__global__ void naiveMapReduceKernel(Type *out, const Type *in, size_t len,
-                                     MapOp map) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    myAtomicAdd(out, map(in[idx]));
-  }
-}
-
-template <typename Type, typename MapOp>
-void naiveMapReduce(Type *out, const Type *in, size_t len, MapOp map,
-                    cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, (size_t)TPB);
-  naiveMapReduceKernel<Type, MapOp>
-    <<<nblks, TPB, 0, stream>>>(out, in, len, map);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-struct MapReduceInputs {
-  T tolerance;
-  size_t len;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const MapReduceInputs<T> &dims) {
-  return os;
-}
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename T>
-void mapReduceLaunch(T *out_ref, T *out, const T *in, size_t len,
-                     cudaStream_t stream) {
-  auto op = [] __device__(T in) { return in; };
-  naiveMapReduce(out_ref, in, len, op, stream);
-  mapThenSumReduce(out, len, op, 0, in);
-}
-
-template <typename T>
-class MapReduceTest : public ::testing::TestWithParam<MapReduceInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MapReduceInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    auto len = params.len;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in, len, T(-1.0), T(1.0), stream);
-    mapReduceLaunch(out_ref, out, in, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  MapReduceInputs<T> params;
-  T *in, *out_ref, *out;
-};
-
-const std::vector<MapReduceInputs<float>> inputsf = {
-  {0.001f, 1024 * 1024, 1234ULL}};
-typedef MapReduceTest<float> MapReduceTestF;
-TEST_P(MapReduceTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MapReduceTests, MapReduceTestF,
-                        ::testing::ValuesIn(inputsf));
-
-const std::vector<MapReduceInputs<double>> inputsd = {
-  {0.000001, 1024 * 1024, 1234ULL}};
-typedef MapReduceTest<double> MapReduceTestD;
-TEST_P(MapReduceTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MapReduceTests, MapReduceTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/math.cu b/cpp/test/prims/math.cu
deleted file mode 100644
index 67f89c4499..0000000000
--- a/cpp/test/prims/math.cu
+++ /dev/null
@@ -1,331 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <matrix/math.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Matrix {
-
-template <typename Type>
-__global__ void nativePowerKernel(Type *in, Type *out, int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = in[idx] * in[idx];
-  }
-}
-
-template <typename Type>
-void naivePower(Type *in, Type *out, int len, cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  nativePowerKernel<Type><<<nblks, TPB, 0, stream>>>(in, out, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type>
-__global__ void nativeSqrtKernel(Type *in, Type *out, int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = sqrt(in[idx]);
-  }
-}
-
-template <typename Type>
-void naiveSqrt(Type *in, Type *out, int len) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  nativeSqrtKernel<Type><<<nblks, TPB>>>(in, out, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type>
-__global__ void naiveSignFlipKernel(Type *in, Type *out, int rowCount,
-                                    int colCount) {
-  int d_i = blockIdx.x * rowCount;
-  int end = d_i + rowCount;
-
-  if (blockIdx.x < colCount) {
-    Type max = 0.0;
-    int max_index = 0;
-    for (int i = d_i; i < end; i++) {
-      Type val = in[i];
-      if (val < 0.0) {
-        val = -val;
-      }
-      if (val > max) {
-        max = val;
-        max_index = i;
-      }
-    }
-
-    for (int i = d_i; i < end; i++) {
-      if (in[max_index] < 0.0) {
-        out[i] = -in[i];
-      } else {
-        out[i] = in[i];
-      }
-    }
-  }
-
-  __syncthreads();
-}
-
-template <typename Type>
-void naiveSignFlip(Type *in, Type *out, int rowCount, int colCount) {
-  naiveSignFlipKernel<Type><<<colCount, 1>>>(in, out, rowCount, colCount);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-struct MathInputs {
-  T tolerance;
-  int n_row;
-  int n_col;
-  int len;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const MathInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class MathTest : public ::testing::TestWithParam<MathInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MathInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-
-    allocate(in_power, len);
-    allocate(out_power_ref, len);
-    allocate(in_sqrt, len);
-    allocate(out_sqrt_ref, len);
-    allocate(in_sign_flip, len);
-    allocate(out_sign_flip_ref, len);
-
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator.reset(new defaultDeviceAllocator);
-
-    allocate(in_ratio, 4);
-    T in_ratio_h[4] = {1.0, 2.0, 2.0, 3.0};
-    updateDevice(in_ratio, in_ratio_h, 4, stream);
-
-    allocate(out_ratio_ref, 4);
-    T out_ratio_ref_h[4] = {0.125, 0.25, 0.25, 0.375};
-    updateDevice(out_ratio_ref, out_ratio_ref_h, 4, stream);
-
-    r.uniform(in_power, len, T(-1.0), T(1.0), stream);
-    r.uniform(in_sqrt, len, T(0.0), T(1.0), stream);
-    // r.uniform(in_ratio, len, T(0.0), T(1.0));
-    r.uniform(in_sign_flip, len, T(-100.0), T(100.0), stream);
-
-    naivePower(in_power, out_power_ref, len, stream);
-    power(in_power, len, stream);
-
-    naiveSqrt(in_sqrt, out_sqrt_ref, len);
-    seqRoot(in_sqrt, len, stream);
-
-    ratio(in_ratio, in_ratio, 4, allocator, stream);
-
-    naiveSignFlip(in_sign_flip, out_sign_flip_ref, params.n_row, params.n_col);
-    signFlip(in_sign_flip, params.n_row, params.n_col, stream);
-
-    allocate(in_recip, 4);
-    allocate(in_recip_ref, 4);
-    allocate(out_recip, 4);
-    // default threshold is 1e-15
-    std::vector<T> in_recip_h = {0.1, 0.01, -0.01, 0.1e-16};
-    std::vector<T> in_recip_ref_h = {10.0, 100.0, -100.0, 0.0};
-    updateDevice(in_recip, in_recip_h.data(), 4, stream);
-    updateDevice(in_recip_ref, in_recip_ref_h.data(), 4, stream);
-    T recip_scalar = T(1.0);
-
-    // this `reciprocal()` has to go first bc next one modifies its input
-    reciprocal(in_recip, out_recip, recip_scalar, 4, stream);
-
-    reciprocal(in_recip, recip_scalar, 4, stream, true);
-
-    std::vector<T> in_small_val_zero_h = {0.1, 1e-16, -1e-16, -0.1};
-    std::vector<T> in_small_val_zero_ref_h = {0.1, 0.0, 0.0, -0.1};
-    allocate(in_smallzero, 4);
-    allocate(out_smallzero, 4);
-    allocate(out_smallzero_ref, 4);
-    updateDevice(in_smallzero, in_small_val_zero_h.data(), 4, stream);
-    updateDevice(out_smallzero_ref, in_small_val_zero_ref_h.data(), 4, stream);
-    setSmallValuesZero(out_smallzero, in_smallzero, 4, stream);
-    setSmallValuesZero(in_smallzero, 4, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in_power));
-    CUDA_CHECK(cudaFree(out_power_ref));
-    CUDA_CHECK(cudaFree(in_sqrt));
-    CUDA_CHECK(cudaFree(out_sqrt_ref));
-    CUDA_CHECK(cudaFree(in_ratio));
-    CUDA_CHECK(cudaFree(out_ratio_ref));
-    CUDA_CHECK(cudaFree(in_sign_flip));
-    CUDA_CHECK(cudaFree(out_sign_flip_ref));
-    CUDA_CHECK(cudaFree(in_recip));
-    CUDA_CHECK(cudaFree(in_recip_ref));
-    CUDA_CHECK(cudaFree(out_recip));
-    CUDA_CHECK(cudaFree(in_smallzero));
-    CUDA_CHECK(cudaFree(out_smallzero));
-    CUDA_CHECK(cudaFree(out_smallzero_ref));
-  }
-
- protected:
-  MathInputs<T> params;
-  T *in_power, *out_power_ref, *in_sqrt, *out_sqrt_ref, *in_ratio,
-    *out_ratio_ref, *in_sign_flip, *out_sign_flip_ref, *in_recip, *in_recip_ref,
-    *out_recip, *in_smallzero, *out_smallzero, *out_smallzero_ref;
-  std::shared_ptr<deviceAllocator> allocator;
-};
-
-const std::vector<MathInputs<float>> inputsf = {
-  {0.00001f, 1024, 1024, 1024 * 1024, 1234ULL}};
-
-const std::vector<MathInputs<double>> inputsd = {
-  {0.00001, 1024, 1024, 1024 * 1024, 1234ULL}};
-
-typedef MathTest<float> MathPowerTestF;
-TEST_P(MathPowerTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_power, out_power_ref, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathPowerTestD;
-TEST_P(MathPowerTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_power, out_power_ref, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-typedef MathTest<float> MathSqrtTestF;
-TEST_P(MathSqrtTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_sqrt, out_sqrt_ref, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathSqrtTestD;
-TEST_P(MathSqrtTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_sqrt, out_sqrt_ref, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-typedef MathTest<float> MathRatioTestF;
-TEST_P(MathRatioTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_ratio, out_ratio_ref, 4,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathRatioTestD;
-TEST_P(MathRatioTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_ratio, out_ratio_ref, 4,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-typedef MathTest<float> MathSignFlipTestF;
-TEST_P(MathSignFlipTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_sign_flip, out_sign_flip_ref, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathSignFlipTestD;
-TEST_P(MathSignFlipTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_sign_flip, out_sign_flip_ref, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-typedef MathTest<float> MathReciprocalTestF;
-TEST_P(MathReciprocalTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_recip, in_recip_ref, 4,
-                          CompareApprox<float>(params.tolerance)));
-
-  // 4-th term tests `setzero=true` functionality, not present in this version of `reciprocal`.
-  ASSERT_TRUE(devArrMatch(out_recip, in_recip_ref, 3,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathReciprocalTestD;
-TEST_P(MathReciprocalTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_recip, in_recip_ref, 4,
-                          CompareApprox<double>(params.tolerance)));
-
-  // 4-th term tests `setzero=true` functionality, not present in this version of `reciprocal`.
-  ASSERT_TRUE(devArrMatch(out_recip, in_recip_ref, 3,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-typedef MathTest<float> MathSetSmallZeroTestF;
-TEST_P(MathSetSmallZeroTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in_smallzero, out_smallzero_ref, 4,
-                          CompareApprox<float>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(out_smallzero, out_smallzero_ref, 4,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MathTest<double> MathSetSmallZeroTestD;
-TEST_P(MathSetSmallZeroTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in_smallzero, out_smallzero_ref, 4,
-                          CompareApprox<double>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(out_smallzero, out_smallzero_ref, 4,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathPowerTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathPowerTestD,
-                        ::testing::ValuesIn(inputsd));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathSqrtTestF, ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathSqrtTestD, ::testing::ValuesIn(inputsd));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathRatioTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathRatioTestD,
-                        ::testing::ValuesIn(inputsd));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathSignFlipTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathSignFlipTestD,
-                        ::testing::ValuesIn(inputsd));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathReciprocalTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathReciprocalTestD,
-                        ::testing::ValuesIn(inputsd));
-
-INSTANTIATE_TEST_CASE_P(MathTests, MathSetSmallZeroTestF,
-                        ::testing::ValuesIn(inputsf));
-INSTANTIATE_TEST_CASE_P(MathTests, MathSetSmallZeroTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace Matrix
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/matrix.cu b/cpp/test/prims/matrix.cu
deleted file mode 100644
index ce9f24eff8..0000000000
--- a/cpp/test/prims/matrix.cu
+++ /dev/null
@@ -1,98 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <matrix/matrix.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Matrix {
-
-template <typename T>
-struct MatrixInputs {
-  T tolerance;
-  int n_row;
-  int n_col;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const MatrixInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class MatrixTest : public ::testing::TestWithParam<MatrixInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MatrixInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.n_row * params.n_col;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(in1_revr, len);
-    r.uniform(in1, len, T(-1.0), T(1.0), stream);
-
-    copy(in1, in2, params.n_row, params.n_col, stream);
-    // copy(in1, in1_revr, params.n_row, params.n_col);
-    // colReverse(in1_revr, params.n_row, params.n_col);
-
-    T *outTrunc;
-    allocate(outTrunc, 6);
-    truncZeroOrigin(in1, params.n_row, outTrunc, 3, 2, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in1));
-    CUDA_CHECK(cudaFree(in2));
-    // CUDA_CHECK(cudaFree(in1_revr));
-  }
-
- protected:
-  MatrixInputs<T> params;
-  T *in1, *in2, *in1_revr;
-};
-
-const std::vector<MatrixInputs<float>> inputsf2 = {{0.000001f, 4, 4, 1234ULL}};
-
-const std::vector<MatrixInputs<double>> inputsd2 = {
-  {0.00000001, 4, 4, 1234ULL}};
-
-typedef MatrixTest<float> MatrixTestF;
-TEST_P(MatrixTestF, Result) {
-  ASSERT_TRUE(devArrMatch(in1, in2, params.n_row * params.n_col,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MatrixTest<double> MatrixTestD;
-TEST_P(MatrixTestD, Result) {
-  ASSERT_TRUE(devArrMatch(in1, in2, params.n_row * params.n_col,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(MatrixTests, MatrixTestF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(MatrixTests, MatrixTestD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace Matrix
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/matrix_vector_op.cu b/cpp/test/prims/matrix_vector_op.cu
deleted file mode 100644
index 0a499a55f8..0000000000
--- a/cpp/test/prims/matrix_vector_op.cu
+++ /dev/null
@@ -1,179 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <random/rng.cuh>
-#include "matrix_vector_op.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T, typename IdxType = int>
-struct MatVecOpInputs {
-  T tolerance;
-  IdxType rows, cols;
-  bool rowMajor, bcastAlongRows, useTwoVectors;
-  unsigned long long int seed;
-};
-
-template <typename T, typename IdxType>
-::std::ostream &operator<<(::std::ostream &os,
-                           const MatVecOpInputs<T, IdxType> &dims) {
-  return os;
-}
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename T, typename IdxType>
-void matrixVectorOpLaunch(T *out, const T *in, const T *vec1, const T *vec2,
-                          IdxType D, IdxType N, bool rowMajor,
-                          bool bcastAlongRows, bool useTwoVectors,
-                          cudaStream_t stream) {
-  if (useTwoVectors) {
-    matrixVectorOp(
-      out, in, vec1, vec2, D, N, rowMajor, bcastAlongRows,
-      [] __device__(T a, T b, T c) { return a + b + c; }, stream);
-  } else {
-    matrixVectorOp(
-      out, in, vec1, D, N, rowMajor, bcastAlongRows,
-      [] __device__(T a, T b) { return a + b; }, stream);
-  }
-}
-
-template <typename T, typename IdxType>
-class MatVecOpTest
-  : public ::testing::TestWithParam<MatVecOpInputs<T, IdxType>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MatVecOpInputs<T, IdxType>>::GetParam();
-    Random::Rng r(params.seed);
-    IdxType N = params.rows, D = params.cols;
-    IdxType len = N * D;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    IdxType vecLen = params.bcastAlongRows ? D : N;
-    allocate(vec1, vecLen);
-    allocate(vec2, vecLen);
-    r.uniform(in, len, (T)-1.0, (T)1.0, stream);
-    r.uniform(vec1, vecLen, (T)-1.0, (T)1.0, stream);
-    r.uniform(vec2, vecLen, (T)-1.0, (T)1.0, stream);
-    if (params.useTwoVectors) {
-      naiveMatVec(out_ref, in, vec1, vec2, D, N, params.rowMajor,
-                  params.bcastAlongRows, (T)1.0);
-    } else {
-      naiveMatVec(out_ref, in, vec1, D, N, params.rowMajor,
-                  params.bcastAlongRows, (T)1.0);
-    }
-    matrixVectorOpLaunch(out, in, vec1, vec2, D, N, params.rowMajor,
-                         params.bcastAlongRows, params.useTwoVectors, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(vec1));
-    CUDA_CHECK(cudaFree(vec2));
-    CUDA_CHECK(cudaFree(out));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(in));
-  }
-
- protected:
-  MatVecOpInputs<T, IdxType> params;
-  T *in, *out, *out_ref, *vec1, *vec2;
-};
-
-const std::vector<MatVecOpInputs<float, int>> inputsf_i32 = {
-  {0.00001f, 1024, 32, true, true, false, 1234ULL},
-  {0.00001f, 1024, 64, true, true, false, 1234ULL},
-  {0.00001f, 1024, 32, true, false, false, 1234ULL},
-  {0.00001f, 1024, 64, true, false, false, 1234ULL},
-  {0.00001f, 1024, 32, false, true, false, 1234ULL},
-  {0.00001f, 1024, 64, false, true, false, 1234ULL},
-  {0.00001f, 1024, 32, false, false, false, 1234ULL},
-  {0.00001f, 1024, 64, false, false, false, 1234ULL},
-
-  {0.00001f, 1024, 32, true, true, true, 1234ULL},
-  {0.00001f, 1024, 64, true, true, true, 1234ULL},
-  {0.00001f, 1024, 32, true, false, true, 1234ULL},
-  {0.00001f, 1024, 64, true, false, true, 1234ULL},
-  {0.00001f, 1024, 32, false, true, true, 1234ULL},
-  {0.00001f, 1024, 64, false, true, true, 1234ULL},
-  {0.00001f, 1024, 32, false, false, true, 1234ULL},
-  {0.00001f, 1024, 64, false, false, true, 1234ULL}};
-typedef MatVecOpTest<float, int> MatVecOpTestF_i32;
-TEST_P(MatVecOpTestF_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MatVecOpTests, MatVecOpTestF_i32,
-                        ::testing::ValuesIn(inputsf_i32));
-
-const std::vector<MatVecOpInputs<float, size_t>> inputsf_i64 = {
-  {0.00001f, 2500, 250, false, false, false, 1234ULL},
-  {0.00001f, 2500, 250, false, false, true, 1234ULL}};
-typedef MatVecOpTest<float, size_t> MatVecOpTestF_i64;
-TEST_P(MatVecOpTestF_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MatVecOpTests, MatVecOpTestF_i64,
-                        ::testing::ValuesIn(inputsf_i64));
-
-const std::vector<MatVecOpInputs<double, int>> inputsd_i32 = {
-  {0.0000001, 1024, 32, true, true, false, 1234ULL},
-  {0.0000001, 1024, 64, true, true, false, 1234ULL},
-  {0.0000001, 1024, 32, true, false, false, 1234ULL},
-  {0.0000001, 1024, 64, true, false, false, 1234ULL},
-  {0.0000001, 1024, 32, false, true, false, 1234ULL},
-  {0.0000001, 1024, 64, false, true, false, 1234ULL},
-  {0.0000001, 1024, 32, false, false, false, 1234ULL},
-  {0.0000001, 1024, 64, false, false, false, 1234ULL},
-
-  {0.0000001, 1024, 32, true, true, true, 1234ULL},
-  {0.0000001, 1024, 64, true, true, true, 1234ULL},
-  {0.0000001, 1024, 32, true, false, true, 1234ULL},
-  {0.0000001, 1024, 64, true, false, true, 1234ULL},
-  {0.0000001, 1024, 32, false, true, true, 1234ULL},
-  {0.0000001, 1024, 64, false, true, true, 1234ULL},
-  {0.0000001, 1024, 32, false, false, true, 1234ULL},
-  {0.0000001, 1024, 64, false, false, true, 1234ULL}};
-typedef MatVecOpTest<double, int> MatVecOpTestD_i32;
-TEST_P(MatVecOpTestD_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MatVecOpTests, MatVecOpTestD_i32,
-                        ::testing::ValuesIn(inputsd_i32));
-
-const std::vector<MatVecOpInputs<double, size_t>> inputsd_i64 = {
-  {0.0000001, 2500, 250, false, false, false, 1234ULL},
-  {0.0000001, 2500, 250, false, false, true, 1234ULL}};
-typedef MatVecOpTest<double, size_t> MatVecOpTestD_i64;
-TEST_P(MatVecOpTestD_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MatVecOpTests, MatVecOpTestD_i64,
-                        ::testing::ValuesIn(inputsd_i64));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/matrix_vector_op.cuh b/cpp/test/prims/matrix_vector_op.cuh
deleted file mode 100644
index f4028763fe..0000000000
--- a/cpp/test/prims/matrix_vector_op.cuh
+++ /dev/null
@@ -1,90 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <cuda_utils.cuh>
-#include <linalg/matrix_vector_op.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type, typename IdxType = int>
-__global__ void naiveMatVecKernel(Type *out, const Type *mat, const Type *vec,
-                                  IdxType D, IdxType N, bool rowMajor,
-                                  bool bcastAlongRows, Type scalar) {
-  IdxType idx = threadIdx.x + blockIdx.x * blockDim.x;
-  IdxType len = N * D;
-  IdxType col;
-  if (rowMajor && bcastAlongRows) {
-    col = idx % D;
-  } else if (!rowMajor && !bcastAlongRows) {
-    col = idx % N;
-  } else if (rowMajor && !bcastAlongRows) {
-    col = idx / D;
-  } else {
-    col = idx / N;
-  }
-  if (idx < len) {
-    out[idx] = mat[idx] + scalar * vec[col];
-  }
-}
-
-template <typename Type, typename IdxType = int>
-void naiveMatVec(Type *out, const Type *mat, const Type *vec, IdxType D,
-                 IdxType N, bool rowMajor, bool bcastAlongRows, Type scalar) {
-  static const IdxType TPB = 64;
-  IdxType len = N * D;
-  IdxType nblks = ceildiv(len, TPB);
-  naiveMatVecKernel<Type>
-    <<<nblks, TPB>>>(out, mat, vec, D, N, rowMajor, bcastAlongRows, scalar);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type, typename IdxType = int>
-__global__ void naiveMatVecKernel(Type *out, const Type *mat, const Type *vec1,
-                                  const Type *vec2, IdxType D, IdxType N,
-                                  bool rowMajor, bool bcastAlongRows,
-                                  Type scalar) {
-  IdxType idx = threadIdx.x + blockIdx.x * blockDim.x;
-  IdxType len = N * D;
-  IdxType col;
-  if (rowMajor && bcastAlongRows) {
-    col = idx % D;
-  } else if (!rowMajor && !bcastAlongRows) {
-    col = idx % N;
-  } else if (rowMajor && !bcastAlongRows) {
-    col = idx / D;
-  } else {
-    col = idx / N;
-  }
-  if (idx < len) {
-    out[idx] = mat[idx] + scalar * vec1[col] + vec2[col];
-  }
-}
-
-template <typename Type, typename IdxType = int>
-void naiveMatVec(Type *out, const Type *mat, const Type *vec1, const Type *vec2,
-                 IdxType D, IdxType N, bool rowMajor, bool bcastAlongRows,
-                 Type scalar) {
-  static const IdxType TPB = 64;
-  IdxType len = N * D;
-  IdxType nblks = ceildiv(len, TPB);
-  naiveMatVecKernel<Type><<<nblks, TPB>>>(out, mat, vec1, vec2, D, N, rowMajor,
-                                          bcastAlongRows, scalar);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/mean.cu b/cpp/test/prims/mean.cu
deleted file mode 100644
index 12b4241f55..0000000000
--- a/cpp/test/prims/mean.cu
+++ /dev/null
@@ -1,134 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include <stats/mean.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Stats {
-
-template <typename T>
-struct MeanInputs {
-  T tolerance, mean;
-  int rows, cols;
-  bool sample, rowMajor;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const MeanInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class MeanTest : public ::testing::TestWithParam<MeanInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MeanInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocate(data, len);
-    allocate(mean_act, cols);
-    r.normal(data, len, params.mean, (T)1.0, stream);
-
-    meanSGtest(data, stream);
-  }
-
-  void meanSGtest(T *data, cudaStream_t stream) {
-    int rows = params.rows, cols = params.cols;
-
-    mean(mean_act, data, cols, rows, params.sample, params.rowMajor, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(mean_act));
-  }
-
- protected:
-  MeanInputs<T> params;
-  T *data, *mean_act;
-};
-
-// Note: For 1024 samples, 256 experiments, a mean of 1.0 with stddev=1.0, the
-// measured mean (of a normal distribution) will fall outside of an epsilon of
-// 0.15 only 4/10000 times. (epsilon of 0.1 will fail 30/100 times)
-const std::vector<MeanInputs<float>> inputsf = {
-  {0.15f, 1.f, 1024, 32, true, false, 1234ULL},
-  {0.15f, 1.f, 1024, 64, true, false, 1234ULL},
-  {0.15f, 1.f, 1024, 128, true, false, 1234ULL},
-  {0.15f, 1.f, 1024, 256, true, false, 1234ULL},
-  {0.15f, -1.f, 1024, 32, false, false, 1234ULL},
-  {0.15f, -1.f, 1024, 64, false, false, 1234ULL},
-  {0.15f, -1.f, 1024, 128, false, false, 1234ULL},
-  {0.15f, -1.f, 1024, 256, false, false, 1234ULL},
-  {0.15f, 1.f, 1024, 32, true, true, 1234ULL},
-  {0.15f, 1.f, 1024, 64, true, true, 1234ULL},
-  {0.15f, 1.f, 1024, 128, true, true, 1234ULL},
-  {0.15f, 1.f, 1024, 256, true, true, 1234ULL},
-  {0.15f, -1.f, 1024, 32, false, true, 1234ULL},
-  {0.15f, -1.f, 1024, 64, false, true, 1234ULL},
-  {0.15f, -1.f, 1024, 128, false, true, 1234ULL},
-  {0.15f, -1.f, 1024, 256, false, true, 1234ULL}};
-
-const std::vector<MeanInputs<double>> inputsd = {
-  {0.15, 1.0, 1024, 32, true, false, 1234ULL},
-  {0.15, 1.0, 1024, 64, true, false, 1234ULL},
-  {0.15, 1.0, 1024, 128, true, false, 1234ULL},
-  {0.15, 1.0, 1024, 256, true, false, 1234ULL},
-  {0.15, -1.0, 1024, 32, false, false, 1234ULL},
-  {0.15, -1.0, 1024, 64, false, false, 1234ULL},
-  {0.15, -1.0, 1024, 128, false, false, 1234ULL},
-  {0.15, -1.0, 1024, 256, false, false, 1234ULL},
-  {0.15, 1.0, 1024, 32, true, true, 1234ULL},
-  {0.15, 1.0, 1024, 64, true, true, 1234ULL},
-  {0.15, 1.0, 1024, 128, true, true, 1234ULL},
-  {0.15, 1.0, 1024, 256, true, true, 1234ULL},
-  {0.15, -1.0, 1024, 32, false, true, 1234ULL},
-  {0.15, -1.0, 1024, 64, false, true, 1234ULL},
-  {0.15, -1.0, 1024, 128, false, true, 1234ULL},
-  {0.15, -1.0, 1024, 256, false, true, 1234ULL}};
-
-typedef MeanTest<float> MeanTestF;
-TEST_P(MeanTestF, Result) {
-  ASSERT_TRUE(devArrMatch(params.mean, mean_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef MeanTest<double> MeanTestD;
-TEST_P(MeanTestD, Result) {
-  ASSERT_TRUE(devArrMatch(params.mean, mean_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(MeanTests, MeanTestF, ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(MeanTests, MeanTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace Stats
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/mean_center.cu b/cpp/test/prims/mean_center.cu
deleted file mode 100644
index d01262590d..0000000000
--- a/cpp/test/prims/mean_center.cu
+++ /dev/null
@@ -1,215 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <random/rng.cuh>
-#include <stats/mean.cuh>
-#include <stats/mean_center.cuh>
-#include "matrix_vector_op.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Stats {
-
-template <typename T, typename IdxType>
-struct MeanCenterInputs {
-  T tolerance, mean;
-  IdxType rows, cols;
-  bool sample, rowMajor, bcastAlongRows;
-  unsigned long long int seed;
-};
-
-template <typename T, typename IdxType>
-::std::ostream &operator<<(::std::ostream &os,
-                           const MeanCenterInputs<T, IdxType> &dims) {
-  return os;
-}
-
-template <typename T, typename IdxType>
-class MeanCenterTest
-  : public ::testing::TestWithParam<MeanCenterInputs<T, IdxType>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<MeanCenterInputs<T, IdxType>>::GetParam();
-    Random::Rng r(params.seed);
-
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    auto rows = params.rows, cols = params.cols;
-    auto len = rows * cols;
-    IdxType vecLen = params.bcastAlongRows ? cols : rows;
-
-    allocate(out, len);
-    allocate(out_ref, len);
-    allocate(data, len);
-    allocate(meanVec, vecLen);
-    r.normal(data, len, params.mean, (T)1.0, stream);
-    mean(meanVec, data, cols, rows, params.sample, params.rowMajor, stream);
-    meanCenter(out, data, meanVec, cols, rows, params.rowMajor,
-               params.bcastAlongRows, stream);
-    LinAlg::naiveMatVec(out_ref, data, meanVec, cols, rows, params.rowMajor,
-                        params.bcastAlongRows, (T)-1.0);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(out));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(meanVec));
-  }
-
- protected:
-  MeanCenterInputs<T, IdxType> params;
-  T *data, *meanVec, *out, *out_ref;
-};
-
-const std::vector<MeanCenterInputs<float, int>> inputsf_i32 = {
-  {0.05f, 1.f, 1024, 32, true, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, true, false, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, true, false, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, true, false, 1234ULL}};
-typedef MeanCenterTest<float, int> MeanCenterTestF_i32;
-TEST_P(MeanCenterTestF_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out, out_ref, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MeanCenterTests, MeanCenterTestF_i32,
-                        ::testing::ValuesIn(inputsf_i32));
-
-const std::vector<MeanCenterInputs<float, size_t>> inputsf_i64 = {
-  {0.05f, 1.f, 1024, 32, true, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, false, true, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, false, true, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, true, true, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, true, true, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, false, false, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, false, false, 1234ULL},
-  {0.05f, 1.f, 1024, 32, true, true, false, 1234ULL},
-  {0.05f, 1.f, 1024, 64, true, true, false, 1234ULL},
-  {0.05f, 1.f, 1024, 128, true, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 32, false, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 64, false, true, false, 1234ULL},
-  {0.05f, -1.f, 1024, 128, false, true, false, 1234ULL}};
-typedef MeanCenterTest<float, size_t> MeanCenterTestF_i64;
-TEST_P(MeanCenterTestF_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out, out_ref, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MeanCenterTests, MeanCenterTestF_i64,
-                        ::testing::ValuesIn(inputsf_i64));
-
-const std::vector<MeanCenterInputs<double, int>> inputsd_i32 = {
-  {0.05, 1.0, 1024, 32, true, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, true, false, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, true, false, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, true, false, 1234ULL}};
-typedef MeanCenterTest<double, int> MeanCenterTestD_i32;
-TEST_P(MeanCenterTestD_i32, Result) {
-  ASSERT_TRUE(devArrMatch(out, out_ref, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MeanCenterTests, MeanCenterTestD_i32,
-                        ::testing::ValuesIn(inputsd_i32));
-
-const std::vector<MeanCenterInputs<double, size_t>> inputsd_i64 = {
-  {0.05, 1.0, 1024, 32, true, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, false, true, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, false, true, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, true, true, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, true, true, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, false, false, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, false, false, 1234ULL},
-  {0.05, 1.0, 1024, 32, true, true, false, 1234ULL},
-  {0.05, 1.0, 1024, 64, true, true, false, 1234ULL},
-  {0.05, 1.0, 1024, 128, true, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 32, false, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 64, false, true, false, 1234ULL},
-  {0.05, -1.0, 1024, 128, false, true, false, 1234ULL}};
-typedef MeanCenterTest<double, size_t> MeanCenterTestD_i64;
-TEST_P(MeanCenterTestD_i64, Result) {
-  ASSERT_TRUE(devArrMatch(out, out_ref, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MeanCenterTests, MeanCenterTestD_i64,
-                        ::testing::ValuesIn(inputsd_i64));
-
-}  // end namespace Stats
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/minmax.cu b/cpp/test/prims/minmax.cu
index 21c6c972db..09b5842d53 100644
--- a/cpp/test/prims/minmax.cu
+++ b/cpp/test/prims/minmax.cu
@@ -14,13 +14,13 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <stdio.h>
 #include <stdlib.h>
-#include <cuda_utils.cuh>
 #include <limits>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <stats/minmax.cuh>
 #include "test_utils.h"
 
@@ -58,8 +58,8 @@ __global__ void naiveMinMaxKernel(const T* data, int nrows, int ncols,
   if (col < ncols) {
     T val = data[tid];
     if (!isnan(val)) {
-      myAtomicMin(&globalmin[col], val);
-      myAtomicMax(&globalmax[col], val);
+      raft::myAtomicMin(&globalmin[col], val);
+      raft::myAtomicMax(&globalmax[col], val);
     }
   }
 }
@@ -68,12 +68,12 @@ template <typename T>
 void naiveMinMax(const T* data, int nrows, int ncols, T* globalmin,
                  T* globalmax, cudaStream_t stream) {
   const int TPB = 128;
-  int nblks = ceildiv(ncols, TPB);
+  int nblks = raft::ceildiv(ncols, TPB);
   T init_val = std::numeric_limits<T>::max();
   naiveMinMaxInitKernel<<<nblks, TPB, 0, stream>>>(ncols, globalmin, globalmax,
                                                    init_val);
   CUDA_CHECK(cudaGetLastError());
-  nblks = ceildiv(nrows * ncols, TPB);
+  nblks = raft::ceildiv(nrows * ncols, TPB);
   naiveMinMaxKernel<<<nblks, TPB, 0, stream>>>(data, nrows, ncols, globalmin,
                                                globalmax);
   CUDA_CHECK(cudaGetLastError());
@@ -91,18 +91,18 @@ class MinMaxTest : public ::testing::TestWithParam<MinMaxInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<MinMaxInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.rows * params.cols;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-    allocate(mask, len);
-    allocate(minmax_act, 2 * params.cols);
-    allocate(minmax_ref, 2 * params.cols);
+    raft::allocate(data, len);
+    raft::allocate(mask, len);
+    raft::allocate(minmax_act, 2 * params.cols);
+    raft::allocate(minmax_ref, 2 * params.cols);
     r.normal(data, len, (T)0.0, (T)1.0, stream);
     T nan_prob = 0.01;
     r.bernoulli(mask, len, nan_prob, stream);
     const int TPB = 256;
-    nanKernel<<<ceildiv(len, TPB), TPB, 0, stream>>>(
+    nanKernel<<<raft::ceildiv(len, TPB), TPB, 0, stream>>>(
       data, mask, len, std::numeric_limits<T>::quiet_NaN());
     CUDA_CHECK(cudaPeekAtLastError());
     naiveMinMax(data, params.rows, params.cols, minmax_ref,
@@ -152,14 +152,14 @@ const std::vector<MinMaxInputs<double>> inputsd = {
 
 typedef MinMaxTest<float> MinMaxTestF;
 TEST_P(MinMaxTestF, Result) {
-  ASSERT_TRUE(devArrMatch(minmax_ref, minmax_act, 2 * params.cols,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(minmax_ref, minmax_act, 2 * params.cols,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef MinMaxTest<double> MinMaxTestD;
 TEST_P(MinMaxTestD, Result) {
-  ASSERT_TRUE(devArrMatch(minmax_ref, minmax_act, 2 * params.cols,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(minmax_ref, minmax_act, 2 * params.cols,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(MinMaxTests, MinMaxTestF, ::testing::ValuesIn(inputsf));
diff --git a/cpp/test/prims/multiply.cu b/cpp/test/prims/multiply.cu
deleted file mode 100644
index 6f36bd788d..0000000000
--- a/cpp/test/prims/multiply.cu
+++ /dev/null
@@ -1,78 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/multiply.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-class MultiplyTest : public ::testing::TestWithParam<UnaryOpInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<UnaryOpInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in, len, T(-1.0), T(1.0), stream);
-    naiveScale(out_ref, in, params.scalar, len, stream);
-    multiplyScalar(out, in, params.scalar, len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  UnaryOpInputs<T> params;
-  T *in, *out_ref, *out;
-};
-
-const std::vector<UnaryOpInputs<float>> inputsf = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-typedef MultiplyTest<float> MultiplyTestF;
-TEST_P(MultiplyTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MultiplyTests, MultiplyTestF,
-                        ::testing::ValuesIn(inputsf));
-
-typedef MultiplyTest<double> MultiplyTestD;
-const std::vector<UnaryOpInputs<double>> inputsd = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-TEST_P(MultiplyTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(MultiplyTests, MultiplyTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/mutualInfoScore.cu b/cpp/test/prims/mutualInfoScore.cu
index cdbb76b1ba..bf3e0894d0 100644
--- a/cpp/test/prims/mutualInfoScore.cu
+++ b/cpp/test/prims/mutualInfoScore.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -109,14 +109,13 @@ class mutualInfoTest : public ::testing::TestWithParam<mutualInfoParam> {
 
     //allocating and initializing memory to the GPU
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(firstClusterArray, nElements, true);
-    MLCommon::allocate(secondClusterArray, nElements, true);
+    raft::allocate(firstClusterArray, nElements, true);
+    raft::allocate(secondClusterArray, nElements, true);
 
-    MLCommon::updateDevice(firstClusterArray, &arr1[0], (int)nElements, stream);
-    MLCommon::updateDevice(secondClusterArray, &arr2[0], (int)nElements,
-                           stream);
+    raft::update_device(firstClusterArray, &arr1[0], (int)nElements, stream);
+    raft::update_device(secondClusterArray, &arr2[0], (int)nElements, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calling the mutualInfo CUDA implementation
     computedmutualInfo = MLCommon::Metrics::mutualInfoScore(
diff --git a/cpp/test/prims/mvg.cu b/cpp/test/prims/mvg.cu
index fa9f134160..840d40d35f 100644
--- a/cpp/test/prims/mvg.cu
+++ b/cpp/test/prims/mvg.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <cmath>
 #include <iostream>
 #include <random/mvg.cuh>
@@ -37,7 +37,7 @@ __global__ void En_KF_accumulate(const int nPoints, const int dim, const T *X,
   int idx = threadIdx.x + blockDim.x * blockIdx.x;
   int col = idx % dim;
   int row = idx / dim;
-  if (col < dim && row < nPoints) myAtomicAdd(x + col, X[idx]);
+  if (col < dim && row < nPoints) raft::myAtomicAdd(x + col, X[idx]);
 }
 
 template <typename T>
@@ -121,8 +121,8 @@ class MVGTest : public ::testing::TestWithParam<MVGInputs<T>> {
     }
 
     // porting inputs to gpu
-    updateDevice(P_d, P, dim * dim, stream);
-    updateDevice(x_d, x, dim, stream);
+    raft::update_device(P_d, P, dim * dim, stream);
+    raft::update_device(x_d, x, dim, stream);
 
     // initilizing the mvg
     mvg = new MultiVarGaussian<T>(dim, method);
@@ -139,15 +139,15 @@ class MVGTest : public ::testing::TestWithParam<MVGInputs<T>> {
     //@todo can be swapped with a API that calculates mean
     CUDA_CHECK(cudaMemset(Rand_mean, 0, dim * sizeof(T)));
     dim3 block = (64);
-    dim3 grid = (ceildiv(nPoints * dim, (int)block.x));
+    dim3 grid = (raft::ceildiv(nPoints * dim, (int)block.x));
     En_KF_accumulate<<<grid, block>>>(nPoints, dim, X_d, Rand_mean);
     CUDA_CHECK(cudaPeekAtLastError());
-    grid = (ceildiv(dim, (int)block.x));
+    grid = (raft::ceildiv(dim, (int)block.x));
     En_KF_normalize<<<grid, block>>>(nPoints, dim, Rand_mean);
     CUDA_CHECK(cudaPeekAtLastError());
 
     // storing the error wrt random point mean in X_d
-    grid = (ceildiv(dim * nPoints, (int)block.x));
+    grid = (raft::ceildiv(dim * nPoints, (int)block.x));
     En_KF_dif<<<grid, block>>>(nPoints, dim, X_d, Rand_mean, X_d);
     CUDA_CHECK(cudaPeekAtLastError());
 
@@ -155,12 +155,12 @@ class MVGTest : public ::testing::TestWithParam<MVGInputs<T>> {
     T alfa = 1.0 / (nPoints - 1), beta = 0.0;
     cublasHandle_t handle;
     CUBLAS_CHECK(cublasCreate(&handle));
-    CUBLAS_CHECK(LinAlg::cublasgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dim, dim,
-                                    nPoints, &alfa, X_d, dim, X_d, dim, &beta,
-                                    Rand_cov, dim, stream));
+    CUBLAS_CHECK(raft::linalg::cublasgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dim,
+                                          dim, nPoints, &alfa, X_d, dim, X_d,
+                                          dim, &beta, Rand_cov, dim, stream));
 
     // restoring cov provided into P_d
-    updateDevice(P_d, P, dim * dim, stream);
+    raft::update_device(P_d, P, dim * dim, stream);
   }
 
   void TearDown() override {
@@ -229,22 +229,23 @@ const std::vector<MVGInputs<double>> inputsd = {
 typedef MVGTest<float> MVGTestF;
 typedef MVGTest<double> MVGTestD;
 TEST_P(MVGTestF, MeanIsCorrectF) {
-  EXPECT_TRUE(devArrMatch(x_d, Rand_mean, dim, CompareApprox<float>(tolerance)))
+  EXPECT_TRUE(raft::devArrMatch(x_d, Rand_mean, dim,
+                                raft::CompareApprox<float>(tolerance)))
     << " in MeanIsCorrect";
 }
 TEST_P(MVGTestF, CovIsCorrectF) {
-  EXPECT_TRUE(
-    devArrMatch(P_d, Rand_cov, dim, dim, CompareApprox<float>(tolerance)))
+  EXPECT_TRUE(raft::devArrMatch(P_d, Rand_cov, dim, dim,
+                                raft::CompareApprox<float>(tolerance)))
     << " in CovIsCorrect";
 }
 TEST_P(MVGTestD, MeanIsCorrectD) {
-  EXPECT_TRUE(
-    devArrMatch(x_d, Rand_mean, dim, CompareApprox<double>(tolerance)))
+  EXPECT_TRUE(raft::devArrMatch(x_d, Rand_mean, dim,
+                                raft::CompareApprox<double>(tolerance)))
     << " in MeanIsCorrect";
 }
 TEST_P(MVGTestD, CovIsCorrectD) {
-  EXPECT_TRUE(
-    devArrMatch(P_d, Rand_cov, dim, dim, CompareApprox<double>(tolerance)))
+  EXPECT_TRUE(raft::devArrMatch(P_d, Rand_cov, dim, dim,
+                                raft::CompareApprox<double>(tolerance)))
     << " in CovIsCorrect";
 }
 
diff --git a/cpp/test/prims/norm.cu b/cpp/test/prims/norm.cu
deleted file mode 100644
index 0782e4363f..0000000000
--- a/cpp/test/prims/norm.cu
+++ /dev/null
@@ -1,290 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/norm.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct NormInputs {
-  T tolerance;
-  int rows, cols;
-  NormType type;
-  bool do_sqrt;
-  bool rowMajor;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const NormInputs<T> &I) {
-  os << "{ " << I.tolerance << ", " << I.rows << ", " << I.cols << ", "
-     << I.type << ", " << I.do_sqrt << ", " << I.seed << '}' << std::endl;
-  return os;
-}
-
-///// Row-wise norm test definitions
-template <typename Type>
-__global__ void naiveRowNormKernel(Type *dots, const Type *data, int D, int N,
-                                   NormType type, bool do_sqrt) {
-  Type acc = (Type)0;
-  int rowStart = threadIdx.x + blockIdx.x * blockDim.x;
-  if (rowStart < N) {
-    for (int i = 0; i < D; ++i) {
-      if (type == L2Norm) {
-        acc += data[rowStart * D + i] * data[rowStart * D + i];
-      } else {
-        acc += myAbs(data[rowStart * D + i]);
-      }
-    }
-    dots[rowStart] = do_sqrt ? mySqrt(acc) : acc;
-  }
-}
-
-template <typename Type>
-void naiveRowNorm(Type *dots, const Type *data, int D, int N, NormType type,
-                  bool do_sqrt, cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(N, TPB);
-  naiveRowNormKernel<Type>
-    <<<nblks, TPB, 0, stream>>>(dots, data, D, N, type, do_sqrt);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-class RowNormTest : public ::testing::TestWithParam<NormInputs<T>> {
- public:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    params = ::testing::TestWithParam<NormInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols, len = rows * cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-    allocate(dots_exp, rows);
-    allocate(dots_act, rows);
-    r.uniform(data, len, T(-1.0), T(1.0), stream);
-    naiveRowNorm(dots_exp, data, cols, rows, params.type, params.do_sqrt,
-                 stream);
-    if (params.do_sqrt) {
-      auto fin_op = [] __device__(T in) { return mySqrt(in); };
-      rowNorm(dots_act, data, cols, rows, params.type, params.rowMajor, stream,
-              fin_op);
-    } else {
-      rowNorm(dots_act, data, cols, rows, params.type, params.rowMajor, stream);
-    }
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(dots_exp));
-    CUDA_CHECK(cudaFree(dots_act));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  NormInputs<T> params;
-  T *data, *dots_exp, *dots_act;
-  cudaStream_t stream;
-};
-
-///// Column-wise norm test definitisons
-template <typename Type>
-__global__ void naiveColNormKernel(Type *dots, const Type *data, int D, int N,
-                                   NormType type, bool do_sqrt) {
-  int colID = threadIdx.x + blockIdx.x * blockDim.x;
-  if (colID > D) return;  //avoid out-of-bounds thread
-
-  Type acc = 0;
-  for (int i = 0; i < N; i++) {
-    Type v = data[colID + i * D];
-    acc += type == L2Norm ? v * v : myAbs(v);
-  }
-
-  dots[colID] = do_sqrt ? mySqrt(acc) : acc;
-}
-
-template <typename Type>
-void naiveColNorm(Type *dots, const Type *data, int D, int N, NormType type,
-                  bool do_sqrt, cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(D, TPB);
-  naiveColNormKernel<Type>
-    <<<nblks, TPB, 0, stream>>>(dots, data, D, N, type, do_sqrt);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-class ColNormTest : public ::testing::TestWithParam<NormInputs<T>> {
- public:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    params = ::testing::TestWithParam<NormInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols, len = rows * cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-    r.uniform(data, len, T(-1.0), T(1.0), stream);
-    allocate(dots_exp, cols);
-    allocate(dots_act, cols);
-
-    naiveColNorm(dots_exp, data, cols, rows, params.type, params.do_sqrt,
-                 stream);
-    if (params.do_sqrt) {
-      auto fin_op = [] __device__(T in) { return mySqrt(in); };
-      colNorm(dots_act, data, cols, rows, params.type, params.rowMajor, stream,
-              fin_op);
-    } else {
-      colNorm(dots_act, data, cols, rows, params.type, params.rowMajor, stream);
-    }
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(dots_exp));
-    CUDA_CHECK(cudaFree(dots_act));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  NormInputs<T> params;
-  T *data, *dots_exp, *dots_act;
-  cudaStream_t stream;
-};
-
-///// Row- and column-wise tests
-const std::vector<NormInputs<float>> inputsf = {
-  {0.00001f, 1024, 32, L1Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 64, L1Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 128, L1Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 256, L1Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 32, L2Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 64, L2Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 128, L2Norm, false, true, 1234ULL},
-  {0.00001f, 1024, 256, L2Norm, false, true, 1234ULL},
-
-  {0.00001f, 1024, 32, L1Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 64, L1Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 128, L1Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 256, L1Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 32, L2Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 64, L2Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 128, L2Norm, true, true, 1234ULL},
-  {0.00001f, 1024, 256, L2Norm, true, true, 1234ULL}};
-
-const std::vector<NormInputs<double>> inputsd = {
-  {0.00000001, 1024, 32, L1Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 64, L1Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 128, L1Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 256, L1Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 32, L2Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 64, L2Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 128, L2Norm, false, true, 1234ULL},
-  {0.00000001, 1024, 256, L2Norm, false, true, 1234ULL},
-
-  {0.00000001, 1024, 32, L1Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 64, L1Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 128, L1Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 256, L1Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 32, L2Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 64, L2Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 128, L2Norm, true, true, 1234ULL},
-  {0.00000001, 1024, 256, L2Norm, true, true, 1234ULL}};
-
-typedef RowNormTest<float> RowNormTestF;
-TEST_P(RowNormTestF, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.rows,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef RowNormTest<double> RowNormTestD;
-TEST_P(RowNormTestD, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.rows,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(RowNormTests, RowNormTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(RowNormTests, RowNormTestD,
-                        ::testing::ValuesIn(inputsd));
-
-const std::vector<NormInputs<float>> inputscf = {
-  {0.00001f, 32, 1024, L1Norm, false, true, 1234ULL},
-  {0.00001f, 64, 1024, L1Norm, false, true, 1234ULL},
-  {0.00001f, 128, 1024, L1Norm, false, true, 1234ULL},
-  {0.00001f, 256, 1024, L1Norm, false, true, 1234ULL},
-  {0.00001f, 32, 1024, L2Norm, false, true, 1234ULL},
-  {0.00001f, 64, 1024, L2Norm, false, true, 1234ULL},
-  {0.00001f, 128, 1024, L2Norm, false, true, 1234ULL},
-  {0.00001f, 256, 1024, L2Norm, false, true, 1234ULL},
-
-  {0.00001f, 32, 1024, L1Norm, true, true, 1234ULL},
-  {0.00001f, 64, 1024, L1Norm, true, true, 1234ULL},
-  {0.00001f, 128, 1024, L1Norm, true, true, 1234ULL},
-  {0.00001f, 256, 1024, L1Norm, true, true, 1234ULL},
-  {0.00001f, 32, 1024, L2Norm, true, true, 1234ULL},
-  {0.00001f, 64, 1024, L2Norm, true, true, 1234ULL},
-  {0.00001f, 128, 1024, L2Norm, true, true, 1234ULL},
-  {0.00001f, 256, 1024, L2Norm, true, true, 1234ULL}};
-
-const std::vector<NormInputs<double>> inputscd = {
-  {0.00000001, 32, 1024, L1Norm, false, true, 1234ULL},
-  {0.00000001, 64, 1024, L1Norm, false, true, 1234ULL},
-  {0.00000001, 128, 1024, L1Norm, false, true, 1234ULL},
-  {0.00000001, 256, 1024, L1Norm, false, true, 1234ULL},
-  {0.00000001, 32, 1024, L2Norm, false, true, 1234ULL},
-  {0.00000001, 64, 1024, L2Norm, false, true, 1234ULL},
-  {0.00000001, 128, 1024, L2Norm, false, true, 1234ULL},
-  {0.00000001, 256, 1024, L2Norm, false, true, 1234ULL},
-
-  {0.00000001, 32, 1024, L1Norm, true, true, 1234ULL},
-  {0.00000001, 64, 1024, L1Norm, true, true, 1234ULL},
-  {0.00000001, 128, 1024, L1Norm, true, true, 1234ULL},
-  {0.00000001, 256, 1024, L1Norm, true, true, 1234ULL},
-  {0.00000001, 32, 1024, L2Norm, true, true, 1234ULL},
-  {0.00000001, 64, 1024, L2Norm, true, true, 1234ULL},
-  {0.00000001, 128, 1024, L2Norm, true, true, 1234ULL},
-  {0.00000001, 256, 1024, L2Norm, true, true, 1234ULL}};
-
-typedef ColNormTest<float> ColNormTestF;
-TEST_P(ColNormTestF, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef ColNormTest<double> ColNormTestD;
-TEST_P(ColNormTestD, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(ColNormTests, ColNormTestF,
-                        ::testing::ValuesIn(inputscf));
-
-INSTANTIATE_TEST_CASE_P(ColNormTests, ColNormTestD,
-                        ::testing::ValuesIn(inputscd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/penalty.cu b/cpp/test/prims/penalty.cu
index 9dfc25d593..1c5a3190d2 100644
--- a/cpp/test/prims/penalty.cu
+++ b/cpp/test/prims/penalty.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <functions/penalty.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -39,41 +39,41 @@ class PenaltyTest : public ::testing::TestWithParam<PenaltyInputs<T>> {
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
-    allocate(in, len);
-    allocate(out_lasso, 1);
-    allocate(out_ridge, 1);
-    allocate(out_elasticnet, 1);
-    allocate(out_lasso_grad, len);
-    allocate(out_ridge_grad, len);
-    allocate(out_elasticnet_grad, len);
-    allocate(out_lasso_ref, 1);
-    allocate(out_ridge_ref, 1);
-    allocate(out_elasticnet_ref, 1);
-    allocate(out_lasso_grad_ref, len);
-    allocate(out_ridge_grad_ref, len);
-    allocate(out_elasticnet_grad_ref, len);
+    raft::allocate(in, len);
+    raft::allocate(out_lasso, 1);
+    raft::allocate(out_ridge, 1);
+    raft::allocate(out_elasticnet, 1);
+    raft::allocate(out_lasso_grad, len);
+    raft::allocate(out_ridge_grad, len);
+    raft::allocate(out_elasticnet_grad, len);
+    raft::allocate(out_lasso_ref, 1);
+    raft::allocate(out_ridge_ref, 1);
+    raft::allocate(out_elasticnet_ref, 1);
+    raft::allocate(out_lasso_grad_ref, len);
+    raft::allocate(out_ridge_grad_ref, len);
+    raft::allocate(out_elasticnet_grad_ref, len);
 
     T h_in[len] = {0.1, 0.35, -0.9, -1.4};
-    updateDevice(in, h_in, len, stream);
+    raft::update_device(in, h_in, len, stream);
 
     T h_out_lasso_ref[1] = {1.65};
-    updateDevice(out_lasso_ref, h_out_lasso_ref, 1, stream);
+    raft::update_device(out_lasso_ref, h_out_lasso_ref, 1, stream);
 
     T h_out_ridge_ref[1] = {1.741499};
-    updateDevice(out_ridge_ref, h_out_ridge_ref, 1, stream);
+    raft::update_device(out_ridge_ref, h_out_ridge_ref, 1, stream);
 
     T h_out_elasticnet_ref[1] = {1.695749};
-    updateDevice(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
+    raft::update_device(out_elasticnet_ref, h_out_elasticnet_ref, 1, stream);
 
     T h_out_lasso_grad_ref[len] = {0.6, 0.6, -0.6, -0.6};
-    updateDevice(out_lasso_grad_ref, h_out_lasso_grad_ref, len, stream);
+    raft::update_device(out_lasso_grad_ref, h_out_lasso_grad_ref, len, stream);
 
     T h_out_ridge_grad_ref[len] = {0.12, 0.42, -1.08, -1.68};
-    updateDevice(out_ridge_grad_ref, h_out_ridge_grad_ref, len, stream);
+    raft::update_device(out_ridge_grad_ref, h_out_ridge_grad_ref, len, stream);
 
     T h_out_elasticnet_grad_ref[len] = {0.36, 0.51, -0.84, -1.14};
-    updateDevice(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref, len,
-                 stream);
+    raft::update_device(out_elasticnet_grad_ref, h_out_elasticnet_grad_ref, len,
+                        stream);
 
     T alpha = 0.6;
     T l1_ratio = 0.5;
@@ -119,43 +119,45 @@ const std::vector<PenaltyInputs<double>> inputsd = {{0.01, 4}};
 typedef PenaltyTest<float> PenaltyTestF;
 TEST_P(PenaltyTestF, Result) {
   ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.len, CompareApprox<float>(params.tolerance)));
+                          params.len,
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef PenaltyTest<double> PenaltyTestD;
 TEST_P(PenaltyTestD, Result) {
   ASSERT_TRUE(devArrMatch(out_lasso_ref, out_lasso, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_lasso_grad_ref, out_lasso_grad, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_ref, out_ridge, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_ridge_grad_ref, out_ridge_grad, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_ref, out_elasticnet, 1,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 
   ASSERT_TRUE(devArrMatch(out_elasticnet_grad_ref, out_elasticnet_grad,
-                          params.len, CompareApprox<double>(params.tolerance)));
+                          params.len,
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(PenaltyTests, PenaltyTestF,
diff --git a/cpp/test/prims/permute.cu b/cpp/test/prims/permute.cu
index ea5e98efd3..99d95e53c6 100644
--- a/cpp/test/prims/permute.cu
+++ b/cpp/test/prims/permute.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <random/permute.cuh>
-#include <random/rng.cuh>
 #include <vector>
 #include "test_utils.h"
 
@@ -47,19 +47,19 @@ class PermTest : public ::testing::TestWithParam<PermInputs<T>> {
     if (params.needShuffle) {
       params.needPerms = true;
     }
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int N = params.N;
     int D = params.D;
     int len = N * D;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
     if (params.needPerms)
-      allocate(outPerms, N);
+      raft::allocate(outPerms, N);
     else
       outPerms = nullptr;
     if (params.needShuffle) {
-      allocate(in, len);
-      allocate(out, len);
+      raft::allocate(in, len);
+      raft::allocate(out, len);
       r.uniform(in, len, T(-1.0), T(1.0), stream);
     } else {
       in = out = nullptr;
@@ -90,7 +90,7 @@ template <typename T, typename L>
                                             bool doSort = true,
                                             cudaStream_t stream = 0) {
   std::vector<T> act_h(size);
-  updateHost<T>(&(act_h[0]), actual, size, stream);
+  raft::update_host<T>(&(act_h[0]), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   if (doSort) std::sort(act_h.begin(), act_h.end());
   for (size_t i(0); i < size; ++i) {
@@ -110,10 +110,10 @@ template <typename T, typename L>
                                               bool rowMajor, L eq_compare,
                                               cudaStream_t stream = 0) {
   std::vector<int> h_perms(N);
-  updateHost<int>(&(h_perms[0]), perms, N, stream);
+  raft::update_host<int>(&(h_perms[0]), perms, N, stream);
   std::vector<T> h_out(N * D), h_in(N * D);
-  updateHost<T>(&(h_out[0]), out, N * D, stream);
-  updateHost<T>(&(h_in[0]), in, N * D, stream);
+  raft::update_host<T>(&(h_out[0]), out, N * D, stream);
+  raft::update_host<T>(&(h_in[0]), in, N * D, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (int i = 0; i < N; ++i) {
     for (int j = 0; j < D; ++j) {
@@ -172,11 +172,11 @@ const std::vector<PermInputs<float>> inputsf = {
 typedef PermTest<float> PermTestF;
 TEST_P(PermTestF, Result) {
   if (params.needPerms) {
-    ASSERT_TRUE(devArrMatchRange(outPerms, params.N, 0, Compare<int>()));
+    ASSERT_TRUE(devArrMatchRange(outPerms, params.N, 0, raft::Compare<int>()));
   }
   if (params.needShuffle) {
     ASSERT_TRUE(devArrMatchShuffle(outPerms, out, in, params.D, params.N,
-                                   params.rowMajor, Compare<float>()));
+                                   params.rowMajor, raft::Compare<float>()));
   }
 }
 INSTANTIATE_TEST_CASE_P(PermTests, PermTestF, ::testing::ValuesIn(inputsf));
@@ -221,11 +221,11 @@ const std::vector<PermInputs<double>> inputsd = {
 typedef PermTest<double> PermTestD;
 TEST_P(PermTestD, Result) {
   if (params.needPerms) {
-    ASSERT_TRUE(devArrMatchRange(outPerms, params.N, 0, Compare<int>()));
+    ASSERT_TRUE(devArrMatchRange(outPerms, params.N, 0, raft::Compare<int>()));
   }
   if (params.needShuffle) {
     ASSERT_TRUE(devArrMatchShuffle(outPerms, out, in, params.D, params.N,
-                                   params.rowMajor, Compare<double>()));
+                                   params.rowMajor, raft::Compare<double>()));
   }
 }
 INSTANTIATE_TEST_CASE_P(PermTests, PermTestD, ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/power.cu b/cpp/test/prims/power.cu
index 0dc93e0e49..7bf5bd5688 100644
--- a/cpp/test/prims/power.cu
+++ b/cpp/test/prims/power.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <linalg/power.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -28,7 +28,7 @@ __global__ void naivePowerElemKernel(Type *out, const Type *in1,
                                      const Type *in2, int len) {
   int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < len) {
-    out[idx] = myPow(in1[idx], in2[idx]);
+    out[idx] = raft::myPow(in1[idx], in2[idx]);
   }
 }
 
@@ -36,7 +36,7 @@ template <typename Type>
 void naivePowerElem(Type *out, const Type *in1, const Type *in2, int len,
                     cudaStream_t stream) {
   static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
+  int nblks = raft::ceildiv(len, TPB);
   naivePowerElemKernel<Type><<<nblks, TPB, 0, stream>>>(out, in1, in2, len);
   CUDA_CHECK(cudaPeekAtLastError());
 }
@@ -46,7 +46,7 @@ __global__ void naivePowerScalarKernel(Type *out, const Type *in1,
                                        const Type in2, int len) {
   int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < len) {
-    out[idx] = myPow(in1[idx], in2);
+    out[idx] = raft::myPow(in1[idx], in2);
   }
 }
 
@@ -54,7 +54,7 @@ template <typename Type>
 void naivePowerScalar(Type *out, const Type *in1, const Type in2, int len,
                       cudaStream_t stream) {
   static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
+  int nblks = raft::ceildiv(len, TPB);
   naivePowerScalarKernel<Type><<<nblks, TPB, 0, stream>>>(out, in1, in2, len);
   CUDA_CHECK(cudaPeekAtLastError());
 }
@@ -76,14 +76,14 @@ class PowerTest : public ::testing::TestWithParam<PowerInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<PowerInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.len;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(out_ref, len);
-    allocate(out, len);
+    raft::allocate(in1, len);
+    raft::allocate(in2, len);
+    raft::allocate(out_ref, len);
+    raft::allocate(out, len);
     r.uniform(in1, len, T(1.0), T(2.0), stream);
     r.uniform(in2, len, T(1.0), T(2.0), stream);
 
@@ -118,20 +118,20 @@ const std::vector<PowerInputs<double>> inputsd2 = {
 
 typedef PowerTest<float> PowerTestF;
 TEST_P(PowerTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.len,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, in1, params.len,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef PowerTest<double> PowerTestD;
 TEST_P(PowerTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.len,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, in1, params.len,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(PowerTests, PowerTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/prims/randIndex.cu b/cpp/test/prims/randIndex.cu
index e201f92975..d44d6babf4 100644
--- a/cpp/test/prims/randIndex.cu
+++ b/cpp/test/prims/randIndex.cu
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -77,13 +77,13 @@ class randIndexTest : public ::testing::TestWithParam<randIndexParam> {
 
     //allocating and initializing memory to the GPU
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(firstClusterArray, size, true);
-    MLCommon::allocate(secondClusterArray, size, true);
+    raft::allocate(firstClusterArray, size, true);
+    raft::allocate(secondClusterArray, size, true);
 
-    MLCommon::updateDevice(firstClusterArray, &arr1[0], (int)size, stream);
-    MLCommon::updateDevice(secondClusterArray, &arr2[0], (int)size, stream);
+    raft::update_device(firstClusterArray, &arr1[0], (int)size, stream);
+    raft::update_device(secondClusterArray, &arr2[0], (int)size, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calling the randIndex CUDA implementation
     computedRandIndex = MLCommon::Metrics::computeRandIndex(
diff --git a/cpp/test/prims/reduce.cu b/cpp/test/prims/reduce.cu
deleted file mode 100644
index a6b3351e9b..0000000000
--- a/cpp/test/prims/reduce.cu
+++ /dev/null
@@ -1,146 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/reduce.cuh>
-#include <random/rng.cuh>
-#include "reduce.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct ReduceInputs {
-  T tolerance;
-  int rows, cols;
-  bool rowMajor, alongRows;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const ReduceInputs<T> &dims) {
-  return os;
-}
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename T>
-void reduceLaunch(T *dots, const T *data, int cols, int rows, bool rowMajor,
-                  bool alongRows, bool inplace, cudaStream_t stream) {
-  reduce(dots, data, cols, rows, (T)0, rowMajor, alongRows, stream, inplace,
-         [] __device__(T in, int i) { return in * in; });
-}
-
-template <typename T>
-class ReduceTest : public ::testing::TestWithParam<ReduceInputs<T>> {
- protected:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    params = ::testing::TestWithParam<ReduceInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-    outlen = params.alongRows ? rows : cols;
-    allocate(data, len);
-    allocate(dots_exp, outlen);
-    allocate(dots_act, outlen);
-    r.uniform(data, len, T(-1.0), T(1.0), stream);
-    naiveReduction(dots_exp, data, cols, rows, params.rowMajor,
-                   params.alongRows, stream);
-
-    // Perform reduction with default inplace = false first
-    reduceLaunch(dots_act, data, cols, rows, params.rowMajor, params.alongRows,
-                 false, stream);
-    // Add to result with inplace = true next, which shouldn't affect
-    // in the case of coalescedReduction!
-    if (!(params.rowMajor ^ params.alongRows)) {
-      reduceLaunch(dots_act, data, cols, rows, params.rowMajor,
-                   params.alongRows, true, stream);
-    }
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(dots_exp));
-    CUDA_CHECK(cudaFree(dots_act));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  ReduceInputs<T> params;
-  T *data, *dots_exp, *dots_act;
-  int outlen;
-  cudaStream_t stream;
-};
-
-const std::vector<ReduceInputs<float>> inputsf = {
-  {0.000002f, 1024, 32, true, true, 1234ULL},
-  {0.000002f, 1024, 64, true, true, 1234ULL},
-  {0.000002f, 1024, 128, true, true, 1234ULL},
-  {0.000002f, 1024, 256, true, true, 1234ULL},
-  {0.000002f, 1024, 32, true, false, 1234ULL},
-  {0.000002f, 1024, 64, true, false, 1234ULL},
-  {0.000002f, 1024, 128, true, false, 1234ULL},
-  {0.000002f, 1024, 256, true, false, 1234ULL},
-  {0.000002f, 1024, 32, false, true, 1234ULL},
-  {0.000002f, 1024, 64, false, true, 1234ULL},
-  {0.000002f, 1024, 128, false, true, 1234ULL},
-  {0.000002f, 1024, 256, false, true, 1234ULL},
-  {0.000002f, 1024, 32, false, false, 1234ULL},
-  {0.000002f, 1024, 64, false, false, 1234ULL},
-  {0.000002f, 1024, 128, false, false, 1234ULL},
-  {0.000002f, 1024, 256, false, false, 1234ULL}};
-
-const std::vector<ReduceInputs<double>> inputsd = {
-  {0.000000001, 1024, 32, true, true, 1234ULL},
-  {0.000000001, 1024, 64, true, true, 1234ULL},
-  {0.000000001, 1024, 128, true, true, 1234ULL},
-  {0.000000001, 1024, 256, true, true, 1234ULL},
-  {0.000000001, 1024, 32, true, false, 1234ULL},
-  {0.000000001, 1024, 64, true, false, 1234ULL},
-  {0.000000001, 1024, 128, true, false, 1234ULL},
-  {0.000000001, 1024, 256, true, false, 1234ULL},
-  {0.000000001, 1024, 32, false, true, 1234ULL},
-  {0.000000001, 1024, 64, false, true, 1234ULL},
-  {0.000000001, 1024, 128, false, true, 1234ULL},
-  {0.000000001, 1024, 256, false, true, 1234ULL},
-  {0.000000001, 1024, 32, false, false, 1234ULL},
-  {0.000000001, 1024, 64, false, false, 1234ULL},
-  {0.000000001, 1024, 128, false, false, 1234ULL},
-  {0.000000001, 1024, 256, false, false, 1234ULL}};
-
-typedef ReduceTest<float> ReduceTestF;
-TEST_P(ReduceTestF, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, outlen,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef ReduceTest<double> ReduceTestD;
-TEST_P(ReduceTestD, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, outlen,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(ReduceTests, ReduceTestF, ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(ReduceTests, ReduceTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/reduce.cuh b/cpp/test/prims/reduce.cuh
deleted file mode 100644
index b73c0185a7..0000000000
--- a/cpp/test/prims/reduce.cuh
+++ /dev/null
@@ -1,85 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <cublas_v2.h>
-#include <linalg/cublas_wrappers.h>
-#include <thrust/device_vector.h>
-#include <cuda_utils.cuh>
-#include <linalg/unary_op.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type>
-__global__ void naiveCoalescedReductionKernel(Type *dots, const Type *data,
-                                              int D, int N) {
-  Type acc = (Type)0;
-  int rowStart = threadIdx.x + blockIdx.x * blockDim.x;
-  if (rowStart < N) {
-    for (int i = 0; i < D; ++i) {
-      acc += data[rowStart * D + i] * data[rowStart * D + i];
-    }
-    dots[rowStart] = 2 * acc;
-  }
-}
-
-template <typename Type>
-void naiveCoalescedReduction(Type *dots, const Type *data, int D, int N,
-                             cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(N, TPB);
-  naiveCoalescedReductionKernel<Type>
-    <<<nblks, TPB, 0, stream>>>(dots, data, D, N);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type>
-void unaryAndGemv(Type *dots, const Type *data, int D, int N,
-                  cudaStream_t stream) {
-  //computes a MLCommon unary op on data (squares it), then computes Ax
-  //(A input matrix and x column vector) to sum columns
-  thrust::device_vector<Type> sq(D * N);
-  unaryOp(
-    thrust::raw_pointer_cast(sq.data()), data, D * N,
-    [] __device__(Type v) { return v * v; }, stream);
-  cublasHandle_t handle;
-  CUBLAS_CHECK(cublasCreate(&handle));
-  thrust::device_vector<Type> ones(N, 1);  //column vector [1...1]
-  Type alpha = 1, beta = 0;
-  CUBLAS_CHECK(cublasgemv(
-    handle, CUBLAS_OP_N, D, N, &alpha, thrust::raw_pointer_cast(sq.data()), D,
-    thrust::raw_pointer_cast(ones.data()), 1, &beta, dots, 1, stream));
-  CUDA_CHECK(cudaDeviceSynchronize());
-  CUBLAS_CHECK(cublasDestroy(handle));
-}
-
-template <typename Type>
-void naiveReduction(Type *dots, const Type *data, int D, int N, bool rowMajor,
-                    bool alongRows, cudaStream_t stream) {
-  if (rowMajor && alongRows) {
-    naiveCoalescedReduction(dots, data, D, N, stream);
-  } else if (rowMajor && !alongRows) {
-    unaryAndGemv(dots, data, D, N, stream);
-  } else if (!rowMajor && alongRows) {
-    unaryAndGemv(dots, data, N, D, stream);
-  } else {
-    naiveCoalescedReduction(dots, data, N, D, stream);
-  }
-  CUDA_CHECK(cudaDeviceSynchronize());
-}
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/reduce_cols_by_key.cu b/cpp/test/prims/reduce_cols_by_key.cu
index 713dfb0ed0..777973c205 100644
--- a/cpp/test/prims/reduce_cols_by_key.cu
+++ b/cpp/test/prims/reduce_cols_by_key.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <linalg/reduce_cols_by_key.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -28,9 +28,9 @@ void naiveReduceColsByKey(const T *in, const uint32_t *keys, T *out_ref,
                           uint32_t nrows, uint32_t ncols, uint32_t nkeys,
                           cudaStream_t stream) {
   std::vector<uint32_t> h_keys(ncols, 0u);
-  copy(&(h_keys[0]), keys, ncols, stream);
+  raft::copy(&(h_keys[0]), keys, ncols, stream);
   std::vector<T> h_in(nrows * ncols);
-  copy(&(h_in[0]), in, nrows * ncols, stream);
+  raft::copy(&(h_in[0]), in, nrows * ncols, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   std::vector<T> out(nrows * nkeys, T(0));
   for (uint32_t i = 0; i < nrows; ++i) {
@@ -38,7 +38,7 @@ void naiveReduceColsByKey(const T *in, const uint32_t *keys, T *out_ref,
       out[i * nkeys + h_keys[j]] += h_in[i * ncols + j];
     }
   }
-  copy(out_ref, &(out[0]), nrows * nkeys, stream);
+  raft::copy(out_ref, &(out[0]), nrows * nkeys, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
@@ -62,15 +62,15 @@ class ReduceColsTest : public ::testing::TestWithParam<ReduceColsInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<ReduceColsInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     CUDA_CHECK(cudaStreamCreate(&stream));
     auto nrows = params.rows;
     auto ncols = params.cols;
     auto nkeys = params.nkeys;
-    allocate(in, nrows * ncols);
-    allocate(keys, ncols);
-    allocate(out_ref, nrows * nkeys);
-    allocate(out, nrows * nkeys);
+    raft::allocate(in, nrows * ncols);
+    raft::allocate(keys, ncols);
+    raft::allocate(out_ref, nrows * nkeys);
+    raft::allocate(out, nrows * nkeys);
     r.uniform(in, nrows * ncols, T(-1.0), T(1.0), stream);
     r.uniformInt(keys, ncols, 0u, params.nkeys, stream);
     naiveReduceColsByKey(in, keys, out_ref, nrows, ncols, nkeys, stream);
@@ -97,8 +97,8 @@ const std::vector<ReduceColsInputs<float>> inputsf = {
   {0.0001f, 128, 32, 6, 1234ULL}, {0.0005f, 121, 63, 10, 1234ULL}};
 typedef ReduceColsTest<float> ReduceColsTestF;
 TEST_P(ReduceColsTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.nkeys,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.rows * params.nkeys,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceColsTests, ReduceColsTestF,
                         ::testing::ValuesIn(inputsf));
@@ -107,8 +107,8 @@ const std::vector<ReduceColsInputs<double>> inputsd2 = {
   {0.0000001, 128, 32, 6, 1234ULL}, {0.0000001, 121, 63, 10, 1234ULL}};
 typedef ReduceColsTest<double> ReduceColsTestD;
 TEST_P(ReduceColsTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.rows * params.nkeys,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.rows * params.nkeys,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceColsTests, ReduceColsTestD,
                         ::testing::ValuesIn(inputsd2));
diff --git a/cpp/test/prims/reduce_rows_by_key.cu b/cpp/test/prims/reduce_rows_by_key.cu
index a754dc62f7..ec066600ab 100644
--- a/cpp/test/prims/reduce_rows_by_key.cu
+++ b/cpp/test/prims/reduce_rows_by_key.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <iostream>
 #include <linalg/reduce_rows_by_key.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -77,25 +77,24 @@ class ReduceRowTest : public ::testing::TestWithParam<ReduceRowsInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<ReduceRowsInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    Random::Rng r_int(params.seed);
+    raft::random::Rng r(params.seed);
+    raft::random::Rng r_int(params.seed);
     CUDA_CHECK(cudaStreamCreate(&stream));
 
     int nobs = params.nobs;
     uint32_t cols = params.cols;
     uint32_t nkeys = params.nkeys;
-    allocate(in, nobs * cols);
-    allocate(keys, nobs);
-    allocate(scratch_buf, nobs);
-    allocate(out_ref, nkeys * cols);
-    allocate(out, nkeys * cols);
+    raft::allocate(in, nobs * cols);
+    raft::allocate(keys, nobs);
+    raft::allocate(scratch_buf, nobs);
+    raft::allocate(out_ref, nkeys * cols);
+    raft::allocate(out, nkeys * cols);
     r.uniform(in, nobs * cols, T(0.0), T(2.0 / nobs), stream);
     r_int.uniformInt(keys, nobs, (uint32_t)0, nkeys, stream);
 
     if (params.weighted) {
-      allocate(weight, nobs);
-      MLCommon::Random::Rng r(params.seed,
-                              MLCommon::Random::GeneratorType::GenPhilox);
+      raft::allocate(weight, nobs);
+      raft::random::Rng r(params.seed, raft::random::GeneratorType::GenPhilox);
       r.uniform(weight, nobs, T(1), params.max_weight, stream);
     } else {
       weight = nullptr;
@@ -140,8 +139,8 @@ const std::vector<ReduceRowsInputs<float>> inputsf2 = {
   {0.000001f, 128, 32, 6, 1234ULL, true, 2.0}};
 typedef ReduceRowTest<float> ReduceRowTestF;
 TEST_P(ReduceRowTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestF,
                         ::testing::ValuesIn(inputsf2));
@@ -154,8 +153,8 @@ const std::vector<ReduceRowsInputs<double>> inputsd2 = {
   {0.00000001, 128, 32, 6, 1234ULL, true, 8.0}};
 typedef ReduceRowTest<double> ReduceRowTestD;
 TEST_P(ReduceRowTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestD,
                         ::testing::ValuesIn(inputsd2));
@@ -168,8 +167,8 @@ const std::vector<ReduceRowsInputs<float>> inputsf_small_nkey = {
   {0.000001f, 128, 32, 3, 1234ULL, true, 8.0}};
 typedef ReduceRowTest<float> ReduceRowTestSmallnKey;
 TEST_P(ReduceRowTestSmallnKey, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestSmallnKey,
                         ::testing::ValuesIn(inputsf_small_nkey));
@@ -182,8 +181,8 @@ const std::vector<ReduceRowsInputs<double>> inputsd_big_space = {
   {0.00000001, 512, 1024, 40, 1234ULL, true, 16.0}};
 typedef ReduceRowTest<double> ReduceRowTestBigSpace;
 TEST_P(ReduceRowTestBigSpace, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestBigSpace,
                         ::testing::ValuesIn(inputsd_big_space));
@@ -196,8 +195,8 @@ const std::vector<ReduceRowsInputs<float>> inputsf_many_obs = {
   {0.00001f, 100000, 37, 32, 1234ULL, true, 16.0}};
 typedef ReduceRowTest<float> ReduceRowTestManyObs;
 TEST_P(ReduceRowTestManyObs, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestManyObs,
                         ::testing::ValuesIn(inputsf_many_obs));
@@ -210,8 +209,8 @@ const std::vector<ReduceRowsInputs<float>> inputsf_many_cluster = {
   {0.00001f, 100000, 37, 2048, 1234ULL, true, 16.0}};
 typedef ReduceRowTest<float> ReduceRowTestManyClusters;
 TEST_P(ReduceRowTestManyClusters, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.cols * params.nkeys,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.cols * params.nkeys,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReduceRowTests, ReduceRowTestManyClusters,
                         ::testing::ValuesIn(inputsf_many_cluster));
diff --git a/cpp/test/prims/reverse.cu b/cpp/test/prims/reverse.cu
index 6da840909e..6ff8bdd918 100644
--- a/cpp/test/prims/reverse.cu
+++ b/cpp/test/prims/reverse.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <matrix/reverse.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -37,10 +37,10 @@ class ReverseTest : public ::testing::TestWithParam<ReverseInputs<T>> {
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
     params = ::testing::TestWithParam<ReverseInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int len = params.nrows * params.ncols;
-    allocate(in, len);
-    allocate(out, len);
+    raft::allocate(in, len);
+    raft::allocate(out, len);
     r.uniform(in, len, T(-1.0), T(1.0), stream);
     // applying reverse twice should yield the same output!
     // this will in turn also verify the inplace mode of reverse method
@@ -75,7 +75,7 @@ const std::vector<ReverseInputs<float>> inputsf = {
 typedef ReverseTest<float> ReverseTestF;
 TEST_P(ReverseTestF, Result) {
   ASSERT_TRUE(devArrMatch(in, out, params.nrows, params.ncols,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReverseTests, ReverseTestF,
                         ::testing::ValuesIn(inputsf));
@@ -93,7 +93,7 @@ const std::vector<ReverseInputs<double>> inputsd = {
   {0.000001, 41, 41, true, true, 1234ULL}};
 TEST_P(ReverseTestD, Result) {
   ASSERT_TRUE(devArrMatch(in, out, params.nrows, params.ncols,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ReverseTests, ReverseTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/rng.cu b/cpp/test/prims/rng.cu
deleted file mode 100644
index eb96448b07..0000000000
--- a/cpp/test/prims/rng.cu
+++ /dev/null
@@ -1,627 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include <stats/mean.cuh>
-#include <stats/stddev.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Random {
-
-enum RandomType {
-  RNG_Normal,
-  RNG_LogNormal,
-  RNG_Uniform,
-  RNG_Gumbel,
-  RNG_Logistic,
-  RNG_Exp,
-  RNG_Rayleigh,
-  RNG_Laplace
-};
-
-template <typename T, int TPB>
-__global__ void meanKernel(T* out, const T* data, int len) {
-  typedef cub::BlockReduce<T, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  T val = tid < len ? data[tid] : T(0);
-  T x = BlockReduce(temp_storage).Sum(val);
-  __syncthreads();
-  T xx = BlockReduce(temp_storage).Sum(val * val);
-  __syncthreads();
-  if (threadIdx.x == 0) {
-    myAtomicAdd(out, x);
-    myAtomicAdd(out + 1, xx);
-  }
-}
-
-template <typename T>
-struct RngInputs {
-  T tolerance;
-  int len;
-  // start, end: for uniform
-  // mean, sigma: for normal/lognormal
-  // mean, beta: for gumbel
-  // mean, scale: for logistic and laplace
-  // lambda: for exponential
-  // sigma: for rayleigh
-  T start, end;
-  RandomType type;
-  GeneratorType gtype;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream& operator<<(::std::ostream& os, const RngInputs<T>& dims) {
-  return os;
-}
-
-#include <sys/timeb.h>
-#include <time.h>
-
-template <typename T>
-class RngTest : public ::testing::TestWithParam<RngInputs<T>> {
- protected:
-  void SetUp() override {
-    // Tests are configured with their expected test-values sigma. For example,
-    // 4 x sigma indicates the test shouldn't fail 99.9% of the time.
-    num_sigma = 10;
-    params = ::testing::TestWithParam<RngInputs<T>>::GetParam();
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    Rng r(params.seed, params.gtype);
-    allocate(data, params.len);
-    allocate(stats, 2, true);
-    switch (params.type) {
-      case RNG_Normal:
-        r.normal(data, params.len, params.start, params.end, stream);
-        break;
-      case RNG_LogNormal:
-        r.lognormal(data, params.len, params.start, params.end, stream);
-        break;
-      case RNG_Uniform:
-        r.uniform(data, params.len, params.start, params.end, stream);
-        break;
-      case RNG_Gumbel:
-        r.gumbel(data, params.len, params.start, params.end, stream);
-        break;
-      case RNG_Logistic:
-        r.logistic(data, params.len, params.start, params.end, stream);
-        break;
-      case RNG_Exp:
-        r.exponential(data, params.len, params.start, stream);
-        break;
-      case RNG_Rayleigh:
-        r.rayleigh(data, params.len, params.start, stream);
-        break;
-      case RNG_Laplace:
-        r.laplace(data, params.len, params.start, params.end, stream);
-        break;
-    };
-    static const int threads = 128;
-    meanKernel<T, threads>
-      <<<ceildiv(params.len, threads), threads, 0, stream>>>(stats, data,
-                                                             params.len);
-    updateHost<T>(h_stats, stats, 2, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    h_stats[0] /= params.len;
-    h_stats[1] = (h_stats[1] / params.len) - (h_stats[0] * h_stats[0]);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(stats));
-  }
-
-  void getExpectedMeanVar(T meanvar[2]) {
-    switch (params.type) {
-      case RNG_Normal:
-        meanvar[0] = params.start;
-        meanvar[1] = params.end * params.end;
-        break;
-      case RNG_LogNormal: {
-        auto var = params.end * params.end;
-        auto mu = params.start;
-        meanvar[0] = myExp(mu + var * T(0.5));
-        meanvar[1] = (myExp(var) - T(1.0)) * myExp(T(2.0) * mu + var);
-        break;
-      }
-      case RNG_Uniform:
-        meanvar[0] = (params.start + params.end) * T(0.5);
-        meanvar[1] = params.end - params.start;
-        meanvar[1] = meanvar[1] * meanvar[1] / T(12.0);
-        break;
-      case RNG_Gumbel: {
-        auto gamma = T(0.577215664901532);
-        meanvar[0] = params.start + params.end * gamma;
-        meanvar[1] = T(3.1415) * T(3.1415) * params.end * params.end / T(6.0);
-        break;
-      }
-      case RNG_Logistic:
-        meanvar[0] = params.start;
-        meanvar[1] = T(3.1415) * T(3.1415) * params.end * params.end / T(3.0);
-        break;
-      case RNG_Exp:
-        meanvar[0] = T(1.0) / params.start;
-        meanvar[1] = meanvar[0] * meanvar[0];
-        break;
-      case RNG_Rayleigh:
-        meanvar[0] = params.start * mySqrt(T(3.1415 / 2.0));
-        meanvar[1] =
-          ((T(4.0) - T(3.1415)) / T(2.0)) * params.start * params.start;
-        break;
-      case RNG_Laplace:
-        meanvar[0] = params.start;
-        meanvar[1] = T(2.0) * params.end * params.end;
-        break;
-    };
-  }
-
- protected:
-  RngInputs<T> params;
-  T *data, *stats;
-  T h_stats[2];  // mean, var
-  int num_sigma;
-};
-
-// The measured mean and standard deviation for each tested distribution are,
-// of course, statistical variables. Thus setting an appropriate testing
-// tolerance essentially requires one to set a probability of test failure. We
-// choose to set this at 3-4 x sigma, i.e., a 99.7-99.9% confidence interval so that
-// the test will indeed pass. In quick experiments (using the identical
-// distributions given by NumPy/SciPy), the measured standard deviation is the
-// variable with the greatest variance and so we determined the variance for
-// each distribution and number of samples (32*1024 or 8*1024). Below
-// are listed the standard deviation for these tests.
-
-// Distribution: StdDev 32*1024, StdDev 8*1024
-// Normal: 0.0055, 0.011
-// LogNormal: 0.05, 0.1
-// Uniform: 0.003, 0.005
-// Gumbel: 0.005, 0.01
-// Logistic: 0.005, 0.01
-// Exp: 0.008, 0.015
-// Rayleigh: 0.0125, 0.025
-// Laplace: 0.02, 0.04
-
-// We generally want 4 x sigma >= 99.9% chance of success
-
-typedef RngTest<float> RngTestF;
-const std::vector<RngInputs<float>> inputsf = {
-  {0.0055, 32 * 1024, 1.f, 1.f, RNG_Normal, GenPhilox, 1234ULL},
-  {0.011, 8 * 1024, 1.f, 1.f, RNG_Normal, GenPhilox, 1234ULL},
-  {0.05, 32 * 1024, 1.f, 1.f, RNG_LogNormal, GenPhilox, 1234ULL},
-  {0.1, 8 * 1024, 1.f, 1.f, RNG_LogNormal, GenPhilox, 1234ULL},
-  {0.003, 32 * 1024, -1.f, 1.f, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.005, 8 * 1024, -1.f, 1.f, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Gumbel, GenPhilox, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Gumbel, GenPhilox, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Logistic, GenPhilox, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Logistic, GenPhilox, 1234ULL},
-  {0.008, 32 * 1024, 1.f, 1.f, RNG_Exp, GenPhilox, 1234ULL},
-  {0.015, 8 * 1024, 1.f, 1.f, RNG_Exp, GenPhilox, 1234ULL},
-  {0.0125, 32 * 1024, 1.f, 1.f, RNG_Rayleigh, GenPhilox, 1234ULL},
-  {0.025, 8 * 1024, 1.f, 1.f, RNG_Rayleigh, GenPhilox, 1234ULL},
-  {0.02, 32 * 1024, 1.f, 1.f, RNG_Laplace, GenPhilox, 1234ULL},
-  {0.04, 8 * 1024, 1.f, 1.f, RNG_Laplace, GenPhilox, 1234ULL},
-
-  {0.0055, 32 * 1024, 1.f, 1.f, RNG_Normal, GenTaps, 1234ULL},
-  {0.011, 8 * 1024, 1.f, 1.f, RNG_Normal, GenTaps, 1234ULL},
-  {0.05, 32 * 1024, 1.f, 1.f, RNG_LogNormal, GenTaps, 1234ULL},
-  {0.1, 8 * 1024, 1.f, 1.f, RNG_LogNormal, GenTaps, 1234ULL},
-  {0.003, 32 * 1024, -1.f, 1.f, RNG_Uniform, GenTaps, 1234ULL},
-  {0.005, 8 * 1024, -1.f, 1.f, RNG_Uniform, GenTaps, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Gumbel, GenTaps, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Gumbel, GenTaps, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Logistic, GenTaps, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Logistic, GenTaps, 1234ULL},
-  {0.008, 32 * 1024, 1.f, 1.f, RNG_Exp, GenTaps, 1234ULL},
-  {0.015, 8 * 1024, 1.f, 1.f, RNG_Exp, GenTaps, 1234ULL},
-  {0.0125, 32 * 1024, 1.f, 1.f, RNG_Rayleigh, GenTaps, 1234ULL},
-  {0.025, 8 * 1024, 1.f, 1.f, RNG_Rayleigh, GenTaps, 1234ULL},
-  {0.02, 32 * 1024, 1.f, 1.f, RNG_Laplace, GenTaps, 1234ULL},
-  {0.04, 8 * 1024, 1.f, 1.f, RNG_Laplace, GenTaps, 1234ULL},
-
-  {0.0055, 32 * 1024, 1.f, 1.f, RNG_Normal, GenKiss99, 1234ULL},
-  {0.011, 8 * 1024, 1.f, 1.f, RNG_Normal, GenKiss99, 1234ULL},
-  {0.05, 32 * 1024, 1.f, 1.f, RNG_LogNormal, GenKiss99, 1234ULL},
-  {0.1, 8 * 1024, 1.f, 1.f, RNG_LogNormal, GenKiss99, 1234ULL},
-  {0.003, 32 * 1024, -1.f, 1.f, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.005, 8 * 1024, -1.f, 1.f, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Gumbel, GenKiss99, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Gumbel, GenKiss99, 1234ULL},
-  {0.005, 32 * 1024, 1.f, 1.f, RNG_Logistic, GenKiss99, 1234ULL},
-  {0.01, 8 * 1024, 1.f, 1.f, RNG_Logistic, GenKiss99, 1234ULL},
-  {0.008, 32 * 1024, 1.f, 1.f, RNG_Exp, GenKiss99, 1234ULL},
-  {0.015, 8 * 1024, 1.f, 1.f, RNG_Exp, GenKiss99, 1234ULL},
-  {0.0125, 32 * 1024, 1.f, 1.f, RNG_Rayleigh, GenKiss99, 1234ULL},
-  {0.025, 8 * 1024, 1.f, 1.f, RNG_Rayleigh, GenKiss99, 1234ULL},
-  {0.02, 32 * 1024, 1.f, 1.f, RNG_Laplace, GenKiss99, 1234ULL},
-  {0.04, 8 * 1024, 1.f, 1.f, RNG_Laplace, GenKiss99, 1234ULL}};
-
-TEST_P(RngTestF, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(match(meanvar[0], h_stats[0],
-                    CompareApprox<float>(num_sigma * params.tolerance)));
-  ASSERT_TRUE(match(meanvar[1], h_stats[1],
-                    CompareApprox<float>(num_sigma * params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestF, ::testing::ValuesIn(inputsf));
-
-typedef RngTest<double> RngTestD;
-const std::vector<RngInputs<double>> inputsd = {
-  {0.0055, 32 * 1024, 1.0, 1.0, RNG_Normal, GenPhilox, 1234ULL},
-  {0.011, 8 * 1024, 1.0, 1.0, RNG_Normal, GenPhilox, 1234ULL},
-  {0.05, 32 * 1024, 1.0, 1.0, RNG_LogNormal, GenPhilox, 1234ULL},
-  {0.1, 8 * 1024, 1.0, 1.0, RNG_LogNormal, GenPhilox, 1234ULL},
-  {0.003, 32 * 1024, -1.0, 1.0, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.005, 8 * 1024, -1.0, 1.0, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Gumbel, GenPhilox, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Gumbel, GenPhilox, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Logistic, GenPhilox, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Logistic, GenPhilox, 1234ULL},
-  {0.008, 32 * 1024, 1.0, 1.0, RNG_Exp, GenPhilox, 1234ULL},
-  {0.015, 8 * 1024, 1.0, 1.0, RNG_Exp, GenPhilox, 1234ULL},
-  {0.0125, 32 * 1024, 1.0, 1.0, RNG_Rayleigh, GenPhilox, 1234ULL},
-  {0.025, 8 * 1024, 1.0, 1.0, RNG_Rayleigh, GenPhilox, 1234ULL},
-  {0.02, 32 * 1024, 1.0, 1.0, RNG_Laplace, GenPhilox, 1234ULL},
-  {0.04, 8 * 1024, 1.0, 1.0, RNG_Laplace, GenPhilox, 1234ULL},
-
-  {0.0055, 32 * 1024, 1.0, 1.0, RNG_Normal, GenTaps, 1234ULL},
-  {0.011, 8 * 1024, 1.0, 1.0, RNG_Normal, GenTaps, 1234ULL},
-  {0.05, 32 * 1024, 1.0, 1.0, RNG_LogNormal, GenTaps, 1234ULL},
-  {0.1, 8 * 1024, 1.0, 1.0, RNG_LogNormal, GenTaps, 1234ULL},
-  {0.003, 32 * 1024, -1.0, 1.0, RNG_Uniform, GenTaps, 1234ULL},
-  {0.005, 8 * 1024, -1.0, 1.0, RNG_Uniform, GenTaps, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Gumbel, GenTaps, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Gumbel, GenTaps, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Logistic, GenTaps, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Logistic, GenTaps, 1234ULL},
-  {0.008, 32 * 1024, 1.0, 1.0, RNG_Exp, GenTaps, 1234ULL},
-  {0.015, 8 * 1024, 1.0, 1.0, RNG_Exp, GenTaps, 1234ULL},
-  {0.0125, 32 * 1024, 1.0, 1.0, RNG_Rayleigh, GenTaps, 1234ULL},
-  {0.025, 8 * 1024, 1.0, 1.0, RNG_Rayleigh, GenTaps, 1234ULL},
-  {0.02, 32 * 1024, 1.0, 1.0, RNG_Laplace, GenTaps, 1234ULL},
-  {0.04, 8 * 1024, 1.0, 1.0, RNG_Laplace, GenTaps, 1234ULL},
-
-  {0.0055, 32 * 1024, 1.0, 1.0, RNG_Normal, GenKiss99, 1234ULL},
-  {0.011, 8 * 1024, 1.0, 1.0, RNG_Normal, GenKiss99, 1234ULL},
-  {0.05, 32 * 1024, 1.0, 1.0, RNG_LogNormal, GenKiss99, 1234ULL},
-  {0.1, 8 * 1024, 1.0, 1.0, RNG_LogNormal, GenKiss99, 1234ULL},
-  {0.003, 32 * 1024, -1.0, 1.0, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.005, 8 * 1024, -1.0, 1.0, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Gumbel, GenKiss99, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Gumbel, GenKiss99, 1234ULL},
-  {0.005, 32 * 1024, 1.0, 1.0, RNG_Logistic, GenKiss99, 1234ULL},
-  {0.01, 8 * 1024, 1.0, 1.0, RNG_Logistic, GenKiss99, 1234ULL},
-  {0.008, 32 * 1024, 1.0, 1.0, RNG_Exp, GenKiss99, 1234ULL},
-  {0.015, 8 * 1024, 1.0, 1.0, RNG_Exp, GenKiss99, 1234ULL},
-  {0.0125, 32 * 1024, 1.0, 1.0, RNG_Rayleigh, GenKiss99, 1234ULL},
-  {0.025, 8 * 1024, 1.0, 1.0, RNG_Rayleigh, GenKiss99, 1234ULL},
-  {0.02, 32 * 1024, 1.0, 1.0, RNG_Laplace, GenKiss99, 1234ULL},
-  {0.04, 8 * 1024, 1.0, 1.0, RNG_Laplace, GenKiss99, 1234ULL}};
-TEST_P(RngTestD, Result) {
-  double meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(match(meanvar[0], h_stats[0],
-                    CompareApprox<double>(num_sigma * params.tolerance)));
-  ASSERT_TRUE(match(meanvar[1], h_stats[1],
-                    CompareApprox<double>(num_sigma * params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestD, ::testing::ValuesIn(inputsd));
-
-// ---------------------------------------------------------------------- //
-// Test for expected variance in mean calculations
-
-template <typename T>
-T quick_mean(const std::vector<T>& d) {
-  T acc = T(0);
-  for (const auto& di : d) {
-    acc += di;
-  }
-  return acc / d.size();
-}
-
-template <typename T>
-T quick_std(const std::vector<T>& d) {
-  T acc = T(0);
-  T d_mean = quick_mean(d);
-  for (const auto& di : d) {
-    acc += ((di - d_mean) * (di - d_mean));
-  }
-  return std::sqrt(acc / (d.size() - 1));
-}
-
-template <typename T>
-std::ostream& operator<<(std::ostream& out, const std::vector<T>& v) {
-  if (!v.empty()) {
-    out << '[';
-    std::copy(v.begin(), v.end(), std::ostream_iterator<T>(out, ", "));
-    out << "\b\b]";
-  }
-  return out;
-}
-
-// The following tests the 3 random number generators by checking that the
-// measured mean error is close to the well-known analytical result
-// (sigma/sqrt(n_samples)). To compute the mean error, we a number of
-// experiments computing the mean, giving us a distribution of the mean
-// itself. The mean error is simply the standard deviation of this
-// distribution (the standard deviation of the mean).
-TEST(Rng, MeanError) {
-  timeb time_struct;
-  ftime(&time_struct);
-  int seed = time_struct.millitm;
-  int num_samples = 1024;
-  int num_experiments = 1024;
-  float* data;
-  float* mean_result;
-  float* std_result;
-  int len = num_samples * num_experiments;
-
-  cudaStream_t stream;
-  CUDA_CHECK(cudaStreamCreate(&stream));
-
-  allocate(data, len);
-  allocate(mean_result, num_experiments);
-  allocate(std_result, num_experiments);
-
-  for (auto rtype :
-       {Random::GenPhilox, Random::GenKiss99 /*, Random::GenTaps */}) {
-    Random::Rng r(seed, rtype);
-    r.normal(data, len, 3.3f, 0.23f, stream);
-    // r.uniform(data, len, -1.0, 2.0);
-    Stats::mean(mean_result, data, num_samples, num_experiments, false, false,
-                stream);
-    Stats::stddev(std_result, data, mean_result, num_samples, num_experiments,
-                  false, false, stream);
-    std::vector<float> h_mean_result(num_experiments);
-    std::vector<float> h_std_result(num_experiments);
-    updateHost(h_mean_result.data(), mean_result, num_experiments, stream);
-    updateHost(h_std_result.data(), std_result, num_experiments, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    auto d_mean = quick_mean(h_mean_result);
-
-    // std-dev of mean; also known as mean error
-    auto d_std_of_mean = quick_std(h_mean_result);
-    auto d_std = quick_mean(h_std_result);
-    auto d_std_of_mean_analytical = d_std / std::sqrt(num_samples);
-
-    // std::cout << "measured mean error: " << d_std_of_mean << "\n";
-    // std::cout << "expected mean error: " << d_std/std::sqrt(num_samples) << "\n";
-
-    auto diff_expected_vs_measured_mean_error =
-      std::abs(d_std_of_mean - d_std / std::sqrt(num_samples));
-
-    ASSERT_TRUE(
-      (diff_expected_vs_measured_mean_error / d_std_of_mean_analytical < 0.5));
-  }
-  CUDA_CHECK(cudaStreamDestroy(stream));
-  CUDA_CHECK(cudaFree(data));
-  CUDA_CHECK(cudaFree(mean_result));
-  CUDA_CHECK(cudaFree(std_result));
-
-  // std::cout << "mean_res:" << h_mean_result << "\n";
-}
-
-template <typename T, int len, int scale>
-class ScaledBernoulliTest : public ::testing::Test {
- protected:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-
-    Rng r(42);
-
-    allocate(data, len * sizeof(T), stream);
-    r.scaled_bernoulli(data, len, T(0.5), T(scale), stream);
-  }
-
-  void TearDown() override { CUDA_CHECK(cudaFree(data)); }
-
-  void rangeCheck() {
-    T* h_data = new T[len];
-    updateHost(h_data, data, len, stream);
-    ASSERT_TRUE(std::none_of(h_data, h_data + len, [](const T& a) {
-      return a < -scale || a > scale;
-    }));
-    delete[] h_data;
-  }
-
-  T* data;
-  cudaStream_t stream;
-};
-
-typedef ScaledBernoulliTest<float, 500, 35> ScaledBernoulliTest1;
-TEST_F(ScaledBernoulliTest1, RangeCheck) { rangeCheck(); }
-
-typedef ScaledBernoulliTest<double, 100, 220> ScaledBernoulliTest2;
-TEST_F(ScaledBernoulliTest2, RangeCheck) { rangeCheck(); }
-
-template <typename T, int len>
-class BernoulliTest : public ::testing::Test {
- protected:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    Rng r(42);
-    allocate(data, len * sizeof(bool), stream);
-    r.bernoulli(data, len, T(0.5), stream);
-  }
-
-  void TearDown() override { CUDA_CHECK(cudaFree(data)); }
-
-  void trueFalseCheck() {
-    // both true and false values must be present
-    bool* h_data = new bool[len];
-    updateHost(h_data, data, len, stream);
-    ASSERT_TRUE(std::any_of(h_data, h_data + len, [](bool a) { return a; }));
-    ASSERT_TRUE(std::any_of(h_data, h_data + len, [](bool a) { return !a; }));
-    delete[] h_data;
-  }
-
-  bool* data;
-  cudaStream_t stream;
-};
-
-typedef BernoulliTest<float, 1000> BernoulliTest1;
-TEST_F(BernoulliTest1, TrueFalseCheck) { trueFalseCheck(); }
-
-typedef BernoulliTest<double, 1000> BernoulliTest2;
-TEST_F(BernoulliTest2, TrueFalseCheck) { trueFalseCheck(); }
-
-/** Rng::normalTable tests */
-template <typename T>
-struct RngNormalTableInputs {
-  T tolerance;
-  int rows, cols;
-  T mu, sigma;
-  GeneratorType gtype;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream& operator<<(::std::ostream& os,
-                           const RngNormalTableInputs<T>& dims) {
-  return os;
-}
-
-template <typename T>
-class RngNormalTableTest
-  : public ::testing::TestWithParam<RngNormalTableInputs<T>> {
- protected:
-  void SetUp() override {
-    // Tests are configured with their expected test-values sigma. For example,
-    // 4 x sigma indicates the test shouldn't fail 99.9% of the time.
-    num_sigma = 10;
-    params = ::testing::TestWithParam<RngNormalTableInputs<T>>::GetParam();
-    int len = params.rows * params.cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    Rng r(params.seed, params.gtype);
-    allocate(data, len);
-    allocate(stats, 2, true);
-    allocate(mu_vec, params.cols);
-    r.fill(mu_vec, params.cols, params.mu, stream);
-    T* sigma_vec = nullptr;
-    r.normalTable(data, params.rows, params.cols, mu_vec, sigma_vec,
-                  params.sigma, stream);
-    static const int threads = 128;
-    meanKernel<T, threads>
-      <<<ceildiv(len, threads), threads, 0, stream>>>(stats, data, len);
-    updateHost<T>(h_stats, stats, 2, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    h_stats[0] /= len;
-    h_stats[1] = (h_stats[1] / len) - (h_stats[0] * h_stats[0]);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(stats));
-    CUDA_CHECK(cudaFree(mu_vec));
-  }
-
-  void getExpectedMeanVar(T meanvar[2]) {
-    meanvar[0] = params.mu;
-    meanvar[1] = params.sigma * params.sigma;
-  }
-
- protected:
-  RngNormalTableInputs<T> params;
-  T *data, *stats, *mu_vec;
-  T h_stats[2];  // mean, var
-  int num_sigma;
-};
-
-typedef RngNormalTableTest<float> RngNormalTableTestF;
-const std::vector<RngNormalTableInputs<float>> inputsf_t = {
-  {0.0055, 32, 1024, 1.f, 1.f, GenPhilox, 1234ULL},
-  {0.011, 8, 1024, 1.f, 1.f, GenPhilox, 1234ULL},
-  {0.0055, 32, 1024, 1.f, 1.f, GenTaps, 1234ULL},
-  {0.011, 8, 1024, 1.f, 1.f, GenTaps, 1234ULL},
-  {0.0055, 32, 1024, 1.f, 1.f, GenKiss99, 1234ULL},
-  {0.011, 8, 1024, 1.f, 1.f, GenKiss99, 1234ULL}};
-
-TEST_P(RngNormalTableTestF, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(match(meanvar[0], h_stats[0],
-                    CompareApprox<float>(num_sigma * params.tolerance)));
-  ASSERT_TRUE(match(meanvar[1], h_stats[1],
-                    CompareApprox<float>(num_sigma * params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngNormalTableTests, RngNormalTableTestF,
-                        ::testing::ValuesIn(inputsf_t));
-
-typedef RngNormalTableTest<double> RngNormalTableTestD;
-const std::vector<RngNormalTableInputs<double>> inputsd_t = {
-  {0.0055, 32, 1024, 1.0, 1.0, GenPhilox, 1234ULL},
-  {0.011, 8, 1024, 1.0, 1.0, GenPhilox, 1234ULL},
-  {0.0055, 32, 1024, 1.0, 1.0, GenTaps, 1234ULL},
-  {0.011, 8, 1024, 1.0, 1.0, GenTaps, 1234ULL},
-  {0.0055, 32, 1024, 1.0, 1.0, GenKiss99, 1234ULL},
-  {0.011, 8, 1024, 1.0, 1.0, GenKiss99, 1234ULL}};
-TEST_P(RngNormalTableTestD, Result) {
-  double meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(match(meanvar[0], h_stats[0],
-                    CompareApprox<double>(num_sigma * params.tolerance)));
-  ASSERT_TRUE(match(meanvar[1], h_stats[1],
-                    CompareApprox<double>(num_sigma * params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngNormalTableTests, RngNormalTableTestD,
-                        ::testing::ValuesIn(inputsd_t));
-
-struct RngAffineInputs {
-  int n;
-  unsigned long long seed;
-};
-
-class RngAffineTest : public ::testing::TestWithParam<RngAffineInputs> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<RngAffineInputs>::GetParam();
-    Rng r(params.seed);
-    r.affine_transform_params(params.n, a, b);
-  }
-
-  void check() {
-    ASSERT_TRUE(gcd(a, params.n) == 1);
-    ASSERT_TRUE(0 <= b && b < params.n);
-  }
-
- private:
-  RngAffineInputs params;
-  int a, b;
-};  // RngAffineTest
-
-const std::vector<RngAffineInputs> inputs_affine = {
-  {100, 123456ULL},     {100, 1234567890ULL},  {101, 123456ULL},
-  {101, 1234567890ULL}, {7, 123456ULL},        {7, 1234567890ULL},
-  {2568, 123456ULL},    {2568, 1234567890ULL},
-};
-TEST_P(RngAffineTest, Result) { check(); }
-INSTANTIATE_TEST_CASE_P(RngAffineTests, RngAffineTest,
-                        ::testing::ValuesIn(inputs_affine));
-
-}  // end namespace Random
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/rng_int.cu b/cpp/test/prims/rng_int.cu
deleted file mode 100644
index 094e28b0c2..0000000000
--- a/cpp/test/prims/rng_int.cu
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cub/cub.cuh>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Random {
-
-enum RandomType { RNG_Uniform };
-
-template <typename T, int TPB>
-__global__ void meanKernel(float *out, const T *data, int len) {
-  typedef cub::BlockReduce<float, TPB> BlockReduce;
-  __shared__ typename BlockReduce::TempStorage temp_storage;
-  int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  float val = tid < len ? data[tid] : T(0);
-  float x = BlockReduce(temp_storage).Sum(val);
-  __syncthreads();
-  float xx = BlockReduce(temp_storage).Sum(val * val);
-  __syncthreads();
-  if (threadIdx.x == 0) {
-    myAtomicAdd(out, x);
-    myAtomicAdd(out + 1, xx);
-  }
-}
-
-template <typename T>
-struct RngInputs {
-  float tolerance;
-  int len;
-  // start, end: for uniform
-  // mean, sigma: for normal/lognormal
-  // mean, beta: for gumbel
-  // mean, scale: for logistic and laplace
-  // lambda: for exponential
-  // sigma: for rayleigh
-  T start, end;
-  RandomType type;
-  GeneratorType gtype;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const RngInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class RngTest : public ::testing::TestWithParam<RngInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<RngInputs<T>>::GetParam();
-    Rng r(params.seed, params.gtype);
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, params.len);
-    allocate(stats, 2, true);
-    switch (params.type) {
-      case RNG_Uniform:
-        r.uniformInt(data, params.len, params.start, params.end, stream);
-        break;
-    };
-    static const int threads = 128;
-    meanKernel<T, threads>
-      <<<ceildiv(params.len, threads), threads, 0, stream>>>(stats, data,
-                                                             params.len);
-    updateHost<float>(h_stats, stats, 2, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    h_stats[0] /= params.len;
-    h_stats[1] = (h_stats[1] / params.len) - (h_stats[0] * h_stats[0]);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(stats));
-  }
-
-  void getExpectedMeanVar(float meanvar[2]) {
-    switch (params.type) {
-      case RNG_Uniform:
-        meanvar[0] = (params.start + params.end) * 0.5f;
-        meanvar[1] = params.end - params.start;
-        meanvar[1] = meanvar[1] * meanvar[1] / 12.f;
-        break;
-    };
-  }
-
- protected:
-  RngInputs<T> params;
-  T *data;
-  float *stats;
-  float h_stats[2];  // mean, var
-};
-
-typedef RngTest<uint32_t> RngTestU32;
-const std::vector<RngInputs<uint32_t>> inputs_u32 = {
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL}};
-TEST_P(RngTestU32, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(
-    match(meanvar[0], h_stats[0], CompareApprox<float>(params.tolerance)));
-  ASSERT_TRUE(
-    match(meanvar[1], h_stats[1], CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestU32, ::testing::ValuesIn(inputs_u32));
-
-typedef RngTest<uint64_t> RngTestU64;
-const std::vector<RngInputs<uint64_t>> inputs_u64 = {
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL}};
-TEST_P(RngTestU64, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(
-    match(meanvar[0], h_stats[0], CompareApprox<float>(params.tolerance)));
-  ASSERT_TRUE(
-    match(meanvar[1], h_stats[1], CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestU64, ::testing::ValuesIn(inputs_u64));
-
-typedef RngTest<int32_t> RngTestS32;
-const std::vector<RngInputs<int32_t>> inputs_s32 = {
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL}};
-TEST_P(RngTestS32, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(
-    match(meanvar[0], h_stats[0], CompareApprox<float>(params.tolerance)));
-  ASSERT_TRUE(
-    match(meanvar[1], h_stats[1], CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestS32, ::testing::ValuesIn(inputs_s32));
-
-typedef RngTest<int64_t> RngTestS64;
-const std::vector<RngInputs<int64_t>> inputs_s64 = {
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenPhilox, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenTaps, 1234ULL},
-  {0.1f, 32 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL},
-  {0.1f, 8 * 1024, 0, 20, RNG_Uniform, GenKiss99, 1234ULL}};
-TEST_P(RngTestS64, Result) {
-  float meanvar[2];
-  getExpectedMeanVar(meanvar);
-  ASSERT_TRUE(
-    match(meanvar[0], h_stats[0], CompareApprox<float>(params.tolerance)));
-  ASSERT_TRUE(
-    match(meanvar[1], h_stats[1], CompareApprox<float>(params.tolerance)));
-}
-INSTANTIATE_TEST_CASE_P(RngTests, RngTestS64, ::testing::ValuesIn(inputs_s64));
-
-}  // end namespace Random
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/rsvd.cu b/cpp/test/prims/rsvd.cu
index 4f3224fff1..07d3315521 100644
--- a/cpp/test/prims/rsvd.cu
+++ b/cpp/test/prims/rsvd.cu
@@ -14,11 +14,12 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <linalg/rsvd.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/handle.hpp>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -46,38 +47,37 @@ template <typename T>
 class RsvdTest : public ::testing::TestWithParam<RsvdInputs<T>> {
  protected:
   void SetUp() override {
-    CUSOLVER_CHECK(cusolverDnCreate(&cusolverH));
-    CUBLAS_CHECK(cublasCreate(&cublasH));
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator.reset(new defaultDeviceAllocator);
+    raft::handle_t handle;
+    stream = handle.get_stream();
 
     params = ::testing::TestWithParam<RsvdInputs<T>>::GetParam();
     // rSVD seems to be very sensitive to the random number sequence as well!
-    Random::Rng r(params.seed, Random::GenTaps);
+    raft::random::Rng r(params.seed, raft::random::GenTaps);
     int m = params.n_row, n = params.n_col;
     T eig_svd_tol = 1.e-7;
     int max_sweeps = 100;
 
     T mu = 0.0, sigma = 1.0;
-    allocate(A, m * n);
+    raft::allocate(A, m * n);
     if (params.tolerance > 1) {  // Sanity check
       ASSERT(m == 3, "This test only supports mxn=3x2!");
       ASSERT(m * n == 6, "This test only supports mxn=3x2!");
       T data_h[] = {1.0, 4.0, 2.0, 2.0, 5.0, 1.0};
-      updateDevice(A, data_h, m * n, stream);
+      raft::update_device(A, data_h, m * n, stream);
 
       T left_eig_vectors_ref_h[] = {-0.308219, -0.906133, -0.289695};
       T right_eig_vectors_ref_h[] = {-0.638636, -0.769509};
       T sing_vals_ref_h[] = {7.065283};
 
-      allocate(left_eig_vectors_ref, m * 1);
-      allocate(right_eig_vectors_ref, n * 1);
-      allocate(sing_vals_ref, 1);
+      raft::allocate(left_eig_vectors_ref, m * 1);
+      raft::allocate(right_eig_vectors_ref, n * 1);
+      raft::allocate(sing_vals_ref, 1);
 
-      updateDevice(left_eig_vectors_ref, left_eig_vectors_ref_h, m * 1, stream);
-      updateDevice(right_eig_vectors_ref, right_eig_vectors_ref_h, n * 1,
-                   stream);
-      updateDevice(sing_vals_ref, sing_vals_ref_h, 1, stream);
+      raft::update_device(left_eig_vectors_ref, left_eig_vectors_ref_h, m * 1,
+                          stream);
+      raft::update_device(right_eig_vectors_ref, right_eig_vectors_ref_h, n * 1,
+                          stream);
+      raft::update_device(sing_vals_ref, sing_vals_ref_h, 1, stream);
 
     } else {  // Other normal tests
       r.normal(A, m * n, mu, sigma, stream);
@@ -85,27 +85,27 @@ class RsvdTest : public ::testing::TestWithParam<RsvdInputs<T>> {
     A_backup_cpu = (T *)malloc(
       sizeof(T) * m *
       n);  // Backup A matrix as svdJacobi will destroy the content of A
-    updateHost(A_backup_cpu, A, m * n, stream);
+    raft::update_host(A_backup_cpu, A, m * n, stream);
 
     // RSVD tests
     if (params.k == 0) {  // Test with PC and upsampling ratio
       params.k = max((int)(min(m, n) * params.PC_perc), 1);
       params.p = max((int)(min(m, n) * params.UpS_perc), 1);
-      allocate(U, m * params.k, true);
-      allocate(S, params.k, true);
-      allocate(V, n * params.k, true);
-      rsvdPerc(A, m, n, S, U, V, params.PC_perc, params.UpS_perc,
+      raft::allocate(U, m * params.k, true);
+      raft::allocate(S, params.k, true);
+      raft::allocate(V, n * params.k, true);
+      rsvdPerc(handle, A, m, n, S, U, V, params.PC_perc, params.UpS_perc,
                params.use_bbt, true, true, false, eig_svd_tol, max_sweeps,
-               cusolverH, cublasH, stream, allocator);
+               stream);
     } else {  // Test with directly given fixed rank
-      allocate(U, m * params.k, true);
-      allocate(S, params.k, true);
-      allocate(V, n * params.k, true);
-      rsvdFixedRank(A, m, n, S, U, V, params.k, params.p, params.use_bbt, true,
-                    true, true, eig_svd_tol, max_sweeps, cusolverH, cublasH,
-                    stream, allocator);
+      raft::allocate(U, m * params.k, true);
+      raft::allocate(S, params.k, true);
+      raft::allocate(V, n * params.k, true);
+      rsvdFixedRank(handle, A, m, n, S, U, V, params.k, params.p,
+                    params.use_bbt, true, true, true, eig_svd_tol, max_sweeps,
+                    stream);
     }
-    updateDevice(A, A_backup_cpu, m * n, stream);
+    raft::update_device(A, A_backup_cpu, m * n, stream);
 
     free(A_backup_cpu);
   }
@@ -118,9 +118,6 @@ class RsvdTest : public ::testing::TestWithParam<RsvdInputs<T>> {
     if (left_eig_vectors_ref) CUDA_CHECK(cudaFree(left_eig_vectors_ref));
     if (right_eig_vectors_ref) CUDA_CHECK(cudaFree(right_eig_vectors_ref));
     if (sing_vals_ref) CUDA_CHECK(cudaFree(sing_vals_ref));
-    CUSOLVER_CHECK(cusolverDnDestroy(cusolverH));
-    CUBLAS_CHECK(cublasDestroy(cublasH));
-    CUDA_CHECK(cudaStreamDestroy(stream));
   }
 
  protected:
@@ -128,10 +125,8 @@ class RsvdTest : public ::testing::TestWithParam<RsvdInputs<T>> {
   T *A, *A_backup_cpu,
     *U = nullptr, *S = nullptr, *V = nullptr, *left_eig_vectors_ref = nullptr,
     *right_eig_vectors_ref = nullptr, *sing_vals_ref = nullptr;
-  cusolverDnHandle_t cusolverH = nullptr;
-  cublasHandle_t cublasH = nullptr;
+
   cudaStream_t stream;
-  std::shared_ptr<deviceAllocator> allocator;
 };
 
 const std::vector<RsvdInputs<float>> inputs_fx = {
@@ -193,65 +188,55 @@ const std::vector<RsvdInputs<double>> sanity_inputs_dx = {
 typedef RsvdTest<float> RsvdSanityCheckValF;
 TEST_P(RsvdSanityCheckValF, Result) {
   ASSERT_TRUE(devArrMatch(sing_vals_ref, S, params.k,
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef RsvdTest<double> RsvdSanityCheckValD;
 TEST_P(RsvdSanityCheckValD, Result) {
   ASSERT_TRUE(devArrMatch(sing_vals_ref, S, params.k,
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef RsvdTest<float> RsvdSanityCheckLeftVecF;
 TEST_P(RsvdSanityCheckLeftVecF, Result) {
   ASSERT_TRUE(devArrMatch(left_eig_vectors_ref, U, params.n_row * params.k,
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef RsvdTest<double> RsvdSanityCheckLeftVecD;
 TEST_P(RsvdSanityCheckLeftVecD, Result) {
   ASSERT_TRUE(devArrMatch(left_eig_vectors_ref, U, params.n_row * params.k,
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef RsvdTest<float> RsvdSanityCheckRightVecF;
 TEST_P(RsvdSanityCheckRightVecF, Result) {
   ASSERT_TRUE(devArrMatch(right_eig_vectors_ref, V, params.n_col * params.k,
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef RsvdTest<double> RsvdSanityCheckRightVecD;
 TEST_P(RsvdSanityCheckRightVecD, Result) {
   ASSERT_TRUE(devArrMatch(right_eig_vectors_ref, V, params.n_col * params.k,
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef RsvdTest<float> RsvdTestSquareMatrixNormF;
 TEST_P(RsvdTestSquareMatrixNormF, Result) {
-  cublasHandle_t cublasH;
-  CUBLAS_CHECK(cublasCreate(&cublasH));
-  cudaStream_t stream;
-  CUDA_CHECK(cudaStreamCreate(&stream));
-  std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
-  ASSERT_TRUE(evaluateSVDByL2Norm(A, U, S, V, params.n_row, params.n_col,
-                                  params.k, 4 * params.tolerance, cublasH,
-                                  stream, allocator));
-  CUBLAS_CHECK(cublasDestroy(cublasH));
-  CUDA_CHECK(cudaStreamDestroy(stream));
+  raft::handle_t handle;
+
+  ASSERT_TRUE(raft::linalg::evaluateSVDByL2Norm(
+    handle, A, U, S, V, params.n_row, params.n_col, params.k,
+    4 * params.tolerance, handle.get_stream()));
 }
 
 typedef RsvdTest<double> RsvdTestSquareMatrixNormD;
 TEST_P(RsvdTestSquareMatrixNormD, Result) {
-  cublasHandle_t cublasH;
-  CUBLAS_CHECK(cublasCreate(&cublasH));
-  cudaStream_t stream;
-  CUDA_CHECK(cudaStreamCreate(&stream));
-  std::shared_ptr<deviceAllocator> allocator(new defaultDeviceAllocator);
-  ASSERT_TRUE(evaluateSVDByL2Norm(A, U, S, V, params.n_row, params.n_col,
-                                  params.k, 4 * params.tolerance, cublasH,
-                                  stream, allocator));
-  CUBLAS_CHECK(cublasDestroy(cublasH));
-  CUDA_CHECK(cudaStreamDestroy(stream));
+  raft::handle_t handle;
+
+  ASSERT_TRUE(raft::linalg::evaluateSVDByL2Norm(
+    handle, A, U, S, V, params.n_row, params.n_col, params.k,
+    4 * params.tolerance, handle.get_stream()));
 }
 
 INSTANTIATE_TEST_CASE_P(RsvdTests, RsvdSanityCheckValF,
diff --git a/cpp/test/prims/sample_without_replacement.cu b/cpp/test/prims/sample_without_replacement.cu
deleted file mode 100644
index 7093034021..0000000000
--- a/cpp/test/prims/sample_without_replacement.cu
+++ /dev/null
@@ -1,256 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include <set>
-#include <vector>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Random {
-
-// Terminology:
-// SWoR - Sample Without Replacement
-
-template <typename T>
-struct SWoRInputs {
-  int len, sampledLen;
-  int largeWeightIndex;
-  T largeWeight;
-  GeneratorType gtype;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream& operator<<(::std::ostream& os, const SWoRInputs<T>& dims) {
-  return os;
-}
-
-template <typename T>
-class SWoRTest : public ::testing::TestWithParam<SWoRInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<SWoRInputs<T>>::GetParam();
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocator.reset(new defaultDeviceAllocator);
-    Rng r(params.seed, params.gtype);
-    allocate(in, params.len);
-    allocate(wts, params.len);
-    allocate(out, params.sampledLen);
-    allocate(outIdx, params.sampledLen);
-    h_outIdx.resize(params.sampledLen);
-    r.uniform(in, params.len, T(-1.0), T(1.0), stream);
-    r.uniform(wts, params.len, T(1.0), T(2.0), stream);
-    if (params.largeWeightIndex >= 0) {
-      updateDevice(wts + params.largeWeightIndex, &params.largeWeight, 1,
-                   stream);
-    }
-    r.sampleWithoutReplacement(out, outIdx, in, wts, params.sampledLen,
-                               params.len, allocator, stream);
-    updateHost(&(h_outIdx[0]), outIdx, params.sampledLen, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(wts));
-    CUDA_CHECK(cudaFree(out));
-    CUDA_CHECK(cudaFree(outIdx));
-  }
-
- protected:
-  SWoRInputs<T> params;
-  T *in, *out, *wts;
-  int* outIdx;
-  std::vector<int> h_outIdx;
-  cudaStream_t stream;
-  std::shared_ptr<deviceAllocator> allocator;
-};
-
-typedef SWoRTest<float> SWoRTestF;
-const std::vector<SWoRInputs<float>> inputsf = {
-  {1024, 512, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 1024, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 512 + 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 1024 - 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 512 + 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 1024 - 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 512, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 1024, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 512, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 1024, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.f, GenPhilox, 1234ULL},
-  {1024, 512, 10, 100000.f, GenPhilox, 1234ULL},
-
-  {1024, 512, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 1024, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 512 + 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 1024 - 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 512 + 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 1024 - 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 512, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 1024, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 512, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 1024, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.f, GenTaps, 1234ULL},
-  {1024, 512, 10, 100000.f, GenTaps, 1234ULL},
-
-  {1024, 512, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 1024, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 512 + 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 1024 - 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 512 + 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 1024 - 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 512, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 1024, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 512, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 1024, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.f, GenKiss99, 1234ULL},
-  {1024, 512, 10, 100000.f, GenKiss99, 1234ULL},
-};
-
-TEST_P(SWoRTestF, Result) {
-  std::set<int> occurence;
-  for (int i = 0; i < params.sampledLen; ++i) {
-    auto val = h_outIdx[i];
-    // indices must be in the given range
-    ASSERT_TRUE(0 <= val && val < params.len)
-      << "out-of-range index @i=" << i << " val=" << val
-      << " sampledLen=" << params.sampledLen;
-    // indices should not repeat
-    ASSERT_TRUE(occurence.find(val) == occurence.end())
-      << "repeated index @i=" << i << " idx=" << val;
-    occurence.insert(val);
-  }
-  // if there's a skewed distribution, the top index should correspond to the
-  // particular item with a large weight
-  if (params.largeWeightIndex >= 0) {
-    ASSERT_EQ(h_outIdx[0], params.largeWeightIndex);
-  }
-}
-INSTANTIATE_TEST_CASE_P(SWoRTests, SWoRTestF, ::testing::ValuesIn(inputsf));
-
-typedef SWoRTest<double> SWoRTestD;
-const std::vector<SWoRInputs<double>> inputsd = {
-  {1024, 512, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 1024, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 512 + 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 1024 - 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 512 + 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 1024 - 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 512, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 1024, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 512, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 1024, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.0, GenPhilox, 1234ULL},
-  {1024, 512, 10, 100000.0, GenPhilox, 1234ULL},
-
-  {1024, 512, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 1024, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 512 + 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 1024 - 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 512 + 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 1024 - 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 512, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 1024, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 512, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 1024, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.0, GenTaps, 1234ULL},
-  {1024, 512, 10, 100000.0, GenTaps, 1234ULL},
-
-  {1024, 512, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 1024, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 512 + 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 1024 - 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 512 + 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 1024 - 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 512, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 1024, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 512 + 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 1024 + 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 512 + 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 1, 1024 - 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 512, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 1024, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 512 + 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 1024 + 1, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 512 + 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024 + 2, 1024 + 2, -1, 0.0, GenKiss99, 1234ULL},
-  {1024, 512, 10, 100000.0, GenKiss99, 1234ULL},
-};
-
-TEST_P(SWoRTestD, Result) {
-  std::set<int> occurence;
-  for (int i = 0; i < params.sampledLen; ++i) {
-    auto val = h_outIdx[i];
-    // indices must be in the given range
-    ASSERT_TRUE(0 <= val && val < params.len)
-      << "out-of-range index @i=" << i << " val=" << val
-      << " sampledLen=" << params.sampledLen;
-    // indices should not repeat
-    ASSERT_TRUE(occurence.find(val) == occurence.end())
-      << "repeated index @i=" << i << " idx=" << val;
-    occurence.insert(val);
-  }
-  // if there's a skewed distribution, the top index should correspond to the
-  // particular item with a large weight
-  if (params.largeWeightIndex >= 0) {
-    ASSERT_EQ(h_outIdx[0], params.largeWeightIndex);
-  }
-}
-INSTANTIATE_TEST_CASE_P(SWoRTests, SWoRTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace Random
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/scatter.cu b/cpp/test/prims/scatter.cu
deleted file mode 100644
index 229945d008..0000000000
--- a/cpp/test/prims/scatter.cu
+++ /dev/null
@@ -1,110 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <algorithm>
-#include <common/scatter.cuh>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include <random>
-#include "test_utils.h"
-
-namespace MLCommon {
-
-template <typename DataT, typename IdxT>
-__global__ void naiveScatterKernel(DataT *out, const DataT *in, const IdxT *idx,
-                                   IdxT len) {
-  IdxT tid = threadIdx.x + blockIdx.x * blockDim.x;
-  if (tid < len) {
-    out[tid] = in[idx[tid]];
-  }
-}
-
-template <typename DataT, typename IdxT>
-void naiveScatter(DataT *out, const DataT *in, const IdxT *idx, IdxT len,
-                  cudaStream_t stream) {
-  int nblks = ceildiv<int>(len, 128);
-  naiveScatterKernel<DataT, IdxT><<<nblks, 128, 0, stream>>>(out, in, idx, len);
-}
-
-struct ScatterInputs {
-  int len;
-  unsigned long long int seed;
-};
-
-template <typename DataT>
-class ScatterTest : public ::testing::TestWithParam<ScatterInputs> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<ScatterInputs>::GetParam();
-    Random::Rng r(params.seed);
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    int len = params.len;
-    allocate(in, len);
-    allocate(ref_out, len);
-    allocate(out, len);
-    allocate(idx, len);
-    r.uniform(in, len, DataT(-1.0), DataT(1.0), stream);
-    {
-      std::vector<int> h_idx(len, 0);
-      for (int i = 0; i < len; ++i) {
-        h_idx[i] = i;
-      }
-      std::random_device rd;
-      std::mt19937 g(rd());
-      std::shuffle(h_idx.begin(), h_idx.end(), g);
-      updateDevice(idx, &(h_idx[0]), len, stream);
-      CUDA_CHECK(cudaStreamSynchronize(stream));
-    }
-    naiveScatter(ref_out, in, idx, len, stream);
-    scatter(out, in, idx, len, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(ref_out));
-    CUDA_CHECK(cudaFree(out));
-    CUDA_CHECK(cudaFree(idx));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  cudaStream_t stream;
-  ScatterInputs params;
-  DataT *in, *ref_out, *out;
-  int *idx;
-};
-
-const std::vector<ScatterInputs> inputs = {
-  {128, 1234ULL}, {129, 1234ULL}, {130, 1234ULL}};
-
-typedef ScatterTest<float> ScatterTestF;
-TEST_P(ScatterTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out, ref_out, params.len, Compare<float>()));
-}
-INSTANTIATE_TEST_CASE_P(ScatterTests, ScatterTestF,
-                        ::testing::ValuesIn(inputs));
-
-typedef ScatterTest<double> ScatterTestD;
-TEST_P(ScatterTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out, ref_out, params.len, Compare<double>()));
-}
-INSTANTIATE_TEST_CASE_P(ScatterTests, ScatterTestD,
-                        ::testing::ValuesIn(inputs));
-
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/score.cu b/cpp/test/prims/score.cu
index 96dee57226..e0c3b3ffd5 100644
--- a/cpp/test/prims/score.cu
+++ b/cpp/test/prims/score.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <iostream>
-#include <random/rng.cuh>
-#include <score/scores.cuh>
+#include <metrics/scores.cuh>
+#include <raft/random/rng.cuh>
 #include <vector>
 #include "test_utils.h"
 
@@ -40,13 +40,13 @@ TEST(ScoreTestHighScore, Result) {
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));
   float *d_y;
-  MLCommon::allocate(d_y, 5);
+  raft::allocate(d_y, 5);
 
   float *d_y_hat;
-  MLCommon::allocate(d_y_hat, 5);
+  raft::allocate(d_y_hat, 5);
 
-  MLCommon::updateDevice(d_y_hat, y_hat, 5, stream);
-  MLCommon::updateDevice(d_y, y, 5, stream);
+  raft::update_device(d_y_hat, y_hat, 5, stream);
+  raft::update_device(d_y, y, 5, stream);
 
   float result = MLCommon::Score::r2_score(d_y, d_y_hat, 5, stream);
   ASSERT_TRUE(result == 0.98f);
@@ -61,13 +61,13 @@ TEST(ScoreTestLowScore, Result) {
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));
   float *d_y;
-  MLCommon::allocate(d_y, 5);
+  raft::allocate(d_y, 5);
 
   float *d_y_hat;
-  MLCommon::allocate(d_y_hat, 5);
+  raft::allocate(d_y_hat, 5);
 
-  MLCommon::updateDevice(d_y_hat, y_hat, 5, stream);
-  MLCommon::updateDevice(d_y, y, 5, stream);
+  raft::update_device(d_y_hat, y_hat, 5, stream);
+  raft::update_device(d_y, y, 5, stream);
 
   float result = MLCommon::Score::r2_score(d_y, d_y_hat, 5, stream);
 
@@ -118,20 +118,21 @@ class AccuracyTest : public ::testing::TestWithParam<AccuracyInputs> {
     ASSERT((params.changed_n <= params.n) && (params.changed_n >= 0),
            "Invalid params.");
 
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     CUDA_CHECK(cudaStreamCreate(&stream));
-    std::shared_ptr<deviceAllocator> d_allocator(new defaultDeviceAllocator);
+    std::shared_ptr<deviceAllocator> d_allocator(
+      new raft::mr::device::default_allocator);
 
-    allocate(predictions, params.n);
-    allocate(ref_predictions, params.n);
+    raft::allocate(predictions, params.n);
+    raft::allocate(ref_predictions, params.n);
     r.normal(ref_predictions, params.n, (T)0.0, (T)1.0, stream);
-    copyAsync(predictions, ref_predictions, params.n, stream);
+    raft::copy_async(predictions, ref_predictions, params.n, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     //Modify params.changed_n unique predictions to a different value. New value is irrelevant.
     if (params.changed_n > 0) {
       int threads = 64;
-      int blocks = ceildiv(params.changed_n, threads);
+      int blocks = raft::ceildiv(params.changed_n, threads);
       //@todo Could also generate params.changed_n unique random positions in [0, n) range, instead of changing the first ones.
       change_vals<T><<<blocks, threads, 0, stream>>>(
         predictions, ref_predictions, params.changed_n);
@@ -234,7 +235,7 @@ void host_regression_computations(std::vector<T> &predictions,
   std::vector<double> abs_diffs(n);
 
   for (int i = 0; i < n; i++) {
-    double abs_diff = abs(predictions[i] - ref_predictions[i]);
+    double abs_diff = raft::abs(predictions[i] - ref_predictions[i]);
     abs_difference_sum += abs_diff;
     mse_sum += pow(predictions[i] - ref_predictions[i], 2);
     abs_diffs[i] = abs_diff;
@@ -262,28 +263,31 @@ class RegressionMetricsTest
     ref_regression_metrics.assign(3, -1.0);
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    std::shared_ptr<deviceAllocator> d_allocator(new defaultDeviceAllocator);
+    std::shared_ptr<deviceAllocator> d_allocator(
+      new raft::mr::device::default_allocator);
 
-    allocate(d_predictions, params.n);
-    allocate(d_ref_predictions, params.n);
+    raft::allocate(d_predictions, params.n);
+    raft::allocate(d_ref_predictions, params.n);
 
     if (params.hardcoded_preds) {
-      updateDevice(d_predictions, params.predictions.data(), params.n, stream);
-      updateDevice(d_ref_predictions, params.ref_predictions.data(), params.n,
-                   stream);
+      raft::update_device(d_predictions, params.predictions.data(), params.n,
+                          stream);
+      raft::update_device(d_ref_predictions, params.ref_predictions.data(),
+                          params.n, stream);
     } else {
       params.predictions.resize(params.n);
       params.ref_predictions.resize(params.n);
-      Random::Rng r(params.seed);
+      raft::random::Rng r(params.seed);
       // randomly generate arrays
       r.uniform(d_predictions, params.n, params.predictions_range[0],
                 params.predictions_range[1], stream);
       r.uniform(d_ref_predictions, params.n, params.ref_predictions_range[0],
                 params.ref_predictions_range[1], stream);
       // copy to host to compute reference regression metrics
-      updateHost(params.predictions.data(), d_predictions, params.n, stream);
-      updateHost(params.ref_predictions.data(), d_ref_predictions, params.n,
-                 stream);
+      raft::update_host(params.predictions.data(), d_predictions, params.n,
+                        stream);
+      raft::update_host(params.ref_predictions.data(), d_ref_predictions,
+                        params.n, stream);
       CUDA_CHECK(cudaStreamSynchronize(stream));
     }
 
@@ -483,7 +487,7 @@ typedef RegressionMetricsTest<float> RegressionMetricsTestF;
 TEST_P(RegressionMetricsTestF, Result) {
   for (int i = 0; i < 3; i++) {
     ASSERT_TRUE(match(computed_regression_metrics[i], ref_regression_metrics[i],
-                      CompareApprox<float>(params.tolerance)));
+                      raft::CompareApprox<float>(params.tolerance)));
   }
 }
 
@@ -491,7 +495,7 @@ typedef RegressionMetricsTest<double> RegressionMetricsTestD;
 TEST_P(RegressionMetricsTestD, Result) {
   for (int i = 0; i < 3; i++) {
     ASSERT_TRUE(match(computed_regression_metrics[i], ref_regression_metrics[i],
-                      CompareApprox<double>(params.tolerance)));
+                      raft::CompareApprox<double>(params.tolerance)));
   }
 }
 
diff --git a/cpp/test/prims/sigmoid.cu b/cpp/test/prims/sigmoid.cu
index b1201aab63..a6df3e4322 100644
--- a/cpp/test/prims/sigmoid.cu
+++ b/cpp/test/prims/sigmoid.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <functions/sigmoid.cuh>
+#include <raft/cuda_utils.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -44,15 +44,15 @@ class SigmoidTest : public ::testing::TestWithParam<SigmoidInputs<T>> {
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
-    allocate(data, len);
+    raft::allocate(data, len);
     T data_h[params.len] = {2.1, -4.5, -0.34, 10.0};
-    updateDevice(data, data_h, len, stream);
+    raft::update_device(data, data_h, len, stream);
 
-    allocate(result, len);
-    allocate(result_ref, len);
+    raft::allocate(result, len);
+    raft::allocate(result_ref, len);
     T result_ref_h[params.len] = {0.89090318, 0.01098694, 0.41580948,
                                   0.9999546};
-    updateDevice(result_ref, result_ref_h, len, stream);
+    raft::update_device(result_ref, result_ref_h, len, stream);
 
     sigmoid(result, data, len, stream);
     CUDA_CHECK(cudaStreamDestroy(stream));
@@ -75,14 +75,16 @@ const std::vector<SigmoidInputs<double>> inputsd2 = {{0.001, 4}};
 
 typedef SigmoidTest<float> SigmoidTestValF;
 TEST_P(SigmoidTestValF, Result) {
-  ASSERT_TRUE(devArrMatch(result_ref, result, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(result_ref, result, params.len,
+                      raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef SigmoidTest<double> SigmoidTestValD;
 TEST_P(SigmoidTestValD, Result) {
-  ASSERT_TRUE(devArrMatch(result_ref, result, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(result_ref, result, params.len,
+                      raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(SigmoidTests, SigmoidTestValF,
diff --git a/cpp/test/prims/silhouetteScore.cu b/cpp/test/prims/silhouetteScore.cu
index 142b909716..1292c68bd4 100644
--- a/cpp/test/prims/silhouetteScore.cu
+++ b/cpp/test/prims/silhouetteScore.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -63,14 +63,14 @@ class silhouetteScoreTest
 
     //allocating and initializing memory to the GPU
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(d_X, nElements, true);
-    MLCommon::allocate(d_labels, nElements, true);
-    MLCommon::allocate(sampleSilScore, nElements);
+    raft::allocate(d_X, nElements, true);
+    raft::allocate(d_labels, nElements, true);
+    raft::allocate(sampleSilScore, nElements);
 
-    MLCommon::updateDevice(d_X, &h_X[0], (int)nElements, stream);
-    MLCommon::updateDevice(d_labels, &h_labels[0], (int)nElements, stream);
+    raft::update_device(d_X, &h_X[0], (int)nElements, stream);
+    raft::update_device(d_labels, &h_labels[0], (int)nElements, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //finding the distance matrix
 
@@ -81,12 +81,12 @@ class silhouetteScoreTest
 
     MLCommon::Distance::pairwiseDistance(
       d_X, d_X, d_distanceMatrix.data(), nRows, nRows, nCols, workspace,
-      static_cast<Distance::DistanceType>(params.metric), stream);
+      static_cast<ML::Distance::DistanceType>(params.metric), stream);
 
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
-    MLCommon::updateHost(h_distanceMatrix, d_distanceMatrix.data(),
-                         nRows * nRows, stream);
+    raft::update_host(h_distanceMatrix, d_distanceMatrix.data(), nRows * nRows,
+                      stream);
 
     //finding the bincount array
 
diff --git a/cpp/test/prims/sqrt.cu b/cpp/test/prims/sqrt.cu
index 8ba333496b..11a6ad716c 100644
--- a/cpp/test/prims/sqrt.cu
+++ b/cpp/test/prims/sqrt.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <linalg/sqrt.cuh>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
@@ -27,14 +27,14 @@ template <typename Type>
 __global__ void naiveSqrtElemKernel(Type *out, const Type *in1, int len) {
   int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < len) {
-    out[idx] = mySqrt(in1[idx]);
+    out[idx] = raft::mySqrt(in1[idx]);
   }
 }
 
 template <typename Type>
 void naiveSqrtElem(Type *out, const Type *in1, int len) {
   static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
+  int nblks = raft::ceildiv(len, TPB);
   naiveSqrtElemKernel<Type><<<nblks, TPB>>>(out, in1, len);
   CUDA_CHECK(cudaPeekAtLastError());
 }
@@ -56,13 +56,13 @@ class SqrtTest : public ::testing::TestWithParam<SqrtInputs<T>> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<SqrtInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
     int len = params.len;
-    allocate(in1, len);
-    allocate(out_ref, len);
-    allocate(out, len);
+    raft::allocate(in1, len);
+    raft::allocate(out_ref, len);
+    raft::allocate(out, len);
     r.uniform(in1, len, T(1.0), T(2.0), stream);
 
     naiveSqrtElem(out_ref, in1, len);
@@ -92,20 +92,20 @@ const std::vector<SqrtInputs<double>> inputsd2 = {
 
 typedef SqrtTest<float> SqrtTestF;
 TEST_P(SqrtTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.len,
+                                raft::CompareApprox<float>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<float>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, in1, params.len,
+                                raft::CompareApprox<float>(params.tolerance)));
 }
 
 typedef SqrtTest<double> SqrtTestD;
 TEST_P(SqrtTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, out, params.len,
+                                raft::CompareApprox<double>(params.tolerance)));
 
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<double>(params.tolerance)));
+  ASSERT_TRUE(raft::devArrMatch(out_ref, in1, params.len,
+                                raft::CompareApprox<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(SqrtTests, SqrtTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/prims/stddev.cu b/cpp/test/prims/stddev.cu
deleted file mode 100644
index 5de3a421d7..0000000000
--- a/cpp/test/prims/stddev.cu
+++ /dev/null
@@ -1,145 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <matrix/math.cuh>
-#include <random/rng.cuh>
-#include <stats/mean.cuh>
-#include <stats/stddev.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Stats {
-
-template <typename T>
-struct StdDevInputs {
-  T tolerance, mean, stddev;
-  int rows, cols;
-  bool sample, rowMajor;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const StdDevInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class StdDevTest : public ::testing::TestWithParam<StdDevInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<StdDevInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-    allocate(mean_act, cols);
-    allocate(stddev_act, cols);
-    allocate(vars_act, cols);
-    r.normal(data, len, params.mean, params.stddev, stream);
-    stdVarSGtest(data, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void stdVarSGtest(T *data, cudaStream_t stream) {
-    int rows = params.rows, cols = params.cols;
-
-    mean(mean_act, data, cols, rows, params.sample, params.rowMajor, stream);
-
-    stddev(stddev_act, data, mean_act, cols, rows, params.sample,
-           params.rowMajor, stream);
-
-    vars(vars_act, data, mean_act, cols, rows, params.sample, params.rowMajor,
-         stream);
-
-    Matrix::seqRoot(vars_act, T(1), cols, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(mean_act));
-    CUDA_CHECK(cudaFree(stddev_act));
-    CUDA_CHECK(cudaFree(vars_act));
-  }
-
- protected:
-  StdDevInputs<T> params;
-  T *data, *mean_act, *stddev_act, *vars_act;
-};
-
-const std::vector<StdDevInputs<float>> inputsf = {
-  {0.1f, 1.f, 2.f, 1024, 32, true, false, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 64, true, false, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 128, true, false, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 256, true, false, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 32, false, false, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 64, false, false, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 128, false, false, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 256, false, false, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 32, true, true, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 64, true, true, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 128, true, true, 1234ULL},
-  {0.1f, 1.f, 2.f, 1024, 256, true, true, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 32, false, true, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 64, false, true, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 128, false, true, 1234ULL},
-  {0.1f, -1.f, 2.f, 1024, 256, false, true, 1234ULL}};
-
-const std::vector<StdDevInputs<double>> inputsd = {
-  {0.1, 1.0, 2.0, 1024, 32, true, false, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 64, true, false, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 128, true, false, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 256, true, false, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 32, false, false, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 64, false, false, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 128, false, false, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 256, false, false, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 32, true, true, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 64, true, true, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 128, true, true, 1234ULL},
-  {0.1, 1.0, 2.0, 1024, 256, true, true, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 32, false, true, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 64, false, true, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 128, false, true, 1234ULL},
-  {0.1, -1.0, 2.0, 1024, 256, false, true, 1234ULL}};
-
-typedef StdDevTest<float> StdDevTestF;
-TEST_P(StdDevTestF, Result) {
-  ASSERT_TRUE(devArrMatch(params.stddev, stddev_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(stddev_act, vars_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef StdDevTest<double> StdDevTestD;
-TEST_P(StdDevTestD, Result) {
-  ASSERT_TRUE(devArrMatch(params.stddev, stddev_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(stddev_act, vars_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(StdDevTests, StdDevTestF, ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(StdDevTests, StdDevTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace Stats
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/strided_reduction.cu b/cpp/test/prims/strided_reduction.cu
deleted file mode 100644
index 241e185bb2..0000000000
--- a/cpp/test/prims/strided_reduction.cu
+++ /dev/null
@@ -1,106 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/strided_reduction.cuh>
-#include <random/rng.cuh>
-#include "reduce.cuh"
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct stridedReductionInputs {
-  T tolerance;
-  int rows, cols;
-  unsigned long long int seed;
-};
-
-template <typename T>
-void stridedReductionLaunch(T *dots, const T *data, int cols, int rows,
-                            cudaStream_t stream) {
-  stridedReduction(dots, data, cols, rows, (T)0, stream, false,
-                   [] __device__(T in, int i) { return in * in; });
-}
-
-template <typename T>
-class stridedReductionTest
-  : public ::testing::TestWithParam<stridedReductionInputs<T>> {
- protected:
-  void SetUp() override {
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    params = ::testing::TestWithParam<stridedReductionInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-
-    allocate(data, len);
-    allocate(dots_exp, cols);  //expected dot products (from test)
-    allocate(dots_act, cols);  //actual dot products (from prim)
-    r.uniform(data, len, T(-1.0), T(1.0),
-              stream);  //initialize matrix to random
-
-    unaryAndGemv(dots_exp, data, cols, rows, stream);
-    stridedReductionLaunch(dots_act, data, cols, rows, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(dots_exp));
-    CUDA_CHECK(cudaFree(dots_act));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  stridedReductionInputs<T> params;
-  T *data, *dots_exp, *dots_act;
-  cudaStream_t stream;
-};
-
-const std::vector<stridedReductionInputs<float>> inputsf = {
-  {0.00001f, 1024, 32, 1234ULL},
-  {0.00001f, 1024, 64, 1234ULL},
-  {0.00001f, 1024, 128, 1234ULL},
-  {0.00001f, 1024, 256, 1234ULL}};
-
-const std::vector<stridedReductionInputs<double>> inputsd = {
-  {0.000000001, 1024, 32, 1234ULL},
-  {0.000000001, 1024, 64, 1234ULL},
-  {0.000000001, 1024, 128, 1234ULL},
-  {0.000000001, 1024, 256, 1234ULL}};
-
-typedef stridedReductionTest<float> stridedReductionTestF;
-TEST_P(stridedReductionTestF, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef stridedReductionTest<double> stridedReductionTestD;
-TEST_P(stridedReductionTestD, Result) {
-  ASSERT_TRUE(devArrMatch(dots_exp, dots_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(stridedReductionTests, stridedReductionTestF,
-                        ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(stridedReductionTests, stridedReductionTestD,
-                        ::testing::ValuesIn(inputsd));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/subtract.cu b/cpp/test/prims/subtract.cu
deleted file mode 100644
index 220bc48ff4..0000000000
--- a/cpp/test/prims/subtract.cu
+++ /dev/null
@@ -1,144 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/subtract.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename Type>
-__global__ void naiveSubtractElemKernel(Type *out, const Type *in1,
-                                        const Type *in2, int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = in1[idx] - in2[idx];
-  }
-}
-
-template <typename Type>
-void naiveSubtractElem(Type *out, const Type *in1, const Type *in2, int len,
-                       cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveSubtractElemKernel<Type><<<nblks, TPB, 0, stream>>>(out, in1, in2, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename Type>
-__global__ void naiveSubtractScalarKernel(Type *out, const Type *in1,
-                                          const Type in2, int len) {
-  int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < len) {
-    out[idx] = in1[idx] - in2;
-  }
-}
-
-template <typename Type>
-void naiveSubtractScalar(Type *out, const Type *in1, const Type in2, int len,
-                         cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveSubtractScalarKernel<Type>
-    <<<nblks, TPB, 0, stream>>>(out, in1, in2, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename T>
-struct SubtractInputs {
-  T tolerance;
-  int len;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const SubtractInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class SubtractTest : public ::testing::TestWithParam<SubtractInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<SubtractInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in1, len, T(-1.0), T(1.0), stream);
-    r.uniform(in2, len, T(-1.0), T(1.0), stream);
-
-    naiveSubtractElem(out_ref, in1, in2, len, stream);
-    naiveSubtractScalar(out_ref, out_ref, T(1), len, stream);
-
-    subtract(out, in1, in2, len, stream);
-    subtractScalar(out, out, T(1), len, stream);
-    subtract(in1, in1, in2, len, stream);
-    subtractScalar(in1, in1, T(1), len, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(in1));
-    CUDA_CHECK(cudaFree(in2));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
- protected:
-  SubtractInputs<T> params;
-  T *in1, *in2, *out_ref, *out;
-};
-
-const std::vector<SubtractInputs<float>> inputsf2 = {
-  {0.000001f, 1024 * 1024, 1234ULL}};
-
-const std::vector<SubtractInputs<double>> inputsd2 = {
-  {0.00000001, 1024 * 1024, 1234ULL}};
-
-typedef SubtractTest<float> SubtractTestF;
-TEST_P(SubtractTestF, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<float>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef SubtractTest<double> SubtractTestD;
-TEST_P(SubtractTestD, Result) {
-  ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                          CompareApprox<double>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(out_ref, in1, params.len,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(SubtractTests, SubtractTestF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(SubtractTests, SubtractTestD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/sum.cu b/cpp/test/prims/sum.cu
deleted file mode 100644
index 9051c6d6af..0000000000
--- a/cpp/test/prims/sum.cu
+++ /dev/null
@@ -1,95 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/eltwise.cuh>
-#include <random/rng.cuh>
-#include <stats/sum.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace Stats {
-
-template <typename T>
-struct SumInputs {
-  T tolerance;
-  int rows, cols;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const SumInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class SumTest : public ::testing::TestWithParam<SumInputs<T>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<SumInputs<T>>::GetParam();
-    int rows = params.rows, cols = params.cols;
-    int len = rows * cols;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-
-    T data_h[len];
-    for (int i = 0; i < len; i++) {
-      data_h[i] = T(1);
-    }
-
-    updateDevice(data, data_h, len, stream);
-
-    allocate(sum_act, cols);
-    sum(sum_act, data, cols, rows, false, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(sum_act));
-  }
-
- protected:
-  SumInputs<T> params;
-  T *data, *sum_act;
-};
-
-const std::vector<SumInputs<float>> inputsf = {{0.05f, 1024, 32, 1234ULL},
-                                               {0.05f, 1024, 256, 1234ULL}};
-
-const std::vector<SumInputs<double>> inputsd = {{0.05, 1024, 32, 1234ULL},
-                                                {0.05, 1024, 256, 1234ULL}};
-
-typedef SumTest<float> SumTestF;
-TEST_P(SumTestF, Result) {
-  ASSERT_TRUE(devArrMatch(float(params.rows), sum_act, params.cols,
-                          CompareApprox<float>(params.tolerance)));
-}
-
-typedef SumTest<double> SumTestD;
-TEST_P(SumTestD, Result) {
-  ASSERT_TRUE(devArrMatch(double(params.rows), sum_act, params.cols,
-                          CompareApprox<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(SumTests, SumTestF, ::testing::ValuesIn(inputsf));
-
-INSTANTIATE_TEST_CASE_P(SumTests, SumTestD, ::testing::ValuesIn(inputsd));
-
-}  // end namespace Stats
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/svd.cu b/cpp/test/prims/svd.cu
deleted file mode 100644
index ccdb996eb3..0000000000
--- a/cpp/test/prims/svd.cu
+++ /dev/null
@@ -1,180 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <cuda_utils.cuh>
-#include <linalg/svd.cuh>
-#include <matrix/matrix.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct SvdInputs {
-  T tolerance;
-  int len;
-  int n_row;
-  int n_col;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const SvdInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class SvdTest : public ::testing::TestWithParam<SvdInputs<T>> {
- protected:
-  void SetUp() override {
-    CUSOLVER_CHECK(cusolverDnCreate(&cusolverH));
-    CUBLAS_CHECK(cublasCreate(&cublasH));
-    allocator.reset(new defaultDeviceAllocator);
-
-    params = ::testing::TestWithParam<SvdInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
-    int len = params.len;
-    cudaStream_t stream;
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(data, len);
-
-    ASSERT(params.n_row == 3, "This test only supports nrows=3!");
-    ASSERT(params.len == 6, "This test only supports len=6!");
-    T data_h[] = {1.0, 4.0, 2.0, 2.0, 5.0, 1.0};
-    updateDevice(data, data_h, len, stream);
-
-    int left_evl = params.n_row * params.n_col;
-    int right_evl = params.n_col * params.n_col;
-
-    allocate(left_eig_vectors_qr, left_evl);
-    allocate(right_eig_vectors_trans_qr, right_evl);
-    allocate(sing_vals_qr, params.n_col);
-
-    // allocate(left_eig_vectors_jacobi, left_evl);
-    // allocate(right_eig_vectors_trans_jacobi, right_evl);
-    // allocate(sing_vals_jacobi, params.n_col);
-
-    T left_eig_vectors_ref_h[] = {-0.308219, -0.906133, -0.289695,
-                                  0.488195,  0.110706,  -0.865685};
-
-    T right_eig_vectors_ref_h[] = {-0.638636, -0.769509, -0.769509, 0.638636};
-
-    T sing_vals_ref_h[] = {7.065283, 1.040081};
-
-    allocate(left_eig_vectors_ref, left_evl);
-    allocate(right_eig_vectors_ref, right_evl);
-    allocate(sing_vals_ref, params.n_col);
-
-    updateDevice(left_eig_vectors_ref, left_eig_vectors_ref_h, left_evl,
-                 stream);
-    updateDevice(right_eig_vectors_ref, right_eig_vectors_ref_h, right_evl,
-                 stream);
-    updateDevice(sing_vals_ref, sing_vals_ref_h, params.n_col, stream);
-
-    svdQR(data, params.n_row, params.n_col, sing_vals_qr, left_eig_vectors_qr,
-          right_eig_vectors_trans_qr, true, true, true, cusolverH, cublasH,
-          allocator, stream);
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(left_eig_vectors_qr));
-    CUDA_CHECK(cudaFree(right_eig_vectors_trans_qr));
-    CUDA_CHECK(cudaFree(sing_vals_qr));
-    CUDA_CHECK(cudaFree(left_eig_vectors_ref));
-    CUDA_CHECK(cudaFree(right_eig_vectors_ref));
-    CUDA_CHECK(cudaFree(sing_vals_ref));
-    CUSOLVER_CHECK(cusolverDnDestroy(cusolverH));
-    CUBLAS_CHECK(cublasDestroy(cublasH));
-  }
-
- protected:
-  SvdInputs<T> params;
-  T *data, *left_eig_vectors_qr, *right_eig_vectors_trans_qr, *sing_vals_qr,
-    *left_eig_vectors_ref, *right_eig_vectors_ref, *sing_vals_ref;
-  cusolverDnHandle_t cusolverH = NULL;
-  cublasHandle_t cublasH;
-  std::shared_ptr<deviceAllocator> allocator;
-};
-
-const std::vector<SvdInputs<float>> inputsf2 = {
-  {0.00001f, 3 * 2, 3, 2, 1234ULL}};
-
-const std::vector<SvdInputs<double>> inputsd2 = {
-  {0.00001, 3 * 2, 3, 2, 1234ULL}};
-
-typedef SvdTest<float> SvdTestValF;
-TEST_P(SvdTestValF, Result) {
-  ASSERT_TRUE(devArrMatch(sing_vals_ref, sing_vals_qr, params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef SvdTest<double> SvdTestValD;
-TEST_P(SvdTestValD, Result) {
-  ASSERT_TRUE(devArrMatch(sing_vals_ref, sing_vals_qr, params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef SvdTest<float> SvdTestLeftVecF;
-TEST_P(SvdTestLeftVecF, Result) {
-  ASSERT_TRUE(devArrMatch(left_eig_vectors_ref, left_eig_vectors_qr,
-                          params.n_row * params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef SvdTest<double> SvdTestLeftVecD;
-TEST_P(SvdTestLeftVecD, Result) {
-  ASSERT_TRUE(devArrMatch(left_eig_vectors_ref, left_eig_vectors_qr,
-                          params.n_row * params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-typedef SvdTest<float> SvdTestRightVecF;
-TEST_P(SvdTestRightVecF, Result) {
-  ASSERT_TRUE(devArrMatch(right_eig_vectors_ref, right_eig_vectors_trans_qr,
-                          params.n_col * params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef SvdTest<double> SvdTestRightVecD;
-TEST_P(SvdTestRightVecD, Result) {
-  ASSERT_TRUE(devArrMatch(right_eig_vectors_ref, right_eig_vectors_trans_qr,
-                          params.n_col * params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestValF, ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestValD, ::testing::ValuesIn(inputsd2));
-
-INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestLeftVecF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestLeftVecD,
-                        ::testing::ValuesIn(inputsd2));
-
-// INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestRightVecF,
-// ::testing::ValuesIn(inputsf2));
-
-// INSTANTIATE_TEST_CASE_P(SvdTests, SvdTestRightVecD,
-//::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/ternary_op.cu b/cpp/test/prims/ternary_op.cu
index 70a6617c6b..78b32b2406 100644
--- a/cpp/test/prims/ternary_op.cu
+++ b/cpp/test/prims/ternary_op.cu
@@ -14,33 +14,45 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <linalg/ternary_op.cuh>
-#include <random/rng.cuh>
-#include "binary_op.cuh"
+#include <raft/random/rng.cuh>
 #include "test_utils.h"
 
 namespace MLCommon {
 namespace LinAlg {
 
+template <typename InType, typename IdxType = int, typename OutType = InType>
+struct BinaryOpInputs {
+  InType tolerance;
+  IdxType len;
+  unsigned long long int seed;
+};
+
+template <typename InType, typename IdxType = int, typename OutType = InType>
+::std::ostream &operator<<(::std::ostream &os,
+                           const BinaryOpInputs<InType, IdxType, OutType> &d) {
+  return os;
+}
+
 template <typename T>
 class ternaryOpTest : public ::testing::TestWithParam<BinaryOpInputs<T>> {
  public:
   void SetUp() override {
     params = ::testing::TestWithParam<BinaryOpInputs<T>>::GetParam();
-    Random::Rng rng(params.seed);
+    raft::random::Rng rng(params.seed);
 
     int len = params.len;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
-    allocate(in1, len);
-    allocate(in2, len);
-    allocate(in3, len);
-    allocate(out_add_ref, len);
-    allocate(out_mul_ref, len);
-    allocate(out_add, len);
-    allocate(out_mul, len);
+    raft::allocate(in1, len);
+    raft::allocate(in2, len);
+    raft::allocate(in3, len);
+    raft::allocate(out_add_ref, len);
+    raft::allocate(out_mul_ref, len);
+    raft::allocate(out_add, len);
+    raft::allocate(out_mul, len);
 
     rng.fill(out_add_ref, len, T(6.0), stream);
     rng.fill(out_mul_ref, len, T(6.0), stream);
@@ -77,9 +89,9 @@ const std::vector<BinaryOpInputs<float>> inputsf = {
 typedef ternaryOpTest<float> ternaryOpTestF;
 TEST_P(ternaryOpTestF, Result) {
   ASSERT_TRUE(devArrMatch(out_add_ref, out_add, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
   ASSERT_TRUE(devArrMatch(out_mul_ref, out_mul, params.len,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ternaryOpTests, ternaryOpTestF,
                         ::testing::ValuesIn(inputsf));
@@ -91,9 +103,9 @@ const std::vector<BinaryOpInputs<double>> inputsd = {
 typedef ternaryOpTest<double> ternaryOpTestD;
 TEST_P(ternaryOpTestD, Result) {
   ASSERT_TRUE(devArrMatch(out_add_ref, out_add, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
   ASSERT_TRUE(devArrMatch(out_mul_ref, out_mul, params.len,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ternaryOpTests, ternaryOpTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/prims/test_utils.h b/cpp/test/prims/test_utils.h
index 9a1b2b381f..1629e8aa34 100644
--- a/cpp/test/prims/test_utils.h
+++ b/cpp/test/prims/test_utils.h
@@ -15,18 +15,13 @@
  */
 
 #pragma once
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <iostream>
 #include <memory>
+#include <raft/cuda_utils.cuh>
 
-namespace MLCommon {
-
-template <typename T>
-T abs(const T &a) {
-  return a > T(0) ? a : -a;
-}
+namespace raft {
 
 template <typename T>
 struct Compare {
@@ -62,62 +57,67 @@ struct CompareApproxAbs {
   T eps;
 };
 
+template <typename T>
+T abs(const T &a) {
+  return a > T(0) ? a : -a;
+}
+
 /*
- * @brief Helper function to compare 2 device n-D arrays with custom comparison
- * @tparam T the data type of the arrays
- * @tparam L the comparator lambda or object function
- * @param expected expected value(s)
- * @param actual actual values
- * @param eq_compare the comparator
- * @param stream cuda stream
- * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
- * @{
- */
+     * @brief Helper function to compare 2 device n-D arrays with custom comparison
+     * @tparam T the data type of the arrays
+     * @tparam L the comparator lambda or object function
+     * @param expected expected value(s)
+     * @param actual actual values
+     * @param eq_compare the comparator
+     * @param stream cuda stream
+     * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
+     * @{
+     */
 template <typename T, typename L>
-::testing::AssertionResult devArrMatch(const T *expected, const T *actual,
-                                       size_t size, L eq_compare,
-                                       cudaStream_t stream = 0) {
+testing::AssertionResult devArrMatch(const T *expected, const T *actual,
+                                     size_t size, L eq_compare,
+                                     cudaStream_t stream = 0) {
   std::shared_ptr<T> exp_h(new T[size]);
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(exp_h.get(), expected, size, stream);
-  updateHost<T>(act_h.get(), actual, size, stream);
+  raft::update_host<T>(exp_h.get(), expected, size, stream);
+  raft::update_host<T>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < size; ++i) {
     auto exp = exp_h.get()[i];
     auto act = act_h.get()[i];
     if (!eq_compare(exp, act)) {
-      return ::testing::AssertionFailure()
+      return testing::AssertionFailure()
              << "actual=" << act << " != expected=" << exp << " @" << i;
     }
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
 template <typename T, typename L>
-::testing::AssertionResult devArrMatch(T expected, const T *actual, size_t size,
-                                       L eq_compare, cudaStream_t stream = 0) {
+testing::AssertionResult devArrMatch(T expected, const T *actual, size_t size,
+                                     L eq_compare, cudaStream_t stream = 0) {
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(act_h.get(), actual, size, stream);
+  raft::update_host<T>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < size; ++i) {
     auto act = act_h.get()[i];
     if (!eq_compare(expected, act)) {
-      return ::testing::AssertionFailure()
+      return testing::AssertionFailure()
              << "actual=" << act << " != expected=" << expected << " @" << i;
     }
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
 template <typename T, typename L>
-::testing::AssertionResult devArrMatch(const T *expected, const T *actual,
-                                       size_t rows, size_t cols, L eq_compare,
-                                       cudaStream_t stream = 0) {
+testing::AssertionResult devArrMatch(const T *expected, const T *actual,
+                                     size_t rows, size_t cols, L eq_compare,
+                                     cudaStream_t stream = 0) {
   size_t size = rows * cols;
   std::shared_ptr<T> exp_h(new T[size]);
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(exp_h.get(), expected, size, stream);
-  updateHost<T>(act_h.get(), actual, size, stream);
+  raft::update_host<T>(exp_h.get(), expected, size, stream);
+  raft::update_host<T>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < rows; ++i) {
     for (size_t j(0); j < cols; ++j) {
@@ -125,59 +125,57 @@ ::testing::AssertionResult devArrMatch(const T *expected, const T *actual,
       auto exp = exp_h.get()[idx];
       auto act = act_h.get()[idx];
       if (!eq_compare(exp, act)) {
-        return ::testing::AssertionFailure()
+        return testing::AssertionFailure()
                << "actual=" << act << " != expected=" << exp << " @" << i << ","
                << j;
       }
     }
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
 template <typename T, typename L>
-::testing::AssertionResult devArrMatch(T expected, const T *actual, size_t rows,
-                                       size_t cols, L eq_compare,
-                                       cudaStream_t stream = 0) {
+testing::AssertionResult devArrMatch(T expected, const T *actual, size_t rows,
+                                     size_t cols, L eq_compare,
+                                     cudaStream_t stream = 0) {
   size_t size = rows * cols;
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(act_h.get(), actual, size, stream);
+  raft::update_host<T>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < rows; ++i) {
     for (size_t j(0); j < cols; ++j) {
       auto idx = i * cols + j;  // row major assumption!
       auto act = act_h.get()[idx];
       if (!eq_compare(expected, act)) {
-        return ::testing::AssertionFailure()
+        return testing::AssertionFailure()
                << "actual=" << act << " != expected=" << expected << " @" << i
                << "," << j;
       }
     }
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
-/** @} */
 
 /*
- * @brief Helper function to compare a device n-D arrays with an expected array
- * on the host, using a custom comparison
- * @tparam T the data type of the arrays
- * @tparam L the comparator lambda or object function
- * @param expected_h host array of expected value(s)
- * @param actual_d device array actual values
- * @param eq_compare the comparator
- * @param stream cuda stream
- * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
- */
+     * @brief Helper function to compare a device n-D arrays with an expected array
+     * on the host, using a custom comparison
+     * @tparam T the data type of the arrays
+     * @tparam L the comparator lambda or object function
+     * @param expected_h host array of expected value(s)
+     * @param actual_d device array actual values
+     * @param eq_compare the comparator
+     * @param stream cuda stream
+     * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
+     */
 template <typename T, typename L>
-::testing::AssertionResult devArrMatchHost(const T *expected_h,
-                                           const T *actual_d, size_t size,
-                                           L eq_compare,
-                                           cudaStream_t stream = 0) {
+testing::AssertionResult devArrMatchHost(const T *expected_h, const T *actual_d,
+                                         size_t size, L eq_compare,
+                                         cudaStream_t stream = 0) {
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(act_h.get(), actual_d, size, stream);
+  raft::update_host<T>(act_h.get(), actual_d, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   bool ok = true;
-  auto fail = ::testing::AssertionFailure();
+  auto fail = testing::AssertionFailure();
   for (size_t i(0); i < size; ++i) {
     auto exp = expected_h[i];
     auto act = act_h.get()[i];
@@ -187,26 +185,26 @@ ::testing::AssertionResult devArrMatchHost(const T *expected_h,
     }
   }
   if (!ok) return fail;
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
 /*
- * @brief Helper function to compare diagonal values of a 2D matrix
- * @tparam T the data type of the arrays
- * @tparam L the comparator lambda or object function
- * @param expected expected value along diagonal
- * @param actual actual matrix
- * @param eq_compare the comparator
- * @param stream cuda stream
- * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
- */
+     * @brief Helper function to compare diagonal values of a 2D matrix
+     * @tparam T the data type of the arrays
+     * @tparam L the comparator lambda or object function
+     * @param expected expected value along diagonal
+     * @param actual actual matrix
+     * @param eq_compare the comparator
+     * @param stream cuda stream
+     * @return the testing assertion to be later used by ASSERT_TRUE/EXPECT_TRUE
+     */
 template <typename T, typename L>
-::testing::AssertionResult diagonalMatch(T expected, const T *actual,
-                                         size_t rows, size_t cols, L eq_compare,
-                                         cudaStream_t stream = 0) {
+testing::AssertionResult diagonalMatch(T expected, const T *actual, size_t rows,
+                                       size_t cols, L eq_compare,
+                                       cudaStream_t stream = 0) {
   size_t size = rows * cols;
   std::shared_ptr<T> act_h(new T[size]);
-  updateHost<T>(act_h.get(), actual, size, stream);
+  raft::update_host<T>(act_h.get(), actual, size, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   for (size_t i(0); i < rows; ++i) {
     for (size_t j(0); j < cols; ++j) {
@@ -214,24 +212,26 @@ ::testing::AssertionResult diagonalMatch(T expected, const T *actual,
       auto idx = i * cols + j;  // row major assumption!
       auto act = act_h.get()[idx];
       if (!eq_compare(expected, act)) {
-        return ::testing::AssertionFailure()
+        return testing::AssertionFailure()
                << "actual=" << act << " != expected=" << expected << " @" << i
                << "," << j;
       }
     }
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
 template <typename T, typename L>
-::testing::AssertionResult match(const T expected, T actual, L eq_compare) {
+testing::AssertionResult match(const T expected, T actual, L eq_compare) {
   if (!eq_compare(expected, actual)) {
-    return ::testing::AssertionFailure()
+    return testing::AssertionFailure()
            << "actual=" << actual << " != expected=" << expected;
   }
-  return ::testing::AssertionSuccess();
+  return testing::AssertionSuccess();
 }
 
+/** @} */
+
 /** time the function call 'func' using cuda events */
 #define TIMEIT_LOOP(ms, count, func)                    \
   do {                                                  \
@@ -249,4 +249,4 @@ ::testing::AssertionResult match(const T expected, T actual, L eq_compare) {
     ms /= args.runs;                                    \
   } while (0)
 
-};  // end namespace MLCommon
+};  // end namespace raft
diff --git a/cpp/test/prims/transpose.cu b/cpp/test/prims/transpose.cu
deleted file mode 100644
index 959b513976..0000000000
--- a/cpp/test/prims/transpose.cu
+++ /dev/null
@@ -1,112 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/transpose.h>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename T>
-struct TranposeInputs {
-  T tolerance;
-  int len;
-  int n_row;
-  int n_col;
-  unsigned long long int seed;
-};
-
-template <typename T>
-::std::ostream &operator<<(::std::ostream &os, const TranposeInputs<T> &dims) {
-  return os;
-}
-
-template <typename T>
-class TransposeTest : public ::testing::TestWithParam<TranposeInputs<T>> {
- protected:
-  void SetUp() override {
-    CUBLAS_CHECK(cublasCreate(&handle));
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    params = ::testing::TestWithParam<TranposeInputs<T>>::GetParam();
-
-    int len = params.len;
-
-    allocate(data, len);
-    ASSERT(params.len == 9, "This test works only with len=9!");
-    T data_h[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
-    updateDevice(data, data_h, len, stream);
-
-    allocate(data_trans_ref, len);
-    T data_ref_h[] = {1.0, 4.0, 7.0, 2.0, 5.0, 8.0, 3.0, 6.0, 9.0};
-    updateDevice(data_trans_ref, data_ref_h, len, stream);
-
-    allocate(data_trans, len);
-
-    transpose(data, data_trans, params.n_row, params.n_col, handle, stream);
-    transpose(data, params.n_row, stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaFree(data));
-    CUDA_CHECK(cudaFree(data_trans));
-    CUDA_CHECK(cudaFree(data_trans_ref));
-    CUBLAS_CHECK(cublasDestroy(handle));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-  }
-
- protected:
-  TranposeInputs<T> params;
-  T *data, *data_trans, *data_trans_ref;
-  cublasHandle_t handle;
-  cudaStream_t stream;
-};
-
-const std::vector<TranposeInputs<float>> inputsf2 = {
-  {0.1f, 3 * 3, 3, 3, 1234ULL}};
-
-const std::vector<TranposeInputs<double>> inputsd2 = {
-  {0.1, 3 * 3, 3, 3, 1234ULL}};
-
-typedef TransposeTest<float> TransposeTestValF;
-TEST_P(TransposeTestValF, Result) {
-  ASSERT_TRUE(devArrMatch(data_trans_ref, data_trans, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(data_trans_ref, data, params.len,
-                          CompareApproxAbs<float>(params.tolerance)));
-}
-
-typedef TransposeTest<double> TransposeTestValD;
-TEST_P(TransposeTestValD, Result) {
-  ASSERT_TRUE(devArrMatch(data_trans_ref, data_trans, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
-
-  ASSERT_TRUE(devArrMatch(data_trans_ref, data, params.len,
-                          CompareApproxAbs<double>(params.tolerance)));
-}
-
-INSTANTIATE_TEST_CASE_P(TransposeTests, TransposeTestValF,
-                        ::testing::ValuesIn(inputsf2));
-
-INSTANTIATE_TEST_CASE_P(TransposeTests, TransposeTestValD,
-                        ::testing::ValuesIn(inputsd2));
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/trustworthiness.cu b/cpp/test/prims/trustworthiness.cu
index a397431801..a8bf343dae 100644
--- a/cpp/test/prims/trustworthiness.cu
+++ b/cpp/test/prims/trustworthiness.cu
@@ -15,10 +15,10 @@
  */
 
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <distance/distance.cuh>
 #include <iostream>
-#include <score/scores.cuh>
+#include <metrics/scores.cuh>
 #include <vector>
 #include "test_utils.h"
 
@@ -415,18 +415,21 @@ class TrustworthinessScoreTest : public ::testing::Test {
     cudaStream_t stream;
     cudaStreamCreate(&stream);
 
-    allocator.reset(new defaultDeviceAllocator);
+    allocator.reset(new raft::mr::device::default_allocator);
 
     float* d_X = (float*)allocator->allocate(X.size() * sizeof(float), stream);
     float* d_X_embedded =
       (float*)allocator->allocate(X_embedded.size() * sizeof(float), stream);
 
-    updateDevice(d_X, X.data(), X.size(), stream);
-    updateDevice(d_X_embedded, X_embedded.data(), X_embedded.size(), stream);
+    raft::update_device(d_X, X.data(), X.size(), stream);
+    raft::update_device(d_X_embedded, X_embedded.data(), X_embedded.size(),
+                        stream);
 
     // euclidean test
-    score = trustworthiness_score<float, Distance::EucUnexpandedL2Sqrt>(
-      d_X, d_X_embedded, 50, 30, 8, 5, allocator, stream);
+    score =
+      trustworthiness_score<float,
+                            ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
+        d_X, d_X_embedded, 50, 30, 8, 5, allocator, stream);
 
     allocator->deallocate(d_X, X.size() * sizeof(float), stream);
     allocator->deallocate(d_X_embedded, X_embedded.size() * sizeof(float),
diff --git a/cpp/test/prims/unary_op.cu b/cpp/test/prims/unary_op.cu
deleted file mode 100644
index 891f132ea5..0000000000
--- a/cpp/test/prims/unary_op.cu
+++ /dev/null
@@ -1,138 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <common/cudart_utils.h>
-#include <gtest/gtest.h>
-#include <linalg/unary_op.cuh>
-#include <random/rng.cuh>
-#include "test_utils.h"
-#include "unary_op.cuh"
-
-namespace MLCommon {
-namespace LinAlg {
-
-// Or else, we get the following compilation error
-// for an extended __device__ lambda cannot have private or protected access
-// within its class
-template <typename InType, typename IdxType = int, typename OutType = InType>
-void unaryOpLaunch(OutType *out, const InType *in, InType scalar, IdxType len,
-                   cudaStream_t stream) {
-  if (in == nullptr) {
-    auto op = [scalar] __device__(OutType * ptr, IdxType idx) {
-      *ptr = static_cast<OutType>(scalar * idx);
-    };
-    writeOnlyUnaryOp<OutType, decltype(op), IdxType>(out, len, op, stream);
-  } else {
-    auto op = [scalar] __device__(InType in) {
-      return static_cast<OutType>(in * scalar);
-    };
-    unaryOp<InType, decltype(op), IdxType, OutType>(out, in, len, op, stream);
-  }
-}
-
-template <typename InType, typename IdxType, typename OutType = InType>
-class UnaryOpTest
-  : public ::testing::TestWithParam<UnaryOpInputs<InType, IdxType, OutType>> {
- protected:
-  void SetUp() override {
-    params = ::testing::TestWithParam<
-      UnaryOpInputs<InType, IdxType, OutType>>::GetParam();
-    Random::Rng r(params.seed);
-    CUDA_CHECK(cudaStreamCreate(&stream));
-    auto len = params.len;
-    allocate(in, len);
-    allocate(out_ref, len);
-    allocate(out, len);
-    r.uniform(in, len, InType(-1.0), InType(1.0), stream);
-  }
-
-  void TearDown() override {
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    CUDA_CHECK(cudaStreamDestroy(stream));
-    CUDA_CHECK(cudaFree(in));
-    CUDA_CHECK(cudaFree(out_ref));
-    CUDA_CHECK(cudaFree(out));
-  }
-
-  virtual void DoTest() {
-    auto len = params.len;
-    auto scalar = params.scalar;
-    naiveScale(out_ref, in, scalar, len, stream);
-    unaryOpLaunch(out, in, scalar, len, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
-    ASSERT_TRUE(devArrMatch(out_ref, out, params.len,
-                            CompareApprox<OutType>(params.tolerance)));
-  }
-
-  UnaryOpInputs<InType, IdxType, OutType> params;
-  InType *in;
-  OutType *out_ref, *out;
-  cudaStream_t stream;
-};
-
-template <typename OutType, typename IdxType>
-class WriteOnlyUnaryOpTest : public UnaryOpTest<OutType, IdxType, OutType> {
- protected:
-  void DoTest() override {
-    auto len = this->params.len;
-    auto scalar = this->params.scalar;
-    naiveScale(this->out_ref, (OutType *)nullptr, scalar, len, this->stream);
-    unaryOpLaunch(this->out, (OutType *)nullptr, scalar, len, this->stream);
-    CUDA_CHECK(cudaStreamSynchronize(this->stream));
-    ASSERT_TRUE(devArrMatch(this->out_ref, this->out, this->params.len,
-                            CompareApprox<OutType>(this->params.tolerance)));
-  }
-};
-
-#define UNARY_OP_TEST(Name, inputs)  \
-  TEST_P(Name, Result) { DoTest(); } \
-  INSTANTIATE_TEST_CASE_P(UnaryOpTests, Name, ::testing::ValuesIn(inputs))
-
-const std::vector<UnaryOpInputs<float, int>> inputsf_i32 = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-typedef UnaryOpTest<float, int> UnaryOpTestF_i32;
-UNARY_OP_TEST(UnaryOpTestF_i32, inputsf_i32);
-typedef WriteOnlyUnaryOpTest<float, int> WriteOnlyUnaryOpTestF_i32;
-UNARY_OP_TEST(WriteOnlyUnaryOpTestF_i32, inputsf_i32);
-
-const std::vector<UnaryOpInputs<float, size_t>> inputsf_i64 = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-typedef UnaryOpTest<float, size_t> UnaryOpTestF_i64;
-UNARY_OP_TEST(UnaryOpTestF_i64, inputsf_i64);
-typedef WriteOnlyUnaryOpTest<float, size_t> WriteOnlyUnaryOpTestF_i64;
-UNARY_OP_TEST(WriteOnlyUnaryOpTestF_i64, inputsf_i64);
-
-const std::vector<UnaryOpInputs<float, int, double>> inputsf_i32_d = {
-  {0.000001f, 1024 * 1024, 2.f, 1234ULL}};
-typedef UnaryOpTest<float, int, double> UnaryOpTestF_i32_D;
-UNARY_OP_TEST(UnaryOpTestF_i32_D, inputsf_i32_d);
-
-const std::vector<UnaryOpInputs<double, int>> inputsd_i32 = {
-  {0.00000001, 1024 * 1024, 2.0, 1234ULL}};
-typedef UnaryOpTest<double, int> UnaryOpTestD_i32;
-UNARY_OP_TEST(UnaryOpTestD_i32, inputsd_i32);
-typedef WriteOnlyUnaryOpTest<double, int> WriteOnlyUnaryOpTestD_i32;
-UNARY_OP_TEST(WriteOnlyUnaryOpTestD_i32, inputsd_i32);
-
-const std::vector<UnaryOpInputs<double, size_t>> inputsd_i64 = {
-  {0.00000001, 1024 * 1024, 2.0, 1234ULL}};
-typedef UnaryOpTest<double, size_t> UnaryOpTestD_i64;
-UNARY_OP_TEST(UnaryOpTestD_i64, inputsd_i64);
-typedef WriteOnlyUnaryOpTest<double, size_t> WriteOnlyUnaryOpTestD_i64;
-UNARY_OP_TEST(WriteOnlyUnaryOpTestD_i64, inputsd_i64);
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/unary_op.cuh b/cpp/test/prims/unary_op.cuh
deleted file mode 100644
index 3934c12ede..0000000000
--- a/cpp/test/prims/unary_op.cuh
+++ /dev/null
@@ -1,64 +0,0 @@
-/*
- * Copyright (c) 2018-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <cuda_utils.cuh>
-#include <linalg/unary_op.cuh>
-
-namespace MLCommon {
-namespace LinAlg {
-
-template <typename InType, typename OutType, typename IdxType>
-__global__ void naiveScaleKernel(OutType *out, const InType *in, InType scalar,
-                                 IdxType len) {
-  IdxType idx = threadIdx.x + ((IdxType)blockIdx.x * (IdxType)blockDim.x);
-  if (idx < len) {
-    if (in == nullptr) {
-      // used for testing writeOnlyUnaryOp
-      out[idx] = static_cast<OutType>(scalar * idx);
-    } else {
-      out[idx] = static_cast<OutType>(scalar * in[idx]);
-    }
-  }
-}
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-void naiveScale(OutType *out, const InType *in, InType scalar, int len,
-                cudaStream_t stream) {
-  static const int TPB = 64;
-  int nblks = ceildiv(len, TPB);
-  naiveScaleKernel<InType, OutType, IdxType>
-    <<<nblks, TPB, 0, stream>>>(out, in, scalar, len);
-  CUDA_CHECK(cudaPeekAtLastError());
-}
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-struct UnaryOpInputs {
-  OutType tolerance;
-  IdxType len;
-  InType scalar;
-  unsigned long long int seed;
-};
-
-template <typename InType, typename IdxType = int, typename OutType = InType>
-::std::ostream &operator<<(::std::ostream &os,
-                           const UnaryOpInputs<InType, IdxType, OutType> &d) {
-  return os;
-}
-
-}  // end namespace LinAlg
-}  // end namespace MLCommon
diff --git a/cpp/test/prims/vMeasure.cu b/cpp/test/prims/vMeasure.cu
index 981dc1475b..69be450541 100644
--- a/cpp/test/prims/vMeasure.cu
+++ b/cpp/test/prims/vMeasure.cu
@@ -13,8 +13,8 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <algorithm>
 #include <cuml/common/cuml_allocator.hpp>
 #include <iostream>
@@ -68,13 +68,13 @@ class vMeasureTest : public ::testing::TestWithParam<vMeasureParam> {
     //allocating and initializing memory to the GPU
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    MLCommon::allocate(truthClusterArray, nElements, true);
-    MLCommon::allocate(predClusterArray, nElements, true);
+    raft::allocate(truthClusterArray, nElements, true);
+    raft::allocate(predClusterArray, nElements, true);
 
-    MLCommon::updateDevice(truthClusterArray, &arr1[0], (int)nElements, stream);
-    MLCommon::updateDevice(predClusterArray, &arr2[0], (int)nElements, stream);
+    raft::update_device(truthClusterArray, &arr1[0], (int)nElements, stream);
+    raft::update_device(predClusterArray, &arr2[0], (int)nElements, stream);
     std::shared_ptr<MLCommon::deviceAllocator> allocator(
-      new defaultDeviceAllocator);
+      new raft::mr::device::default_allocator);
 
     //calculating the golden output
     double truthHomogeity, truthCompleteness;
diff --git a/cpp/test/prims/weighted_mean.cu b/cpp/test/prims/weighted_mean.cu
index 6dad8ad012..ef7e37a79c 100644
--- a/cpp/test/prims/weighted_mean.cu
+++ b/cpp/test/prims/weighted_mean.cu
@@ -16,8 +16,8 @@
 
 #include <gtest/gtest.h>
 #include <thrust/device_vector.h>
-#include <cuda_utils.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <stats/weighted_mean.cuh>
 #include "test_utils.h"
 
@@ -62,7 +62,7 @@ class RowWeightedMeanTest
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<WeightedMeanInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int rows = params.M, cols = params.N, len = rows * cols;
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
@@ -127,7 +127,7 @@ class ColWeightedMeanTest
   : public ::testing::TestWithParam<WeightedMeanInputs<T>> {
   void SetUp() override {
     params = ::testing::TestWithParam<WeightedMeanInputs<T>>::GetParam();
-    Random::Rng r(params.seed);
+    raft::random::Rng r(params.seed);
     int rows = params.M, cols = params.N, len = rows * cols;
 
     cudaStream_t stream;
@@ -186,7 +186,7 @@ const std::vector<WeightedMeanInputs<double>> inputsd = {
 using RowWeightedMeanTestF = RowWeightedMeanTest<float>;
 TEST_P(RowWeightedMeanTestF, Result) {
   ASSERT_TRUE(devArrMatch(dexp.data().get(), dact.data().get(), params.M,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(RowWeightedMeanTest, RowWeightedMeanTestF,
                         ::testing::ValuesIn(inputsf));
@@ -194,7 +194,7 @@ INSTANTIATE_TEST_CASE_P(RowWeightedMeanTest, RowWeightedMeanTestF,
 using RowWeightedMeanTestD = RowWeightedMeanTest<double>;
 TEST_P(RowWeightedMeanTestD, Result) {
   ASSERT_TRUE(devArrMatch(dexp.data().get(), dact.data().get(), params.M,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(RowWeightedMeanTest, RowWeightedMeanTestD,
                         ::testing::ValuesIn(inputsd));
@@ -202,7 +202,7 @@ INSTANTIATE_TEST_CASE_P(RowWeightedMeanTest, RowWeightedMeanTestD,
 using ColWeightedMeanTestF = ColWeightedMeanTest<float>;
 TEST_P(ColWeightedMeanTestF, Result) {
   ASSERT_TRUE(devArrMatch(dexp.data().get(), dact.data().get(), params.N,
-                          CompareApprox<float>(params.tolerance)));
+                          raft::CompareApprox<float>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ColWeightedMeanTest, ColWeightedMeanTestF,
                         ::testing::ValuesIn(inputsf));
@@ -210,7 +210,7 @@ INSTANTIATE_TEST_CASE_P(ColWeightedMeanTest, ColWeightedMeanTestF,
 using ColWeightedMeanTestD = ColWeightedMeanTest<double>;
 TEST_P(ColWeightedMeanTestD, Result) {
   ASSERT_TRUE(devArrMatch(dexp.data().get(), dact.data().get(), params.N,
-                          CompareApprox<double>(params.tolerance)));
+                          raft::CompareApprox<double>(params.tolerance)));
 }
 INSTANTIATE_TEST_CASE_P(ColWeightedMeanTest, ColWeightedMeanTestD,
                         ::testing::ValuesIn(inputsd));
diff --git a/cpp/test/sg/cd_test.cu b/cpp/test/sg/cd_test.cu
index 99cb369187..217aec8233 100644
--- a/cpp/test/sg/cd_test.cu
+++ b/cpp/test/sg/cd_test.cu
@@ -14,18 +14,17 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/cusolver_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <test_utils.h>
-#include <matrix/matrix.cuh>
+#include <raft/matrix/matrix.cuh>
 #include <solver/cd.cuh>
 
 namespace ML {
 namespace Solver {
 
 using namespace MLCommon;
-using namespace MLCommon::LinAlg;
 
 template <typename T>
 struct CdInputs {
@@ -41,34 +40,34 @@ class CdTest : public ::testing::TestWithParam<CdInputs<T>> {
     params = ::testing::TestWithParam<CdInputs<T>>::GetParam();
     int len = params.n_row * params.n_col;
 
-    allocate(data, len);
-    allocate(labels, params.n_row);
-    allocate(coef, params.n_col, true);
-    allocate(coef2, params.n_col, true);
-    allocate(coef3, params.n_col, true);
-    allocate(coef4, params.n_col, true);
-    allocate(coef_ref, params.n_col, true);
-    allocate(coef2_ref, params.n_col, true);
-    allocate(coef3_ref, params.n_col, true);
-    allocate(coef4_ref, params.n_col, true);
+    raft::allocate(data, len);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(coef, params.n_col, true);
+    raft::allocate(coef2, params.n_col, true);
+    raft::allocate(coef3, params.n_col, true);
+    raft::allocate(coef4, params.n_col, true);
+    raft::allocate(coef_ref, params.n_col, true);
+    raft::allocate(coef2_ref, params.n_col, true);
+    raft::allocate(coef3_ref, params.n_col, true);
+    raft::allocate(coef4_ref, params.n_col, true);
 
     T data_h[len] = {1.0, 1.2, 2.0, 2.0, 4.5, 2.0, 2.0, 3.0};
-    updateDevice(data, data_h, len, stream);
+    raft::update_device(data, data_h, len, stream);
 
     T labels_h[params.n_row] = {6.0, 8.3, 9.8, 11.2};
-    updateDevice(labels, labels_h, params.n_row, stream);
+    raft::update_device(labels, labels_h, params.n_row, stream);
 
     T coef_ref_h[params.n_col] = {4.90832, 0.35031};
-    updateDevice(coef_ref, coef_ref_h, params.n_col, stream);
+    raft::update_device(coef_ref, coef_ref_h, params.n_col, stream);
 
     T coef2_ref_h[params.n_col] = {2.53530, -0.36832};
-    updateDevice(coef2_ref, coef2_ref_h, params.n_col, stream);
+    raft::update_device(coef2_ref, coef2_ref_h, params.n_col, stream);
 
     T coef3_ref_h[params.n_col] = {2.932841, 1.15248};
-    updateDevice(coef3_ref, coef3_ref_h, params.n_col, stream);
+    raft::update_device(coef3_ref, coef3_ref_h, params.n_col, stream);
 
     T coef4_ref_h[params.n_col] = {0.569439, -0.00542};
-    updateDevice(coef4_ref, coef4_ref_h, params.n_col, stream);
+    raft::update_device(coef4_ref, coef4_ref_h, params.n_col, stream);
 
     bool fit_intercept = false;
     bool normalize = false;
@@ -80,35 +79,35 @@ class CdTest : public ::testing::TestWithParam<CdInputs<T>> {
     ML::loss_funct loss = ML::loss_funct::SQRD_LOSS;
 
     intercept = T(0);
-    cdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef,
-          &intercept, fit_intercept, normalize, epochs, loss, alpha, l1_ratio,
-          shuffle, tol, stream);
+    cdFit(handle, data, params.n_row, params.n_col, labels, coef, &intercept,
+          fit_intercept, normalize, epochs, loss, alpha, l1_ratio, shuffle, tol,
+          stream);
 
     fit_intercept = true;
     intercept2 = T(0);
-    cdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef2,
-          &intercept2, fit_intercept, normalize, epochs, loss, alpha, l1_ratio,
-          shuffle, tol, stream);
+    cdFit(handle, data, params.n_row, params.n_col, labels, coef2, &intercept2,
+          fit_intercept, normalize, epochs, loss, alpha, l1_ratio, shuffle, tol,
+          stream);
 
     alpha = T(1.0);
     l1_ratio = T(0.5);
     fit_intercept = false;
     intercept = T(0);
-    cdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef3,
-          &intercept, fit_intercept, normalize, epochs, loss, alpha, l1_ratio,
-          shuffle, tol, stream);
+    cdFit(handle, data, params.n_row, params.n_col, labels, coef3, &intercept,
+          fit_intercept, normalize, epochs, loss, alpha, l1_ratio, shuffle, tol,
+          stream);
 
     fit_intercept = true;
     normalize = true;
     intercept2 = T(0);
-    cdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef4,
-          &intercept2, fit_intercept, normalize, epochs, loss, alpha, l1_ratio,
-          shuffle, tol, stream);
+    cdFit(handle, data, params.n_row, params.n_col, labels, coef4, &intercept2,
+          fit_intercept, normalize, epochs, loss, alpha, l1_ratio, shuffle, tol,
+          stream);
   }
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     lasso();
   }
 
@@ -134,7 +133,7 @@ class CdTest : public ::testing::TestWithParam<CdInputs<T>> {
   T *coef4, *coef4_ref;
   T intercept, intercept2;
   cudaStream_t stream;
-  cumlHandle handle;
+  raft::handle_t handle;
 };
 
 const std::vector<CdInputs<float>> inputsf2 = {{0.01f, 4, 2}};
@@ -143,32 +142,32 @@ const std::vector<CdInputs<double>> inputsd2 = {{0.01, 4, 2}};
 
 typedef CdTest<float> CdTestF;
 TEST_P(CdTestF, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef3_ref, coef3, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef4_ref, coef4, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef4_ref, coef4, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 }
 
 typedef CdTest<double> CdTestD;
 TEST_P(CdTestD, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef3_ref, coef3, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef4_ref, coef4, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef4_ref, coef4, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 }
 
 INSTANTIATE_TEST_CASE_P(CdTests, CdTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/sg/dbscan_test.cu b/cpp/test/sg/dbscan_test.cu
index ef995a1e96..bd37ae389d 100644
--- a/cpp/test/sg/dbscan_test.cu
+++ b/cpp/test/sg/dbscan_test.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 
 #include <cuml/cluster/dbscan.hpp>
@@ -25,8 +25,8 @@
 #include <cuml/datasets/make_blobs.hpp>
 #include <cuml/metrics/metrics.hpp>
 
-#include <linalg/cublas_wrappers.h>
-#include <linalg/transpose.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <raft/linalg/transpose.h>
 
 #include <test_utils.h>
 
@@ -62,38 +62,38 @@ template <typename T, typename IdxT>
 class DbscanTest : public ::testing::TestWithParam<DbscanInputs<T, IdxT>> {
  protected:
   void basicTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
     params = ::testing::TestWithParam<DbscanInputs<T, IdxT>>::GetParam();
 
-    device_buffer<T> out(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<T> out(handle.get_device_allocator(), handle.get_stream(),
                          params.n_row * params.n_col);
-    device_buffer<IdxT> l(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<IdxT> l(handle.get_device_allocator(), handle.get_stream(),
                           params.n_row);
 
     make_blobs(handle, out.data(), l.data(), params.n_row, params.n_col,
                params.n_centers, true, nullptr, nullptr, params.cluster_std,
                true, -10.0f, 10.0f, params.seed);
 
-    allocate(labels, params.n_row);
-    allocate(labels_ref, params.n_row);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(labels_ref, params.n_row);
 
-    MLCommon::copy(labels_ref, l.data(), params.n_row, handle.getStream());
+    raft::copy(labels_ref, l.data(), params.n_row, handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     dbscanFit(handle, out.data(), params.n_row, params.n_col, params.eps,
               params.min_pts, labels, nullptr, params.max_bytes_per_batch);
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     score = adjustedRandIndex(handle, labels_ref, labels, params.n_row);
 
     if (score < 1.0) {
-      auto str =
-        arr2Str(labels_ref, params.n_row, "labels_ref", handle.getStream());
+      auto str = raft::arr2Str(labels_ref, params.n_row, "labels_ref",
+                               handle.get_stream());
       CUML_LOG_DEBUG("y: %s", str.c_str());
-      str = arr2Str(labels, params.n_row, "labels", handle.getStream());
+      str = raft::arr2Str(labels, params.n_row, "labels", handle.get_stream());
       CUML_LOG_DEBUG("y_hat: %s", str.c_str());
       CUML_LOG_DEBUG("Score = %lf", score);
     }
@@ -184,38 +184,38 @@ template <typename T>
 class Dbscan2DSimple : public ::testing::TestWithParam<DBScan2DArrayInputs<T>> {
  protected:
   void basicTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
     params = ::testing::TestWithParam<DBScan2DArrayInputs<T>>::GetParam();
 
-    allocate(inputs, params.n_row * 2);
-    allocate(labels, params.n_row);
-    allocate(labels_ref, params.n_out);
-    allocate(core_sample_indices_d, params.n_row);
+    raft::allocate(inputs, params.n_row * 2);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(labels_ref, params.n_out);
+    raft::allocate(core_sample_indices_d, params.n_row);
 
-    MLCommon::copy(inputs, params.points, params.n_row * 2, handle.getStream());
-    MLCommon::copy(labels_ref, params.out, params.n_out, handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::copy(inputs, params.points, params.n_row * 2, handle.get_stream());
+    raft::copy(labels_ref, params.out, params.n_out, handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     dbscanFit(handle, inputs, (int)params.n_row, 2, params.eps, params.min_pts,
               labels, core_sample_indices_d);
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     score = adjustedRandIndex(handle, labels_ref, labels, (int)params.n_out);
 
     if (score < 1.0) {
-      auto str =
-        arr2Str(labels_ref, params.n_out, "labels_ref", handle.getStream());
+      auto str = raft::arr2Str(labels_ref, params.n_out, "labels_ref",
+                               handle.get_stream());
       CUML_LOG_DEBUG("y: %s", str.c_str());
-      str = arr2Str(labels, params.n_row, "labels", handle.getStream());
+      str = raft::arr2Str(labels, params.n_row, "labels", handle.get_stream());
       CUML_LOG_DEBUG("y_hat: %s", str.c_str());
       CUML_LOG_DEBUG("Score = %lf", score);
     }
 
-    EXPECT_TRUE(devArrMatchHost(params.core_indices, core_sample_indices_d,
-                                params.n_row, Compare<int>(),
-                                handle.getStream()));
+    EXPECT_TRUE(raft::devArrMatchHost(
+      params.core_indices, core_sample_indices_d, params.n_row,
+      raft::Compare<int>(), handle.get_stream()));
   }
 
   void SetUp() override { basicTest(); }
diff --git a/cpp/test/sg/decisiontree_batchedlevel_algo.cu b/cpp/test/sg/decisiontree_batchedlevel_algo.cu
new file mode 100644
index 0000000000..81aa996255
--- /dev/null
+++ b/cpp/test/sg/decisiontree_batchedlevel_algo.cu
@@ -0,0 +1,176 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <decisiontree/quantile/quantile.h>
+#include <gtest/gtest.h>
+#include <raft/linalg/cublas_wrappers.h>
+#include <test_utils.h>
+#include <common/iota.cuh>
+#include <cuml/cuml.hpp>
+#include <decisiontree/batched-levelalgo/builder.cuh>
+#include <memory>
+#include <raft/cuda_utils.cuh>
+#include <random/make_blobs.cuh>
+#include <random/make_regression.cuh>
+
+namespace ML {
+namespace DecisionTree {
+
+struct DtTestParams {
+  int M, N, nclasses, max_depth, nbins;
+  float min_gain;
+  CRITERION splitType;
+  unsigned long long seed;
+};
+
+::std::ostream& operator<<(::std::ostream& os, const DtTestParams& dims) {
+  return os;
+}
+
+template <typename T, typename L, typename I = int>
+class DtBaseTest : public ::testing::TestWithParam<DtTestParams> {
+ protected:
+  void SetUp() {
+    inparams = ::testing::TestWithParam<DtTestParams>::GetParam();
+    handle.reset(new raft::handle_t);
+    CUDA_CHECK(cudaStreamCreate(&stream));
+    handle->set_stream(stream);
+    set_tree_params(params, inparams.max_depth, 1 << inparams.max_depth, 1.f,
+                    inparams.nbins, SPLIT_ALGO::GLOBAL_QUANTILE, inparams.nbins,
+                    inparams.min_gain, false, inparams.splitType, false, true,
+                    128);
+    auto allocator = handle->get_device_allocator();
+    data = (T*)allocator->allocate(sizeof(T) * inparams.M * inparams.N, stream);
+    labels = (L*)allocator->allocate(sizeof(L) * inparams.M, stream);
+    auto* tmp =
+      (T*)allocator->allocate(sizeof(T) * inparams.M * inparams.N, stream);
+    prepareDataset(tmp);
+    auto alpha = T(1.0), beta = T(0.0);
+    auto cublas = handle->get_cublas_handle();
+    CUBLAS_CHECK(raft::linalg::cublasgeam(
+      cublas, CUBLAS_OP_T, CUBLAS_OP_N, inparams.M, inparams.N, &alpha, tmp,
+      inparams.N, &beta, tmp, inparams.M, data, inparams.M, stream));
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    allocator->deallocate(tmp, sizeof(T) * inparams.M * inparams.N, stream);
+    rowids = (I*)allocator->allocate(sizeof(I) * inparams.M, stream);
+    MLCommon::iota(rowids, 0, 1, inparams.M, stream);
+    colids = (I*)allocator->allocate(sizeof(I) * inparams.N, stream);
+    MLCommon::iota(colids, 0, 1, inparams.N, stream);
+    quantiles =
+      (T*)allocator->allocate(sizeof(T) * inparams.nbins * inparams.N, stream);
+
+    std::shared_ptr<TemporaryMemory<T, int>> tempmem;
+    tempmem = std::make_shared<TemporaryMemory<T, int>>(
+      *handle, handle->get_stream(), inparams.M, inparams.N, 1, params);
+
+    preprocess_quantile((const T*)data, (const unsigned*)rowids, inparams.M,
+                        inparams.N, inparams.M, inparams.nbins, tempmem);
+  }
+
+  void TearDown() {
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    auto allocator = handle->get_device_allocator();
+    allocator->deallocate(data, sizeof(T) * inparams.M * inparams.N, stream);
+    allocator->deallocate(labels, sizeof(L) * inparams.M, stream);
+    allocator->deallocate(rowids, sizeof(int) * inparams.M, stream);
+    allocator->deallocate(colids, sizeof(int) * inparams.N, stream);
+    allocator->deallocate(quantiles, sizeof(T) * inparams.nbins * inparams.N,
+                          stream);
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    handle.reset();
+    CUDA_CHECK(cudaStreamDestroy(stream));
+  }
+
+  cudaStream_t stream;
+  std::shared_ptr<raft::handle_t> handle;
+  T *data, *quantiles;
+  L* labels;
+  I *rowids, *colids;
+  DecisionTreeParams params;
+  DtTestParams inparams;
+  std::vector<SparseTreeNode<T, L>> sparsetree;
+
+  virtual void prepareDataset(T* tmp) = 0;
+};  // class DtBaseTest
+
+const std::vector<DtTestParams> allC = {
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::GINI, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::GINI, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::ENTROPY, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::ENTROPY, 12345ULL},
+};
+template <typename T>
+class DtClassifierTest : public DtBaseTest<T, int> {
+ protected:
+  void prepareDataset(T* tmp) override {
+    auto allocator = this->handle->get_device_allocator();
+    auto inparams = this->inparams;
+    MLCommon::Random::make_blobs<T>(tmp, this->labels, inparams.M, inparams.N,
+                                    inparams.nclasses, allocator, this->stream,
+                                    true, nullptr, nullptr, T(1.0), false,
+                                    T(10.0), T(-10.0), inparams.seed);
+  }
+};  // class DtClassifierTest
+typedef DtClassifierTest<float> DtClsTestF;
+///@todo: add checks
+TEST_P(DtClsTestF, Test) {
+  int num_leaves, depth;
+  grow_tree<float, int, int>(
+    handle->get_device_allocator(), handle->get_host_allocator(), data,
+    inparams.N, inparams.M, labels, quantiles, rowids, colids, inparams.M,
+    inparams.nclasses, params, stream, sparsetree, num_leaves, depth);
+  // this is a "well behaved" dataset!
+  ASSERT_EQ(depth, 1);
+}
+INSTANTIATE_TEST_CASE_P(BatchedLevelAlgo, DtClsTestF,
+                        ::testing::ValuesIn(allC));
+
+const std::vector<DtTestParams> allR = {
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::MSE, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::MSE, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::MAE, 12345ULL},
+  {1024, 4, 2, 8, 16, 0.00001f, CRITERION::MAE, 12345ULL},
+};
+template <typename T>
+class DtRegressorTest : public DtBaseTest<T, T> {
+ protected:
+  void prepareDataset(T* tmp) override {
+    auto allocator = this->handle->get_device_allocator();
+    auto cublas = this->handle->get_cublas_handle();
+    auto cusolver = this->handle->get_cusolver_dn_handle();
+    auto inparams = this->inparams;
+    MLCommon::Random::make_regression<T>(*(this->handle), tmp, this->labels,
+                                         inparams.M, inparams.N, inparams.N,
+                                         this->stream, nullptr, 1, T(1.0), -1,
+                                         T(0.5), T(0.0), false, inparams.seed);
+  }
+};  // class DtRegressorTest
+typedef DtRegressorTest<float> DtRegTestF;
+///@todo: add checks
+TEST_P(DtRegTestF, Test) {
+  int num_leaves, depth;
+  grow_tree<float, int>(
+    handle->get_device_allocator(), handle->get_host_allocator(), data,
+    inparams.N, inparams.M, labels, quantiles, rowids, colids, inparams.M, 0,
+    params, stream, sparsetree, num_leaves, depth);
+  // goes all the way to max-depth
+  ASSERT_EQ(depth, inparams.max_depth);
+}
+INSTANTIATE_TEST_CASE_P(BatchedLevelAlgo, DtRegTestF,
+                        ::testing::ValuesIn(allR));
+
+}  // namespace DecisionTree
+}  // end namespace ML
diff --git a/cpp/test/sg/fil_test.cu b/cpp/test/sg/fil_test.cu
index 3fc3126466..6d9afee79d 100644
--- a/cpp/test/sg/fil_test.cu
+++ b/cpp/test/sg/fil_test.cu
@@ -14,19 +14,19 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/fil/fil.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
 #include <treelite/c_api.h>
 #include <treelite/frontend.h>
 #include <treelite/tree.h>
 #include <cmath>
 #include <cstdio>
-#include <cuda_utils.cuh>
 #include <limits>
 #include <memory>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <utility>
 
 #define TL_CPP_CHECK(call) ASSERT(int(call) >= 0, "treelite call error")
@@ -56,10 +56,18 @@ struct FilTestParams {
   float tolerance;
   // treelite parameters, only used for treelite tests
   tl::Operator op;
-  fil::leaf_value_t leaf_payload_type;
-  // num_classes must be 1 or 2 when FLOAT_SCALAR == leaf_payload_type
-  // (1 if it's regression)
-  // num_classes must be >1 when INT_CLASS_LABEL == leaf_payload_type
+  fil::leaf_algo_t leaf_algo;
+  // when FLOAT_UNARY_BINARY == leaf_algo:
+  // num_classes = 1 means it's regression
+  // num_classes = 2 means it's binary classification
+  // (complement probabilities, then use threshold)
+  // when GROVE_PER_CLASS == leaf_algo:
+  // it's multiclass classification (num_classes must be > 2),
+  // done by splitting the forest in num_classes groups,
+  // each of which computes one-vs-all probability for its class.
+  // when CATEGORICAL_LEAF == leaf_algo:
+  // num_classes must be > 1 and it's multiclass classification.
+  // done by storing the class label in each leaf and voting.
   // it's used in treelite ModelBuilder initialization
   int num_classes;
 
@@ -84,7 +92,7 @@ std::ostream& operator<<(std::ostream& os, const FilTestParams& ps) {
      << ", threshold = " << ps.threshold << ", algo = " << ps.algo
      << ", seed = " << ps.seed << ", tolerance = " << ps.tolerance
      << ", op = " << tl::OpName(ps.op) << ", global_bias = " << ps.global_bias
-     << ", leaf_payload_type = " << ps.leaf_payload_type
+     << ", leaf_algo = " << ps.leaf_algo
      << ", num_classes = " << ps.num_classes;
   return os;
 }
@@ -99,11 +107,11 @@ float sigmoid(float x) { return 1.0f / (1.0f + expf(-x)); }
 
 class BaseFilTest : public testing::TestWithParam<FilTestParams> {
  protected:
-  void SetUp() override {
+  void setup_helper() {
     // setup
     ps = testing::TestWithParam<FilTestParams>::GetParam();
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
 
     generate_forest();
     generate_data();
@@ -111,6 +119,8 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     predict_on_gpu();
   }
 
+  void SetUp() override { setup_helper(); }
+
   void TearDown() override {
     CUDA_CHECK(cudaFree(preds_d));
     CUDA_CHECK(cudaFree(want_preds_d));
@@ -133,16 +143,16 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     bool* is_leafs_h = nullptr;
 
     // allocate GPU data
-    allocate(weights_d, num_nodes);
+    raft::allocate(weights_d, num_nodes);
     // sizeof(float) == sizeof(int)
-    allocate(thresholds_d, num_nodes);
-    allocate(fids_d, num_nodes);
-    allocate(def_lefts_d, num_nodes);
-    allocate(is_leafs_d, num_nodes);
+    raft::allocate(thresholds_d, num_nodes);
+    raft::allocate(fids_d, num_nodes);
+    raft::allocate(def_lefts_d, num_nodes);
+    raft::allocate(is_leafs_d, num_nodes);
 
     // generate on-GPU random data
-    Random::Rng r(ps.seed);
-    if (ps.leaf_payload_type == fil::leaf_value_t::FLOAT_SCALAR) {
+    raft::random::Rng r(ps.seed);
+    if (ps.leaf_algo != fil::leaf_algo_t::CATEGORICAL_LEAF) {
       r.uniform((float*)weights_d, num_nodes, -1.0f, 1.0f, stream);
     } else {
       // [0..num_classes)
@@ -159,11 +169,11 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     def_lefts_h = new bool[num_nodes];
     is_leafs_h = new bool[num_nodes];
 
-    updateHost(weights_h.data(), (int*)weights_d, num_nodes, stream);
-    updateHost(thresholds_h.data(), thresholds_d, num_nodes, stream);
-    updateHost(fids_h.data(), fids_d, num_nodes, stream);
-    updateHost(def_lefts_h, def_lefts_d, num_nodes, stream);
-    updateHost(is_leafs_h, is_leafs_d, num_nodes, stream);
+    raft::update_host(weights_h.data(), (int*)weights_d, num_nodes, stream);
+    raft::update_host(thresholds_h.data(), thresholds_d, num_nodes, stream);
+    raft::update_host(fids_h.data(), fids_d, num_nodes, stream);
+    raft::update_host(def_lefts_h, def_lefts_d, num_nodes, stream);
+    raft::update_host(is_leafs_h, is_leafs_d, num_nodes, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // mark leaves
@@ -180,17 +190,21 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     nodes.resize(num_nodes);
     for (size_t i = 0; i < num_nodes; ++i) {
       fil::val_t w;
-      switch (ps.leaf_payload_type) {
-        case fil::leaf_value_t::INT_CLASS_LABEL:
+      switch (ps.leaf_algo) {
+        case fil::leaf_algo_t::CATEGORICAL_LEAF:
           w.idx = weights_h[i];
           break;
-        case fil::leaf_value_t::FLOAT_SCALAR:
+        case fil::leaf_algo_t::FLOAT_UNARY_BINARY:
+        case fil::leaf_algo_t::GROVE_PER_CLASS:
           // not relying on fil::val_t internals
           // merely that we copied floats into weights_h earlier
           std::memcpy(&w.f, &weights_h[i], sizeof w.f);
+          break;
+        default:
+          ASSERT(false, "internal error: invalid ps.leaf_algo");
       }
-      fil::dense_node_init(&nodes[i], w, thresholds_h[i], fids_h[i],
-                           def_lefts_h[i], is_leafs_h[i]);
+      fil::node_init(&nodes[i], w, thresholds_h[i], fids_h[i], def_lefts_h[i],
+                     is_leafs_h[i]);
     }
 
     // clean up
@@ -206,22 +220,22 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
   void generate_data() {
     // allocate arrays
     size_t num_data = ps.num_rows * ps.num_cols;
-    allocate(data_d, num_data);
+    raft::allocate(data_d, num_data);
     bool* mask_d = nullptr;
-    allocate(mask_d, num_data);
+    raft::allocate(mask_d, num_data);
 
     // generate random data
-    Random::Rng r(ps.seed);
+    raft::random::Rng r(ps.seed);
     r.uniform(data_d, num_data, -1.0f, 1.0f, stream);
     r.bernoulli(mask_d, num_data, ps.nan_prob, stream);
     int tpb = 256;
-    nan_kernel<<<ceildiv(int(num_data), tpb), tpb, 0, stream>>>(
+    nan_kernel<<<raft::ceildiv(int(num_data), tpb), tpb, 0, stream>>>(
       data_d, mask_d, num_data, std::numeric_limits<float>::quiet_NaN());
     CUDA_CHECK(cudaPeekAtLastError());
 
     // copy to host
     data_h.resize(num_data);
-    updateHost(data_h.data(), data_d, num_data, stream);
+    raft::update_host(data_h.data(), data_d, num_data, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // clean up
@@ -250,8 +264,9 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     std::vector<float> want_preds_h(ps.num_preds_outputs());
     std::vector<float> want_proba_h(ps.num_proba_outputs());
     int num_nodes = tree_num_nodes();
-    switch (ps.leaf_payload_type) {
-      case fil::leaf_value_t::FLOAT_SCALAR:
+    std::vector<float> class_scores(ps.num_classes);
+    switch (ps.leaf_algo) {
+      case fil::leaf_algo_t::FLOAT_UNARY_BINARY:
         for (int i = 0; i < ps.num_rows; ++i) {
           float pred = 0.0f;
           for (int j = 0; j < ps.num_trees; ++j) {
@@ -262,7 +277,22 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
           complement(&(want_proba_h[i * 2]));
         }
         break;
-      case fil::leaf_value_t::INT_CLASS_LABEL:
+      case fil::leaf_algo_t::GROVE_PER_CLASS:
+        for (int row = 0; row < ps.num_rows; ++row) {
+          std::fill(class_scores.begin(), class_scores.end(), 0.0f);
+          for (int tree = 0; tree < ps.num_trees; ++tree) {
+            class_scores[tree % ps.num_classes] +=
+              infer_one_tree(&nodes[tree * num_nodes],
+                             &data_h[row * ps.num_cols])
+                .f;
+          }
+          // not supporting predict_proba() with GROVE_PER_CLASS (xgboost-style models)
+          want_preds_h[row] =
+            std::max_element(class_scores.begin(), class_scores.end()) -
+            class_scores.begin();
+        }
+        break;
+      case fil::leaf_algo_t::CATEGORICAL_LEAF:
         std::vector<int> class_votes(ps.num_classes);
         for (int r = 0; r < ps.num_rows; ++r) {
           std::fill(class_votes.begin(), class_votes.end(), 0);
@@ -285,12 +315,12 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     }
 
     // copy to GPU
-    allocate(want_preds_d, ps.num_preds_outputs());
-    allocate(want_proba_d, ps.num_proba_outputs());
-    updateDevice(want_preds_d, want_preds_h.data(), ps.num_preds_outputs(),
-                 stream);
-    updateDevice(want_proba_d, want_proba_h.data(), ps.num_proba_outputs(),
-                 stream);
+    raft::allocate(want_preds_d, ps.num_preds_outputs());
+    raft::allocate(want_proba_d, ps.num_proba_outputs());
+    raft::update_device(want_preds_d, want_preds_h.data(),
+                        ps.num_preds_outputs(), stream);
+    raft::update_device(want_proba_d, want_proba_h.data(),
+                        ps.num_proba_outputs(), stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
   }
 
@@ -301,10 +331,12 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     init_forest(&forest);
 
     // predict
-    allocate(preds_d, ps.num_preds_outputs());
-    allocate(proba_d, ps.num_proba_outputs());
+    raft::allocate(preds_d, ps.num_preds_outputs());
+    raft::allocate(proba_d, ps.num_proba_outputs());
     fil::predict(handle, forest, preds_d, data_d, ps.num_rows);
-    fil::predict(handle, forest, proba_d, data_d, ps.num_rows, true);
+    // not supporting predict_proba() with GROVE_PER_CLASS (xgboost-style models)
+    if (ps.leaf_algo != fil::leaf_algo_t::GROVE_PER_CLASS)
+      fil::predict(handle, forest, proba_d, data_d, ps.num_rows, true);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     // cleanup
@@ -312,15 +344,20 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
   }
 
   void compare() {
-    ASSERT_TRUE(devArrMatch(want_proba_d, proba_d, ps.num_proba_outputs(),
-                            CompareApprox<float>(ps.tolerance), stream));
-    float tolerance = ps.leaf_payload_type == fil::leaf_value_t::FLOAT_SCALAR
+    // not supporting predict_proba() with GROVE_PER_CLASS (xgboost-style models)
+    if (ps.leaf_algo != fil::leaf_algo_t::GROVE_PER_CLASS) {
+      ASSERT_TRUE(
+        raft::devArrMatch(want_proba_d, proba_d, ps.num_proba_outputs(),
+                          raft::CompareApprox<float>(ps.tolerance), stream));
+    }
+    float tolerance = ps.leaf_algo == fil::leaf_algo_t::FLOAT_UNARY_BINARY
                         ? ps.tolerance
                         : std::numeric_limits<float>::epsilon();
     // in multi-class prediction, floats represent the most likely class
     // and would be generated by converting an int to float
-    ASSERT_TRUE(devArrMatch(want_preds_d, preds_d, ps.num_rows,
-                            CompareApprox<float>(tolerance), stream));
+    ASSERT_TRUE(raft::devArrMatch(want_preds_d, preds_d, ps.num_rows,
+                                  raft::CompareApprox<float>(tolerance),
+                                  stream));
   }
 
   fil::val_t infer_one_tree(fil::dense_node_t* root, float* data) {
@@ -330,8 +367,8 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
     int fid = 0;
     bool def_left = false, is_leaf = false;
     for (;;) {
-      fil::dense_node_decode(&root[curr], &output, &threshold, &fid, &def_left,
-                             &is_leaf);
+      fil::node_decode(&root[curr], &output, &threshold, &fid, &def_left,
+                       &is_leaf);
       if (is_leaf) break;
       float val = data[fid];
       bool cond = isnan(val) ? !def_left : val >= threshold;
@@ -359,7 +396,7 @@ class BaseFilTest : public testing::TestWithParam<FilTestParams> {
 
   // parameters
   cudaStream_t stream;
-  cumlHandle handle;
+  raft::handle_t handle;
   FilTestParams ps;
 };
 
@@ -375,13 +412,15 @@ class PredictDenseFilTest : public BaseFilTest {
     fil_ps.output = ps.output;
     fil_ps.threshold = ps.threshold;
     fil_ps.global_bias = ps.global_bias;
-    fil_ps.leaf_payload_type = ps.leaf_payload_type;
+    fil_ps.leaf_algo = ps.leaf_algo;
     fil_ps.num_classes = ps.num_classes;
+
     fil::init_dense(handle, pforest, nodes.data(), &fil_ps);
   }
 };
 
-class PredictSparseFilTest : public BaseFilTest {
+template <typename fil_node_t>
+class BasePredictSparseFilTest : public BaseFilTest {
  protected:
   void dense2sparse_node(const fil::dense_node_t* dense_root, int i_dense,
                          int i_sparse_root, int i_sparse) {
@@ -389,21 +428,21 @@ class PredictSparseFilTest : public BaseFilTest {
     fil::val_t output;
     int feature;
     bool def_left, is_leaf;
-    dense_node_decode(&dense_root[i_dense], &output, &threshold, &feature,
-                      &def_left, &is_leaf);
+    fil::node_decode(&dense_root[i_dense], &output, &threshold, &feature,
+                     &def_left, &is_leaf);
     if (is_leaf) {
       // leaf sparse node
-      sparse_node_init(&sparse_nodes[i_sparse], output, threshold, feature,
-                       def_left, is_leaf, 0);
+      node_init(&sparse_nodes[i_sparse], output, threshold, feature, def_left,
+                is_leaf, 0);
       return;
     }
     // inner sparse node
     // reserve space for children
     int left_index = sparse_nodes.size();
-    sparse_nodes.push_back(fil::sparse_node_t());
-    sparse_nodes.push_back(fil::sparse_node_t());
-    sparse_node_init(&sparse_nodes[i_sparse], output, threshold, feature,
-                     def_left, is_leaf, left_index - i_sparse_root);
+    sparse_nodes.push_back(fil_node_t());
+    sparse_nodes.push_back(fil_node_t());
+    node_init(&sparse_nodes[i_sparse], output, threshold, feature, def_left,
+              is_leaf, left_index - i_sparse_root);
     dense2sparse_node(dense_root, 2 * i_dense + 1, i_sparse_root, left_index);
     dense2sparse_node(dense_root, 2 * i_dense + 2, i_sparse_root,
                       left_index + 1);
@@ -411,7 +450,7 @@ class PredictSparseFilTest : public BaseFilTest {
 
   void dense2sparse_tree(const fil::dense_node_t* dense_root) {
     int i_sparse_root = sparse_nodes.size();
-    sparse_nodes.push_back(fil::sparse_node_t());
+    sparse_nodes.push_back(fil_node_t());
     dense2sparse_node(dense_root, 0, i_sparse_root, i_sparse_root);
     trees.push_back(i_sparse_root);
   }
@@ -431,17 +470,21 @@ class PredictSparseFilTest : public BaseFilTest {
     fil_params.output = ps.output;
     fil_params.threshold = ps.threshold;
     fil_params.global_bias = ps.global_bias;
-    fil_params.leaf_payload_type = ps.leaf_payload_type;
+    fil_params.leaf_algo = ps.leaf_algo;
     fil_params.num_classes = ps.num_classes;
+
     dense2sparse();
     fil_params.num_nodes = sparse_nodes.size();
     fil::init_sparse(handle, pforest, trees.data(), sparse_nodes.data(),
                      &fil_params);
   }
-  std::vector<fil::sparse_node_t> sparse_nodes;
+  std::vector<fil_node_t> sparse_nodes;
   std::vector<int> trees;
 };
 
+typedef BasePredictSparseFilTest<fil::sparse_node16_t> PredictSparse16FilTest;
+typedef BasePredictSparseFilTest<fil::sparse_node8_t> PredictSparse8FilTest;
+
 class TreeliteFilTest : public BaseFilTest {
  protected:
   /** adds nodes[node] of tree starting at index root to builder
@@ -455,15 +498,16 @@ class TreeliteFilTest : public BaseFilTest {
     float threshold;
     fil::val_t output;
     bool is_leaf, default_left;
-    fil::dense_node_decode(&nodes[node], &output, &threshold, &feature,
-                           &default_left, &is_leaf);
+    fil::node_decode(&nodes[node], &output, &threshold, &feature, &default_left,
+                     &is_leaf);
     if (is_leaf) {
-      switch (ps.leaf_payload_type) {
-        case fil::leaf_value_t::FLOAT_SCALAR:
-          // default is fil::FLOAT_SCALAR
+      switch (ps.leaf_algo) {
+        case fil::leaf_algo_t::FLOAT_UNARY_BINARY:
+        case fil::leaf_algo_t::GROVE_PER_CLASS:
+          // default is fil::FLOAT_UNARY_BINARY
           builder->SetLeafNode(key, output.f);
           break;
-        case fil::leaf_value_t::INT_CLASS_LABEL:
+        case fil::leaf_algo_t::CATEGORICAL_LEAF:
           std::vector<tl::tl_float> vec(ps.num_classes);
           for (int i = 0; i < ps.num_classes; ++i)
             vec[i] = i == output.idx ? 1.0f : 0.0f;
@@ -504,14 +548,18 @@ class TreeliteFilTest : public BaseFilTest {
                         fil::storage_type_t storage_type) {
     bool random_forest_flag = (ps.output & fil::output_t::AVG) != 0;
     int treelite_num_classes =
-      ps.leaf_payload_type == fil::leaf_value_t::FLOAT_SCALAR ? 1
-                                                              : ps.num_classes;
+      ps.leaf_algo == fil::leaf_algo_t::FLOAT_UNARY_BINARY ? 1 : ps.num_classes;
     std::unique_ptr<tlf::ModelBuilder> model_builder(new tlf::ModelBuilder(
       ps.num_cols, treelite_num_classes, random_forest_flag));
 
     // prediction transform
     if ((ps.output & fil::output_t::SIGMOID) != 0) {
       model_builder->SetModelParam("pred_transform", "sigmoid");
+    } else if (ps.leaf_algo != fil::leaf_algo_t::FLOAT_UNARY_BINARY) {
+      model_builder->SetModelParam("pred_transform", "max_index");
+      ps.output = fil::output_t(ps.output | fil::output_t::CLASS);
+    } else {
+      model_builder->SetModelParam("pred_transform", "identity");
     }
 
     // global bias
@@ -554,13 +602,20 @@ class TreeliteDenseFilTest : public TreeliteFilTest {
   }
 };
 
-class TreeliteSparseFilTest : public TreeliteFilTest {
+class TreeliteSparse16FilTest : public TreeliteFilTest {
  protected:
   void init_forest(fil::forest_t* pforest) override {
     init_forest_impl(pforest, fil::storage_type_t::SPARSE);
   }
 };
 
+class TreeliteSparse8FilTest : public TreeliteFilTest {
+ protected:
+  void init_forest(fil::forest_t* pforest) override {
+    init_forest_impl(pforest, fil::storage_type_t::SPARSE8);
+  }
+};
+
 class TreeliteAutoFilTest : public TreeliteFilTest {
  protected:
   void init_forest(fil::forest_t* pforest) override {
@@ -568,92 +623,116 @@ class TreeliteAutoFilTest : public TreeliteFilTest {
   }
 };
 
+// test for failures; currently only supported for sparse8 nodes
+class TreeliteThrowSparse8FilTest : public TreeliteSparse8FilTest {
+ protected:
+  // model import happens in check(), so this function is empty
+  void SetUp() override {}
+
+  void check() { ASSERT_THROW(setup_helper(), raft::exception); }
+};
+
 // rows, cols, nan_prob, depth, num_trees, leaf_prob, output, threshold,
 // global_bias, algo, seed, tolerance, branch comparison operator, FIL implementation, number of classes
 std::vector<FilTestParams> predict_dense_inputs = {
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0.5,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 2},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 7},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 7},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 4},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 4},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::INT_CLASS_LABEL, 4},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::CATEGORICAL_LEAF, 4},
+  {20000, 50, 0.05, 8, 50, 0.05,
+   fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
+   fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator(0),
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
+   fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 49, 0.05, fil::output_t::SIGMOID, 0, 0,
+   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
+   fil::leaf_algo_t::GROVE_PER_CLASS, 7},
+  {20000, 50, 0.05, 8, 52, 0.05, fil::output_t::RAW, 0, 0.5,
+   fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator(0),
+   fil::leaf_algo_t::GROVE_PER_CLASS, 4},
+  {20000, 50, 0.05, 8, 52, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::GROVE_PER_CLASS, 4},
 };
 
 TEST_P(PredictDenseFilTest, Predict) { compare(); }
@@ -665,183 +744,207 @@ INSTANTIATE_TEST_CASE_P(FilTests, PredictDenseFilTest,
 // global_bias, algo, seed, tolerance, branch comparison operator, FIL implementation, number of classes
 std::vector<FilTestParams> predict_sparse_inputs = {
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 5000},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5000},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::INT_CLASS_LABEL, 6},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0.5,
-   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 2},
-  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
-   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
-   fil::leaf_value_t::INT_CLASS_LABEL, 3},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::CATEGORICAL_LEAF, 6},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::CATEGORICAL_LEAF, 3},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator(0), fil::leaf_value_t::INT_CLASS_LABEL, 3},
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::CATEGORICAL_LEAF, 3},
+  {20000, 50, 0.05, 2, 5000, 0.05,
+   fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
+   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator(0),
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5000},
+  {20000, 50, 0.05, 8, 60, 0.05, fil::output_t::RAW, 0, 0.5, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::GROVE_PER_CLASS, 6},
+  {20000, 50, 0.05, 8, 51, 0.05, fil::output_t::CLASS, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::GROVE_PER_CLASS, 3},
+  {20000, 50, 0.05, 8, 51, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator(0), fil::leaf_algo_t::GROVE_PER_CLASS, 3},
 };
 
-TEST_P(PredictSparseFilTest, Predict) { compare(); }
+TEST_P(PredictSparse16FilTest, Predict) { compare(); }
 
-INSTANTIATE_TEST_CASE_P(FilTests, PredictSparseFilTest,
+INSTANTIATE_TEST_CASE_P(FilTests, PredictSparse16FilTest,
+                        testing::ValuesIn(predict_sparse_inputs));
+
+TEST_P(PredictSparse8FilTest, Predict) { compare(); }
+
+INSTANTIATE_TEST_CASE_P(FilTests, PredictSparse8FilTest,
                         testing::ValuesIn(predict_sparse_inputs));
 
 // rows, cols, nan_prob, depth, num_trees, leaf_prob, output, threshold,
 // global_bias, algo, seed, tolerance, branch comparison operator, FIL implementation, number of classes
 std::vector<FilTestParams> import_dense_inputs = {
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kLT, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kGE, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kGE, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0.5,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kGT, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kGT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::SIGMOID | fil::output_t::AVG), 0, 0,
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 6},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::SIGMOID | fil::output_t::AVG), 0, 0,
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 6},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 3},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 3},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 7},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
-   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 2},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 7},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kLT, fil::leaf_value_t::INT_CLASS_LABEL, 6},
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::CATEGORICAL_LEAF, 6},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 48, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kGT,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 6},
+  {20000, 50, 0.05, 8, 51, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 3},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 49, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::TREE_REORG, 42, 2e-3f, tl::Operator::kLE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 7},
+  {20000, 50, 0.05, 8, 48, 0.05, fil::output_t::CLASS, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::GROVE_PER_CLASS, 6},
 };
 
 TEST_P(TreeliteDenseFilTest, Import) { compare(); }
@@ -853,56 +956,66 @@ INSTANTIATE_TEST_CASE_P(FilTests, TreeliteDenseFilTest,
 // global_bias, algo, seed, tolerance, branch comparison operator, FIL implementation, number of classes
 std::vector<FilTestParams> import_sparse_inputs = {
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kLT, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::SIGMOID | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kGT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kGE, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kGE, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kLT, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::SIGMOID, 0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kGT, fil::leaf_value_t::FLOAT_SCALAR, 0},
+   42, 2e-3f, tl::Operator::kGT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 2},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 8, 50, 0.05,
    fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 1.0, 0.5,
    fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kGE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 10},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 10},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 4},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::SIGMOID | fil::output_t::AVG), 0, 0,
-   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLE,
-   fil::leaf_value_t::INT_CLASS_LABEL, 5},
-  {20000, 50, 0.05, 8, 50, 0.05,
-   fil::output_t(fil::output_t::AVG | fil::output_t::CLASS), 0, 0,
-   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 2},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 4},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator::kLE, fil::leaf_algo_t::CATEGORICAL_LEAF, 5},
   {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::AVG, 0, 0.5, fil::algo_t::NAIVE,
-   42, 2e-3f, tl::Operator::kLT, fil::leaf_value_t::INT_CLASS_LABEL, 3},
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::CATEGORICAL_LEAF, 3},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 1.0, 0.5,
+   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kGE,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 10},
+  {20000, 50, 0.05, 8, 52, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 4},
+  {20000, 50, 0.05, 8, 50, 0.05, fil::output_t::CLASS, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator::kLE, fil::leaf_algo_t::GROVE_PER_CLASS, 5},
+  {20000, 50, 0.05, 8, 51, 0.05, fil::output_t::CLASS, 0, 0.5,
+   fil::algo_t::NAIVE, 42, 2e-3f, tl::Operator::kLT,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 3},
 };
 
-TEST_P(TreeliteSparseFilTest, Import) { compare(); }
+TEST_P(TreeliteSparse16FilTest, Import) { compare(); }
 
-INSTANTIATE_TEST_CASE_P(FilTests, TreeliteSparseFilTest,
+INSTANTIATE_TEST_CASE_P(FilTests, TreeliteSparse16FilTest,
+                        testing::ValuesIn(import_sparse_inputs));
+
+TEST_P(TreeliteSparse8FilTest, Import) { compare(); }
+
+INSTANTIATE_TEST_CASE_P(FilTests, TreeliteSparse8FilTest,
                         testing::ValuesIn(import_sparse_inputs));
 
 // rows, cols, nan_prob, depth, num_trees, leaf_prob, output, threshold,
@@ -910,23 +1023,26 @@ INSTANTIATE_TEST_CASE_P(FilTests, TreeliteSparseFilTest,
 std::vector<FilTestParams> import_auto_inputs = {
   {20000, 50, 0.05, 10, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 15, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 19, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 19, 50, 0.05, fil::output_t::RAW, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::FLOAT_SCALAR, 0},
+   fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
   {20000, 50, 0.05, 10, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 3},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 3},
+  {20000, 50, 0.05, 10, 51, 0.05, fil::output_t::CLASS, 0, 0,
+   fil::algo_t::ALGO_AUTO, 42, 2e-3f, tl::Operator::kLT,
+   fil::leaf_algo_t::GROVE_PER_CLASS, 3},
 #if 0  
   {20000, 50, 0.05, 19, 50, 0.05, fil::output_t::AVG, 0, 0,
    fil::algo_t::BATCH_TREE_REORG, 42, 2e-3f, tl::Operator::kLT,
-   fil::leaf_value_t::INT_CLASS_LABEL, 6},
+   fil::leaf_algo_t::CATEGORICAL_LEAF, 6},
 #endif
 };
 
@@ -935,4 +1051,22 @@ TEST_P(TreeliteAutoFilTest, Import) { compare(); }
 INSTANTIATE_TEST_CASE_P(FilTests, TreeliteAutoFilTest,
                         testing::ValuesIn(import_auto_inputs));
 
+// rows, cols, nan_prob, depth, num_trees, leaf_prob, output, threshold,
+// global_bias, algo, seed, tolerance, branch comparison operator,
+// FIL implementation, number of classes
+// adjust test parameters if the sparse8 format changes
+std::vector<FilTestParams> import_throw_sparse8_inputs = {
+  // to many features
+  {100, 20000, 0.05, 10, 50, 0.05, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE,
+   42, 2e-3f, tl::Operator::kLT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
+  // too many tree nodes
+  {20000, 50, 0.05, 16, 5, 0, fil::output_t::RAW, 0, 0, fil::algo_t::NAIVE, 42,
+   2e-3f, tl::Operator::kLT, fil::leaf_algo_t::FLOAT_UNARY_BINARY, 1},
+};
+
+TEST_P(TreeliteThrowSparse8FilTest, Import) { check(); }
+
+INSTANTIATE_TEST_CASE_P(FilTests, TreeliteThrowSparse8FilTest,
+                        testing::ValuesIn(import_throw_sparse8_inputs));
+
 }  // namespace ML
diff --git a/cpp/test/sg/handle_test.cu b/cpp/test/sg/handle_test.cu
index 39c69fec9e..2445f045f3 100644
--- a/cpp/test/sg/handle_test.cu
+++ b/cpp/test/sg/handle_test.cu
@@ -39,7 +39,7 @@ TEST(HandleTest, DoubleDestoryFails) {
   EXPECT_EQ(CUML_INVALID_HANDLE, status);
 }
 
-TEST(HandleTest, SetStream) {
+TEST(HandleTest, set_stream) {
   cumlHandle_t handle;
   cumlError_t status = cumlCreate(&handle);
   EXPECT_EQ(CUML_SUCCESS, status);
diff --git a/cpp/test/sg/holtwinters_test.cu b/cpp/test/sg/holtwinters_test.cu
index 7ec90a2f09..2e1508246c 100644
--- a/cpp/test/sg/holtwinters_test.cu
+++ b/cpp/test/sg/holtwinters_test.cu
@@ -14,13 +14,13 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/tsa/holtwinters.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
 #include <algorithm>
-#include <cuda_utils.cuh>
 #include <cuml/common/logger.hpp>
+#include <raft/cuda_utils.cuh>
 #include "time_series_datasets.h"
 
 namespace ML {
@@ -68,17 +68,17 @@ class HoltWintersTest : public ::testing::TestWithParam<HoltWintersInputs<T>> {
       &leveltrend_coef_offset,  // = (n-wlen-1)*batch_size (last row)
       &season_coef_offset);  // = (n-wlen-frequency)*batch_size(last freq rows)
 
-    allocate(level_ptr, components_len, stream);
-    allocate(trend_ptr, components_len, stream);
-    allocate(season_ptr, components_len, stream);
-    allocate(SSE_error_ptr, batch_size, stream);
-    allocate(forecast_ptr, batch_size * h, stream);
+    raft::allocate(level_ptr, components_len, stream);
+    raft::allocate(trend_ptr, components_len, stream);
+    raft::allocate(season_ptr, components_len, stream);
+    raft::allocate(SSE_error_ptr, batch_size, stream);
+    raft::allocate(forecast_ptr, batch_size * h, stream);
 
-    allocate(data, batch_size * n);
-    updateDevice(data, dataset_h, batch_size * n, stream);
+    raft::allocate(data, batch_size * n);
+    raft::update_device(data, dataset_h, batch_size * n, stream);
 
-    cumlHandle handle;
-    handle.setStream(stream);
+    raft::handle_t handle;
+    handle.set_stream(stream);
 
     ML::HoltWinters::fit(handle, n, batch_size, frequency, start_periods,
                          seasonal, epsilon, data, level_ptr, trend_ptr,
@@ -153,7 +153,7 @@ T calculate_MAE(T *test, T *forecast, int batch_size, int h) {
   normalise(forecast, batch_size * h);
   std::vector<T> ae(batch_size * h);
   for (int i = 0; i < batch_size * h; i++) {
-    ae[i] = abs(test[i] - forecast[i]);
+    ae[i] = raft::abs(test[i] - forecast[i]);
   }
   std::sort(ae.begin(), ae.end());
   T mae;
@@ -168,8 +168,9 @@ T calculate_MAE(T *test, T *forecast, int batch_size, int h) {
 typedef HoltWintersTest<float> HoltWintersTestF;
 TEST_P(HoltWintersTestF, Fit) {
   std::vector<float> forecast_h(batch_size * h);
-  updateHost(forecast_h.data(), forecast_ptr, batch_size * h, stream);
-  myPrintHostVector("forecast", forecast_h.data(), batch_size * h);
+  raft::update_host(forecast_h.data(), forecast_ptr, batch_size * h, stream);
+  raft::print_host_vector("forecast", forecast_h.data(), batch_size * h,
+                          std::cout);
   float mae = calculate_MAE<float>(test, forecast_h.data(), batch_size, h);
   CUML_LOG_DEBUG("MAE: %f", mae);
   ASSERT_TRUE(mae < mae_tolerance);
@@ -178,8 +179,9 @@ TEST_P(HoltWintersTestF, Fit) {
 typedef HoltWintersTest<double> HoltWintersTestD;
 TEST_P(HoltWintersTestD, Fit) {
   std::vector<double> forecast_h(batch_size * h);
-  updateHost(forecast_h.data(), forecast_ptr, batch_size * h, stream);
-  myPrintHostVector("forecast", forecast_h.data(), batch_size * h);
+  raft::update_host(forecast_h.data(), forecast_ptr, batch_size * h, stream);
+  raft::print_host_vector("forecast", forecast_h.data(), batch_size * h,
+                          std::cout);
   double mae = calculate_MAE<double>(test, forecast_h.data(), batch_size, h);
   CUML_LOG_DEBUG("MAE: %f", mae);
   ASSERT_TRUE(mae < mae_tolerance);
diff --git a/cpp/test/sg/kmeans_test.cu b/cpp/test/sg/kmeans_test.cu
index ce50fffb61..d7598264fb 100644
--- a/cpp/test/sg/kmeans_test.cu
+++ b/cpp/test/sg/kmeans_test.cu
@@ -14,10 +14,10 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 
 #include <thrust/fill.h>
@@ -48,7 +48,7 @@ template <typename T>
 class KmeansTest : public ::testing::TestWithParam<KmeansInputs<T>> {
  protected:
   void basicTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
     testparams = ::testing::TestWithParam<KmeansInputs<T>>::GetParam();
 
     int n_samples = testparams.n_row;
@@ -59,49 +59,52 @@ class KmeansTest : public ::testing::TestWithParam<KmeansInputs<T>> {
     params.seed = 1;
     params.oversampling_factor = 0;
 
-    device_buffer<T> X(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<T> X(handle.get_device_allocator(), handle.get_stream(),
                        n_samples * n_features);
 
-    device_buffer<int> labels(handle.getDeviceAllocator(), handle.getStream(),
-                              n_samples);
+    device_buffer<int> labels(handle.get_device_allocator(),
+                              handle.get_stream(), n_samples);
 
     make_blobs(handle, X.data(), labels.data(), n_samples, n_features,
                params.n_clusters, true, nullptr, nullptr, 1.0, false, -10.0f,
                10.0f, 1234ULL);
 
-    allocate(d_labels, n_samples);
-    allocate(d_labels_ref, n_samples);
-    allocate(d_centroids, params.n_clusters * n_features);
+    raft::allocate(d_labels, n_samples);
+    raft::allocate(d_labels_ref, n_samples);
+    raft::allocate(d_centroids, params.n_clusters * n_features);
 
     if (testparams.weighted) {
-      allocate(d_sample_weight, n_samples);
-      thrust::fill(thrust::cuda::par.on(handle.getStream()), d_sample_weight,
+      raft::allocate(d_sample_weight, n_samples);
+      thrust::fill(thrust::cuda::par.on(handle.get_stream()), d_sample_weight,
                    d_sample_weight + n_samples, 1);
     } else {
       d_sample_weight = nullptr;
     }
 
-    MLCommon::copy(d_labels_ref, labels.data(), n_samples, handle.getStream());
+    raft::copy(d_labels_ref, labels.data(), n_samples, handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     T inertia = 0;
     int n_iter = 0;
+
     kmeans::fit_predict(handle, params, X.data(), n_samples, n_features,
                         d_sample_weight, d_centroids, d_labels, inertia,
                         n_iter);
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     score = adjustedRandIndex(handle, d_labels_ref, d_labels, n_samples);
 
     if (score < 1.0) {
       std::stringstream ss;
       ss << "Expected: "
-         << arr2Str(d_labels_ref, 25, "d_labels_ref", handle.getStream());
+         << raft::arr2Str(d_labels_ref, 25, "d_labels_ref",
+                          handle.get_stream());
       CUML_LOG_DEBUG(ss.str().c_str());
       ss.str(std::string());
-      ss << "Actual: " << arr2Str(d_labels, 25, "d_labels", handle.getStream());
+      ss << "Actual: "
+         << raft::arr2Str(d_labels, 25, "d_labels", handle.get_stream());
       CUML_LOG_DEBUG(ss.str().c_str());
       CUML_LOG_DEBUG("Score = %lf", score);
     }
diff --git a/cpp/test/sg/knn_test.cu b/cpp/test/sg/knn_test.cu
index b71ed2802e..53f379f65b 100644
--- a/cpp/test/sg/knn_test.cu
+++ b/cpp/test/sg/knn_test.cu
@@ -14,13 +14,13 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <cuml/neighbors/knn.hpp>
 #include <iostream>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <vector>
 
 #include <cuml/neighbors/knn.hpp>
@@ -31,7 +31,7 @@
 namespace ML {
 
 using namespace MLCommon;
-using namespace MLCommon::Random;
+using namespace raft::random;
 using namespace std;
 
 struct KNNInputs {
@@ -51,17 +51,17 @@ template <typename T, typename IdxT>
 }
 
 template <typename T>
-void gen_blobs(cumlHandle &handle, T *out, int *l, int rows, int cols,
+void gen_blobs(raft::handle_t &handle, T *out, int *l, int rows, int cols,
                int centers, const T *centroids) {
   Datasets::make_blobs(handle, out, l, rows, cols, centers, true, centroids,
                        nullptr, 0.1f, true, -10.0f, 10.0f, 1234ULL);
 }
 
-void create_index_parts(cumlHandle &handle, float *query_data,
+void create_index_parts(raft::handle_t &handle, float *query_data,
                         int *query_labels, vector<float *> &part_inputs,
                         vector<int *> &part_labels, vector<int> &part_sizes,
                         const KNNInputs &params, const float *centers) {
-  cudaStream_t stream = handle.getStream();
+  cudaStream_t stream = handle.get_stream();
   gen_blobs<float>(handle, query_data, query_labels,
                    params.n_rows * params.n_parts, params.n_cols,
                    params.n_centers, centers);
@@ -104,12 +104,14 @@ template <typename T>
 class KNNTest : public ::testing::TestWithParam<KNNInputs> {
  protected:
   void testBruteForce() {
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
-    allocate(actual_labels,
-             params.n_query_row * params.n_neighbors * params.n_parts, true);
-    allocate(expected_labels,
-             params.n_query_row * params.n_neighbors * params.n_parts, true);
+    raft::allocate(actual_labels,
+                   params.n_query_row * params.n_neighbors * params.n_parts,
+                   true);
+    raft::allocate(expected_labels,
+                   params.n_query_row * params.n_neighbors * params.n_parts,
+                   true);
 
     create_data();
 
@@ -117,24 +119,26 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
                     params.n_query_row, output_indices, output_dists,
                     params.n_neighbors, true, true);
 
-    build_actual_output<<<ceildiv(params.n_query_row * params.n_neighbors, 32),
+    build_actual_output<<<raft::ceildiv(params.n_query_row * params.n_neighbors,
+                                        32),
                           32, 0, stream>>>(actual_labels, params.n_query_row,
                                            params.n_neighbors, index_labels,
                                            output_indices);
 
-    build_expected_output<<<ceildiv(params.n_query_row, 32), 32, 0, stream>>>(
-      expected_labels, params.n_query_row, params.n_neighbors, search_labels);
+    build_expected_output<<<raft::ceildiv(params.n_query_row, 32), 32, 0,
+                            stream>>>(expected_labels, params.n_query_row,
+                                      params.n_neighbors, search_labels);
 
     ASSERT_TRUE(devArrMatch(expected_labels, actual_labels,
                             params.n_query_row * params.n_neighbors,
-                            Compare<int>()));
+                            raft::Compare<int>()));
   }
 
   void testClassification() {
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
-    allocate(actual_labels, params.n_query_row, true);
-    allocate(expected_labels, params.n_query_row, true);
+    raft::allocate(actual_labels, params.n_query_row, true);
+    raft::allocate(expected_labels, params.n_query_row, true);
 
     create_data();
 
@@ -150,14 +154,14 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
                  params.n_neighbors);
 
     ASSERT_TRUE(devArrMatch(search_labels, actual_labels, params.n_query_row,
-                            Compare<int>()));
+                            raft::Compare<int>()));
   }
 
   void testRegression() {
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
-    allocate(actual_labels, params.n_query_row, true);
-    allocate(expected_labels, params.n_query_row, true);
+    raft::allocate(actual_labels, params.n_query_row, true);
+    raft::allocate(expected_labels, params.n_query_row, true);
 
     create_data();
 
@@ -165,18 +169,19 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
                     params.n_query_row, output_indices, output_dists,
                     params.n_neighbors, true, true);
 
-    device_buffer<float> index_labels_float(handle.getDeviceAllocator(), stream,
-                                            params.n_rows * params.n_parts);
-    device_buffer<float> query_labels_float(handle.getDeviceAllocator(), stream,
-                                            params.n_query_row);
-    to_float<<<ceildiv((int)index_labels_float.size(), 32), 32, 0, stream>>>(
-      index_labels_float.data(), index_labels, index_labels_float.size());
-    to_float<<<ceildiv(params.n_query_row, 32), 32, 0, stream>>>(
+    device_buffer<float> index_labels_float(
+      handle.get_device_allocator(), stream, params.n_rows * params.n_parts);
+    device_buffer<float> query_labels_float(handle.get_device_allocator(),
+                                            stream, params.n_query_row);
+    to_float<<<raft::ceildiv((int)index_labels_float.size(), 32), 32, 0,
+               stream>>>(index_labels_float.data(), index_labels,
+                         index_labels_float.size());
+    to_float<<<raft::ceildiv(params.n_query_row, 32), 32, 0, stream>>>(
       query_labels_float.data(), search_labels, params.n_query_row);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     CUDA_CHECK(cudaPeekAtLastError());
 
-    device_buffer<float> actual_labels_float(handle.getDeviceAllocator(),
+    device_buffer<float> actual_labels_float(handle.get_device_allocator(),
                                              stream, params.n_query_row);
 
     vector<float *> full_labels(1);
@@ -185,26 +190,29 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
     knn_regress(handle, actual_labels_float.data(), output_indices, full_labels,
                 params.n_rows, params.n_query_row, params.n_neighbors);
 
-    ASSERT_TRUE(devArrMatch(query_labels_float.data(),
-                            actual_labels_float.data(), params.n_query_row,
-                            Compare<float>()));
+    ASSERT_TRUE(raft::devArrMatch(query_labels_float.data(),
+                                  actual_labels_float.data(),
+                                  params.n_query_row, raft::Compare<float>()));
   }
 
   void SetUp() override {
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
     params = ::testing::TestWithParam<KNNInputs>::GetParam();
 
-    allocate(index_data, params.n_rows * params.n_cols * params.n_parts, true);
-    allocate(index_labels, params.n_rows * params.n_parts, true);
+    raft::allocate(index_data, params.n_rows * params.n_cols * params.n_parts,
+                   true);
+    raft::allocate(index_labels, params.n_rows * params.n_parts, true);
 
-    allocate(search_data, params.n_query_row * params.n_cols, true);
-    allocate(search_labels, params.n_query_row, true);
+    raft::allocate(search_data, params.n_query_row * params.n_cols, true);
+    raft::allocate(search_labels, params.n_query_row, true);
 
-    allocate(output_indices,
-             params.n_query_row * params.n_neighbors * params.n_parts, true);
-    allocate(output_dists,
-             params.n_query_row * params.n_neighbors * params.n_parts, true);
+    raft::allocate(output_indices,
+                   params.n_query_row * params.n_neighbors * params.n_parts,
+                   true);
+    raft::allocate(output_dists,
+                   params.n_query_row * params.n_neighbors * params.n_parts,
+                   true);
   }
 
   void TearDown() override {
@@ -220,9 +228,9 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
 
  private:
   void create_data() {
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
-    device_buffer<T> rand_centers(handle.getDeviceAllocator(), stream,
+    device_buffer<T> rand_centers(handle.get_device_allocator(), stream,
                                   params.n_centers * params.n_cols);
     Rng r(0, GeneratorType::GenPhilox);
     r.uniform(rand_centers.data(), params.n_centers * params.n_cols, -10.0f,
@@ -236,7 +244,7 @@ class KNNTest : public ::testing::TestWithParam<KNNInputs> {
               params.n_cols, params.n_centers, rand_centers.data());
   }
 
-  cumlHandle handle;
+  raft::handle_t handle;
 
   KNNInputs params;
 
diff --git a/cpp/test/sg/logger.cpp b/cpp/test/sg/logger.cpp
index 32a6031ceb..56456aacec 100644
--- a/cpp/test/sg/logger.cpp
+++ b/cpp/test/sg/logger.cpp
@@ -16,6 +16,7 @@
 
 #include <gtest/gtest.h>
 #include <cuml/common/logger.hpp>
+#include <string>
 
 namespace ML {
 
@@ -36,4 +37,56 @@ TEST(Logger, Test) {
   ASSERT_TRUE(Logger::get().shouldLogFor(CUML_LEVEL_WARN));
 }
 
+std::string logged = "";
+void exampleCallback(int lvl, const char* msg) { logged = std::string(msg); }
+
+int flushCount = 0;
+void exampleFlush() { ++flushCount; }
+
+class LoggerTest : public ::testing::Test {
+ protected:
+  void SetUp() override {
+    flushCount = 0;
+    logged = "";
+    Logger::get().setLevel(CUML_LEVEL_TRACE);
+  }
+
+  void TearDown() override {
+    Logger::get().setCallback(nullptr);
+    Logger::get().setFlush(nullptr);
+    Logger::get().setLevel(CUML_LEVEL_INFO);
+  }
+};
+
+TEST_F(LoggerTest, callback) {
+  std::string testMsg;
+  Logger::get().setCallback(exampleCallback);
+
+  testMsg = "This is a critical message";
+  CUML_LOG_CRITICAL(testMsg.c_str());
+  ASSERT_TRUE(logged.find(testMsg) != std::string::npos);
+
+  testMsg = "This is an error message";
+  CUML_LOG_ERROR(testMsg.c_str());
+  ASSERT_TRUE(logged.find(testMsg) != std::string::npos);
+
+  testMsg = "This is a warning message";
+  CUML_LOG_WARN(testMsg.c_str());
+  ASSERT_TRUE(logged.find(testMsg) != std::string::npos);
+
+  testMsg = "This is an info message";
+  CUML_LOG_INFO(testMsg.c_str());
+  ASSERT_TRUE(logged.find(testMsg) != std::string::npos);
+
+  testMsg = "This is a debug message";
+  CUML_LOG_DEBUG(testMsg.c_str());
+  ASSERT_TRUE(logged.find(testMsg) != std::string::npos);
+}
+
+TEST_F(LoggerTest, flush) {
+  Logger::get().setFlush(exampleFlush);
+  Logger::get().flush();
+  ASSERT_EQ(1, flushCount);
+}
+
 }  // namespace ML
diff --git a/cpp/test/sg/nvtx_test.cpp b/cpp/test/sg/nvtx_test.cpp
new file mode 100644
index 0000000000..62dcc8f518
--- /dev/null
+++ b/cpp/test/sg/nvtx_test.cpp
@@ -0,0 +1,48 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <gtest/gtest.h>
+/**
+ * tests for the functionality of generating next color based on string
+ * entered in the NVTX Range marker wrappers
+ */
+
+namespace ML {
+
+uint32_t generateNextColor(const std::string &tag);
+
+class nvtxNextColorTest : public ::testing::Test {
+ protected:
+  void SetUp() override {
+    const std::string temp1 = "foo";
+    const std::string temp2 = "bar";
+
+    if (ML::generateNextColor(temp1) != ML::generateNextColor(temp2))
+      diff_string_diff_color = true;
+    if (ML::generateNextColor(temp1) == ML::generateNextColor(temp1))
+      same_string_same_color = true;
+  }
+  void TearDown() {}
+  bool diff_string_diff_color = false;
+  bool same_string_same_color = false;
+};
+
+TEST_F(nvtxNextColorTest, nvtxGenerateNextColorTest) {
+  EXPECT_TRUE(diff_string_diff_color);
+  EXPECT_TRUE(same_string_same_color);
+}
+
+}  // end namespace ML
diff --git a/cpp/test/sg/ols.cu b/cpp/test/sg/ols.cu
index 27e892da86..635610b11b 100644
--- a/cpp/test/sg/ols.cu
+++ b/cpp/test/sg/ols.cu
@@ -14,12 +14,12 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <glm/ols.cuh>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 
 namespace ML {
@@ -44,84 +44,84 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> {
     int len = params.n_row * params.n_col;
     int len2 = params.n_row_2 * params.n_col;
 
-    allocate(data, len);
-    allocate(labels, params.n_row);
-    allocate(coef, params.n_col);
-    allocate(coef2, params.n_col);
-    allocate(coef3, params.n_col);
-    allocate(coef_ref, params.n_col);
-    allocate(coef2_ref, params.n_col);
-    allocate(coef3_ref, params.n_col);
-    allocate(pred_data, len2);
-    allocate(pred, params.n_row_2);
-    allocate(pred_ref, params.n_row_2);
-    allocate(pred2, params.n_row_2);
-    allocate(pred2_ref, params.n_row_2);
-    allocate(pred3, params.n_row_2);
-    allocate(pred3_ref, params.n_row_2);
+    raft::allocate(data, len);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(coef, params.n_col);
+    raft::allocate(coef2, params.n_col);
+    raft::allocate(coef3, params.n_col);
+    raft::allocate(coef_ref, params.n_col);
+    raft::allocate(coef2_ref, params.n_col);
+    raft::allocate(coef3_ref, params.n_col);
+    raft::allocate(pred_data, len2);
+    raft::allocate(pred, params.n_row_2);
+    raft::allocate(pred_ref, params.n_row_2);
+    raft::allocate(pred2, params.n_row_2);
+    raft::allocate(pred2_ref, params.n_row_2);
+    raft::allocate(pred3, params.n_row_2);
+    raft::allocate(pred3_ref, params.n_row_2);
 
     std::vector<T> data_h = {1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 3.0};
     data_h.resize(len);
-    updateDevice(data, data_h.data(), len, stream);
+    raft::update_device(data, data_h.data(), len, stream);
 
     std::vector<T> labels_h = {6.0, 8.0, 9.0, 11.0};
     labels_h.resize(params.n_row);
-    updateDevice(labels, labels_h.data(), params.n_row, stream);
+    raft::update_device(labels, labels_h.data(), params.n_row, stream);
 
     std::vector<T> coef_ref_h = {2.090908, 2.5454557};
     coef_ref_h.resize(params.n_col);
-    updateDevice(coef_ref, coef_ref_h.data(), params.n_col, stream);
+    raft::update_device(coef_ref, coef_ref_h.data(), params.n_col, stream);
 
     std::vector<T> coef2_ref_h = {1.000001, 1.9999998};
     coef2_ref_h.resize(params.n_col);
-    updateDevice(coef2_ref, coef2_ref_h.data(), params.n_col, stream);
+    raft::update_device(coef2_ref, coef2_ref_h.data(), params.n_col, stream);
 
     std::vector<T> coef3_ref_h = {0.99999, 2.00000};
     coef3_ref_h.resize(params.n_col);
-    updateDevice(coef3_ref, coef3_ref_h.data(), params.n_col, stream);
+    raft::update_device(coef3_ref, coef3_ref_h.data(), params.n_col, stream);
 
     std::vector<T> pred_data_h = {3.0, 2.0, 5.0, 5.0};
     pred_data_h.resize(len2);
-    updateDevice(pred_data, pred_data_h.data(), len2, stream);
+    raft::update_device(pred_data, pred_data_h.data(), len2, stream);
 
     std::vector<T> pred_ref_h = {19.0, 16.9090};
     pred_ref_h.resize(params.n_row_2);
-    updateDevice(pred_ref, pred_ref_h.data(), params.n_row_2, stream);
+    raft::update_device(pred_ref, pred_ref_h.data(), params.n_row_2, stream);
 
     std::vector<T> pred2_ref_h = {16.0, 15.0};
     pred2_ref_h.resize(params.n_row_2);
-    updateDevice(pred2_ref, pred2_ref_h.data(), params.n_row_2, stream);
+    raft::update_device(pred2_ref, pred2_ref_h.data(), params.n_row_2, stream);
 
     std::vector<T> pred3_ref_h = {16.0, 15.0};
     pred3_ref_h.resize(params.n_row_2);
-    updateDevice(pred3_ref, pred3_ref_h.data(), params.n_row_2, stream);
+    raft::update_device(pred3_ref, pred3_ref_h.data(), params.n_row_2, stream);
 
     intercept = T(0);
 
-    olsFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef,
-           &intercept, false, false, stream, params.algo);
+    olsFit(handle, data, params.n_row, params.n_col, labels, coef, &intercept,
+           false, false, stream, params.algo);
 
-    olsPredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col, coef,
-               intercept, pred, stream);
+    olsPredict(handle, pred_data, params.n_row_2, params.n_col, coef, intercept,
+               pred, stream);
 
-    updateDevice(data, data_h.data(), len, stream);
-    updateDevice(labels, labels_h.data(), params.n_row, stream);
+    raft::update_device(data, data_h.data(), len, stream);
+    raft::update_device(labels, labels_h.data(), params.n_row, stream);
 
     intercept2 = T(0);
-    olsFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef2,
-           &intercept2, true, false, stream, params.algo);
+    olsFit(handle, data, params.n_row, params.n_col, labels, coef2, &intercept2,
+           true, false, stream, params.algo);
 
-    olsPredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col, coef2,
+    olsPredict(handle, pred_data, params.n_row_2, params.n_col, coef2,
                intercept2, pred2, stream);
 
-    updateDevice(data, data_h.data(), len, stream);
-    updateDevice(labels, labels_h.data(), params.n_row, stream);
+    raft::update_device(data, data_h.data(), len, stream);
+    raft::update_device(labels, labels_h.data(), params.n_row, stream);
 
     intercept3 = T(0);
-    olsFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef3,
-           &intercept3, true, true, stream, params.algo);
+    olsFit(handle, data, params.n_row, params.n_col, labels, coef3, &intercept3,
+           true, true, stream, params.algo);
 
-    olsPredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col, coef3,
+    olsPredict(handle, pred_data, params.n_row_2, params.n_col, coef3,
                intercept3, pred3, stream);
   }
 
@@ -129,32 +129,32 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> {
     params = ::testing::TestWithParam<OlsInputs<T>>::GetParam();
     int len = params.n_row * params.n_col;
 
-    allocate(data_sc, len);
-    allocate(labels_sc, len);
-    allocate(coef_sc, 1);
-    allocate(coef_sc_ref, 1);
+    raft::allocate(data_sc, len);
+    raft::allocate(labels_sc, len);
+    raft::allocate(coef_sc, 1);
+    raft::allocate(coef_sc_ref, 1);
 
     std::vector<T> data_h = {1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 3.0};
     data_h.resize(len);
-    updateDevice(data_sc, data_h.data(), len, stream);
+    raft::update_device(data_sc, data_h.data(), len, stream);
 
     std::vector<T> labels_h = {6.0, 8.0, 9.0, 11.0, -1.0, 2.0, -3.6, 3.3};
     labels_h.resize(len);
-    updateDevice(labels_sc, labels_h.data(), len, stream);
+    raft::update_device(labels_sc, labels_h.data(), len, stream);
 
     std::vector<T> coef_sc_ref_h = {-0.29285714};
     coef_sc_ref_h.resize(1);
-    updateDevice(coef_sc_ref, coef_sc_ref_h.data(), 1, stream);
+    raft::update_device(coef_sc_ref, coef_sc_ref_h.data(), 1, stream);
 
     T intercept_sc = T(0);
 
-    olsFit(handle.getImpl(), data_sc, len, 1, labels_sc, coef_sc, &intercept_sc,
-           true, false, stream, params.algo);
+    olsFit(handle, data_sc, len, 1, labels_sc, coef_sc, &intercept_sc, true,
+           false, stream, params.algo);
   }
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     basicTest();
     basicTest2();
   }
@@ -190,7 +190,7 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> {
   T *coef3, *coef3_ref, *pred3, *pred3_ref;
   T *data_sc, *labels_sc, *coef_sc, *coef_sc_ref;
   T intercept, intercept2, intercept3;
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
 };
 
@@ -203,49 +203,49 @@ const std::vector<OlsInputs<double>> inputsd2 = {
 typedef OlsTest<float> OlsTestF;
 TEST_P(OlsTestF, Fit) {
   ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(pred_ref, pred, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(pred2_ref, pred2, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(pred3_ref, pred3, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+                          raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(
-    devArrMatch(coef_sc_ref, coef_sc, 1, CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(devArrMatch(coef_sc_ref, coef_sc, 1,
+                          raft::CompareApproxAbs<float>(params.tol)));
 }
 
 typedef OlsTest<double> OlsTestD;
 TEST_P(OlsTestD, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef3_ref, coef3, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_ref, pred, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_ref, pred, params.n_row_2,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
   ASSERT_TRUE(devArrMatch(pred2_ref, pred2, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+                          raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred3_ref, pred3, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred3_ref, pred3, params.n_row_2,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(
-    devArrMatch(coef_sc_ref, coef_sc, 1, CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(devArrMatch(coef_sc_ref, coef_sc, 1,
+                          raft::CompareApproxAbs<double>(params.tol)));
 }
 
 INSTANTIATE_TEST_CASE_P(OlsTests, OlsTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/sg/pca_test.cu b/cpp/test/sg/pca_test.cu
index d7c9d967a3..8ff95dd744 100644
--- a/cpp/test/sg/pca_test.cu
+++ b/cpp/test/sg/pca_test.cu
@@ -14,14 +14,14 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/cublas_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cublas_wrappers.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <cuml/decomposition/params.hpp>
 #include <pca/pca.cuh>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <vector>
 
 namespace ML {
@@ -51,42 +51,43 @@ class PcaTest : public ::testing::TestWithParam<PcaInputs<T>> {
  protected:
   void basicTest() {
     params = ::testing::TestWithParam<PcaInputs<T>>::GetParam();
-    Random::Rng r(params.seed, MLCommon::Random::GenTaps);
+    raft::random::Rng r(params.seed, raft::random::GenTaps);
     int len = params.len;
 
-    allocate(data, len);
-    allocate(data_back, len);
-    allocate(trans_data, len);
-    allocate(trans_data_ref, len);
+    raft::allocate(data, len);
+    raft::allocate(data_back, len);
+    raft::allocate(trans_data, len);
+    raft::allocate(trans_data_ref, len);
 
     std::vector<T> data_h = {1.0, 2.0, 5.0, 4.0, 2.0, 1.0};
     data_h.resize(len);
-    updateDevice(data, data_h.data(), len, stream);
+    raft::update_device(data, data_h.data(), len, stream);
 
     std::vector<T> trans_data_ref_h = {-2.3231, -0.3517, 2.6748,
                                        -0.3979, 0.6571,  -0.2592};
     trans_data_ref_h.resize(len);
-    updateDevice(trans_data_ref, trans_data_ref_h.data(), len, stream);
+    raft::update_device(trans_data_ref, trans_data_ref_h.data(), len, stream);
 
     int len_comp = params.n_col * params.n_col;
-    allocate(components, len_comp);
-    allocate(explained_vars, params.n_col);
-    allocate(explained_var_ratio, params.n_col);
-    allocate(singular_vals, params.n_col);
-    allocate(mean, params.n_col);
-    allocate(noise_vars, 1);
+    raft::allocate(components, len_comp);
+    raft::allocate(explained_vars, params.n_col);
+    raft::allocate(explained_var_ratio, params.n_col);
+    raft::allocate(singular_vals, params.n_col);
+    raft::allocate(mean, params.n_col);
+    raft::allocate(noise_vars, 1);
 
     std::vector<T> components_ref_h = {0.8163, 0.5776, -0.5776, 0.8163};
     components_ref_h.resize(len_comp);
     std::vector<T> explained_vars_ref_h = {6.338, 0.3287};
     explained_vars_ref_h.resize(params.n_col);
 
-    allocate(components_ref, len_comp);
-    allocate(explained_vars_ref, params.n_col);
+    raft::allocate(components_ref, len_comp);
+    raft::allocate(explained_vars_ref, params.n_col);
 
-    updateDevice(components_ref, components_ref_h.data(), len_comp, stream);
-    updateDevice(explained_vars_ref, explained_vars_ref_h.data(), params.n_col,
-                 stream);
+    raft::update_device(components_ref, components_ref_h.data(), len_comp,
+                        stream);
+    raft::update_device(explained_vars_ref, explained_vars_ref_h.data(),
+                        params.n_col, stream);
 
     paramsPCA prms;
     prms.n_cols = params.n_col;
@@ -98,17 +99,17 @@ class PcaTest : public ::testing::TestWithParam<PcaInputs<T>> {
     else
       prms.algorithm = solver::COV_EIG_JACOBI;
 
-    pcaFit(handle.getImpl(), data, components, explained_vars,
-           explained_var_ratio, singular_vals, mean, noise_vars, prms, stream);
-    pcaTransform(handle.getImpl(), data, components, trans_data, singular_vals,
-                 mean, prms, stream);
-    pcaInverseTransform(handle.getImpl(), trans_data, components, singular_vals,
-                        mean, data_back, prms, stream);
+    pcaFit(handle, data, components, explained_vars, explained_var_ratio,
+           singular_vals, mean, noise_vars, prms, stream);
+    pcaTransform(handle, data, components, trans_data, singular_vals, mean,
+                 prms, stream);
+    pcaInverseTransform(handle, trans_data, components, singular_vals, mean,
+                        data_back, prms, stream);
   }
 
   void advancedTest() {
     params = ::testing::TestWithParam<PcaInputs<T>>::GetParam();
-    Random::Rng r(params.seed, MLCommon::Random::GenTaps);
+    raft::random::Rng r(params.seed, raft::random::GenTaps);
     int len = params.len2;
 
     paramsPCA prms;
@@ -121,30 +122,30 @@ class PcaTest : public ::testing::TestWithParam<PcaInputs<T>> {
     else if (params.algo == 1)
       prms.algorithm = solver::COV_EIG_JACOBI;
 
-    allocate(data2, len);
+    raft::allocate(data2, len);
     r.uniform(data2, len, T(-1.0), T(1.0), stream);
-    allocate(data2_trans, prms.n_rows * prms.n_components);
+    raft::allocate(data2_trans, prms.n_rows * prms.n_components);
 
     int len_comp = params.n_col2 * prms.n_components;
-    allocate(components2, len_comp);
-    allocate(explained_vars2, prms.n_components);
-    allocate(explained_var_ratio2, prms.n_components);
-    allocate(singular_vals2, prms.n_components);
-    allocate(mean2, prms.n_cols);
-    allocate(noise_vars2, 1);
-
-    pcaFitTransform(handle.getImpl(), data2, data2_trans, components2,
-                    explained_vars2, explained_var_ratio2, singular_vals2,
-                    mean2, noise_vars2, prms, stream);
-
-    allocate(data2_back, len);
-    pcaInverseTransform(handle.getImpl(), data2_trans, components2,
-                        singular_vals2, mean2, data2_back, prms, stream);
+    raft::allocate(components2, len_comp);
+    raft::allocate(explained_vars2, prms.n_components);
+    raft::allocate(explained_var_ratio2, prms.n_components);
+    raft::allocate(singular_vals2, prms.n_components);
+    raft::allocate(mean2, prms.n_cols);
+    raft::allocate(noise_vars2, 1);
+
+    pcaFitTransform(handle, data2, data2_trans, components2, explained_vars2,
+                    explained_var_ratio2, singular_vals2, mean2, noise_vars2,
+                    prms, stream);
+
+    raft::allocate(data2_back, len);
+    pcaInverseTransform(handle, data2_trans, components2, singular_vals2, mean2,
+                        data2_back, prms, stream);
   }
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     basicTest();
     advancedTest();
   }
@@ -182,7 +183,7 @@ class PcaTest : public ::testing::TestWithParam<PcaInputs<T>> {
 
   T *data2, *data2_trans, *data2_back, *components2, *explained_vars2,
     *explained_var_ratio2, *singular_vals2, *mean2, *noise_vars2;
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
 };
 
@@ -197,53 +198,53 @@ const std::vector<PcaInputs<double>> inputsd2 = {
 typedef PcaTest<float> PcaTestValF;
 TEST_P(PcaTestValF, Result) {
   ASSERT_TRUE(devArrMatch(explained_vars, explained_vars_ref, params.n_col,
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef PcaTest<double> PcaTestValD;
 TEST_P(PcaTestValD, Result) {
   ASSERT_TRUE(devArrMatch(explained_vars, explained_vars_ref, params.n_col,
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef PcaTest<float> PcaTestLeftVecF;
 TEST_P(PcaTestLeftVecF, Result) {
   ASSERT_TRUE(devArrMatch(components, components_ref,
                           (params.n_col * params.n_col),
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef PcaTest<double> PcaTestLeftVecD;
 TEST_P(PcaTestLeftVecD, Result) {
   ASSERT_TRUE(devArrMatch(components, components_ref,
                           (params.n_col * params.n_col),
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef PcaTest<float> PcaTestTransDataF;
 TEST_P(PcaTestTransDataF, Result) {
   ASSERT_TRUE(devArrMatch(trans_data, trans_data_ref,
                           (params.n_row * params.n_col),
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef PcaTest<double> PcaTestTransDataD;
 TEST_P(PcaTestTransDataD, Result) {
   ASSERT_TRUE(devArrMatch(trans_data, trans_data_ref,
                           (params.n_row * params.n_col),
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef PcaTest<float> PcaTestDataVecSmallF;
 TEST_P(PcaTestDataVecSmallF, Result) {
   ASSERT_TRUE(devArrMatch(data, data_back, (params.n_col * params.n_col),
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef PcaTest<double> PcaTestDataVecSmallD;
 TEST_P(PcaTestDataVecSmallD, Result) {
   ASSERT_TRUE(devArrMatch(data, data_back, (params.n_col * params.n_col),
-                          CompareApproxAbs<double>(params.tolerance)));
+                          raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 // FIXME: These tests are disabled due to driver 418+ making them fail:
@@ -251,13 +252,14 @@ TEST_P(PcaTestDataVecSmallD, Result) {
 typedef PcaTest<float> PcaTestDataVecF;
 TEST_P(PcaTestDataVecF, Result) {
   ASSERT_TRUE(devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
-                          CompareApproxAbs<float>(params.tolerance)));
+                          raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef PcaTest<double> PcaTestDataVecD;
 TEST_P(PcaTestDataVecD, Result) {
-  ASSERT_TRUE(devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
-                          CompareApproxAbs<double>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
+                      raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(PcaTests, PcaTestValF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/sg/quasi_newton.cu b/cpp/test/sg/quasi_newton.cu
index 884e852397..e0c446735a 100644
--- a/cpp/test/sg/quasi_newton.cu
+++ b/cpp/test/sg/quasi_newton.cu
@@ -14,9 +14,9 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include <test_utils.h>
 #include <cuml/linear_model/glm.hpp>
 #include <glm/qn/glm_linear.cuh>
@@ -37,25 +37,25 @@ struct QuasiNewtonTest : ::testing::Test {
   const static double *nobptr;
   const static double tol;
   const static double X[N][D];
-  cumlHandle cuml_handle;
-  const cumlHandle_impl &handle;
+  raft::handle_t cuml_handle;
+  const raft::handle_t &handle;
   cudaStream_t stream;
   std::shared_ptr<SimpleMatOwning<double>> Xdev;
   std::shared_ptr<SimpleVecOwning<double>> ydev;
 
   std::shared_ptr<deviceAllocator> allocator;
-  QuasiNewtonTest() : handle(cuml_handle.getImpl()) {}
+  QuasiNewtonTest() : handle(cuml_handle) {}
   void SetUp() {
-    stream = cuml_handle.getStream();
-    Xdev.reset(new SimpleMatOwning<double>(handle.getDeviceAllocator(), N, D,
+    stream = cuml_handle.get_stream();
+    Xdev.reset(new SimpleMatOwning<double>(handle.get_device_allocator(), N, D,
                                            stream, ROW_MAJOR));
-    updateDevice(Xdev->data, &X[0][0], Xdev->len, stream);
+    raft::update_device(Xdev->data, &X[0][0], Xdev->len, stream);
 
     ydev.reset(
-      new SimpleVecOwning<double>(handle.getDeviceAllocator(), N, stream));
+      new SimpleVecOwning<double>(handle.get_device_allocator(), N, stream));
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
-    allocator = handle.getDeviceAllocator();
+    allocator = handle.get_device_allocator();
   }
   void TearDown() {}
 };
@@ -75,7 +75,7 @@ const double QuasiNewtonTest::X[QuasiNewtonTest::N][QuasiNewtonTest::D] = {
   {1.6690253095248706, -0.4385697358355719}};
 
 template <typename T, class Comp>
-::testing::AssertionResult checkParamsEqual(const cumlHandle_impl &handle,
+::testing::AssertionResult checkParamsEqual(const raft::handle_t &handle,
                                             const T *host_weights,
                                             const T *host_bias, const T *w,
                                             const GLMDims &dims, Comp &comp,
@@ -90,17 +90,17 @@ template <typename T, class Comp>
       w_ref_cm[idx++] = host_weights[c * D + d];
     }
 
-  SimpleVecOwning<T> w_ref(handle.getDeviceAllocator(), dims.n_param, stream);
-  updateDevice(w_ref.data, &w_ref_cm[0], C * D, stream);
+  SimpleVecOwning<T> w_ref(handle.get_device_allocator(), dims.n_param, stream);
+  raft::update_device(w_ref.data, &w_ref_cm[0], C * D, stream);
   if (fit_intercept) {
-    updateDevice(&w_ref.data[C * D], host_bias, C, stream);
+    raft::update_device(&w_ref.data[C * D], host_bias, C, stream);
   }
   CUDA_CHECK(cudaStreamSynchronize(stream));
-  return devArrMatch(w_ref.data, w, w_ref.len, comp);
+  return raft::devArrMatch(w_ref.data, w, w_ref.len, comp);
 }
 
 template <typename T, class LossFunction>
-T run(const cumlHandle_impl &handle, LossFunction &loss, const SimpleMat<T> &X,
+T run(const raft::handle_t &handle, LossFunction &loss, const SimpleMat<T> &X,
       const SimpleVec<T> &y, T l1, T l2, T *w, SimpleMat<T> &z, int verbosity,
       cudaStream_t stream) {
   int max_iter = 100;
@@ -120,7 +120,7 @@ T run(const cumlHandle_impl &handle, LossFunction &loss, const SimpleMat<T> &X,
 }
 
 template <typename T>
-T run_api(const cumlHandle &cuml_handle, int loss_type, int C,
+T run_api(const raft::handle_t &cuml_handle, int loss_type, int C,
           bool fit_intercept, const SimpleMat<T> &X, const SimpleVec<T> &y,
           T l1, T l2, T *w, SimpleMat<T> &z, int verbosity,
           cudaStream_t stream) {
@@ -142,10 +142,10 @@ T run_api(const cumlHandle &cuml_handle, int loss_type, int C,
 }
 
 TEST_F(QuasiNewtonTest, binary_logistic_vs_sklearn) {
-  CompareApprox<double> compApprox(tol);
+  raft::CompareApprox<double> compApprox(tol);
   // Test case generated in python and solved with sklearn
   double y[N] = {1, 1, 1, 0, 1, 0, 1, 0, 1, 0};
-  updateDevice(ydev->data, &y[0], ydev->len, stream);
+  raft::update_device(ydev->data, &y[0], ydev->len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   double alpha = 0.01 * N;
@@ -222,9 +222,9 @@ TEST_F(QuasiNewtonTest, multiclass_logistic_vs_sklearn) {
   // The data seems to small for the objective to be strongly convex
   // leaving out exact param checks
 
-  CompareApprox<double> compApprox(tol);
+  raft::CompareApprox<double> compApprox(tol);
   double y[N] = {2, 2, 0, 3, 3, 0, 0, 0, 1, 0};
-  updateDevice(ydev->data, &y[0], ydev->len, stream);
+  raft::update_device(ydev->data, &y[0], ydev->len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   double fx, l1, l2;
@@ -285,12 +285,12 @@ TEST_F(QuasiNewtonTest, multiclass_logistic_vs_sklearn) {
 }
 
 TEST_F(QuasiNewtonTest, linear_regression_vs_sklearn) {
-  CompareApprox<double> compApprox(tol);
+  raft::CompareApprox<double> compApprox(tol);
   double y[N] = {0.2675836026202781,  -0.0678277759663704, -0.6334027174275105,
                  -0.1018336189077367, 0.0933815935886932,  -1.1058853496996381,
                  -0.1658298189619160, -0.2954290675648911, 0.7966520536712608,
                  -1.0767450516284769};
-  updateDevice(ydev->data, &y[0], ydev->len, stream);
+  raft::update_device(ydev->data, &y[0], ydev->len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   double fx, l1, l2;
@@ -360,18 +360,18 @@ TEST_F(QuasiNewtonTest, linear_regression_vs_sklearn) {
 }
 
 TEST_F(QuasiNewtonTest, predict) {
-  CompareApprox<double> compApprox(1e-8);
+  raft::CompareApprox<double> compApprox(1e-8);
   std::vector<double> w_host(D);
   w_host[0] = 1;
   std::vector<double> preds_host(N);
   SimpleVecOwning<double> w(allocator, D, stream);
   SimpleVecOwning<double> preds(allocator, N, stream);
 
-  updateDevice(w.data, &w_host[0], w.len, stream);
+  raft::update_device(w.data, &w_host[0], w.len, stream);
 
   qnPredict(handle, Xdev->data, N, D, 2, false, w.data, false, 0, preds.data,
             stream);
-  updateHost(&preds_host[0], preds.data, preds.len, stream);
+  raft::update_host(&preds_host[0], preds.data, preds.len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   for (int it = 0; it < N; it++) {
@@ -381,7 +381,7 @@ TEST_F(QuasiNewtonTest, predict) {
 
   qnPredict(handle, Xdev->data, N, D, 1, false, w.data, false, 1, preds.data,
             stream);
-  updateHost(&preds_host[0], preds.data, preds.len, stream);
+  raft::update_host(&preds_host[0], preds.data, preds.len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   for (int it = 0; it < N; it++) {
@@ -390,7 +390,7 @@ TEST_F(QuasiNewtonTest, predict) {
 }
 
 TEST_F(QuasiNewtonTest, predict_softmax) {
-  CompareApprox<double> compApprox(1e-8);
+  raft::CompareApprox<double> compApprox(1e-8);
   int C = 4;
   std::vector<double> w_host(C * D);
   w_host[0] = 1;
@@ -400,11 +400,11 @@ TEST_F(QuasiNewtonTest, predict_softmax) {
   SimpleVecOwning<double> w(allocator, w_host.size(), stream);
   SimpleVecOwning<double> preds(allocator, N, stream);
 
-  updateDevice(w.data, &w_host[0], w.len, stream);
+  raft::update_device(w.data, &w_host[0], w.len, stream);
 
   qnPredict(handle, Xdev->data, N, D, C, false, w.data, false, 2, preds.data,
             stream);
-  updateHost(&preds_host[0], preds.data, preds.len, stream);
+  raft::update_host(&preds_host[0], preds.data, preds.len, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   for (int it = 0; it < N; it++) {
diff --git a/cpp/test/sg/rf_accuracy_test.cu b/cpp/test/sg/rf_accuracy_test.cu
index 663fa76449..479432b89c 100644
--- a/cpp/test/sg/rf_accuracy_test.cu
+++ b/cpp/test/sg/rf_accuracy_test.cu
@@ -15,9 +15,10 @@
  */
 
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <cuml/ensemble/randomforest.hpp>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 
 namespace ML {
 
@@ -37,11 +38,11 @@ class RFClassifierAccuracyTest : public ::testing::TestWithParam<RFInputs> {
  protected:
   void SetUp() override {
     params = ::testing::TestWithParam<RFInputs>::GetParam();
-    rng.reset(new Random::Rng(params.seed));
+    rng.reset(new raft::random::Rng(params.seed));
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.reset(new cumlHandle(1));
-    handle->setStream(stream);
-    auto allocator = handle->getDeviceAllocator();
+    handle.reset(new raft::handle_t(1));
+    handle->set_stream(stream);
+    auto allocator = handle->get_device_allocator();
     setRFParams();
     X_train = (T *)allocator->allocate(params.n_rows_train * sizeof(T), stream);
     y_train =
@@ -56,7 +57,7 @@ class RFClassifierAccuracyTest : public ::testing::TestWithParam<RFInputs> {
 
   void TearDown() override {
     CUDA_CHECK(cudaStreamSynchronize(stream));
-    auto allocator = handle->getDeviceAllocator();
+    auto allocator = handle->get_device_allocator();
     allocator->deallocate(X_train, params.n_rows_train * sizeof(T), stream);
     allocator->deallocate(y_train, params.n_rows_train * sizeof(int), stream);
     allocator->deallocate(X_test, params.n_rows_test * sizeof(T), stream);
@@ -83,7 +84,7 @@ class RFClassifierAccuracyTest : public ::testing::TestWithParam<RFInputs> {
     DecisionTree::DecisionTreeParams tree_params;
     auto algo = SPLIT_ALGO::GLOBAL_QUANTILE;
     auto sc = CRITERION::CRITERION_END;
-    set_tree_params(tree_params, 1, /* max_depth */
+    set_tree_params(tree_params, 0, /* max_depth */
                     -1,             /* max_leaves */
                     1.0,            /* max_features */
                     16,             /* n_bins */
@@ -92,8 +93,7 @@ class RFClassifierAccuracyTest : public ::testing::TestWithParam<RFInputs> {
                     0.f,            /* min_impurity_decrease */
                     false,          /* bootstrap_features */
                     sc,             /* split_criterion */
-                    false,          /* quantile_per_tree */
-                    false           /* shuffle_features */
+                    false           /* quantile_per_tree */
     );
     set_all_rf_params(rfp, 1, /* n_trees */
                       true,   /* bootstrap */
@@ -122,11 +122,11 @@ class RFClassifierAccuracyTest : public ::testing::TestWithParam<RFInputs> {
 
   RFInputs params;
   RF_params rfp;
-  std::shared_ptr<cumlHandle> handle;
+  std::shared_ptr<raft::handle_t> handle;
   cudaStream_t stream;
   T *X_train, *X_test;
   int *y_train, *y_test, *y_pred;
-  std::shared_ptr<Random::Rng> rng;
+  std::shared_ptr<raft::random::Rng> rng;
 };
 
 const std::vector<RFInputs> inputs = {
diff --git a/cpp/test/sg/rf_batched_classification_test.cu b/cpp/test/sg/rf_batched_classification_test.cu
new file mode 100644
index 0000000000..e811c5787a
--- /dev/null
+++ b/cpp/test/sg/rf_batched_classification_test.cu
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
+#include <test_utils.h>
+#include <cuml/datasets/make_blobs.hpp>
+#include <cuml/ensemble/randomforest.hpp>
+#include <raft/cuda_utils.cuh>
+
+namespace ML {
+
+using namespace MLCommon;
+
+struct RfInputs {
+  int n_rows;
+  int n_cols;
+  int n_trees;
+  float max_features;
+  float rows_sample;
+  int max_depth;
+  int max_leaves;
+  bool bootstrap;
+  bool bootstrap_features;
+  int n_bins;
+  int split_algo;
+  int min_rows_per_node;
+  float min_impurity_decrease;
+  int n_streams;
+  CRITERION split_criterion;
+};
+
+template <typename T>
+class RFBatchedClsTest : public ::testing::TestWithParam<RfInputs> {
+ protected:
+  void basicTest() {
+    params = ::testing::TestWithParam<RfInputs>::GetParam();
+
+    DecisionTree::DecisionTreeParams tree_params;
+    set_tree_params(tree_params, params.max_depth, params.max_leaves,
+                    params.max_features, params.n_bins, params.split_algo,
+                    params.min_rows_per_node, params.min_impurity_decrease,
+                    params.bootstrap_features, params.split_criterion, false,
+                    true);
+    RF_params rf_params;
+    set_all_rf_params(rf_params, params.n_trees, params.bootstrap,
+                      params.rows_sample, -1, params.n_streams, tree_params);
+
+    CUDA_CHECK(cudaStreamCreate(&stream));
+    handle.reset(new raft::handle_t(rf_params.n_streams));
+    handle->set_stream(stream);
+    auto allocator = handle->get_device_allocator();
+
+    int data_len = params.n_rows * params.n_cols;
+    data = (T*)allocator->allocate(data_len * sizeof(T), stream);
+    labels = (int*)allocator->allocate(params.n_rows * sizeof(int), stream);
+    predicted_labels =
+      (int*)allocator->allocate(params.n_rows * sizeof(int), stream);
+
+    Datasets::make_blobs(*handle, data, labels, params.n_rows, params.n_cols, 5,
+                         false, nullptr, nullptr, T(0.1), false, T(-0.5),
+                         T(0.5), 3536699ULL);
+
+    labels_h.resize(params.n_rows);
+    raft::update_host(labels_h.data(), labels, params.n_rows, stream);
+    preprocess_labels(params.n_rows, labels_h, labels_map);
+    raft::update_device(labels, labels_h.data(), params.n_rows, stream);
+
+    // Training part
+    forest = new typename ML::RandomForestMetaData<T, int>;
+    null_trees_ptr(forest);
+    fit(*handle, forest, data, params.n_rows, params.n_cols, labels,
+        labels_map.size(), rf_params);
+
+    // predict function expects row major lay out of data, so we need to
+    // transpose the data first
+    T* data_row_major;
+    data_row_major = (T*)allocator->allocate(data_len * sizeof(T), stream);
+    cublasHandle_t cublas_h = handle->get_cublas_handle();
+    raft::linalg::transpose(*handle, data, data_row_major, params.n_rows,
+                            params.n_cols, stream);
+
+    predict(*handle, forest, data_row_major, params.n_rows, params.n_cols,
+            predicted_labels);
+    raft::update_host(labels_h.data(), predicted_labels, params.n_rows, stream);
+
+    RF_metrics tmp =
+      score(*handle, forest, labels, params.n_rows, predicted_labels);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    CUDA_CHECK(cudaStreamDestroy(stream));
+    accuracy = tmp.accuracy;
+    allocator->deallocate(data_row_major, data_len * sizeof(T), stream);
+  }
+
+  void SetUp() override { basicTest(); }
+
+  void TearDown() override {
+    auto allocator = handle->get_device_allocator();
+    accuracy = -1.0f;
+    postprocess_labels(params.n_rows, labels_h, labels_map);
+    labels_h.clear();
+    labels_map.clear();
+
+    allocator->deallocate(labels, params.n_rows * sizeof(int), stream);
+    allocator->deallocate(predicted_labels, params.n_rows * sizeof(int),
+                          stream);
+    allocator->deallocate(data, params.n_rows * params.n_cols * sizeof(T),
+                          stream);
+    delete forest;
+    handle.reset();
+  }
+
+ protected:
+  std::shared_ptr<raft::handle_t> handle;
+  cudaStream_t stream;
+  RfInputs params;
+  T* data;
+  int* labels;
+  std::vector<int> labels_h;
+  std::map<int, int>
+    labels_map;  //unique map of labels to int vals starting from 0
+
+  RandomForestMetaData<T, int>* forest;
+  float accuracy = -1.0f;  // overriden in each test SetUp and TearDown
+
+  int* predicted_labels;
+};
+
+//-------------------------------------------------------------------------------------------------------------------------------------
+const std::vector<RfInputs> inputsf2_clf = {
+  {20000, 10, 25, 1.0f, 0.4f, 16, -1, true, false, 10,
+   SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::GINI},
+  {20000, 10, 5, 1.0f, 0.4f, 14, -1, true, false, 10,
+   SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::ENTROPY}};
+
+typedef RFBatchedClsTest<float> RFBatchedClsTestF;
+TEST_P(RFBatchedClsTestF, Fit) {
+  if (!params.bootstrap && (params.max_features == 1.0f)) {
+    ASSERT_TRUE(accuracy == 1.0f);
+  } else {
+    ASSERT_TRUE(accuracy >= 0.75f);  // Empirically derived accuracy range
+  }
+}
+
+INSTANTIATE_TEST_CASE_P(RFBatchedClsTests, RFBatchedClsTestF,
+                        ::testing::ValuesIn(inputsf2_clf));
+
+typedef RFBatchedClsTest<double> RFBatchedClsTestD;
+TEST_P(RFBatchedClsTestD, Fit) {
+  if (!params.bootstrap && (params.max_features == 1.0f)) {
+    ASSERT_TRUE(accuracy == 1.0f);
+  } else {
+    ASSERT_TRUE(accuracy >= 0.75f);  // Empirically derived accuracy range
+  }
+}
+
+INSTANTIATE_TEST_CASE_P(RFBatchedClsTests, RFBatchedClsTestD,
+                        ::testing::ValuesIn(inputsf2_clf));
+
+}  // end namespace ML
diff --git a/cpp/test/sg/rf_batched_regression_test.cu b/cpp/test/sg/rf_batched_regression_test.cu
new file mode 100644
index 0000000000..a1dd3104bf
--- /dev/null
+++ b/cpp/test/sg/rf_batched_regression_test.cu
@@ -0,0 +1,140 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <raft/cudart_utils.h>
+
+#include <gtest/gtest.h>
+#include <raft/linalg/transpose.h>
+#include <test_utils.h>
+#include <cuml/datasets/make_blobs.hpp>
+#include <cuml/datasets/make_regression.hpp>
+#include <cuml/ensemble/randomforest.hpp>
+#include <metrics/scores.cuh>
+
+namespace ML {
+
+using namespace MLCommon;
+
+struct RfInputs {
+  int n_rows;
+  int n_cols;
+  int n_trees;
+  float max_features;
+  float rows_sample;
+  int max_depth;
+  int max_leaves;
+  bool bootstrap;
+  bool bootstrap_features;
+  int n_bins;
+  int split_algo;
+  int min_rows_per_node;
+  float min_impurity_decrease;
+  int n_streams;
+  CRITERION split_criterion;
+};
+
+template <typename T>
+class RFBatchedRegTest : public ::testing::TestWithParam<RfInputs> {
+ protected:
+  void basicTest() {
+    params = ::testing::TestWithParam<RfInputs>::GetParam();
+
+    DecisionTree::DecisionTreeParams tree_params;
+    set_tree_params(tree_params, params.max_depth, params.max_leaves,
+                    params.max_features, params.n_bins, params.split_algo,
+                    params.min_rows_per_node, params.min_impurity_decrease,
+                    params.bootstrap_features, params.split_criterion, false,
+                    true);
+    RF_params rf_params;
+    set_all_rf_params(rf_params, params.n_trees, params.bootstrap,
+                      params.rows_sample, -1, params.n_streams, tree_params);
+
+    CUDA_CHECK(cudaStreamCreate(&stream));
+    handle.reset(new raft::handle_t(rf_params.n_streams));
+    handle->set_stream(stream);
+    auto allocator = handle->get_device_allocator();
+
+    int data_len = params.n_rows * params.n_cols;
+    data = (T *)allocator->allocate(data_len * sizeof(T), stream);
+    data_row_major = (T *)allocator->allocate(data_len * sizeof(T), stream);
+    labels = (T *)allocator->allocate(params.n_rows * sizeof(T), stream);
+    predicted_labels =
+      (T *)allocator->allocate(params.n_rows * sizeof(T), stream);
+
+    Datasets::make_regression(*handle, data_row_major, labels, params.n_rows,
+                              params.n_cols, params.n_cols, nullptr, 1, 0.0f,
+                              -1, 0.0, 0.0f, false, 3536699ULL);
+
+    cublasHandle_t cublas_h = handle->get_cublas_handle();
+    raft::linalg::transpose(*handle, data_row_major, data, params.n_cols,
+                            params.n_rows, stream);
+
+    // Training part
+    forest = new typename ML::RandomForestMetaData<T, T>;
+    null_trees_ptr(forest);
+    fit(*handle, forest, data, params.n_rows, params.n_cols, labels, rf_params);
+
+    // predict function expects row major lay out of data, so we need to
+    // transpose the data first
+    predict(*handle, forest, data_row_major, params.n_rows, params.n_cols,
+            predicted_labels);
+    accuracy = Score::r2_score(predicted_labels, labels, params.n_rows, stream);
+  }
+
+  void SetUp() override { basicTest(); }
+
+  void TearDown() override {
+    auto allocator = handle->get_device_allocator();
+    allocator->deallocate(data, params.n_rows * params.n_cols * sizeof(T),
+                          stream);
+    allocator->deallocate(data_row_major,
+                          params.n_rows * params.n_cols * sizeof(T), stream);
+    allocator->deallocate(labels, params.n_rows * sizeof(T), stream);
+    allocator->deallocate(predicted_labels, params.n_rows * sizeof(T), stream);
+    delete forest;
+    handle.reset();
+  }
+
+ protected:
+  std::shared_ptr<raft::handle_t> handle;
+  cudaStream_t stream;
+  RfInputs params;
+  RandomForestMetaData<T, T> *forest;
+  float accuracy = -1.0f;  // overriden in each test SetUp and TearDown
+  T *data, *data_row_major;
+  T *labels, *predicted_labels;
+};
+
+//-------------------------------------------------------------------------------------------------------------------------------------
+const std::vector<RfInputs> inputs = {
+  {1000, 10, 10, 1.0f, 1.0f, 12, -1, true, false, 10,
+   SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::MAE},
+  {2000, 20, 20, 1.0f, 0.6f, 13, -1, true, false, 10,
+   SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::MSE}};
+
+typedef RFBatchedRegTest<float> RFBatchedRegTestF;
+TEST_P(RFBatchedRegTestF, Fit) { ASSERT_TRUE(accuracy >= 0.7f); }
+
+INSTANTIATE_TEST_CASE_P(RFBatchedRegTests, RFBatchedRegTestF,
+                        ::testing::ValuesIn(inputs));
+
+typedef RFBatchedRegTest<double> RFBatchedRegTestD;
+TEST_P(RFBatchedRegTestD, Fit) { ASSERT_TRUE(accuracy >= 0.7f); }
+
+INSTANTIATE_TEST_CASE_P(RFBatchedRegTests, RFBatchedRegTestD,
+                        ::testing::ValuesIn(inputs));
+
+}  // end namespace ML
diff --git a/cpp/test/sg/rf_depth_test.cu b/cpp/test/sg/rf_depth_test.cu
new file mode 100644
index 0000000000..1f235a0bcf
--- /dev/null
+++ b/cpp/test/sg/rf_depth_test.cu
@@ -0,0 +1,299 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
+#include <cuml/ensemble/randomforest.hpp>
+#include <queue>
+#include <raft/cuda_utils.cuh>
+#include <random>
+
+namespace ML {
+
+using namespace MLCommon;
+
+template <typename T>  // template useless for now.
+struct RfInputs {
+  int n_rows;
+  int n_cols;
+  int n_trees;
+  float max_features;
+  float rows_sample;
+  int max_depth;
+  int max_leaves;
+  bool bootstrap;
+  bool bootstrap_features;
+  int n_bins;
+  int split_algo;
+  int min_rows_per_node;
+  float min_impurity_decrease;
+  int n_streams;
+  CRITERION split_criterion;
+};
+
+template <typename T>
+class RfClassifierDepthTest : public ::testing::TestWithParam<int> {
+ protected:
+  void basicTest() {
+    const int max_depth = ::testing::TestWithParam<int>::GetParam();
+    params = RfInputs<T>{5000,
+                         10,
+                         1,
+                         1.0f,
+                         1.0f,
+                         max_depth,
+                         -1,
+                         false,
+                         false,
+                         8,
+                         SPLIT_ALGO::GLOBAL_QUANTILE,
+                         2,
+                         0.0,
+                         2,
+                         CRITERION::ENTROPY};
+
+    DecisionTree::DecisionTreeParams tree_params;
+    set_tree_params(tree_params, params.max_depth, params.max_leaves,
+                    params.max_features, params.n_bins, params.split_algo,
+                    params.min_rows_per_node, params.min_impurity_decrease,
+                    params.bootstrap_features, params.split_criterion, false);
+    RF_params rf_params;
+    set_all_rf_params(rf_params, params.n_trees, params.bootstrap,
+                      params.rows_sample, -1, params.n_streams, tree_params);
+
+    int data_len = params.n_rows * params.n_cols;
+    raft::allocate(data, data_len);
+    raft::allocate(labels, params.n_rows);
+
+    cudaStream_t stream;
+    CUDA_CHECK(cudaStreamCreate(&stream));
+
+    // Populate data (assume Col major)
+    std::mt19937 gen(0);
+    std::vector<T> data_h(data_len);
+    std::normal_distribution<> d{0, 1};
+    for (int col = 0; col < params.n_cols; ++col) {
+      for (int row = 0; row < params.n_rows; ++row) {
+        data_h[row + col * params.n_rows] = d(gen);
+      }
+    }
+    raft::update_device(data, data_h.data(), data_len, stream);
+
+    // Populate labels
+    labels_h.resize(params.n_rows);
+    for (int row = 0; row < params.n_rows; ++row) {
+      labels_h[row] =
+        (data_h[row + 2 * params.n_rows] * data_h[row + 3 * params.n_rows] >
+         0.5);
+    }
+    preprocess_labels(params.n_rows, labels_h, labels_map);
+    raft::update_device(labels, labels_h.data(), params.n_rows, stream);
+
+    forest = new typename ML::RandomForestMetaData<T, int>;
+    null_trees_ptr(forest);
+
+    raft::handle_t handle(rf_params.n_streams);
+    handle.set_stream(stream);
+
+    fit(handle, forest, data, params.n_rows, params.n_cols, labels,
+        labels_map.size(), rf_params);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+  }
+
+  void SetUp() override { basicTest(); }
+
+  void TearDown() override {
+    labels_h.clear();
+    labels_map.clear();
+
+    CUDA_CHECK(cudaFree(labels));
+    CUDA_CHECK(cudaFree(data));
+    delete forest;
+  }
+
+ protected:
+  RfInputs<T> params;
+  T* data;
+  int* labels;
+  std::vector<int> labels_h;
+  std::map<int, int> labels_map;
+  // unique map of labels to int vals starting from 0
+
+  RandomForestMetaData<T, int>* forest;
+};
+
+template <typename T>
+class RfRegressorDepthTest : public ::testing::TestWithParam<int> {
+ protected:
+  void basicTest() {
+    const int max_depth = ::testing::TestWithParam<int>::GetParam();
+    params = RfInputs<T>{5000,
+                         10,
+                         1,
+                         1.0f,
+                         1.0f,
+                         max_depth,
+                         -1,
+                         false,
+                         false,
+                         8,
+                         SPLIT_ALGO::GLOBAL_QUANTILE,
+                         2,
+                         0.0,
+                         2,
+                         CRITERION::MSE};
+
+    DecisionTree::DecisionTreeParams tree_params;
+    set_tree_params(tree_params, params.max_depth, params.max_leaves,
+                    params.max_features, params.n_bins, params.split_algo,
+                    params.min_rows_per_node, params.min_impurity_decrease,
+                    params.bootstrap_features, params.split_criterion, false);
+    RF_params rf_params;
+    set_all_rf_params(rf_params, params.n_trees, params.bootstrap,
+                      params.rows_sample, -1, params.n_streams, tree_params);
+
+    int data_len = params.n_rows * params.n_cols;
+    raft::allocate(data, data_len);
+    raft::allocate(labels, params.n_rows);
+
+    cudaStream_t stream;
+    CUDA_CHECK(cudaStreamCreate(&stream));
+
+    // Populate data (assume Col major)
+    std::mt19937 gen(0);
+    std::vector<T> data_h(data_len);
+    std::normal_distribution<> d{0, 1};
+    for (int col = 0; col < params.n_cols; ++col) {
+      for (int row = 0; row < params.n_rows; ++row) {
+        data_h[row + col * params.n_rows] = d(gen);
+      }
+    }
+    raft::update_device(data, data_h.data(), data_len, stream);
+
+    // Populate labels
+    labels_h.resize(params.n_rows);
+    for (int row = 0; row < params.n_rows; ++row) {
+      labels_h[row] =
+        (data_h[row + 2 * params.n_rows] * data_h[row + 3 * params.n_rows]);
+    }
+    raft::update_device(labels, labels_h.data(), params.n_rows, stream);
+
+    forest = new typename ML::RandomForestMetaData<T, T>;
+    null_trees_ptr(forest);
+
+    raft::handle_t handle(rf_params.n_streams);
+    handle.set_stream(stream);
+
+    fit(handle, forest, data, params.n_rows, params.n_cols, labels, rf_params);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+  }
+
+  void SetUp() override { basicTest(); }
+
+  void TearDown() override {
+    labels_h.clear();
+
+    CUDA_CHECK(cudaFree(labels));
+    CUDA_CHECK(cudaFree(data));
+    delete forest;
+  }
+
+ protected:
+  RfInputs<T> params;
+  T* data;
+  T* labels;
+  std::vector<T> labels_h;
+
+  RandomForestMetaData<T, T>* forest;
+};
+
+template <typename L, typename T>
+int MaxDepthOfDecisionTree(const DecisionTree::TreeMetaDataNode<T, L>* tree) {
+  const auto& node_array = tree->sparsetree;
+  std::queue<std::pair<int, int>> q;  // (node ID, depth)
+  // Traverse the tree breadth-first
+  int initial_depth = 0;
+  q.emplace(0, initial_depth);
+  int max_depth = initial_depth;
+  while (!q.empty()) {
+    int node_id, depth;
+    std::tie(node_id, depth) = q.front();
+    q.pop();
+    max_depth = std::max(depth, max_depth);
+    const SparseTreeNode<T, L>& node = node_array.at(node_id);
+    if (node.colid != -1) {
+      q.emplace(node.left_child_id, depth + 1);
+      q.emplace(node.left_child_id + 1, depth + 1);
+    }
+  }
+  return max_depth;
+}
+
+typedef RfClassifierDepthTest<float> RfClassifierDepthTestF;
+TEST_P(RfClassifierDepthTestF, Fit) {
+  CUML_LOG_INFO("Param max_depth = %d", params.max_depth);
+  for (int i = 0; i < forest->rf_params.n_trees; i++) {
+    int actual_max_depth = MaxDepthOfDecisionTree(&(forest->trees[i]));
+    ASSERT_EQ(actual_max_depth, params.max_depth);
+    ASSERT_EQ(actual_max_depth, forest->trees[i].depth_counter);
+  }
+}
+
+typedef RfClassifierDepthTest<double> RfClassifierDepthTestD;
+TEST_P(RfClassifierDepthTestD, Fit) {
+  CUML_LOG_INFO("Param max_depth = %d", params.max_depth);
+  for (int i = 0; i < forest->rf_params.n_trees; i++) {
+    int actual_max_depth = MaxDepthOfDecisionTree(&(forest->trees[i]));
+    ASSERT_EQ(actual_max_depth, params.max_depth);
+    ASSERT_EQ(actual_max_depth, forest->trees[i].depth_counter);
+  }
+}
+
+INSTANTIATE_TEST_CASE_P(RfClassifierDepthTests, RfClassifierDepthTestF,
+                        ::testing::Range(0, 19));
+
+INSTANTIATE_TEST_CASE_P(RfClassifierDepthTests, RfClassifierDepthTestD,
+                        ::testing::Range(0, 19));
+
+typedef RfRegressorDepthTest<float> RfRegressorDepthTestF;
+TEST_P(RfRegressorDepthTestF, Fit) {
+  CUML_LOG_INFO("Param max_depth = %d", params.max_depth);
+  for (int i = 0; i < forest->rf_params.n_trees; i++) {
+    int actual_max_depth = MaxDepthOfDecisionTree(&(forest->trees[i]));
+    ASSERT_EQ(actual_max_depth, params.max_depth);
+    ASSERT_EQ(actual_max_depth, forest->trees[i].depth_counter);
+  }
+}
+
+typedef RfRegressorDepthTest<double> RfRegressorDepthTestD;
+TEST_P(RfRegressorDepthTestD, Fit) {
+  CUML_LOG_INFO("Param max_depth = %d", params.max_depth);
+  for (int i = 0; i < forest->rf_params.n_trees; i++) {
+    int actual_max_depth = MaxDepthOfDecisionTree(&(forest->trees[i]));
+    ASSERT_EQ(actual_max_depth, params.max_depth);
+    ASSERT_EQ(actual_max_depth, forest->trees[i].depth_counter);
+  }
+}
+
+INSTANTIATE_TEST_CASE_P(RfRegressorDepthTests, RfRegressorDepthTestF,
+                        ::testing::Range(0, 19));
+
+INSTANTIATE_TEST_CASE_P(RfRegressorDepthTests, RfRegressorDepthTestD,
+                        ::testing::Range(0, 19));
+
+}  // end namespace ML
diff --git a/cpp/test/sg/rf_test.cu b/cpp/test/sg/rf_test.cu
index df481ebd25..5f8b2e053b 100644
--- a/cpp/test/sg/rf_test.cu
+++ b/cpp/test/sg/rf_test.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <cuml/ensemble/randomforest.hpp>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 
@@ -70,9 +70,9 @@ class RfClassifierTest : public ::testing::TestWithParam<RfInputs<T>> {
     //--------------------------------------------------------
 
     int data_len = params.n_rows * params.n_cols;
-    allocate(data, data_len);
-    allocate(labels, params.n_rows);
-    allocate(predicted_labels, params.n_inference_rows);
+    raft::allocate(data, data_len);
+    raft::allocate(labels, params.n_rows);
+    raft::allocate(predicted_labels, params.n_inference_rows);
 
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
@@ -80,19 +80,19 @@ class RfClassifierTest : public ::testing::TestWithParam<RfInputs<T>> {
     // Populate data (assume Col major)
     std::vector<T> data_h = {30.0, 1.0, 2.0, 0.0, 10.0, 20.0, 10.0, 40.0};
     data_h.resize(data_len);
-    updateDevice(data, data_h.data(), data_len, stream);
+    raft::update_device(data, data_h.data(), data_len, stream);
 
     // Populate labels
     labels_h = {0, 1, 0, 4};
     labels_h.resize(params.n_rows);
     preprocess_labels(params.n_rows, labels_h, labels_map);
-    updateDevice(labels, labels_h.data(), params.n_rows, stream);
+    raft::update_device(labels, labels_h.data(), params.n_rows, stream);
 
     forest = new typename ML::RandomForestMetaData<T, int>;
     null_trees_ptr(forest);
 
-    cumlHandle handle(rf_params.n_streams);
-    handle.setStream(stream);
+    raft::handle_t handle(rf_params.n_streams);
+    handle.set_stream(stream);
 
     fit(handle, forest, data, params.n_rows, params.n_cols, labels,
         labels_map.size(), rf_params);
@@ -103,9 +103,9 @@ class RfClassifierTest : public ::testing::TestWithParam<RfInputs<T>> {
     int inference_data_len = params.n_inference_rows * params.n_cols;
     inference_data_h = {30.0, 10.0, 1.0, 20.0, 2.0, 10.0, 0.0, 40.0};
     inference_data_h.resize(inference_data_len);
-    allocate(inference_data_d, inference_data_len);
-    updateDevice(inference_data_d, inference_data_h.data(), inference_data_len,
-                 stream);
+    raft::allocate(inference_data_d, inference_data_len);
+    raft::update_device(inference_data_d, inference_data_h.data(),
+                        inference_data_len, stream);
 
     predict(handle, forest, inference_data_d, params.n_inference_rows,
             params.n_cols, predicted_labels);
@@ -172,27 +172,27 @@ class RfRegressorTest : public ::testing::TestWithParam<RfInputs<T>> {
     //--------------------------------------------------------
 
     int data_len = params.n_rows * params.n_cols;
-    allocate(data, data_len);
-    allocate(labels, params.n_rows);
-    allocate(predicted_labels, params.n_inference_rows);
+    raft::allocate(data, data_len);
+    raft::allocate(labels, params.n_rows);
+    raft::allocate(predicted_labels, params.n_inference_rows);
     cudaStream_t stream;
     CUDA_CHECK(cudaStreamCreate(&stream));
 
     // Populate data (assume Col major)
     std::vector<T> data_h = {0.0, 0.0, 0.0, 0.0, 10.0, 20.0, 30.0, 40.0};
     data_h.resize(data_len);
-    updateDevice(data, data_h.data(), data_len, stream);
+    raft::update_device(data, data_h.data(), data_len, stream);
 
     // Populate labels
     labels_h = {1.0, 2.0, 3.0, 4.0};
     labels_h.resize(params.n_rows);
-    updateDevice(labels, labels_h.data(), params.n_rows, stream);
+    raft::update_device(labels, labels_h.data(), params.n_rows, stream);
 
     forest = new typename ML::RandomForestMetaData<T, T>;
     null_trees_ptr(forest);
 
-    cumlHandle handle(rf_params.n_streams);
-    handle.setStream(stream);
+    raft::handle_t handle(rf_params.n_streams);
+    handle.set_stream(stream);
 
     fit(handle, forest, data, params.n_rows, params.n_cols, labels, rf_params);
 
@@ -202,9 +202,9 @@ class RfRegressorTest : public ::testing::TestWithParam<RfInputs<T>> {
     int inference_data_len = params.n_inference_rows * params.n_cols;
     inference_data_h = {0.0, 10.0, 0.0, 20.0, 0.0, 30.0, 0.0, 40.0};
     inference_data_h.resize(inference_data_len);
-    allocate(inference_data_d, inference_data_len);
-    updateDevice(inference_data_d, inference_data_h.data(), inference_data_len,
-                 stream);
+    raft::allocate(inference_data_d, inference_data_len);
+    raft::update_device(inference_data_d, inference_data_h.data(),
+                        inference_data_len, stream);
 
     predict(handle, forest, inference_data_d, params.n_inference_rows,
             params.n_cols, predicted_labels);
@@ -247,55 +247,55 @@ class RfRegressorTest : public ::testing::TestWithParam<RfInputs<T>> {
 //-------------------------------------------------------------------------------------------------------------------------------------
 
 const std::vector<RfInputs<float>> inputsf2_clf = {
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},  // single tree forest, bootstrap false, depth 8, 4 bins
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},  // single tree forest, bootstrap false, depth of 8, 4 bins
-  {4, 2, 10, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::
      GINI},  //forest with 10 trees, all trees should produce identical predictions (no bootstrapping or column subsampling)
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::
      GINI},  //forest with 10 trees, with bootstrap and column subsampling enabled, 3 bins
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2,
    CRITERION::
      CRITERION_END},  //forest with 10 trees, with bootstrap and column subsampling enabled, 3 bins, different split algorithm
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::ENTROPY},
-  {50, 10, 10, 0.8f, 0.8f, 10, 8, -1, true, true, 3,
+  {50, 10, 10, 0.8f, 0.8f, 10, 7, -1, true, true, 3,
    SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::ENTROPY}};
 
 const std::vector<RfInputs<double>> inputsd2_clf = {  // Same as inputsf2_clf
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},
-  {4, 2, 10, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::GINI},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::CRITERION_END},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::ENTROPY},
-  {4, 2, 10, 0.8f, 0.8f, 4, 8, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 10, 0.8f, 0.8f, 4, 7, -1, true, false, 3, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::ENTROPY},
-  {50, 10, 10, 0.8f, 0.8f, 10, 8, -1, true, true, 3,
+  {50, 10, 10, 0.8f, 0.8f, 10, 7, -1, true, true, 3,
    SPLIT_ALGO::GLOBAL_QUANTILE, 2, 0.0, 2, CRITERION::ENTROPY}};
 
 typedef RfClassifierTest<float> RfClassifierTestF;
@@ -343,32 +343,32 @@ TEST_P(RfRegressorTestD, Fit) {
 }
 
 const std::vector<RfInputs<float>> inputsf2_reg = {
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::
      CRITERION_END},  // CRITERION_END uses the default criterion (GINI for classification, MSE for regression)
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MAE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::MAE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::CRITERION_END}};
 
 const std::vector<RfInputs<double>> inputsd2_reg = {  // Same as inputsf2_reg
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::CRITERION_END},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MAE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::MAE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::CRITERION_END}};
 
 INSTANTIATE_TEST_CASE_P(RfRegressorTests, RfRegressorTestF,
diff --git a/cpp/test/sg/rf_treelite_test.cu b/cpp/test/sg/rf_treelite_test.cu
index eb8d254fa7..edd74942f2 100644
--- a/cpp/test/sg/rf_treelite_test.cu
+++ b/cpp/test/sg/rf_treelite_test.cu
@@ -14,22 +14,22 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <decisiontree/decisiontree_impl.h>
 #include <gtest/gtest.h>
-#include <linalg/gemv.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/gemv.h>
+#include <raft/linalg/transpose.h>
 #include <sys/stat.h>
 #include <test_utils.h>
 #include <treelite/c_api.h>
 #include <treelite/c_api_runtime.h>
 #include <cstdlib>
-#include <cuda_utils.cuh>
 #include <cuml/ensemble/randomforest.hpp>
 #include <fstream>
 #include <iostream>
 #include <limits>
-#include <random/rng.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/random/rng.cuh>
 #include <string>
 
 namespace ML {
@@ -160,8 +160,8 @@ class RfTreeliteTestCommon : public ::testing::TestWithParam<RfInputs<T>> {
     predicted_labels_h.resize(params.n_inference_rows);
     ref_predicted_labels.resize(params.n_inference_rows);
 
-    updateHost(predicted_labels_h.data(), predicted_labels_d,
-               params.n_inference_rows, stream);
+    raft::update_host(predicted_labels_h.data(), predicted_labels_d,
+                      params.n_inference_rows, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
 
     for (int i = 0; i < params.n_inference_rows; i++) {
@@ -174,9 +174,9 @@ class RfTreeliteTestCommon : public ::testing::TestWithParam<RfInputs<T>> {
       }
     }
 
-    EXPECT_TRUE(devArrMatchHost(
+    EXPECT_TRUE(raft::devArrMatchHost(
       ref_predicted_labels.data(), treelite_predicted_labels.data(),
-      params.n_inference_rows, Compare<float>(), stream));
+      params.n_inference_rows, raft::Compare<float>(), stream));
   }
 
   void SetUp() override {
@@ -189,22 +189,22 @@ class RfTreeliteTestCommon : public ::testing::TestWithParam<RfInputs<T>> {
                     params.bootstrap_features, params.split_criterion, false);
     set_all_rf_params(rf_params, params.n_trees, params.bootstrap,
                       params.rows_sample, -1, params.n_streams, tree_params);
-    handle.reset(new cumlHandle(rf_params.n_streams));
+    handle.reset(new raft::handle_t(rf_params.n_streams));
 
     data_len = params.n_rows * params.n_cols;
     inference_data_len = params.n_inference_rows * params.n_cols;
 
-    allocate(data_d, data_len);
-    allocate(inference_data_d, inference_data_len);
+    raft::allocate(data_d, data_len);
+    raft::allocate(inference_data_d, inference_data_len);
 
-    allocate(labels_d, params.n_rows);
-    allocate(predicted_labels_d, params.n_inference_rows);
+    raft::allocate(labels_d, params.n_rows);
+    raft::allocate(predicted_labels_d, params.n_inference_rows);
 
     treelite_predicted_labels.resize(params.n_inference_rows);
     ref_predicted_labels.resize(params.n_inference_rows);
 
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle->setStream(stream);
+    handle->set_stream(stream);
 
     forest = new typename ML::RandomForestMetaData<T, L>;
     null_trees_ptr(forest);
@@ -217,16 +217,16 @@ class RfTreeliteTestCommon : public ::testing::TestWithParam<RfInputs<T>> {
     inference_data_h.resize(inference_data_len);
 
     // Random number generator.
-    Random::Rng r1(1234ULL);
+    raft::random::Rng r1(1234ULL);
     // Generate data_d is in column major order.
     r1.uniform(data_d, data_len, T(0.0), T(10.0), stream);
-    Random::Rng r2(4321ULL);
+    raft::random::Rng r2(4321ULL);
     // Generate inference_data_d which is in row major order.
     r2.uniform(inference_data_d, inference_data_len, T(0.0), T(10.0), stream);
 
-    updateHost(data_h.data(), data_d, data_len, stream);
-    updateHost(inference_data_h.data(), inference_data_d, inference_data_len,
-               stream);
+    raft::update_host(data_h.data(), data_d, data_len, stream);
+    raft::update_host(inference_data_h.data(), inference_data_d,
+                      inference_data_len, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
   }
 
@@ -268,7 +268,7 @@ class RfTreeliteTestCommon : public ::testing::TestWithParam<RfInputs<T>> {
   int inference_data_len;
 
   cudaStream_t stream;
-  std::shared_ptr<cumlHandle> handle;
+  std::shared_ptr<raft::handle_t> handle;
   std::vector<float> treelite_predicted_labels;
   std::vector<float> ref_predicted_labels;
   std::vector<ML::RandomForestMetaData<T, L> *> all_forest_info;
@@ -297,29 +297,28 @@ class RfConcatTestClf : public RfTreeliteTestCommon<T, L> {
     float *weight, *temp_label_d, *temp_data_d;
     std::vector<float> temp_label_h;
 
-    allocate(weight, this->params.n_cols);
-    allocate(temp_label_d, this->params.n_rows);
-    allocate(temp_data_d, this->data_len);
+    raft::allocate(weight, this->params.n_cols);
+    raft::allocate(temp_label_d, this->params.n_rows);
+    raft::allocate(temp_data_d, this->data_len);
 
-    Random::Rng r(1234ULL);
+    raft::random::Rng r(1234ULL);
 
     // Generate weight for each feature.
     r.uniform(weight, this->params.n_cols, T(0.0), T(1.0), this->stream);
     // Generate noise.
     r.uniform(temp_label_d, this->params.n_rows, T(0.0), T(10.0), this->stream);
 
-    LinAlg::transpose<float>(
-      this->data_d, temp_data_d, this->params.n_rows, this->params.n_cols,
-      this->handle->getImpl().getCublasHandle(), this->stream);
+    raft::linalg::transpose<float>(*(this->handle), this->data_d, temp_data_d,
+                                   this->params.n_rows, this->params.n_cols,
+                                   this->stream);
 
-    LinAlg::gemv<float>(temp_data_d, this->params.n_cols, this->params.n_rows,
-                        weight, temp_label_d, true, 1.f, 1.f,
-                        this->handle->getImpl().getCublasHandle(),
-                        this->stream);
+    raft::linalg::gemv<float>(*(this->handle), temp_data_d, this->params.n_cols,
+                              this->params.n_rows, weight, temp_label_d, true,
+                              1.f, 1.f, this->stream);
 
     temp_label_h.resize(this->params.n_rows);
-    updateHost(temp_label_h.data(), temp_label_d, this->params.n_rows,
-               this->stream);
+    raft::update_host(temp_label_h.data(), temp_label_d, this->params.n_rows,
+                      this->stream);
 
     CUDA_CHECK(cudaStreamSynchronize(this->stream));
 
@@ -335,8 +334,8 @@ class RfConcatTestClf : public RfTreeliteTestCommon<T, L> {
       this->labels_h.push_back(value);
     }
 
-    updateDevice(this->labels_d, this->labels_h.data(), this->params.n_rows,
-                 this->stream);
+    raft::update_device(this->labels_d, this->labels_h.data(),
+                        this->params.n_rows, this->stream);
 
     preprocess_labels(this->params.n_rows, this->labels_h, labels_map);
 
@@ -384,10 +383,10 @@ class RfConcatTestReg : public RfTreeliteTestCommon<T, L> {
     this->task_category = 1;
 
     float *weight, *temp_data_d;
-    allocate(weight, this->params.n_cols);
-    allocate(temp_data_d, this->data_len);
+    raft::allocate(weight, this->params.n_cols);
+    raft::allocate(temp_data_d, this->data_len);
 
-    Random::Rng r(1234ULL);
+    raft::random::Rng r(1234ULL);
 
     // Generate weight for each feature.
     r.uniform(weight, this->params.n_cols, T(0.0), T(1.0), this->stream);
@@ -395,18 +394,17 @@ class RfConcatTestReg : public RfTreeliteTestCommon<T, L> {
     r.uniform(this->labels_d, this->params.n_rows, T(0.0), T(10.0),
               this->stream);
 
-    LinAlg::transpose<float>(
-      this->data_d, temp_data_d, this->params.n_rows, this->params.n_cols,
-      this->handle->getImpl().getCublasHandle(), this->stream);
+    raft::linalg::transpose<float>(*(this->handle), this->data_d, temp_data_d,
+                                   this->params.n_rows, this->params.n_cols,
+                                   this->stream);
 
-    LinAlg::gemv<float>(temp_data_d, this->params.n_cols, this->params.n_rows,
-                        weight, this->labels_d, true, 1.f, 1.f,
-                        this->handle->getImpl().getCublasHandle(),
-                        this->stream);
+    raft::linalg::gemv<float>(*(this->handle), temp_data_d, this->params.n_cols,
+                              this->params.n_rows, weight, this->labels_d, true,
+                              1.f, 1.f, this->stream);
 
     this->labels_h.resize(this->params.n_rows);
-    updateHost(this->labels_h.data(), this->labels_d, this->params.n_rows,
-               this->stream);
+    raft::update_host(this->labels_h.data(), this->labels_d,
+                      this->params.n_rows, this->stream);
     CUDA_CHECK(cudaStreamSynchronize(this->stream));
 
     for (int i = 0; i < 3; i++) {
@@ -465,18 +463,18 @@ INSTANTIATE_TEST_CASE_P(RfBinaryClassifierConcatTests, RfClassifierConcatTestF,
                         ::testing::ValuesIn(inputsf2_clf));
 
 const std::vector<RfInputs<float>> inputsf2_reg = {
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MSE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::
      CRITERION_END},  // CRITERION_END uses the default criterion (GINI for classification, MSE for regression)
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::MAE},
-  {4, 2, 1, 1.0f, 1.0f, 4, 8, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
+  {4, 2, 1, 1.0f, 1.0f, 4, 7, -1, false, false, 4, SPLIT_ALGO::GLOBAL_QUANTILE,
    2, 0.0, 2, CRITERION::MAE},
-  {4, 2, 5, 1.0f, 1.0f, 4, 8, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
+  {4, 2, 5, 1.0f, 1.0f, 4, 7, -1, true, false, 4, SPLIT_ALGO::HIST, 2, 0.0, 2,
    CRITERION::CRITERION_END}};
 
 typedef RfConcatTestReg<float, float> RfRegressorConcatTestF;
diff --git a/cpp/test/sg/ridge.cu b/cpp/test/sg/ridge.cu
index 75d67747c1..8e7f28621e 100644
--- a/cpp/test/sg/ridge.cu
+++ b/cpp/test/sg/ridge.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <glm/ridge.cuh>
+#include <raft/cuda_utils.cuh>
 
 namespace ML {
 namespace GLM {
@@ -43,110 +43,110 @@ class RidgeTest : public ::testing::TestWithParam<RidgeInputs<T>> {
     int len = params.n_row * params.n_col;
     int len2 = params.n_row_2 * params.n_col;
 
-    allocate(data, len);
-    allocate(labels, params.n_row);
-    allocate(coef, params.n_col);
-    allocate(coef2, params.n_col);
-    allocate(coef3, params.n_col);
-    allocate(coef_ref, params.n_col);
-    allocate(coef2_ref, params.n_col);
-    allocate(coef3_ref, params.n_col);
-    allocate(pred_data, len2);
-    allocate(pred, params.n_row_2);
-    allocate(pred_ref, params.n_row_2);
-    allocate(pred2, params.n_row_2);
-    allocate(pred2_ref, params.n_row_2);
-    allocate(pred3, params.n_row_2);
-    allocate(pred3_ref, params.n_row_2);
+    raft::allocate(data, len);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(coef, params.n_col);
+    raft::allocate(coef2, params.n_col);
+    raft::allocate(coef3, params.n_col);
+    raft::allocate(coef_ref, params.n_col);
+    raft::allocate(coef2_ref, params.n_col);
+    raft::allocate(coef3_ref, params.n_col);
+    raft::allocate(pred_data, len2);
+    raft::allocate(pred, params.n_row_2);
+    raft::allocate(pred_ref, params.n_row_2);
+    raft::allocate(pred2, params.n_row_2);
+    raft::allocate(pred2_ref, params.n_row_2);
+    raft::allocate(pred3, params.n_row_2);
+    raft::allocate(pred3_ref, params.n_row_2);
     T alpha = params.alpha;
 
     T data_h[len] = {0.0, 0.0, 1.0, 0.0, 0.0, 1.0};
-    updateDevice(data, data_h, len, stream);
+    raft::update_device(data, data_h, len, stream);
 
     T labels_h[params.n_row] = {0.0, 0.1, 1.0};
-    updateDevice(labels, labels_h, params.n_row, stream);
+    raft::update_device(labels, labels_h, params.n_row, stream);
 
     T coef_ref_h[params.n_col] = {0.39999998, 0.4};
-    updateDevice(coef_ref, coef_ref_h, params.n_col, stream);
+    raft::update_device(coef_ref, coef_ref_h, params.n_col, stream);
 
     T coef2_ref_h[params.n_col] = {0.3454546, 0.34545454};
-    updateDevice(coef2_ref, coef2_ref_h, params.n_col, stream);
+    raft::update_device(coef2_ref, coef2_ref_h, params.n_col, stream);
 
     T coef3_ref_h[params.n_col] = {0.3799999, 0.38000008};
-    updateDevice(coef3_ref, coef3_ref_h, params.n_col, stream);
+    raft::update_device(coef3_ref, coef3_ref_h, params.n_col, stream);
 
     T pred_data_h[len2] = {0.5, 2.0, 0.2, 1.0};
-    updateDevice(pred_data, pred_data_h, len2, stream);
+    raft::update_device(pred_data, pred_data_h, len2, stream);
 
     T pred_ref_h[params.n_row_2] = {0.28, 1.1999999};
-    updateDevice(pred_ref, pred_ref_h, params.n_row_2, stream);
+    raft::update_device(pred_ref, pred_ref_h, params.n_row_2, stream);
 
     T pred2_ref_h[params.n_row_2] = {0.37818184, 1.1727273};
-    updateDevice(pred2_ref, pred2_ref_h, params.n_row_2, stream);
+    raft::update_device(pred2_ref, pred2_ref_h, params.n_row_2, stream);
 
     T pred3_ref_h[params.n_row_2] = {0.37933332, 1.2533332};
-    updateDevice(pred3_ref, pred3_ref_h, params.n_row_2, stream);
+    raft::update_device(pred3_ref, pred3_ref_h, params.n_row_2, stream);
 
     intercept = T(0);
 
-    ridgeFit(handle.getImpl(), data, params.n_row, params.n_col, labels, &alpha,
-             1, coef, &intercept, false, false, stream, params.algo);
+    ridgeFit(handle, data, params.n_row, params.n_col, labels, &alpha, 1, coef,
+             &intercept, false, false, stream, params.algo);
 
-    ridgePredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col,
-                 coef, intercept, pred, stream);
+    ridgePredict(handle, pred_data, params.n_row_2, params.n_col, coef,
+                 intercept, pred, stream);
 
-    updateDevice(data, data_h, len, stream);
-    updateDevice(labels, labels_h, params.n_row, stream);
+    raft::update_device(data, data_h, len, stream);
+    raft::update_device(labels, labels_h, params.n_row, stream);
 
     intercept2 = T(0);
-    ridgeFit(handle.getImpl(), data, params.n_row, params.n_col, labels, &alpha,
-             1, coef2, &intercept2, true, false, stream, params.algo);
+    ridgeFit(handle, data, params.n_row, params.n_col, labels, &alpha, 1, coef2,
+             &intercept2, true, false, stream, params.algo);
 
-    ridgePredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col,
-                 coef2, intercept2, pred2, stream);
+    ridgePredict(handle, pred_data, params.n_row_2, params.n_col, coef2,
+                 intercept2, pred2, stream);
 
-    updateDevice(data, data_h, len, stream);
-    updateDevice(labels, labels_h, params.n_row, stream);
+    raft::update_device(data, data_h, len, stream);
+    raft::update_device(labels, labels_h, params.n_row, stream);
 
     intercept3 = T(0);
-    ridgeFit(handle.getImpl(), data, params.n_row, params.n_col, labels, &alpha,
-             1, coef3, &intercept3, true, true, stream, params.algo);
+    ridgeFit(handle, data, params.n_row, params.n_col, labels, &alpha, 1, coef3,
+             &intercept3, true, true, stream, params.algo);
 
-    ridgePredict(handle.getImpl(), pred_data, params.n_row_2, params.n_col,
-                 coef3, intercept3, pred3, stream);
+    ridgePredict(handle, pred_data, params.n_row_2, params.n_col, coef3,
+                 intercept3, pred3, stream);
   }
 
   void basicTest2() {
     params = ::testing::TestWithParam<RidgeInputs<T>>::GetParam();
     int len = params.n_row * params.n_col;
 
-    allocate(data_sc, len);
-    allocate(labels_sc, len);
-    allocate(coef_sc, 1);
-    allocate(coef_sc_ref, 1);
+    raft::allocate(data_sc, len);
+    raft::allocate(labels_sc, len);
+    raft::allocate(coef_sc, 1);
+    raft::allocate(coef_sc_ref, 1);
 
     std::vector<T> data_h = {1.0, 1.0, 2.0, 2.0, 1.0, 2.0};
     data_h.resize(len);
-    updateDevice(data_sc, data_h.data(), len, stream);
+    raft::update_device(data_sc, data_h.data(), len, stream);
 
     std::vector<T> labels_h = {6.0, 8.0, 9.0, 11.0, -1.0, 2.0};
     labels_h.resize(len);
-    updateDevice(labels_sc, labels_h.data(), len, stream);
+    raft::update_device(labels_sc, labels_h.data(), len, stream);
 
     std::vector<T> coef_sc_ref_h = {1.8};
     coef_sc_ref_h.resize(1);
-    updateDevice(coef_sc_ref, coef_sc_ref_h.data(), 1, stream);
+    raft::update_device(coef_sc_ref, coef_sc_ref_h.data(), 1, stream);
 
     T intercept_sc = T(0);
     T alpha_sc = T(1.0);
 
-    ridgeFit(handle.getImpl(), data_sc, len, 1, labels_sc, &alpha_sc, 1,
-             coef_sc, &intercept_sc, true, false, stream, params.algo);
+    ridgeFit(handle, data_sc, len, 1, labels_sc, &alpha_sc, 1, coef_sc,
+             &intercept_sc, true, false, stream, params.algo);
   }
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     basicTest();
     basicTest2();
   }
@@ -182,7 +182,7 @@ class RidgeTest : public ::testing::TestWithParam<RidgeInputs<T>> {
   T *coef3, *coef3_ref, *pred3, *pred3_ref;
   T *data_sc, *labels_sc, *coef_sc, *coef_sc_ref;
   T intercept, intercept2, intercept3;
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
 };
 
@@ -194,50 +194,50 @@ const std::vector<RidgeInputs<double>> inputsd2 = {{0.001, 3, 2, 2, 0, 0.5},
 
 typedef RidgeTest<float> RidgeTestF;
 TEST_P(RidgeTestF, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef3_ref, coef3, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_ref, pred, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_ref, pred, params.n_row_2,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred2_ref, pred2, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred2_ref, pred2, params.n_row_2,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred3_ref, pred3, params.n_row_2,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred3_ref, pred3, params.n_row_2,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(
-    devArrMatch(coef_sc_ref, coef_sc, 1, CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_sc_ref, coef_sc, 1,
+                                raft::CompareApproxAbs<float>(params.tol)));
 }
 
 typedef RidgeTest<double> RidgeTestD;
 TEST_P(RidgeTestD, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef3_ref, coef3, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef3_ref, coef3, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_ref, pred, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_ref, pred, params.n_row_2,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred2_ref, pred2, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred2_ref, pred2, params.n_row_2,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred3_ref, pred3, params.n_row_2,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred3_ref, pred3, params.n_row_2,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(
-    devArrMatch(coef_sc_ref, coef_sc, 1, CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_sc_ref, coef_sc, 1,
+                                raft::CompareApproxAbs<double>(params.tol)));
 }
 
 INSTANTIATE_TEST_CASE_P(RidgeTests, RidgeTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/sg/rproj_test.cu b/cpp/test/sg/rproj_test.cu
index e0353c7e75..d246d812c5 100644
--- a/cpp/test/sg/rproj_test.cu
+++ b/cpp/test/sg/rproj_test.cu
@@ -14,14 +14,14 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/random_projection/rproj_c.h>
 #include <gtest/gtest.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include <test_utils.h>
-#include <cuda_utils.cuh>
 #include <distance/distance.cuh>
 #include <iostream>
+#include <raft/cuda_utils.cuh>
 #include <random>
 #include <vector>
 
@@ -33,12 +33,11 @@ template <typename T, int N, int M>
 class RPROJTest : public ::testing::Test {
  protected:
   T* transpose(T* in, int n_rows, int n_cols) {
-    cudaStream_t stream = h.getStream();
-    cublasHandle_t cublas_handle = h.getImpl().getCublasHandle();
+    cudaStream_t stream = h.get_stream();
+    cublasHandle_t cublas_handle = h.get_cublas_handle();
     T* result;
-    allocate(result, n_rows * n_cols);
-    MLCommon::LinAlg::transpose(in, result, n_rows, n_cols, cublas_handle,
-                                stream);
+    raft::allocate(result, n_rows * n_cols);
+    raft::linalg::transpose(h, in, result, n_rows, n_cols, stream);
     CUDA_CHECK(cudaPeekAtLastError());
     CUDA_CHECK(cudaFree(in));
     return result;
@@ -53,8 +52,8 @@ class RPROJTest : public ::testing::Test {
     for (auto& i : h_input) {
       i = dist(rng);
     }
-    allocate(d_input, h_input.size());
-    updateDevice(d_input, h_input.data(), h_input.size(), NULL);
+    raft::allocate(d_input, h_input.size());
+    raft::update_device(d_input, h_input.data(), h_input.size(), NULL);
     //d_input = transpose(d_input, N, M);
     // From row major to column major (this operation is only useful for non-random datasets)
   }
@@ -72,11 +71,11 @@ class RPROJTest : public ::testing::Test {
       42        // random seed
     };
 
-    cudaStream_t stream = h.getStream();
-    auto alloc = h.getDeviceAllocator();
+    cudaStream_t stream = h.get_stream();
+    auto alloc = h.get_device_allocator();
     random_matrix1 = new rand_mat<T>(alloc, stream);
     RPROJfit(h, random_matrix1, params1);
-    allocate(d_output1, N * params1->n_components);
+    raft::allocate(d_output1, N * params1->n_components);
     RPROJtransform(h, d_input, random_matrix1, d_output1, params1);
     d_output1 = transpose(
       d_output1, N, params1->n_components);  // From column major to row major
@@ -95,12 +94,12 @@ class RPROJTest : public ::testing::Test {
       42        // random seed
     };
 
-    cudaStream_t stream = h.getStream();
-    auto alloc = h.getDeviceAllocator();
+    cudaStream_t stream = h.get_stream();
+    auto alloc = h.get_device_allocator();
     random_matrix2 = new rand_mat<T>(alloc, stream);
     RPROJfit(h, random_matrix2, params2);
 
-    allocate(d_output2, N * params2->n_components);
+    raft::allocate(d_output2, N * params2->n_components);
 
     RPROJtransform(h, d_input, random_matrix2, d_output2, params2);
 
@@ -144,42 +143,42 @@ class RPROJTest : public ::testing::Test {
     int D = johnson_lindenstrauss_min_dim(N, epsilon);
 
     constexpr auto distance_type =
-      MLCommon::Distance::DistanceType::EucUnexpandedL2Sqrt;
+      ML::Distance::DistanceType::EucUnexpandedL2Sqrt;
     size_t workspaceSize = 0;
     typedef cutlass::Shape<8, 128, 128> OutputTile_t;
 
     T* d_pdist;
-    allocate(d_pdist, N * N);
+    raft::allocate(d_pdist, N * N);
 
     MLCommon::Distance::distance<distance_type, T, T, T, OutputTile_t>(
       d_input, d_input, d_pdist, N, N, M, (void*)nullptr, workspaceSize,
-      h.getStream());
+      h.get_stream());
     CUDA_CHECK(cudaPeekAtLastError());
 
     T* h_pdist = new T[N * N];
-    updateHost(h_pdist, d_pdist, N * N, NULL);
+    raft::update_host(h_pdist, d_pdist, N * N, NULL);
     CUDA_CHECK(cudaFree(d_pdist));
 
     T* d_pdist1;
-    allocate(d_pdist1, N * N);
+    raft::allocate(d_pdist1, N * N);
     MLCommon::Distance::distance<distance_type, T, T, T, OutputTile_t>(
       d_output1, d_output1, d_pdist1, N, N, D, (void*)nullptr, workspaceSize,
-      h.getStream());
+      h.get_stream());
     CUDA_CHECK(cudaPeekAtLastError());
 
     T* h_pdist1 = new T[N * N];
-    updateHost(h_pdist1, d_pdist1, N * N, NULL);
+    raft::update_host(h_pdist1, d_pdist1, N * N, NULL);
     CUDA_CHECK(cudaFree(d_pdist1));
 
     T* d_pdist2;
-    allocate(d_pdist2, N * N);
+    raft::allocate(d_pdist2, N * N);
     MLCommon::Distance::distance<distance_type, T, T, T, OutputTile_t>(
       d_output2, d_output2, d_pdist2, N, N, D, (void*)nullptr, workspaceSize,
-      h.getStream());
+      h.get_stream());
     CUDA_CHECK(cudaPeekAtLastError());
 
     T* h_pdist2 = new T[N * N];
-    updateHost(h_pdist2, d_pdist2, N * N, NULL);
+    raft::update_host(h_pdist2, d_pdist2, N * N, NULL);
     CUDA_CHECK(cudaFree(d_pdist2));
 
     for (size_t i = 0; i < N; i++) {
@@ -202,7 +201,7 @@ class RPROJTest : public ::testing::Test {
   }
 
  protected:
-  ML::cumlHandle h;
+  raft::handle_t h;
   paramsRPROJ* params1;
   T epsilon;
 
diff --git a/cpp/test/sg/sgd.cu b/cpp/test/sg/sgd.cu
index 1abf8764eb..5a73f58e52 100644
--- a/cpp/test/sg/sgd.cu
+++ b/cpp/test/sg/sgd.cu
@@ -14,18 +14,17 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
-#include <linalg/cusolver_wrappers.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/cusolver_wrappers.h>
 #include <test_utils.h>
-#include <matrix/matrix.cuh>
+#include <raft/matrix/matrix.cuh>
 #include <solver/sgd.cuh>
 
 namespace ML {
 namespace Solver {
 
 using namespace MLCommon;
-using namespace MLCommon::LinAlg;
 
 template <typename T>
 struct SgdInputs {
@@ -44,24 +43,24 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     params = ::testing::TestWithParam<SgdInputs<T>>::GetParam();
     int len = params.n_row * params.n_col;
 
-    allocate(data, len);
-    allocate(labels, params.n_row);
-    allocate(coef, params.n_col, true);
-    allocate(coef2, params.n_col, true);
-    allocate(coef_ref, params.n_col);
-    allocate(coef2_ref, params.n_col);
+    raft::allocate(data, len);
+    raft::allocate(labels, params.n_row);
+    raft::allocate(coef, params.n_col, true);
+    raft::allocate(coef2, params.n_col, true);
+    raft::allocate(coef_ref, params.n_col);
+    raft::allocate(coef2_ref, params.n_col);
 
     T data_h[len] = {1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 3.0};
-    updateDevice(data, data_h, len, stream);
+    raft::update_device(data, data_h, len, stream);
 
     T labels_h[params.n_row] = {6.0, 8.0, 9.0, 11.0};
-    updateDevice(labels, labels_h, params.n_row, stream);
+    raft::update_device(labels, labels_h, params.n_row, stream);
 
     T coef_ref_h[params.n_col] = {2.087, 2.5454557};
-    updateDevice(coef_ref, coef_ref_h, params.n_col, stream);
+    raft::update_device(coef_ref, coef_ref_h, params.n_col, stream);
 
     T coef2_ref_h[params.n_col] = {1.000001, 1.9999998};
-    updateDevice(coef2_ref, coef2_ref_h, params.n_col, stream);
+    raft::update_device(coef2_ref, coef2_ref_h, params.n_col, stream);
 
     bool fit_intercept = false;
     intercept = T(0);
@@ -77,17 +76,16 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     MLCommon::Functions::penalty pen = MLCommon::Functions::penalty::NONE;
     int n_iter_no_change = 10;
 
-    sgdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef,
-           &intercept, fit_intercept, params.batch_size, epochs, lr_type, lr,
-           power_t, loss, pen, alpha, l1_ratio, shuffle, tol, n_iter_no_change,
-           stream);
+    sgdFit(handle, data, params.n_row, params.n_col, labels, coef, &intercept,
+           fit_intercept, params.batch_size, epochs, lr_type, lr, power_t, loss,
+           pen, alpha, l1_ratio, shuffle, tol, n_iter_no_change, stream);
 
     fit_intercept = true;
     intercept2 = T(0);
-    sgdFit(handle.getImpl(), data, params.n_row, params.n_col, labels, coef2,
-           &intercept2, fit_intercept, params.batch_size, epochs,
-           ML::lr_type::CONSTANT, lr, power_t, loss, pen, alpha, l1_ratio,
-           shuffle, tol, n_iter_no_change, stream);
+    sgdFit(handle, data, params.n_row, params.n_col, labels, coef2, &intercept2,
+           fit_intercept, params.batch_size, epochs, ML::lr_type::CONSTANT, lr,
+           power_t, loss, pen, alpha, l1_ratio, shuffle, tol, n_iter_no_change,
+           stream);
   }
 
   void logisticRegressionTest() {
@@ -95,26 +93,26 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     int len = params.n_row2 * params.n_col2;
 
     T *coef_class;
-    allocate(data_logreg, len);
-    allocate(data_logreg_test, len);
-    allocate(labels_logreg, params.n_row2);
-    allocate(coef_class, params.n_col2, true);
-    allocate(pred_log, params.n_row2);
-    allocate(pred_log_ref, params.n_row2);
+    raft::allocate(data_logreg, len);
+    raft::allocate(data_logreg_test, len);
+    raft::allocate(labels_logreg, params.n_row2);
+    raft::allocate(coef_class, params.n_col2, true);
+    raft::allocate(pred_log, params.n_row2);
+    raft::allocate(pred_log_ref, params.n_row2);
 
     T data_h[len] = {0.1,  -2.1, 5.4,  5.4,   -1.5,  -2.15,
                      2.65, 2.65, 3.25, -0.15, -7.35, -7.35};
-    updateDevice(data_logreg, data_h, len, stream);
+    raft::update_device(data_logreg, data_h, len, stream);
 
     T data_test_h[len] = {0.3,   1.1,   2.1,  -10.1, 0.5,  2.5,
                           -3.55, -20.5, -1.3, 3.0,   -5.0, 15.0};
-    updateDevice(data_logreg_test, data_test_h, len, stream);
+    raft::update_device(data_logreg_test, data_test_h, len, stream);
 
     T labels_logreg_h[params.n_row2] = {0.0, 1.0, 1.0, 0.0};
-    updateDevice(labels_logreg, labels_logreg_h, params.n_row2, stream);
+    raft::update_device(labels_logreg, labels_logreg_h, params.n_row2, stream);
 
     T pred_log_ref_h[params.n_row2] = {1.0, 0.0, 1.0, 1.0};
-    updateDevice(pred_log_ref, pred_log_ref_h, params.n_row2, stream);
+    raft::update_device(pred_log_ref, pred_log_ref_h, params.n_row2, stream);
 
     bool fit_intercept = true;
     T intercept_class = T(0);
@@ -130,12 +128,12 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     MLCommon::Functions::penalty pen = MLCommon::Functions::penalty::NONE;
     int n_iter_no_change = 10;
 
-    sgdFit(handle.getImpl(), data_logreg, params.n_row2, params.n_col2,
-           labels_logreg, coef_class, &intercept_class, fit_intercept,
-           params.batch_size, epochs, lr_type, lr, power_t, loss, pen, alpha,
-           l1_ratio, shuffle, tol, n_iter_no_change, stream);
+    sgdFit(handle, data_logreg, params.n_row2, params.n_col2, labels_logreg,
+           coef_class, &intercept_class, fit_intercept, params.batch_size,
+           epochs, lr_type, lr, power_t, loss, pen, alpha, l1_ratio, shuffle,
+           tol, n_iter_no_change, stream);
 
-    sgdPredictBinaryClass(handle.getImpl(), data_logreg_test, params.n_row2,
+    sgdPredictBinaryClass(handle, data_logreg_test, params.n_row2,
                           params.n_col2, coef_class, intercept_class, pred_log,
                           loss, stream);
 
@@ -147,26 +145,26 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     int len = params.n_row2 * params.n_col2;
 
     T *coef_class;
-    allocate(data_svmreg, len);
-    allocate(data_svmreg_test, len);
-    allocate(labels_svmreg, params.n_row2);
-    allocate(coef_class, params.n_col2, true);
-    allocate(pred_svm, params.n_row2);
-    allocate(pred_svm_ref, params.n_row2);
+    raft::allocate(data_svmreg, len);
+    raft::allocate(data_svmreg_test, len);
+    raft::allocate(labels_svmreg, params.n_row2);
+    raft::allocate(coef_class, params.n_col2, true);
+    raft::allocate(pred_svm, params.n_row2);
+    raft::allocate(pred_svm_ref, params.n_row2);
 
     T data_h[len] = {0.1,  -2.1, 5.4,  5.4,   -1.5,  -2.15,
                      2.65, 2.65, 3.25, -0.15, -7.35, -7.35};
-    updateDevice(data_svmreg, data_h, len, stream);
+    raft::update_device(data_svmreg, data_h, len, stream);
 
     T data_test_h[len] = {0.3,   1.1,   2.1,  -10.1, 0.5,  2.5,
                           -3.55, -20.5, -1.3, 3.0,   -5.0, 15.0};
-    updateDevice(data_svmreg_test, data_test_h, len, stream);
+    raft::update_device(data_svmreg_test, data_test_h, len, stream);
 
     T labels_svmreg_h[params.n_row2] = {0.0, 1.0, 1.0, 0.0};
-    updateDevice(labels_svmreg, labels_svmreg_h, params.n_row2, stream);
+    raft::update_device(labels_svmreg, labels_svmreg_h, params.n_row2, stream);
 
     T pred_svm_ref_h[params.n_row2] = {1.0, 0.0, 1.0, 1.0};
-    updateDevice(pred_svm_ref, pred_svm_ref_h, params.n_row2, stream);
+    raft::update_device(pred_svm_ref, pred_svm_ref_h, params.n_row2, stream);
 
     bool fit_intercept = true;
     T intercept_class = T(0);
@@ -182,12 +180,12 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
     MLCommon::Functions::penalty pen = MLCommon::Functions::penalty::L2;
     int n_iter_no_change = 10;
 
-    sgdFit(handle.getImpl(), data_svmreg, params.n_row2, params.n_col2,
-           labels_svmreg, coef_class, &intercept_class, fit_intercept,
-           params.batch_size, epochs, lr_type, lr, power_t, loss, pen, alpha,
-           l1_ratio, shuffle, tol, n_iter_no_change, stream);
+    sgdFit(handle, data_svmreg, params.n_row2, params.n_col2, labels_svmreg,
+           coef_class, &intercept_class, fit_intercept, params.batch_size,
+           epochs, lr_type, lr, power_t, loss, pen, alpha, l1_ratio, shuffle,
+           tol, n_iter_no_change, stream);
 
-    sgdPredictBinaryClass(handle.getImpl(), data_svmreg_test, params.n_row2,
+    sgdPredictBinaryClass(handle, data_svmreg_test, params.n_row2,
                           params.n_col2, coef_class, intercept_class, pred_svm,
                           loss, stream);
 
@@ -196,7 +194,7 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     linearRegressionTest();
     logisticRegressionTest();
     svmTest();
@@ -231,7 +229,7 @@ class SgdTest : public ::testing::TestWithParam<SgdInputs<T>> {
   T *pred_svm, *pred_svm_ref, *pred_log, *pred_log_ref;
   T intercept, intercept2;
   cudaStream_t stream;
-  cumlHandle handle;
+  raft::handle_t handle;
 };
 
 const std::vector<SgdInputs<float>> inputsf2 = {{0.01f, 4, 2, 4, 3, 2}};
@@ -240,32 +238,32 @@ const std::vector<SgdInputs<double>> inputsd2 = {{0.01, 4, 2, 4, 3, 2}};
 
 typedef SgdTest<float> SgdTestF;
 TEST_P(SgdTestF, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_log_ref, pred_log, params.n_row,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_log_ref, pred_log, params.n_row,
+                                raft::CompareApproxAbs<float>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_svm_ref, pred_svm, params.n_row,
-                          CompareApproxAbs<float>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_svm_ref, pred_svm, params.n_row,
+                                raft::CompareApproxAbs<float>(params.tol)));
 }
 
 typedef SgdTest<double> SgdTestD;
 TEST_P(SgdTestD, Fit) {
-  ASSERT_TRUE(devArrMatch(coef_ref, coef, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef_ref, coef, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(coef2_ref, coef2, params.n_col,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(coef2_ref, coef2, params.n_col,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_log_ref, pred_log, params.n_row,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_log_ref, pred_log, params.n_row,
+                                raft::CompareApproxAbs<double>(params.tol)));
 
-  ASSERT_TRUE(devArrMatch(pred_svm_ref, pred_svm, params.n_row,
-                          CompareApproxAbs<double>(params.tol)));
+  ASSERT_TRUE(raft::devArrMatch(pred_svm_ref, pred_svm, params.n_row,
+                                raft::CompareApproxAbs<double>(params.tol)));
 }
 
 INSTANTIATE_TEST_CASE_P(SgdTests, SgdTestF, ::testing::ValuesIn(inputsf2));
diff --git a/cpp/test/sg/svc_test.cu b/cpp/test/sg/svc_test.cu
index 83ccf4b4a6..34a2c7092f 100644
--- a/cpp/test/sg/svc_test.cu
+++ b/cpp/test/sg/svc_test.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/svm/svm_model.h>
 #include <cuml/svm/svm_parameter.h>
 #include <gtest/gtest.h>
-#include <linalg/transpose.h>
+#include <raft/cudart_utils.h>
+#include <raft/linalg/transpose.h>
 #include <test_utils.h>
 #include <thrust/device_ptr.h>
 #include <thrust/fill.h>
@@ -27,18 +27,18 @@
 #include <common/cumlHandle.hpp>
 #include <common/device_buffer.hpp>
 #include <cub/cub.cuh>
-#include <cuda_utils.cuh>
 #include <cuml/common/logger.hpp>
 #include <cuml/datasets/make_blobs.hpp>
 #include <cuml/svm/svc.hpp>
 #include <cuml/svm/svr.hpp>
 #include <iostream>
-#include <linalg/binary_op.cuh>
-#include <linalg/map_then_reduce.cuh>
 #include <matrix/grammatrix.cuh>
 #include <matrix/kernelmatrices.cuh>
+#include <raft/cuda_utils.cuh>
+#include <raft/linalg/binary_op.cuh>
+#include <raft/linalg/map_then_reduce.cuh>
+#include <raft/random/rng.cuh>
 #include <random/make_blobs.cuh>
-#include <random/rng.cuh>
 #include <string>
 #include <svm/smoblocksolve.cuh>
 #include <svm/smosolver.cuh>
@@ -51,22 +51,37 @@ namespace SVM {
 using namespace MLCommon;
 using namespace Matrix;
 
+// Initialize device vector C_vec with scalar C
+template <typename math_t>
+void init_C(math_t C, math_t *C_vec, int n, cudaStream_t stream) {
+  thrust::device_ptr<math_t> c_ptr(C_vec);
+  thrust::fill(thrust::cuda::par.on(stream), c_ptr, c_ptr + n, C);
+}
+
 template <typename math_t>
 class WorkingSetTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
-    allocate(f_dev, 10);
-    allocate(y_dev, 10);
-    allocate(alpha_dev, 10);
-    updateDevice(f_dev, f_host, 10, stream);
-    updateDevice(y_dev, y_host, 10, stream);
-    updateDevice(alpha_dev, alpha_host, 10, stream);
+    handle.set_stream(stream);
+    raft::allocate(f_dev, 10);
+    raft::allocate(y_dev, 10);
+    raft::allocate(C_dev, 10);
+    raft::allocate(alpha_dev, 10);
+    init_C(C, C_dev, 10, stream);
+    raft::update_device(f_dev, f_host, 10, stream);
+    raft::update_device(y_dev, y_host, 10, stream);
+    raft::update_device(alpha_dev, alpha_host, 10, stream);
   }
 
-  void TearDown() override { CUDA_CHECK(cudaStreamDestroy(stream)); }
-  cumlHandle handle;
+  void TearDown() override {
+    CUDA_CHECK(cudaStreamDestroy(stream));
+    CUDA_CHECK(cudaFree(f_dev));
+    CUDA_CHECK(cudaFree(y_dev));
+    CUDA_CHECK(cudaFree(C_dev));
+    CUDA_CHECK(cudaFree(alpha_dev));
+  }
+  raft::handle_t handle;
   cudaStream_t stream;
   WorkingSet<math_t> *ws;
 
@@ -76,6 +91,7 @@ class WorkingSetTest : public ::testing::Test {
   math_t y_host[10] = {-1, -1, -1, -1, -1, 1, 1, 1, 1, 1};
   math_t *y_dev;
 
+  math_t *C_dev;
   math_t C = 1.5;
 
   math_t alpha_host[10] = {0, 0, 0.1, 0.2, 1.5, 0, 0.2, 0.4, 1.5, 1.5};
@@ -90,32 +106,31 @@ typedef ::testing::Types<float, double> FloatTypes;
 TYPED_TEST_CASE(WorkingSetTest, FloatTypes);
 
 TYPED_TEST(WorkingSetTest, Init) {
-  this->ws = new WorkingSet<TypeParam>(this->handle.getImpl(),
-                                       this->handle.getStream(), 10);
+  this->ws =
+    new WorkingSet<TypeParam>(this->handle, this->handle.get_stream(), 10);
   EXPECT_EQ(this->ws->GetSize(), 10);
   delete this->ws;
 
-  this->ws =
-    new WorkingSet<TypeParam>(this->handle.getImpl(), this->stream, 100000);
+  this->ws = new WorkingSet<TypeParam>(this->handle, this->stream, 100000);
   EXPECT_EQ(this->ws->GetSize(), 1024);
   delete this->ws;
 }
 
 TYPED_TEST(WorkingSetTest, Select) {
-  this->ws =
-    new WorkingSet<TypeParam>(this->handle.getImpl(), this->stream, 10, 4);
+  this->ws = new WorkingSet<TypeParam>(this->handle, this->stream, 10, 4);
   EXPECT_EQ(this->ws->GetSize(), 4);
-  this->ws->SimpleSelect(this->f_dev, this->alpha_dev, this->y_dev, this->C);
+  this->ws->SimpleSelect(this->f_dev, this->alpha_dev, this->y_dev,
+                         this->C_dev);
   ASSERT_TRUE(devArrMatchHost(this->expected_idx, this->ws->GetIndices(),
-                              this->ws->GetSize(), Compare<int>()));
+                              this->ws->GetSize(), raft::Compare<int>()));
 
-  this->ws->Select(this->f_dev, this->alpha_dev, this->y_dev, this->C);
+  this->ws->Select(this->f_dev, this->alpha_dev, this->y_dev, this->C_dev);
   ASSERT_TRUE(devArrMatchHost(this->expected_idx, this->ws->GetIndices(),
-                              this->ws->GetSize(), Compare<int>()));
-  this->ws->Select(this->f_dev, this->alpha_dev, this->y_dev, this->C);
+                              this->ws->GetSize(), raft::Compare<int>()));
+  this->ws->Select(this->f_dev, this->alpha_dev, this->y_dev, this->C_dev);
 
   ASSERT_TRUE(devArrMatchHost(this->expected_idx2, this->ws->GetIndices(),
-                              this->ws->GetSize(), Compare<int>()));
+                              this->ws->GetSize(), raft::Compare<int>()));
   delete this->ws;
 }
 
@@ -128,13 +143,13 @@ class KernelCacheTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
-    cublas_handle = handle.getImpl().getCublasHandle();
-    allocate(x_dev, n_rows * n_cols);
-    updateDevice(x_dev, x_host, n_rows * n_cols, stream);
+    handle.set_stream(stream);
+    cublas_handle = handle.get_cublas_handle();
+    raft::allocate(x_dev, n_rows * n_cols);
+    raft::update_device(x_dev, x_host, n_rows * n_cols, stream);
 
-    allocate(ws_idx_dev, 2 * n_ws);
-    updateDevice(ws_idx_dev, ws_idx_host, n_ws, stream);
+    raft::allocate(ws_idx_dev, 2 * n_ws);
+    raft::update_device(ws_idx_dev, ws_idx_host, n_ws, stream);
   }
 
   void TearDown() override {
@@ -178,10 +193,10 @@ class KernelCacheTest : public ::testing::Test {
 
   void check(const math_t *tile_dev, int n_ws, int n_rows, const int *ws_idx,
              const int *kColIdx) {
-    host_buffer<int> ws_idx_h(handle.getHostAllocator(), stream, n_ws);
-    updateHost(ws_idx_h.data(), ws_idx, n_ws, stream);
-    host_buffer<int> kidx_h(handle.getHostAllocator(), stream, n_ws);
-    updateHost(kidx_h.data(), kColIdx, n_ws, stream);
+    host_buffer<int> ws_idx_h(handle.get_host_allocator(), stream, n_ws);
+    raft::update_host(ws_idx_h.data(), ws_idx, n_ws, stream);
+    host_buffer<int> kidx_h(handle.get_host_allocator(), stream, n_ws);
+    raft::update_host(kidx_h.data(), kColIdx, n_ws, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     // Note: kernel cache can permute the working set, so we have to look
     // up which rows we compare
@@ -192,11 +207,11 @@ class KernelCacheTest : public ::testing::Test {
       const math_t *cache_row = tile_dev + kidx * n_rows;
       const math_t *row_exp = tile_host_all + widx * n_rows;
       EXPECT_TRUE(devArrMatchHost(row_exp, cache_row, n_rows,
-                                  CompareApprox<math_t>(1e-6f)));
+                                  raft::CompareApprox<math_t>(1e-6f)));
     }
   }
 
-  cumlHandle handle;
+  raft::handle_t handle;
   cublasHandle_t cublas_handle;
   cudaStream_t stream;
 
@@ -228,16 +243,16 @@ TYPED_TEST_P(KernelCacheTest, EvalTest) {
   for (auto params : param_vec) {
     Matrix::GramMatrixBase<TypeParam> *kernel =
       Matrix::KernelFactory<TypeParam>::create(
-        params, this->handle.getImpl().getCublasHandle());
-    KernelCache<TypeParam> cache(this->handle.getImpl(), this->x_dev,
-                                 this->n_rows, this->n_cols, this->n_ws, kernel,
-                                 cache_size, C_SVC);
+        params, this->handle.get_cublas_handle());
+    KernelCache<TypeParam> cache(this->handle, this->x_dev, this->n_rows,
+                                 this->n_cols, this->n_ws, kernel, cache_size,
+                                 C_SVC);
     TypeParam *tile_dev = cache.GetTile(this->ws_idx_dev);
     // apply nonlinearity on tile_host_expected
     this->ApplyNonlin(params);
     ASSERT_TRUE(devArrMatchHost(this->tile_host_expected, tile_dev,
                                 this->n_rows * this->n_ws,
-                                CompareApprox<TypeParam>(1e-6f)));
+                                raft::CompareApprox<TypeParam>(1e-6f)));
     delete kernel;
   }
 }
@@ -247,11 +262,11 @@ TYPED_TEST_P(KernelCacheTest, CacheEvalTest) {
   float cache_size = sizeof(TypeParam) * this->n_rows * 32 / (1024.0 * 1024);
 
   Matrix::GramMatrixBase<TypeParam> *kernel =
-    Matrix::KernelFactory<TypeParam>::create(
-      param, this->handle.getImpl().getCublasHandle());
-  KernelCache<TypeParam> cache(this->handle.getImpl(), this->x_dev,
-                               this->n_rows, this->n_cols, this->n_ws, kernel,
-                               cache_size, C_SVC);
+    Matrix::KernelFactory<TypeParam>::create(param,
+                                             this->handle.get_cublas_handle());
+  KernelCache<TypeParam> cache(this->handle, this->x_dev, this->n_rows,
+                               this->n_cols, this->n_ws, kernel, cache_size,
+                               C_SVC);
   for (int i = 0; i < 2; i++) {
     // We calculate cache tile multiple times to see if cache lookup works
     TypeParam *tile_dev = cache.GetTile(this->ws_idx_dev);
@@ -267,14 +282,14 @@ TYPED_TEST_P(KernelCacheTest, SvrEvalTest) {
 
   this->n_ws = 6;
   int ws_idx_svr[6] = {0, 5, 1, 4, 3, 7};
-  updateDevice(this->ws_idx_dev, ws_idx_svr, 6, this->stream);
+  raft::update_device(this->ws_idx_dev, ws_idx_svr, 6, this->stream);
 
   Matrix::GramMatrixBase<TypeParam> *kernel =
-    Matrix::KernelFactory<TypeParam>::create(
-      param, this->handle.getImpl().getCublasHandle());
-  KernelCache<TypeParam> cache(this->handle.getImpl(), this->x_dev,
-                               this->n_rows, this->n_cols, this->n_ws, kernel,
-                               cache_size, EPSILON_SVR);
+    Matrix::KernelFactory<TypeParam>::create(param,
+                                             this->handle.get_cublas_handle());
+  KernelCache<TypeParam> cache(this->handle, this->x_dev, this->n_rows,
+                               this->n_cols, this->n_ws, kernel, cache_size,
+                               EPSILON_SVR);
 
   for (int i = 0; i < 2; i++) {
     // We calculate cache tile multiple times to see if cache lookup works
@@ -294,24 +309,25 @@ class GetResultsTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
   }
 
   void TearDown() override { CUDA_CHECK(cudaStreamDestroy(stream)); }
 
   void TestResults() {
-    auto allocator = handle.getImpl().getDeviceAllocator();
+    auto allocator = handle.get_device_allocator();
     device_buffer<math_t> x_dev(allocator, stream, n_rows * n_cols);
-    updateDevice(x_dev.data(), x_host, n_rows * n_cols, stream);
+    raft::update_device(x_dev.data(), x_host, n_rows * n_cols, stream);
     device_buffer<math_t> f_dev(allocator, stream, n_rows);
-    updateDevice(f_dev.data(), f_host, n_rows, stream);
+    raft::update_device(f_dev.data(), f_host, n_rows, stream);
     device_buffer<math_t> y_dev(allocator, stream, n_rows);
-    updateDevice(y_dev.data(), y_host, n_rows, stream);
+    raft::update_device(y_dev.data(), y_host, n_rows, stream);
     device_buffer<math_t> alpha_dev(allocator, stream, n_rows);
-    updateDevice(alpha_dev.data(), alpha_host, n_rows, stream);
-
-    Results<math_t> res(handle.getImpl(), x_dev.data(), y_dev.data(), n_rows,
-                        n_cols, C, C_SVC);
+    raft::update_device(alpha_dev.data(), alpha_host, n_rows, stream);
+    device_buffer<math_t> C_dev(allocator, stream, n_rows);
+    init_C(C, C_dev.data(), n_rows, stream);
+    Results<math_t> res(handle, x_dev.data(), y_dev.data(), n_rows, n_cols,
+                        C_dev.data(), C_SVC);
     res.Get(alpha_dev.data(), f_dev.data(), &dual_coefs, &n_coefs, &idx,
             &x_support, &b);
 
@@ -319,14 +335,14 @@ class GetResultsTest : public ::testing::Test {
 
     math_t dual_coefs_exp[] = {-0.1, -0.2, -1.5, 0.2, 0.4, 1.5, 1.5};
     EXPECT_TRUE(devArrMatchHost(dual_coefs_exp, dual_coefs, n_coefs,
-                                CompareApprox<math_t>(1e-6f)));
+                                raft::CompareApprox<math_t>(1e-6f)));
 
     int idx_exp[] = {2, 3, 4, 6, 7, 8, 9};
-    EXPECT_TRUE(devArrMatchHost(idx_exp, idx, n_coefs, Compare<int>()));
+    EXPECT_TRUE(devArrMatchHost(idx_exp, idx, n_coefs, raft::Compare<int>()));
 
     math_t x_support_exp[] = {3, 4, 5, 7, 8, 9, 10, 13, 14, 15, 17, 18, 19, 20};
     EXPECT_TRUE(devArrMatchHost(x_support_exp, x_support, n_coefs * n_cols,
-                                CompareApprox<math_t>(1e-6f)));
+                                raft::CompareApprox<math_t>(1e-6f)));
 
     EXPECT_FLOAT_EQ(b, -6.25f);
 
@@ -339,7 +355,7 @@ class GetResultsTest : public ::testing::Test {
 
     // Modify the test by setting all SVs bound, then b is calculated differently
     math_t alpha_host2[10] = {0, 0, 1.5, 1.5, 1.5, 0, 1.5, 1.5, 1.5, 1.5};
-    updateDevice(alpha_dev.data(), alpha_host2, n_rows, stream);
+    raft::update_device(alpha_dev.data(), alpha_host2, n_rows, stream);
     res.Get(alpha_dev.data(), f_dev.data(), &dual_coefs, &n_coefs, &idx,
             &x_support, &b);
     EXPECT_FLOAT_EQ(b, -5.5f);
@@ -360,7 +376,7 @@ class GetResultsTest : public ::testing::Test {
   math_t *x_support;
   math_t b;
 
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
 };
 
@@ -385,29 +401,29 @@ template <typename math_t>
 class SmoUpdateTest : public ::testing::Test {
  protected:
   void SetUp() override {
-    stream = handle.getImpl().getInternalStream(0);
-    cublasHandle_t cublas_handle = handle.getImpl().getCublasHandle();
-    allocate(f_dev, n_rows, true);
-    allocate(kernel_dev, n_rows * n_ws);
-    updateDevice(kernel_dev, kernel_host, n_ws * n_rows, stream);
-    allocate(delta_alpha_dev, n_ws);
-    updateDevice(delta_alpha_dev, delta_alpha_host, n_ws, stream);
+    stream = handle.get_stream();
+    cublasHandle_t cublas_handle = handle.get_cublas_handle();
+    raft::allocate(f_dev, n_rows, true);
+    raft::allocate(kernel_dev, n_rows * n_ws);
+    raft::update_device(kernel_dev, kernel_host, n_ws * n_rows, stream);
+    raft::allocate(delta_alpha_dev, n_ws);
+    raft::update_device(delta_alpha_dev, delta_alpha_host, n_ws, stream);
   }
   void RunTest() {
     svmParameter param = getDefaultSvmParameter();
-    SmoSolver<float> smo(handle.getImpl(), param, nullptr);
+    SmoSolver<float> smo(handle, param, nullptr);
     smo.UpdateF(f_dev, n_rows, delta_alpha_dev, n_ws, kernel_dev);
 
     float f_host_expected[] = {0.1f, 7.4505806e-9f, 0.3f, 0.2f, 0.5f, 0.4f};
     devArrMatchHost(f_host_expected, f_dev, n_rows,
-                    CompareApprox<math_t>(1e-6));
+                    raft::CompareApprox<math_t>(1e-6));
   }
   void TearDown() override {
     CUDA_CHECK(cudaFree(delta_alpha_dev));
     CUDA_CHECK(cudaFree(kernel_dev));
     CUDA_CHECK(cudaFree(f_dev));
   }
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
   int n_rows = 6;
   int n_ws = 2;
@@ -426,49 +442,53 @@ class SmoBlockSolverTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
-    cublas_handle = handle.getImpl().getCublasHandle();
-    allocate(ws_idx_dev, n_ws);
-    allocate(y_dev, n_rows);
-    allocate(f_dev, n_rows);
-    allocate(alpha_dev, n_rows, true);
-    allocate(delta_alpha_dev, n_ws, true);
-    allocate(kernel_dev, n_ws * n_rows);
-    allocate(return_buff_dev, 2);
-
-    updateDevice(ws_idx_dev, ws_idx_host, n_ws, stream);
-    updateDevice(y_dev, y_host, n_rows, stream);
-    updateDevice(f_dev, f_host, n_rows, stream);
-    updateDevice(kernel_dev, kernel_host, n_ws * n_rows, stream);
+    handle.set_stream(stream);
+    cublas_handle = handle.get_cublas_handle();
+    raft::allocate(ws_idx_dev, n_ws);
+    raft::allocate(y_dev, n_rows);
+    raft::allocate(C_dev, n_rows);
+    raft::allocate(f_dev, n_rows);
+    raft::allocate(alpha_dev, n_rows, true);
+    raft::allocate(delta_alpha_dev, n_ws, true);
+    raft::allocate(kernel_dev, n_ws * n_rows);
+    raft::allocate(return_buff_dev, 2);
+
+    init_C(C, C_dev, n_rows, stream);
+    raft::update_device(ws_idx_dev, ws_idx_host, n_ws, stream);
+    raft::update_device(y_dev, y_host, n_rows, stream);
+    raft::update_device(f_dev, f_host, n_rows, stream);
+    raft::update_device(kernel_dev, kernel_host, n_ws * n_rows, stream);
   }
 
  public:  // because of the device lambda
   void testBlockSolve() {
     SmoBlockSolve<math_t, 1024><<<1, n_ws, 0, stream>>>(
       y_dev, n_rows, alpha_dev, n_ws, delta_alpha_dev, f_dev, kernel_dev,
-      ws_idx_dev, 1.5f, 1e-3f, return_buff_dev, 1);
+      ws_idx_dev, C_dev, 1e-3f, return_buff_dev, 1);
     CUDA_CHECK(cudaPeekAtLastError());
 
     math_t return_buff_exp[2] = {0.2, 1};
     devArrMatchHost(return_buff_exp, return_buff_dev, 2,
-                    CompareApprox<math_t>(1e-6));
+                    raft::CompareApprox<math_t>(1e-6));
 
     math_t *delta_alpha_calc;
-    allocate(delta_alpha_calc, n_rows);
-    LinAlg::binaryOp(
+    raft::allocate(delta_alpha_calc, n_rows);
+    raft::linalg::binaryOp(
       delta_alpha_calc, y_dev, alpha_dev, n_rows,
       [] __device__(math_t a, math_t b) { return a * b; }, stream);
-    devArrMatch(delta_alpha_dev, delta_alpha_calc, n_rows,
-                CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(delta_alpha_dev, delta_alpha_calc, n_rows,
+                      raft::CompareApprox<math_t>(1e-6));
     CUDA_CHECK(cudaFree(delta_alpha_calc));
     math_t alpha_expected[] = {0, 0.1f, 0.1f, 0};
-    devArrMatch(alpha_expected, alpha_dev, n_rows, CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(alpha_expected, alpha_dev, n_rows,
+                      raft::CompareApprox<math_t>(1e-6));
   }
 
  protected:
   void TearDown() override {
     CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(y_dev));
+    CUDA_CHECK(cudaFree(C_dev));
     CUDA_CHECK(cudaFree(f_dev));
     CUDA_CHECK(cudaFree(ws_idx_dev));
     CUDA_CHECK(cudaFree(alpha_dev));
@@ -477,7 +497,7 @@ class SmoBlockSolverTest : public ::testing::Test {
     CUDA_CHECK(cudaFree(return_buff_dev));
   }
 
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
   cublasHandle_t cublas_handle;
 
@@ -488,6 +508,7 @@ class SmoBlockSolverTest : public ::testing::Test {
   int *ws_idx_dev;
   math_t *y_dev;
   math_t *f_dev;
+  math_t *C_dev;
   math_t *alpha_dev;
   math_t *delta_alpha_dev;
   math_t *kernel_dev;
@@ -495,6 +516,7 @@ class SmoBlockSolverTest : public ::testing::Test {
 
   int ws_idx_host[4] = {0, 1, 2, 3};
   math_t y_host[4] = {1, 1, -1, -1};
+  math_t C = 1.5;
   math_t f_host[4] = {0.4, 0.3, 0.5, 0.1};
   math_t kernel_host[16] = {26, 32, 38, 44, 32, 40, 48, 56,
                             38, 48, 58, 68, 44, 56, 68, 80};
@@ -581,32 +603,32 @@ void checkResults(svmModel<math_t> model, smoOutput<math_t> expected,
   EXPECT_LE(abs(model.n_support - expected.n_support), tol.n_sv);
   if (dcoef_exp) {
     EXPECT_TRUE(devArrMatchHost(dcoef_exp, model.dual_coefs, model.n_support,
-                                CompareApprox<math_t>(1e-3f)));
+                                raft::CompareApprox<math_t>(1e-3f)));
   }
   math_t *dual_coefs_host = new math_t[model.n_support];
-  updateHost(dual_coefs_host, model.dual_coefs, model.n_support, stream);
+  raft::update_host(dual_coefs_host, model.dual_coefs, model.n_support, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   math_t ay = 0;
   for (int i = 0; i < model.n_support; i++) {
     ay += dual_coefs_host[i];
   }
   // Test if \sum \alpha_i y_i = 0
-  EXPECT_LT(abs(ay), ay_tol);
+  EXPECT_LT(raft::abs(ay), ay_tol);
 
   if (x_support_exp) {
     EXPECT_TRUE(devArrMatchHost(x_support_exp, model.x_support,
                                 model.n_support * model.n_cols,
-                                CompareApprox<math_t>(1e-6f)));
+                                raft::CompareApprox<math_t>(1e-6f)));
   }
 
   if (idx_exp) {
     EXPECT_TRUE(devArrMatchHost(idx_exp, model.support_idx, model.n_support,
-                                Compare<int>()));
+                                raft::Compare<int>()));
   }
 
   math_t *x_support_host = new math_t[model.n_support * model.n_cols];
-  updateHost(x_support_host, model.x_support, model.n_support * model.n_cols,
-             stream);
+  raft::update_host(x_support_host, model.x_support,
+                    model.n_support * model.n_cols, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   if (w_exp) {
@@ -628,7 +650,7 @@ void checkResults(svmModel<math_t> model, smoOutput<math_t> expected,
     EXPECT_GT(cs, tol.cs);
   }
 
-  EXPECT_LT(abs(model.b - expected.b), tol.b);
+  EXPECT_LT(raft::abs(model.b - expected.b), tol.b);
 
   delete[] dual_coefs_host;
   delete[] x_support_host;
@@ -639,24 +661,27 @@ class SmoSolverTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
-    allocate(x_dev, n_rows * n_cols);
-    allocate(ws_idx_dev, n_ws);
-    allocate(y_dev, n_rows);
-    allocate(y_pred, n_rows);
-    allocate(f_dev, n_rows);
-    allocate(alpha_dev, n_rows, true);
-    allocate(delta_alpha_dev, n_ws, true);
-    allocate(kernel_dev, n_ws * n_rows);
-    allocate(return_buff_dev, 2);
-
-    cublas_handle = handle.getImpl().getCublasHandle();
-
-    updateDevice(x_dev, x_host, n_rows * n_cols, stream);
-    updateDevice(ws_idx_dev, ws_idx_host, n_ws, stream);
-    updateDevice(y_dev, y_host, n_rows, stream);
-    updateDevice(f_dev, f_host, n_rows, stream);
-    updateDevice(kernel_dev, kernel_host, n_ws * n_rows, stream);
+    handle.set_stream(stream);
+    raft::allocate(x_dev, n_rows * n_cols);
+    raft::allocate(ws_idx_dev, n_ws);
+    raft::allocate(y_dev, n_rows);
+    raft::allocate(C_dev, n_rows);
+    raft::allocate(y_pred, n_rows);
+    raft::allocate(f_dev, n_rows);
+    raft::allocate(alpha_dev, n_rows, true);
+    raft::allocate(delta_alpha_dev, n_ws, true);
+    raft::allocate(kernel_dev, n_ws * n_rows);
+    raft::allocate(return_buff_dev, 2);
+    raft::allocate(sample_weights_dev, n_rows);
+    LinAlg::range(sample_weights_dev, 1, n_rows + 1, stream);
+    cublas_handle = handle.get_cublas_handle();
+
+    raft::update_device(x_dev, x_host, n_rows * n_cols, stream);
+    raft::update_device(ws_idx_dev, ws_idx_host, n_ws, stream);
+    raft::update_device(y_dev, y_host, n_rows, stream);
+    init_C(C, C_dev, n_rows, stream);
+    raft::update_device(f_dev, f_host, n_rows, stream);
+    raft::update_device(kernel_dev, kernel_host, n_ws * n_rows, stream);
     CUDA_CHECK(
       cudaMemsetAsync(delta_alpha_dev, 0, n_ws * sizeof(math_t), stream));
 
@@ -676,6 +701,7 @@ class SmoSolverTest : public ::testing::Test {
     CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(x_dev));
     CUDA_CHECK(cudaFree(y_dev));
+    CUDA_CHECK(cudaFree(C_dev));
     CUDA_CHECK(cudaFree(y_pred));
     CUDA_CHECK(cudaFree(f_dev));
     CUDA_CHECK(cudaFree(ws_idx_dev));
@@ -683,6 +709,7 @@ class SmoSolverTest : public ::testing::Test {
     CUDA_CHECK(cudaFree(delta_alpha_dev));
     CUDA_CHECK(cudaFree(kernel_dev));
     CUDA_CHECK(cudaFree(return_buff_dev));
+    CUDA_CHECK(cudaFree(sample_weights_dev));
     FreeResultBuffers();
   }
 
@@ -690,31 +717,32 @@ class SmoSolverTest : public ::testing::Test {
   void blockSolveTest() {
     SmoBlockSolve<math_t, 1024><<<1, n_ws, 0, stream>>>(
       y_dev, n_rows, alpha_dev, n_ws, delta_alpha_dev, f_dev, kernel_dev,
-      ws_idx_dev, 1.0, 1e-3, return_buff_dev);
+      ws_idx_dev, C_dev, 1e-3, return_buff_dev);
     CUDA_CHECK(cudaPeekAtLastError());
 
     math_t return_buff[2];
-    updateHost(return_buff, return_buff_dev, 2, stream);
+    raft::update_host(return_buff, return_buff_dev, 2, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     EXPECT_FLOAT_EQ(return_buff[0], 2.0f) << return_buff[0];
     EXPECT_LT(return_buff[1], 100) << return_buff[1];
 
     // check results won't work, because it expects that GetResults was called
     math_t *delta_alpha_calc;
-    allocate(delta_alpha_calc, n_rows);
-    LinAlg::binaryOp(
+    raft::allocate(delta_alpha_calc, n_rows);
+    raft::linalg::binaryOp(
       delta_alpha_calc, y_dev, alpha_dev, n_rows,
       [] __device__(math_t a, math_t b) { return a * b; }, stream);
-    devArrMatch(delta_alpha_dev, delta_alpha_calc, n_rows,
-                CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(delta_alpha_dev, delta_alpha_calc, n_rows,
+                      raft::CompareApprox<math_t>(1e-6));
     CUDA_CHECK(cudaFree(delta_alpha_calc));
 
     math_t alpha_expected[] = {0.6f, 0, 1, 1, 0, 0.6f};
     //for C=10: {0.25f, 0, 2.25f, 3.75f, 0, 1.75f};
-    devArrMatch(alpha_expected, alpha_dev, n_rows, CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(alpha_expected, alpha_dev, n_rows,
+                      raft::CompareApprox<math_t>(1e-6));
 
     math_t host_alpha[6];
-    updateHost(host_alpha, alpha_dev, n_rows, stream);
+    raft::update_host(host_alpha, alpha_dev, n_rows, stream);
 
     math_t w[] = {0, 0};
     math_t ay = 0;
@@ -742,30 +770,32 @@ class SmoSolverTest : public ::testing::Test {
     math_t kernel[4] = {1, 2, 2, 4};
     // ws_idx is defined as {0, 1, 2, 3}
     int kColIdx[4] = {0, 1, 0, 1};
-    device_buffer<int> kColIdx_dev(handle.getDeviceAllocator(), stream, 4);
-    updateDevice(f_dev, f, 4, stream);
-    updateDevice(kernel_dev, kernel, 4, stream);
-    updateDevice(kColIdx_dev.data(), kColIdx, 4, stream);
+    device_buffer<int> kColIdx_dev(handle.get_device_allocator(), stream, 4);
+    raft::update_device(f_dev, f, 4, stream);
+    raft::update_device(kernel_dev, kernel, 4, stream);
+    raft::update_device(kColIdx_dev.data(), kColIdx, 4, stream);
     SmoBlockSolve<math_t, 1024><<<1, n_ws, 0, stream>>>(
       y_dev, 2 * n_rows, alpha_dev, n_ws, delta_alpha_dev, f_dev, kernel_dev,
-      ws_idx_dev, 1.0, 1e-3, return_buff_dev, 10, EPSILON_SVR,
+      ws_idx_dev, C_dev, 1e-3, return_buff_dev, 10, EPSILON_SVR,
       kColIdx_dev.data());
     CUDA_CHECK(cudaPeekAtLastError());
 
     math_t return_buff[2];
-    updateHost(return_buff, return_buff_dev, 2, stream);
+    raft::update_host(return_buff, return_buff_dev, 2, stream);
     CUDA_CHECK(cudaStreamSynchronize(stream));
     EXPECT_LT(return_buff[1], 10) << return_buff[1];
 
     math_t alpha_exp[] = {0, 0.8, 0.8, 0};
-    devArrMatch(alpha_exp, alpha_dev, 4, CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(alpha_exp, alpha_dev, 4,
+                      raft::CompareApprox<math_t>(1e-6));
 
     math_t dalpha_exp[] = {-0.8, 0.8};
-    devArrMatch(dalpha_exp, delta_alpha_dev, 2, CompareApprox<math_t>(1e-6));
+    raft::devArrMatch(dalpha_exp, delta_alpha_dev, 2,
+                      raft::CompareApprox<math_t>(1e-6));
   }
 
  protected:
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
   Matrix::GramMatrixBase<math_t> *kernel;
   int n_rows = 6;
@@ -775,17 +805,19 @@ class SmoSolverTest : public ::testing::Test {
   math_t *x_dev;
   int *ws_idx_dev;
   math_t *y_dev;
+  math_t *C_dev;
   math_t *y_pred;
   math_t *f_dev;
   math_t *alpha_dev;
   math_t *delta_alpha_dev;
   math_t *kernel_dev;
   math_t *return_buff_dev;
+  math_t *sample_weights_dev;
 
   math_t x_host[12] = {1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 3, 3};
   int ws_idx_host[6] = {0, 1, 2, 3, 4, 5};
   math_t y_host[6] = {-1, -1, 1, -1, 1, 1};
-
+  math_t C = 1;
   math_t f_host[6] = {1, 1, -1, 1, -1, -1};
 
   math_t kernel_host[36] = {2, 3, 3, 4, 4,  5,  3, 5, 4, 6,  5,  7,
@@ -844,11 +876,11 @@ TYPED_TEST(SmoSolverTest, SmoSolveTest) {
     param.tol = p.tol;
     //param.max_iter = p.max_iter;
     GramMatrixBase<TypeParam> *kernel = KernelFactory<TypeParam>::create(
-      p.kernel_params, this->handle.getImpl().getCublasHandle());
-    SmoSolver<TypeParam> smo(this->handle.getImpl(), param, kernel);
+      p.kernel_params, this->handle.get_cublas_handle());
+    SmoSolver<TypeParam> smo(this->handle, param, kernel);
     svmModel<TypeParam> model{0,       this->n_cols, 0, nullptr,
                               nullptr, nullptr,      0, nullptr};
-    smo.Solve(this->x_dev, this->n_rows, this->n_cols, this->y_dev,
+    smo.Solve(this->x_dev, this->n_rows, this->n_cols, this->y_dev, nullptr,
               &model.dual_coefs, &model.n_support, &model.x_support,
               &model.support_idx, &model.b, p.max_iter, p.max_inner_iter);
     checkResults(model, exp, this->stream);
@@ -867,6 +899,16 @@ TYPED_TEST(SmoSolverTest, SvcTest) {
                            {1, 1, 2, 2, 1, 2, 2, 3},
                            {0, 2, 3, 5},
                            {-1.0, -1.4, 0.2, -0.2, 1.4, 1.0}}},
+    {// C == 0 marks a special tast case with sample weights
+     svcInput<TypeParam>{0, 0.001, KernelParams{LINEAR, 3, 1, 0}, this->n_rows,
+                         this->n_cols, this->x_dev, this->y_dev, true},
+     smoOutput2<TypeParam>{4,
+                           {},
+                           -1.0f,
+                           {-2, 2},
+                           {1, 1, 2, 2, 1, 2, 2, 3},
+                           {0, 2, 3, 5},
+                           {-1.0, -3.0, 1.0, -1.0, 3.0, 1.0}}},
     {svcInput<TypeParam>{1, 1e-6, KernelParams{POLYNOMIAL, 3, 1, 0},
                          this->n_rows, this->n_cols, this->x_dev, this->y_dev,
                          true},
@@ -904,20 +946,26 @@ TYPED_TEST(SmoSolverTest, SvcTest) {
     auto p = d.first;
     auto exp = d.second;
     SCOPED_TRACE(kernelName(p.kernel_params));
+    TypeParam *sample_weights = nullptr;
+    if (p.C == 0) {
+      p.C = 1;
+      sample_weights = this->sample_weights_dev;
+    }
     SVC<TypeParam> svc(this->handle, p.C, p.tol, p.kernel_params);
-    svc.fit(p.x_dev, p.n_rows, p.n_cols, p.y_dev);
+    svc.fit(p.x_dev, p.n_rows, p.n_cols, p.y_dev, sample_weights);
     checkResults(svc.model, toSmoOutput(exp), this->stream);
-    device_buffer<TypeParam> y_pred(this->handle.getDeviceAllocator(),
+    device_buffer<TypeParam> y_pred(this->handle.get_device_allocator(),
                                     this->stream, p.n_rows);
     if (p.predict) {
       svc.predict(p.x_dev, p.n_rows, p.n_cols, y_pred.data());
-      EXPECT_TRUE(devArrMatch(this->y_dev, y_pred.data(), p.n_rows,
-                              CompareApprox<TypeParam>(1e-6f)));
+      EXPECT_TRUE(raft::devArrMatch(this->y_dev, y_pred.data(), p.n_rows,
+                                    raft::CompareApprox<TypeParam>(1e-6f)));
     }
     if (exp.decision_function.size() > 0) {
       svc.decisionFunction(p.x_dev, p.n_rows, p.n_cols, y_pred.data());
       EXPECT_TRUE(devArrMatchHost(exp.decision_function.data(), y_pred.data(),
-                                  p.n_rows, CompareApprox<TypeParam>(1e-3f)));
+                                  p.n_rows,
+                                  raft::CompareApprox<TypeParam>(1e-3f)));
     }
   }
 }
@@ -945,11 +993,11 @@ __global__ void cast(outType *out, int n, inType *in) {
 // To have the same input data for both single and double precision,
 // we generate the blobs in single precision only, and cast to dp if needed.
 template <typename math_t>
-void make_blobs(const cumlHandle &handle, math_t *x, math_t *y, int n_rows,
+void make_blobs(const raft::handle_t &handle, math_t *x, math_t *y, int n_rows,
                 int n_cols, int n_cluster, float *centers = nullptr) {
-  auto allocator = handle.getDeviceAllocator();
-  auto cublas_h = handle.getImpl().getCublasHandle();
-  auto stream = handle.getStream();
+  auto allocator = handle.get_device_allocator();
+  auto cublas_h = handle.get_cublas_handle();
+  auto stream = handle.get_stream();
   device_buffer<float> x_float(allocator, stream, n_rows * n_cols);
   device_buffer<int> y_int(allocator, stream, n_rows);
 
@@ -958,17 +1006,16 @@ void make_blobs(const cumlHandle &handle, math_t *x, math_t *y, int n_rows,
                        -2.0f, 2.0f, 0);
   int TPB = 256;
   if (std::is_same<float, math_t>::value) {
-    LinAlg::transpose(x_float.data(), (float *)x, n_cols, n_rows, cublas_h,
-                      stream);
+    raft::linalg::transpose(handle, x_float.data(), (float *)x, n_cols, n_rows,
+                            stream);
   } else {
     device_buffer<math_t> x2(allocator, stream, n_rows * n_cols);
-    cast<<<MLCommon::ceildiv(n_rows * n_cols, TPB), TPB, 0, stream>>>(
+    cast<<<raft::ceildiv(n_rows * n_cols, TPB), TPB, 0, stream>>>(
       x2.data(), n_rows * n_cols, x_float.data());
-    LinAlg::transpose(x2.data(), x, n_cols, n_rows, cublas_h, stream);
+    raft::linalg::transpose(handle, x2.data(), x, n_cols, n_rows, stream);
     CUDA_CHECK(cudaPeekAtLastError());
   }
-  cast<<<MLCommon::ceildiv(n_rows, TPB), TPB, 0, stream>>>(y, n_rows,
-                                                           y_int.data());
+  cast<<<raft::ceildiv(n_rows, TPB), TPB, 0, stream>>>(y, n_rows, y_int.data());
   CUDA_CHECK(cudaPeekAtLastError());
 }
 
@@ -990,7 +1037,7 @@ TYPED_TEST(SmoSolverTest, BlobPredict) {
   // This should be larger then N_PRED_BATCH in svcPredict
   const int n_pred = 5000;
 
-  auto allocator = this->handle.getDeviceAllocator();
+  auto allocator = this->handle.get_device_allocator();
 
   for (auto d : data) {
     auto p = d.first;
@@ -1017,12 +1064,12 @@ TYPED_TEST(SmoSolverTest, BlobPredict) {
     // Create a different dataset for prediction
     make_blobs(this->handle, x_pred.data(), y_pred.data(), n_pred, p.n_cols, 2,
                centers.data());
-    device_buffer<TypeParam> y_pred2(this->handle.getDeviceAllocator(),
+    device_buffer<TypeParam> y_pred2(this->handle.get_device_allocator(),
                                      this->stream, n_pred);
     svc.predict(x_pred.data(), n_pred, p.n_cols, y_pred2.data());
 
     // Count the number of correct predictions
-    device_buffer<int> is_correct(this->handle.getDeviceAllocator(),
+    device_buffer<int> is_correct(this->handle.get_device_allocator(),
                                   this->stream, n_pred);
     thrust::device_ptr<TypeParam> ptr1(y_pred.data());
     thrust::device_ptr<TypeParam> ptr2(y_pred2.data());
@@ -1058,7 +1105,7 @@ TYPED_TEST(SmoSolverTest, MemoryLeak) {
   // to stop fitting.
   size_t free1, total, free2;
   CUDA_CHECK(cudaMemGetInfo(&free1, &total));
-  auto allocator = this->handle.getDeviceAllocator();
+  auto allocator = this->handle.get_device_allocator();
   for (auto d : data) {
     auto p = d.first;
     SCOPED_TRACE(p);
@@ -1072,10 +1119,10 @@ TYPED_TEST(SmoSolverTest, MemoryLeak) {
     if (d.second == ThrowException::Yes) {
       // We want to check whether we leak any memory while we unwind the stack
       EXPECT_THROW(svc.fit(x.data(), p.n_rows, p.n_cols, y.data()),
-                   MLCommon::Exception);
+                   raft::exception);
     } else {
       svc.fit(x.data(), p.n_rows, p.n_cols, y.data());
-      device_buffer<TypeParam> y_pred(this->handle.getDeviceAllocator(),
+      device_buffer<TypeParam> y_pred(this->handle.get_device_allocator(),
                                       this->stream, p.n_rows);
       CUDA_CHECK(cudaStreamSynchronize(this->stream));
       CUDA_CHECK(cudaMemGetInfo(&free2, &total));
@@ -1103,6 +1150,7 @@ struct SvrInput {
   int n_cols;
   std::vector<math_t> x;
   std::vector<math_t> y;
+  std::vector<math_t> sample_weighs;
 };
 
 template <typename math_t>
@@ -1117,18 +1165,19 @@ class SvrTest : public ::testing::Test {
  protected:
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
-    allocator = handle.getDeviceAllocator();
-    allocate(x_dev, n_rows * n_cols);
-    allocate(y_dev, n_rows);
-    allocate(y_pred, n_rows);
+    handle.set_stream(stream);
+    allocator = handle.get_device_allocator();
+    raft::allocate(x_dev, n_rows * n_cols);
+    raft::allocate(y_dev, n_rows);
+    raft::allocate(C_dev, 2 * n_rows);
+    raft::allocate(y_pred, n_rows);
 
-    allocate(yc, n_train);
-    allocate(f, n_train);
-    allocate(alpha, n_train);
+    raft::allocate(yc, n_train);
+    raft::allocate(f, n_train);
+    raft::allocate(alpha, n_train);
 
-    updateDevice(x_dev, x_host, n_rows * n_cols, stream);
-    updateDevice(y_dev, y_host, n_rows, stream);
+    raft::update_device(x_dev, x_host, n_rows * n_cols, stream);
+    raft::update_device(y_dev, y_host, n_rows, stream);
 
     model.n_support = 0;
     model.dual_coefs = nullptr;
@@ -1142,6 +1191,7 @@ class SvrTest : public ::testing::Test {
     CUDA_CHECK(cudaStreamDestroy(stream));
     CUDA_CHECK(cudaFree(x_dev));
     CUDA_CHECK(cudaFree(y_dev));
+    CUDA_CHECK(cudaFree(C_dev));
     CUDA_CHECK(cudaFree(y_pred));
     CUDA_CHECK(cudaFree(yc));
     CUDA_CHECK(cudaFree(f));
@@ -1153,65 +1203,63 @@ class SvrTest : public ::testing::Test {
   void TestSvrInit() {
     svmParameter param = getDefaultSvmParameter();
     param.svmType = EPSILON_SVR;
-    SmoSolver<math_t> smo(handle.getImpl(), param, nullptr);
+    SmoSolver<math_t> smo(handle, param, nullptr);
     smo.SvrInit(y_dev, n_rows, yc, f);
 
-    EXPECT_TRUE(
-      devArrMatchHost(yc_exp, yc, n_train, CompareApprox<math_t>(1.0e-9)));
-    EXPECT_TRUE(devArrMatchHost(f_exp, f, n_train, Compare<math_t>()));
+    EXPECT_TRUE(devArrMatchHost(yc_exp, yc, n_train,
+                                raft::CompareApprox<math_t>(1.0e-9)));
+    EXPECT_TRUE(devArrMatchHost(f_exp, f, n_train, raft::Compare<math_t>()));
   }
 
   void TestSvrWorkingSet() {
-    math_t C = 1;
+    init_C((math_t)1.0, C_dev, 2 * n_rows, stream);
     WorkingSet<math_t> *ws;
-    ws =
-      new WorkingSet<math_t>(handle.getImpl(), stream, n_rows, 20, EPSILON_SVR);
+    ws = new WorkingSet<math_t>(handle, stream, n_rows, 20, EPSILON_SVR);
     EXPECT_EQ(ws->GetSize(), 2 * n_rows);
 
-    updateDevice(alpha, alpha_host, n_train, stream);
-    updateDevice(f, f_exp, n_train, stream);
-    updateDevice(yc, yc_exp, n_train, stream);
+    raft::update_device(alpha, alpha_host, n_train, stream);
+    raft::update_device(f, f_exp, n_train, stream);
+    raft::update_device(yc, yc_exp, n_train, stream);
 
-    ws->Select(f, alpha, yc, C);
+    ws->Select(f, alpha, yc, C_dev);
     int exp_idx[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13};
     ASSERT_TRUE(devArrMatchHost(exp_idx, ws->GetIndices(), ws->GetSize(),
-                                Compare<int>()));
+                                raft::Compare<int>()));
 
     delete ws;
 
-    ws =
-      new WorkingSet<math_t>(handle.getImpl(), stream, n_rows, 10, EPSILON_SVR);
+    ws = new WorkingSet<math_t>(handle, stream, n_rows, 10, EPSILON_SVR);
     EXPECT_EQ(ws->GetSize(), 10);
-    ws->Select(f, alpha, yc, C);
+    ws->Select(f, alpha, yc, C_dev);
     int exp_idx2[] = {6, 12, 5, 11, 3, 9, 8, 1, 7, 0};
     ASSERT_TRUE(devArrMatchHost(exp_idx2, ws->GetIndices(), ws->GetSize(),
-                                Compare<int>()));
+                                raft::Compare<int>()));
     delete ws;
   }
 
   void TestSvrResults() {
-    updateDevice(yc, yc_exp, n_train, stream);
-    Results<math_t> res(handle.getImpl(), x_dev, yc, n_rows, n_cols,
-                        (math_t)0.001, EPSILON_SVR);
+    raft::update_device(yc, yc_exp, n_train, stream);
+    init_C((math_t)0.001, C_dev, n_rows * 2, stream);
+    Results<math_t> res(handle, x_dev, yc, n_rows, n_cols, C_dev, EPSILON_SVR);
     model.n_cols = n_cols;
-    updateDevice(alpha, alpha_host, n_train, stream);
-    updateDevice(f, f_exp, n_train, stream);
+    raft::update_device(alpha, alpha_host, n_train, stream);
+    raft::update_device(f, f_exp, n_train, stream);
 
     res.Get(alpha, f, &model.dual_coefs, &model.n_support, &model.support_idx,
             &model.x_support, &model.b);
     ASSERT_EQ(model.n_support, 5);
     math_t dc_exp[] = {0.1, 0.3, -0.4, 0.9, -0.9};
     EXPECT_TRUE(devArrMatchHost(dc_exp, model.dual_coefs, model.n_support,
-                                CompareApprox<math_t>(1.0e-6)));
+                                raft::CompareApprox<math_t>(1.0e-6)));
 
     math_t x_exp[] = {1, 2, 3, 5, 6};
     EXPECT_TRUE(devArrMatchHost(x_exp, model.x_support,
                                 model.n_support * n_cols,
-                                CompareApprox<math_t>(1.0e-6)));
+                                raft::CompareApprox<math_t>(1.0e-6)));
 
     int idx_exp[] = {0, 1, 2, 4, 5};
     EXPECT_TRUE(devArrMatchHost(idx_exp, model.support_idx, model.n_support,
-                                CompareApprox<math_t>(1.0e-6)));
+                                raft::CompareApprox<math_t>(1.0e-6)));
   }
 
   void TestSvrFitPredict() {
@@ -1263,29 +1311,57 @@ class SvrTest : public ::testing::Test {
                           {1.1},
                           {1.0, 2.0, 3.0, 5.0, 6.0, 7.0},
                           {0, 1, 2, 4, 5, 6},
-                          {0.7, 1.8, 2.9, 4, 5.1, 6.2, 7.3}}}};
+                          {0.7, 1.8, 2.9, 4, 5.1, 6.2, 7.3}}},
+      // Almost same as above, but with sample weights
+      {SvrInput<math_t>{
+         svmParameter{1, 0, 100, 10, 1e-3, CUML_LEVEL_INFO, 0.1, EPSILON_SVR},
+         KernelParams{LINEAR, 3, 1, 0},
+         7,                       // n_rows
+         1,                       // n_cols
+         {1, 2, 3, 4, 5, 6, 7},   // x
+         {0, 2, 3, 0, 4, 8, 12},  // y
+         {1, 1, 1, 10, 2, 10, 1}  // sample weights
+       },
+       smoOutput2<math_t>{6,
+                          {},
+                          -15.5,
+                          {3.9},
+                          {1.0, 2.0, 3.0, 4.0, 6.0, 7.0},
+                          {0, 1, 2, 3, 5, 6},
+                          {}}}};
     for (auto d : data) {
       auto p = d.first;
       auto exp = d.second;
       SCOPED_TRACE(p);
       device_buffer<math_t> x_dev(allocator, stream, p.n_rows * p.n_cols);
-      updateDevice(x_dev.data(), p.x.data(), p.n_rows * p.n_cols, stream);
+      raft::update_device(x_dev.data(), p.x.data(), p.n_rows * p.n_cols,
+                          stream);
       device_buffer<math_t> y_dev(allocator, stream, p.n_rows);
-      updateDevice(y_dev.data(), p.y.data(), p.n_rows, stream);
-
+      raft::update_device(y_dev.data(), p.y.data(), p.n_rows, stream);
+      MLCommon::device_buffer<math_t> sample_weights_dev(allocator, stream);
+      math_t *sample_weights = nullptr;
+      if (!p.sample_weighs.empty()) {
+        sample_weights_dev.resize(p.n_rows, stream);
+        sample_weights = sample_weights_dev.data();
+        raft::update_device(sample_weights_dev.data(), p.sample_weighs.data(),
+                            p.n_rows, stream);
+      }
       svrFit(handle, x_dev.data(), p.n_rows, p.n_cols, y_dev.data(), p.param,
-             p.kernel, model);
+             p.kernel, model, sample_weights);
       checkResults(model, toSmoOutput(exp), stream);
       device_buffer<math_t> preds(allocator, stream, p.n_rows);
       svcPredict(handle, x_dev.data(), p.n_rows, p.n_cols, p.kernel, model,
                  preds.data(), (math_t)200.0, false);
-      EXPECT_TRUE(devArrMatchHost(exp.decision_function.data(), preds.data(),
-                                  p.n_rows, CompareApprox<math_t>(1.0e-5)));
+      if (!exp.decision_function.empty()) {
+        EXPECT_TRUE(devArrMatchHost(exp.decision_function.data(), preds.data(),
+                                    p.n_rows,
+                                    raft::CompareApprox<math_t>(1.0e-5)));
+      }
     }
   }
 
  protected:
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
   std::shared_ptr<deviceAllocator> allocator;
   int n_rows = 7;
@@ -1295,6 +1371,7 @@ class SvrTest : public ::testing::Test {
   svmModel<math_t> model;
   math_t *x_dev;
   math_t *y_dev;
+  math_t *C_dev;
   math_t *y_pred;
   math_t *yc;
   math_t *f;
@@ -1307,7 +1384,7 @@ class SvrTest : public ::testing::Test {
                       -0.1, -2.1, -3.1, -4.1, -5.1, -6.1, -8.1};
   math_t alpha_host[14] = {0.2, 0.3, 0,   0, 1,   0.1, 0,
                            0.1, 0,   0.4, 0, 0.1, 1,   0};
-};
+};  // namespace SVM
 
 typedef ::testing::Types<float> OnlyFp32;
 TYPED_TEST_CASE(SvrTest, FloatTypes);
@@ -1316,5 +1393,5 @@ TYPED_TEST(SvrTest, Init) { this->TestSvrInit(); }
 TYPED_TEST(SvrTest, WorkingSet) { this->TestSvrWorkingSet(); }
 TYPED_TEST(SvrTest, Results) { this->TestSvrResults(); }
 TYPED_TEST(SvrTest, FitPredict) { this->TestSvrFitPredict(); }
-};  // end namespace SVM
+};  // namespace SVM
 };  // namespace ML
diff --git a/cpp/test/sg/trustworthiness_test.cu b/cpp/test/sg/trustworthiness_test.cu
index 84017aeaed..edfe28e881 100644
--- a/cpp/test/sg/trustworthiness_test.cu
+++ b/cpp/test/sg/trustworthiness_test.cu
@@ -15,8 +15,9 @@
  */
 
 #include <gtest/gtest.h>
-#include <cuda_utils.cuh>
+#include <raft/cudart_utils.h>
 #include <metrics/trustworthiness.cuh>
+#include <raft/cuda_utils.cuh>
 #include <vector>
 
 using namespace MLCommon;
@@ -409,20 +410,22 @@ class TrustworthinessScoreTest : public ::testing::Test {
       -0.1128775,  -0.0078648,  -0.02323332, 0.04292452,  0.39291084,
       -0.94897962, -0.63863206, -0.16546988, 0.23698957,  -0.30633628};
 
-    ML::cumlHandle h;
-    cudaStream_t stream = h.getStream();
-    auto d_alloc = h.getDeviceAllocator();
+    raft::handle_t h;
+    cudaStream_t stream = h.get_stream();
+    auto d_alloc = h.get_device_allocator();
 
     float* d_X = (float*)d_alloc->allocate(X.size() * sizeof(float), stream);
     float* d_X_embedded =
       (float*)d_alloc->allocate(X_embedded.size() * sizeof(float), stream);
 
-    updateDevice(d_X, X.data(), X.size(), stream);
-    updateDevice(d_X_embedded, X_embedded.data(), X_embedded.size(), stream);
+    raft::update_device(d_X, X.data(), X.size(), stream);
+    raft::update_device(d_X_embedded, X_embedded.data(), X_embedded.size(),
+                        stream);
 
     // euclidean test
     score =
-      trustworthiness_score<float, MLCommon::Distance::EucUnexpandedL2Sqrt>(
+      trustworthiness_score<float,
+                            ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
         h, d_X, d_X_embedded, 50, 30, 8, 5);
 
     d_alloc->deallocate(d_X, X.size() * sizeof(float), stream);
diff --git a/cpp/test/sg/tsne_test.cu b/cpp/test/sg/tsne_test.cu
index cbda4f994b..805b0bbf28 100644
--- a/cpp/test/sg/tsne_test.cu
+++ b/cpp/test/sg/tsne_test.cu
@@ -14,18 +14,17 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <cuml/manifold/tsne.h>
 #include <datasets/digits.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/common/logger.hpp>
 #include <iostream>
-#include <score/scores.cuh>
+#include <metrics/scores.cuh>
 #include <vector>
 
 using namespace MLCommon;
@@ -37,16 +36,15 @@ using namespace ML;
 class TSNETest : public ::testing::Test {
  protected:
   void basicTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
     // Allocate memory
-    device_buffer<float> X_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> X_d(handle.get_device_allocator(), handle.get_stream(),
                              n * p);
-    MLCommon::updateDevice(X_d.data(), digits.data(), n * p,
-                           handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::update_device(X_d.data(), digits.data(), n * p, handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> Y_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> Y_d(handle.get_device_allocator(), handle.get_stream(),
                              n * 2);
 
     // Test Barnes Hut
@@ -58,9 +56,8 @@ class TSNETest : public ::testing::Test {
     float *embeddings_h = (float *)malloc(sizeof(float) * n * 2);
     assert(embeddings_h != NULL);
 
-    MLCommon::updateHost(&embeddings_h[0], Y_d.data(), n * 2,
-                         handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::update_host(&embeddings_h[0], Y_d.data(), n * 2, handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     // Transpose the data
     int k = 0;
@@ -71,23 +68,24 @@ class TSNETest : public ::testing::Test {
     }
 
     // Move transposed embeddings back to device, as trustworthiness requires C contiguous format
-    MLCommon::updateDevice(Y_d.data(), C_contiguous_embedding, n * 2,
-                           handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::update_device(Y_d.data(), C_contiguous_embedding, n * 2,
+                        handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     // Test trustworthiness
-    score_bh = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
-      X_d.data(), Y_d.data(), n, p, 2, 5, handle.getDeviceAllocator(),
-      handle.getStream());
+    score_bh =
+      trustworthiness_score<float,
+                            ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
+        X_d.data(), Y_d.data(), n, p, 2, 5, handle.get_device_allocator(),
+        handle.get_stream());
 
     // Test Exact TSNE
     TSNE_fit(handle, X_d.data(), Y_d.data(), n, p, 2, 90, 0.5, 0.0025, 50, 100,
              1e-5, 12, 250, 0.01, 200, 500, 1000, 1e-7, 0.5, 0.8, -1,
              CUML_LEVEL_INFO, false, false);
 
-    MLCommon::updateHost(&embeddings_h[0], Y_d.data(), n * 2,
-                         handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::update_host(&embeddings_h[0], Y_d.data(), n * 2, handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     // Move embeddings to host.
     // This can be used for printing if needed.
@@ -98,14 +96,16 @@ class TSNETest : public ::testing::Test {
     }
 
     // Move transposed embeddings back to device, as trustworthiness requires C contiguous format
-    MLCommon::updateDevice(Y_d.data(), C_contiguous_embedding, n * 2,
-                           handle.getStream());
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    raft::update_device(Y_d.data(), C_contiguous_embedding, n * 2,
+                        handle.get_stream());
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     // Test trustworthiness
-    score_exact = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
-      X_d.data(), Y_d.data(), n, p, 2, 5, handle.getDeviceAllocator(),
-      handle.getStream());
+    score_exact =
+      trustworthiness_score<float,
+                            ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
+        X_d.data(), Y_d.data(), n, p, 2, 5, handle.get_device_allocator(),
+        handle.get_stream());
 
     // Free space
     free(embeddings_h);
diff --git a/cpp/test/sg/tsvd_test.cu b/cpp/test/sg/tsvd_test.cu
index 8ef005be7e..ead521b290 100644
--- a/cpp/test/sg/tsvd_test.cu
+++ b/cpp/test/sg/tsvd_test.cu
@@ -14,11 +14,11 @@
  * limitations under the License.
  */
 
-#include <common/cudart_utils.h>
 #include <gtest/gtest.h>
+#include <raft/cudart_utils.h>
 #include <test_utils.h>
 #include <cuml/decomposition/params.hpp>
-#include <random/rng.cuh>
+#include <raft/random/rng.cuh>
 #include <tsvd/tsvd.cuh>
 #include <vector>
 
@@ -49,27 +49,28 @@ class TsvdTest : public ::testing::TestWithParam<TsvdInputs<T>> {
  protected:
   void basicTest() {
     params = ::testing::TestWithParam<TsvdInputs<T>>::GetParam();
-    Random::Rng r(params.seed, MLCommon::Random::GenTaps);
+    raft::random::Rng r(params.seed, raft::random::GenTaps);
     int len = params.len;
 
-    allocate(data, len);
+    raft::allocate(data, len);
 
     std::vector<T> data_h = {1.0, 2.0, 4.0, 2.0, 4.0, 5.0,
                              5.0, 4.0, 2.0, 1.0, 6.0, 4.0};
     data_h.resize(len);
-    updateDevice(data, data_h.data(), len, stream);
+    raft::update_device(data, data_h.data(), len, stream);
 
     int len_comp = params.n_col * params.n_col;
-    allocate(components, len_comp);
-    allocate(singular_vals, params.n_col);
+    raft::allocate(components, len_comp);
+    raft::allocate(singular_vals, params.n_col);
 
     std::vector<T> components_ref_h = {-0.3951, 0.1532,  0.9058,
                                        -0.7111, -0.6752, -0.1959,
                                        -0.5816, 0.7215,  -0.3757};
     components_ref_h.resize(len_comp);
 
-    allocate(components_ref, len_comp);
-    updateDevice(components_ref, components_ref_h.data(), len_comp, stream);
+    raft::allocate(components_ref, len_comp);
+    raft::update_device(components_ref, components_ref_h.data(), len_comp,
+                        stream);
 
     paramsTSVD prms;
     prms.n_cols = params.n_col;
@@ -80,12 +81,12 @@ class TsvdTest : public ::testing::TestWithParam<TsvdInputs<T>> {
     else
       prms.algorithm = solver::COV_EIG_JACOBI;
 
-    tsvdFit(handle.getImpl(), data, components, singular_vals, prms, stream);
+    tsvdFit(handle, data, components, singular_vals, prms, stream);
   }
 
   void advancedTest() {
     params = ::testing::TestWithParam<TsvdInputs<T>>::GetParam();
-    Random::Rng r(params.seed, MLCommon::Random::GenTaps);
+    raft::random::Rng r(params.seed, raft::random::GenTaps);
     int len = params.len2;
 
     paramsTSVD prms;
@@ -96,33 +97,30 @@ class TsvdTest : public ::testing::TestWithParam<TsvdInputs<T>> {
       prms.algorithm = solver::COV_EIG_DQ;
     else if (params.algo == 1)
       prms.algorithm = solver::COV_EIG_JACOBI;
-    else if (params.algo == 2) {
-      prms.algorithm = solver::RANDOMIZED;
+    else
       prms.n_components = params.n_col2 - 15;
-    }
 
-    allocate(data2, len);
+    raft::allocate(data2, len);
     r.uniform(data2, len, T(-1.0), T(1.0), stream);
-    allocate(data2_trans, prms.n_rows * prms.n_components);
+    raft::allocate(data2_trans, prms.n_rows * prms.n_components);
 
     int len_comp = params.n_col2 * prms.n_components;
-    allocate(components2, len_comp);
-    allocate(explained_vars2, prms.n_components);
-    allocate(explained_var_ratio2, prms.n_components);
-    allocate(singular_vals2, prms.n_components);
-
-    tsvdFitTransform(handle.getImpl(), data2, data2_trans, components2,
-                     explained_vars2, explained_var_ratio2, singular_vals2,
-                     prms, stream);
-
-    allocate(data2_back, len);
-    tsvdInverseTransform(handle.getImpl(), data2_trans, components2, data2_back,
-                         prms, stream);
+    raft::allocate(components2, len_comp);
+    raft::allocate(explained_vars2, prms.n_components);
+    raft::allocate(explained_var_ratio2, prms.n_components);
+    raft::allocate(singular_vals2, prms.n_components);
+
+    tsvdFitTransform(handle, data2, data2_trans, components2, explained_vars2,
+                     explained_var_ratio2, singular_vals2, prms, stream);
+
+    raft::allocate(data2_back, len);
+    tsvdInverseTransform(handle, data2_trans, components2, data2_back, prms,
+                         stream);
   }
 
   void SetUp() override {
     CUDA_CHECK(cudaStreamCreate(&stream));
-    handle.setStream(stream);
+    handle.set_stream(stream);
     basicTest();
     advancedTest();
   }
@@ -147,7 +145,7 @@ class TsvdTest : public ::testing::TestWithParam<TsvdInputs<T>> {
   T *data, *components, *singular_vals, *components_ref, *explained_vars_ref;
   T *data2, *data2_trans, *data2_back, *components2, *explained_vars2,
     *explained_var_ratio2, *singular_vals2;
-  cumlHandle handle;
+  raft::handle_t handle;
   cudaStream_t stream;
 };
 
@@ -165,28 +163,30 @@ const std::vector<TsvdInputs<double>> inputsd2 = {
 
 typedef TsvdTest<float> TsvdTestLeftVecF;
 TEST_P(TsvdTestLeftVecF, Result) {
-  ASSERT_TRUE(devArrMatch(components, components_ref,
-                          (params.n_col * params.n_col),
-                          CompareApproxAbs<float>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(components, components_ref, (params.n_col * params.n_col),
+                      raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef TsvdTest<double> TsvdTestLeftVecD;
 TEST_P(TsvdTestLeftVecD, Result) {
-  ASSERT_TRUE(devArrMatch(components, components_ref,
-                          (params.n_col * params.n_col),
-                          CompareApproxAbs<double>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(components, components_ref, (params.n_col * params.n_col),
+                      raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 typedef TsvdTest<float> TsvdTestDataVecF;
 TEST_P(TsvdTestDataVecF, Result) {
-  ASSERT_TRUE(devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
-                          CompareApproxAbs<float>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
+                      raft::CompareApproxAbs<float>(params.tolerance)));
 }
 
 typedef TsvdTest<double> TsvdTestDataVecD;
 TEST_P(TsvdTestDataVecD, Result) {
-  ASSERT_TRUE(devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
-                          CompareApproxAbs<double>(params.tolerance)));
+  ASSERT_TRUE(
+    raft::devArrMatch(data2, data2_back, (params.n_col2 * params.n_col2),
+                      raft::CompareApproxAbs<double>(params.tolerance)));
 }
 
 INSTANTIATE_TEST_CASE_P(TsvdTests, TsvdTestLeftVecF,
diff --git a/cpp/test/sg/umap_parametrizable_test.cu b/cpp/test/sg/umap_parametrizable_test.cu
index 4310c218bb..ffb2e39612 100644
--- a/cpp/test/sg/umap_parametrizable_test.cu
+++ b/cpp/test/sg/umap_parametrizable_test.cu
@@ -20,8 +20,8 @@
 
 #include <cuml/manifold/umapparams.h>
 #include <datasets/digits.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/cuml.hpp>
 #include <cuml/datasets/make_blobs.hpp>
@@ -29,6 +29,7 @@
 #include <distance/distance.cuh>
 #include <linalg/reduce_rows_by_key.cuh>
 #include <metrics/trustworthiness.cuh>
+#include <raft/cuda_utils.cuh>
 #include <umap/runner.cuh>
 
 using namespace ML;
@@ -54,12 +55,12 @@ template <typename T>
 bool has_nan(T* data, size_t len, std::shared_ptr<deviceAllocator> alloc,
              cudaStream_t stream) {
   dim3 blk(256);
-  dim3 grid(MLCommon::ceildiv(len, (size_t)blk.x));
+  dim3 grid(raft::ceildiv(len, (size_t)blk.x));
   bool h_answer = false;
   device_buffer<bool> d_answer(alloc, stream, 1);
-  updateDevice(d_answer.data(), &h_answer, 1, stream);
+  raft::update_device(d_answer.data(), &h_answer, 1, stream);
   has_nan_kernel<<<grid, blk, 0, stream>>>(data, len, d_answer.data());
-  updateHost(&h_answer, d_answer.data(), 1, stream);
+  raft::update_host(&h_answer, d_answer.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
   return h_answer;
 }
@@ -78,13 +79,13 @@ template <typename T>
 bool are_equal(T* embedding1, T* embedding2, size_t len,
                std::shared_ptr<deviceAllocator> alloc, cudaStream_t stream) {
   dim3 blk(32);
-  dim3 grid(MLCommon::ceildiv(len, (size_t)blk.x));
+  dim3 grid(raft::ceildiv(len, (size_t)blk.x));
   double h_answer = 0.;
   device_buffer<double> d_answer(alloc, stream, 1);
-  updateDevice(d_answer.data(), &h_answer, 1, stream);
+  raft::update_device(d_answer.data(), &h_answer, 1, stream);
   are_equal_kernel<<<grid, blk, 0, stream>>>(embedding1, embedding2, len,
                                              d_answer.data());
-  updateHost(&h_answer, d_answer.data(), 1, stream);
+  raft::update_host(&h_answer, d_answer.data(), 1, stream);
   CUDA_CHECK(cudaStreamSynchronize(stream));
 
   double tolerance = 1.0;
@@ -107,11 +108,11 @@ class UMAPParametrizableTest : public ::testing::Test {
     double min_trustworthiness;
   };
 
-  void get_embedding(cumlHandle& handle, float* X, float* y,
+  void get_embedding(raft::handle_t& handle, float* X, float* y,
                      float* embedding_ptr, TestParams& test_params,
                      UMAPParams& umap_params) {
-    cudaStream_t stream = handle.getStream();
-    auto alloc = handle.getDeviceAllocator();
+    cudaStream_t stream = handle.get_stream();
+    auto alloc = handle.get_device_allocator();
     int& n_samples = test_params.n_samples;
     int& n_features = test_params.n_features;
 
@@ -187,19 +188,21 @@ class UMAPParametrizableTest : public ::testing::Test {
     }
   }
 
-  void assertions(cumlHandle& handle, float* X, float* embedding_ptr,
+  void assertions(raft::handle_t& handle, float* X, float* embedding_ptr,
                   TestParams& test_params, UMAPParams& umap_params) {
-    cudaStream_t stream = handle.getStream();
-    auto alloc = handle.getDeviceAllocator();
+    cudaStream_t stream = handle.get_stream();
+    auto alloc = handle.get_device_allocator();
     int& n_samples = test_params.n_samples;
     int& n_features = test_params.n_features;
 
     ASSERT_TRUE(!has_nan(embedding_ptr, n_samples * umap_params.n_components,
                          alloc, stream));
 
-    double trustworthiness = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
-      handle, X, embedding_ptr, n_samples, n_features, umap_params.n_components,
-      umap_params.n_neighbors);
+    double trustworthiness =
+      trustworthiness_score<float,
+                            ML::Distance::DistanceType::EucUnexpandedL2Sqrt>(
+        handle, X, embedding_ptr, n_samples, n_features,
+        umap_params.n_components, umap_params.n_neighbors);
 
     std::cout << "min. expected trustworthiness: "
               << test_params.min_trustworthiness << std::endl;
@@ -219,9 +222,9 @@ class UMAPParametrizableTest : public ::testing::Test {
               << "-" << test_params.n_features << "-" << test_params.n_clusters
               << "-" << test_params.min_trustworthiness << "]" << std::endl;
 
-    cumlHandle handle;
-    cudaStream_t stream = handle.getStream();
-    auto alloc = handle.getDeviceAllocator();
+    raft::handle_t handle;
+    cudaStream_t stream = handle.get_stream();
+    auto alloc = handle.get_device_allocator();
     int& n_samples = test_params.n_samples;
     int& n_features = test_params.n_features;
 
diff --git a/cpp/test/sg/umap_test.cu b/cpp/test/sg/umap_test.cu
index f93e986d2a..1a1d2c70f6 100644
--- a/cpp/test/sg/umap_test.cu
+++ b/cpp/test/sg/umap_test.cu
@@ -18,17 +18,17 @@
 #include <iostream>
 #include <vector>
 
-#include <common/cudart_utils.h>
 #include <cuml/manifold/umapparams.h>
 #include <datasets/digits.h>
+#include <raft/cudart_utils.h>
 #include <common/device_buffer.hpp>
-#include <cuda_utils.cuh>
 #include <cuml/common/cuml_allocator.hpp>
 #include <cuml/common/logger.hpp>
 #include <cuml/cuml.hpp>
 #include <cuml/neighbors/knn.hpp>
 #include <distance/distance.cuh>
 #include <metrics/trustworthiness.cuh>
+#include <raft/cuda_utils.cuh>
 #include <umap/runner.cuh>
 
 using namespace ML;
@@ -43,44 +43,44 @@ using namespace MLCommon::Datasets::Digits;
 class UMAPTest : public ::testing::Test {
  protected:
   void xformTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
     UMAPParams *umap_params = new UMAPParams();
     umap_params->n_neighbors = 10;
     umap_params->init = 1;
     umap_params->verbosity = CUML_LEVEL_INFO;
 
-    UMAPAlgo::find_ab(umap_params, handle.getDeviceAllocator(), stream);
+    UMAPAlgo::find_ab(umap_params, handle.get_device_allocator(), stream);
 
-    device_buffer<float> X_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> X_d(handle.get_device_allocator(), handle.get_stream(),
                              n_samples * n_features);
 
     MLCommon::updateDevice(X_d.data(), digits.data(), n_samples * n_features,
-                           handle.getStream());
+                           handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> embeddings(handle.getDeviceAllocator(),
-                                    handle.getStream(),
+    device_buffer<float> embeddings(handle.get_device_allocator(),
+                                    handle.get_stream(),
                                     n_samples * umap_params->n_components);
 
     UMAPAlgo::_fit<float, 256>(handle, X_d.data(), n_samples, n_features,
                                nullptr, nullptr, umap_params,
                                embeddings.data());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> xformed(handle.getDeviceAllocator(),
-                                 handle.getStream(),
+    device_buffer<float> xformed(handle.get_device_allocator(),
+                                 handle.get_stream(),
                                  n_samples * umap_params->n_components);
 
     UMAPAlgo::_transform<float, 256>(
       handle, X_d.data(), n_samples, n_features, nullptr, nullptr, X_d.data(),
       n_samples, embeddings.data(), n_samples, umap_params, xformed.data());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     xformed_score = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
       handle, X_d.data(), xformed.data(), n_samples, n_features,
@@ -88,34 +88,34 @@ class UMAPTest : public ::testing::Test {
   }
 
   void fitTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
     UMAPParams *umap_params = new UMAPParams();
     umap_params->n_neighbors = 10;
     umap_params->init = 1;
     umap_params->verbosity = CUML_LEVEL_INFO;
 
-    UMAPAlgo::find_ab(umap_params, handle.getDeviceAllocator(), stream);
+    UMAPAlgo::find_ab(umap_params, handle.get_device_allocator(), stream);
 
-    device_buffer<float> X_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> X_d(handle.get_device_allocator(), handle.get_stream(),
                              n_samples * n_features);
 
     MLCommon::updateDevice(X_d.data(), digits.data(), n_samples * n_features,
-                           handle.getStream());
+                           handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> embeddings(handle.getDeviceAllocator(),
-                                    handle.getStream(),
+    device_buffer<float> embeddings(handle.get_device_allocator(),
+                                    handle.get_stream(),
                                     n_samples * umap_params->n_components);
 
     UMAPAlgo::_fit<float, 256>(handle, X_d.data(), n_samples, n_features,
                                nullptr, nullptr, umap_params,
                                embeddings.data());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     fit_score = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
       handle, X_d.data(), embeddings.data(), n_samples, n_features,
@@ -123,36 +123,36 @@ class UMAPTest : public ::testing::Test {
   }
 
   void supervisedTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
-    cudaStream_t stream = handle.getStream();
+    cudaStream_t stream = handle.get_stream();
 
     UMAPParams *umap_params = new UMAPParams();
     umap_params->n_neighbors = 10;
     umap_params->init = 1;
     umap_params->verbosity = CUML_LEVEL_INFO;
 
-    UMAPAlgo::find_ab(umap_params, handle.getDeviceAllocator(), stream);
+    UMAPAlgo::find_ab(umap_params, handle.get_device_allocator(), stream);
 
-    device_buffer<float> X_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> X_d(handle.get_device_allocator(), handle.get_stream(),
                              n_samples * n_features);
-    device_buffer<float> Y_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> Y_d(handle.get_device_allocator(), handle.get_stream(),
                              n_samples * 2);
 
     MLCommon::updateDevice(X_d.data(), digits.data(), n_samples * n_features,
-                           handle.getStream());
+                           handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> embeddings(handle.getDeviceAllocator(),
-                                    handle.getStream(),
+    device_buffer<float> embeddings(handle.get_device_allocator(),
+                                    handle.get_stream(),
                                     n_samples * umap_params->n_components);
 
     UMAPAlgo::_fit<float, 256>(handle, X_d.data(), Y_d.data(), n_samples,
                                n_features, nullptr, nullptr, umap_params,
                                embeddings.data());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     supervised_score = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
       handle, X_d.data(), embeddings.data(), n_samples, n_features,
@@ -160,39 +160,39 @@ class UMAPTest : public ::testing::Test {
   }
 
   void fitWithKNNTest() {
-    cumlHandle handle;
+    raft::handle_t handle;
 
     UMAPParams *umap_params = new UMAPParams();
     umap_params->n_neighbors = 10;
     umap_params->init = 1;
     umap_params->verbosity = CUML_LEVEL_INFO;
 
-    UMAPAlgo::find_ab(umap_params, handle.getDeviceAllocator(),
-                      handle.getStream());
+    UMAPAlgo::find_ab(umap_params, handle.get_device_allocator(),
+                      handle.get_stream());
 
-    device_buffer<float> X_d(handle.getDeviceAllocator(), handle.getStream(),
+    device_buffer<float> X_d(handle.get_device_allocator(), handle.get_stream(),
                              n_samples * n_features);
 
     MLCommon::updateDevice(X_d.data(), digits.data(), n_samples * n_features,
-                           handle.getStream());
+                           handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
-    device_buffer<float> embeddings(handle.getDeviceAllocator(),
-                                    handle.getStream(),
+    device_buffer<float> embeddings(handle.get_device_allocator(),
+                                    handle.get_stream(),
                                     n_samples * umap_params->n_components);
 
     MLCommon::device_buffer<int64_t> knn_indices(
-      handle.getDeviceAllocator(), handle.getStream(),
+      handle.get_device_allocator(), handle.get_stream(),
       n_samples * umap_params->n_components);
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     MLCommon::device_buffer<float> knn_dists(
-      handle.getDeviceAllocator(), handle.getStream(),
+      handle.get_device_allocator(), handle.get_stream(),
       n_samples * umap_params->n_components);
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     std::vector<float *> ptrs(1);
     std::vector<int> sizes(1);
@@ -201,17 +201,17 @@ class UMAPTest : public ::testing::Test {
 
     MLCommon::Selection::brute_force_knn(
       ptrs, sizes, n_features, X_d.data(), n_samples, knn_indices.data(),
-      knn_dists.data(), umap_params->n_neighbors, handle.getDeviceAllocator(),
-      handle.getStream());
+      knn_dists.data(), umap_params->n_neighbors, handle.get_device_allocator(),
+      handle.get_stream());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     UMAPAlgo::_fit<float, 256>(
       handle, X_d.data(), n_samples, n_features,
       //knn_indices.data(), knn_dists.data(), umap_params,
       nullptr, nullptr, umap_params, embeddings.data());
 
-    CUDA_CHECK(cudaStreamSynchronize(handle.getStream()));
+    CUDA_CHECK(cudaStreamSynchronize(handle.get_stream()));
 
     fit_with_knn_score = trustworthiness_score<float, EucUnexpandedL2Sqrt>(
       handle, X_d.data(), embeddings.data(), n_samples, n_features,
diff --git a/docs/source/_static/copybutton.css b/docs/source/_static/copybutton.css
new file mode 100644
index 0000000000..5eef6e366d
--- /dev/null
+++ b/docs/source/_static/copybutton.css
@@ -0,0 +1,42 @@
+/* This contains code with copyright by the scikit-learn project, subject to
+the license in /thirdparty/LICENSES/LICENSE.scikit_learn */
+
+/* copybutton */
+/* Adds "Show/Hide Output" button to Examples */
+
+.copybutton {
+  cursor: pointer;
+  position: absolute;
+  top: 0px;
+  right: 0px;
+  border: 1px solid rgb(221, 221, 221);
+  color: rgb(221, 221, 221);
+  font-family: monospace;
+  padding-left: 0.2rem;
+  padding-right: 0.2rem;
+}
+
+div.highlight:hover span.copybutton::after {
+  background: #3F556B;
+  border-radius: 0.25rem;
+  color: white;
+  content: attr(title);
+  padding: 0.25rem;
+  position: absolute;
+  z-index: 98;
+  width: 100px;
+  font-size: 0.7rem;
+  top: 0;
+  right: 0;
+}
+
+/* copy buttonn */
+div.highlight:hover span.copybutton {
+  background-color: #3F556B;
+  color: white;
+}
+
+div.highlight:hover span.copybutton:hover {
+  background-color: #20252B;
+}
+
diff --git a/docs/source/_static/example_mod.js b/docs/source/_static/example_mod.js
new file mode 100644
index 0000000000..77dc618a82
--- /dev/null
+++ b/docs/source/_static/example_mod.js
@@ -0,0 +1,61 @@
+// This contains code with copyright by the scikit-learn project, subject to
+// the license in /thirdparty/LICENSES/LICENSE.scikit_learn
+
+$(document).ready(function () {
+   /* Add a [>>>] button on the top-right corner of code samples to hide
+    * the >>> and ... prompts and the output and thus make the code
+    * copyable. */
+   var div = $('.highlight-python .highlight,' +
+      '.highlight-python3 .highlight,' +
+      '.highlight-pycon .highlight,' +
+      '.highlight-default .highlight')
+   var pre = div.find('pre');
+
+   // get the styles from the current theme
+   pre.parent().parent().css('position', 'relative');
+   var hide_text = 'Hide prompts and outputs';
+   var show_text = 'Show prompts and outputs';
+
+   // create and add the button to all the code blocks that contain >>>
+   div.each(function (index) {
+      var jthis = $(this);
+      if (jthis.find('.gp').length > 0) {
+         var button = $('<span class="copybutton">&gt;&gt;&gt;</span>');
+         button.attr('title', hide_text);
+         button.data('hidden', 'false');
+         jthis.prepend(button);
+      }
+      // tracebacks (.gt) contain bare text elements that need to be
+      // wrapped in a span to work with .nextUntil() (see later)
+      jthis.find('pre:has(.gt)').contents().filter(function () {
+         return ((this.nodeType == 3) && (this.data.trim().length > 0));
+      }).wrap('<span>');
+   });
+
+   // define the behavior of the button when it's clicked
+   $('.copybutton').click(function (e) {
+      e.preventDefault();
+      var button = $(this);
+      if (button.data('hidden') === 'false') {
+         // hide the code output
+         button.parent().find('.go, .gp, .gt').hide();
+         button.next('pre')
+            .find('.gt')
+            .nextUntil('.gp, .go')
+            .css('visibility', 'hidden');
+         button.css('text-decoration', 'line-through');
+         button.attr('title', show_text);
+         button.data('hidden', 'true');
+      } else {
+         // show the code output
+         button.parent().find('.go, .gp, .gt').show();
+         button.next('pre')
+            .find('.gt')
+            .nextUntil('.gp, .go')
+            .css('visibility', 'visible');
+         button.css('text-decoration', 'none');
+         button.attr('title', hide_text);
+         button.data('hidden', 'false');
+      }
+   });
+});
\ No newline at end of file
diff --git a/docs/source/_static/references.css b/docs/source/_static/references.css
new file mode 100644
index 0000000000..225cf13ba9
--- /dev/null
+++ b/docs/source/_static/references.css
@@ -0,0 +1,23 @@
+
+/* Fix references to not look like parameters */
+dl.citation > dt.label {
+  display: unset !important;
+  float: left !important;
+  border: unset !important;
+  background: unset !important;
+  padding: unset !important;
+  margin: unset !important;
+  font-size: unset !important;
+  line-height: unset !important;
+  padding-right: 0.5rem !important;
+}
+
+/* Add opening bracket */
+dl.citation > dt.label > span::before {
+  content: "[";
+}
+
+/* Add closing bracket */
+dl.citation > dt.label > span::after {
+  content: "]";
+}
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/api.rst
index 5d83c4c966..10f266b661 100644
--- a/docs/source/api.rst
+++ b/docs/source/api.rst
@@ -5,12 +5,16 @@ cuML API Reference
 Module Configuration
 ====================
 
+.. _output-data-type-configuration:
+
 Output Data Type Configuration
 ------------------------------
 
  .. automethod:: cuml.common.memory_utils.set_global_output_type
  .. automethod:: cuml.common.memory_utils.using_output_type
 
+.. _verbosity-levels:
+
 Verbosity Levels
 ----------------
 
@@ -59,7 +63,7 @@ Model Selection and Data Splitting
 Feature and Label Encoding (Single-GPU)
 ---------------------------------------
 
- .. autoclass:: cuml.preprocessing.LabelEncoder
+ .. autoclass:: cuml.preprocessing.LabelEncoder.LabelEncoder
     :members:
 
  .. autoclass:: cuml.preprocessing.LabelBinarizer
@@ -70,6 +74,15 @@ Feature and Label Encoding (Single-GPU)
  .. autoclass:: cuml.preprocessing.OneHotEncoder
     :members:
 
+ .. autoclass:: cuml.preprocessing.TargetEncoder.TargetEncoder
+    :members:
+
+
+Text Preprocessing (Single-GPU)
+---------------------------------------
+ .. autoclass:: cuml.preprocessing.text.stem.PorterStemmer
+    :members:
+
 Feature and Label Encoding (Dask-based Multi-GPU)
 -------------------------------------------------
 
@@ -120,8 +133,8 @@ Array Wrappers (Internal API)
 .. autoclass:: cuml.common.CumlArray
     :members:
 
-Metrics
----------
+Metrics (regression, classification, and distance)
+---------------------------------------------------
 
   .. automodule:: cuml.metrics.regression
     :members:
@@ -129,6 +142,17 @@ Metrics
   .. automodule:: cuml.metrics.accuracy
     :members:
 
+  .. automethod:: cuml.metrics.confusion_matrix
+
+  .. automethod:: cuml.metrics.roc_auc_score
+
+  .. automethod:: cuml.metrics.precision_recall_curve
+
+  .. automodule:: cuml.metrics.pairwise_distances
+    :members:
+
+Metrics (clustering and trustworthiness)
+----------------------------------------
   .. automodule:: cuml.metrics.trustworthiness
     :members:
 
@@ -138,7 +162,14 @@ Metrics
   .. automodule:: cuml.metrics.cluster.entropy
     :members:
 
-  .. automethod:: cuml.metrics.roc_auc_score
+  .. automodule:: cuml.metrics.cluster.homogeneity_score
+    :members:
+
+  .. automodule:: cuml.metrics.cluster.completeness_score
+    :members:
+
+  .. automodule:: cuml.metrics.cluster.mutual_info_score
+    :members:
 
 Benchmarking
 -------------
@@ -198,6 +229,12 @@ Mini Batch SGD Regressor
 .. autoclass:: cuml.MBSGDRegressor
     :members:
 
+Mutinomial Naive Bayes
+----------------------
+
+.. autoclass:: cuml.MultinomialNB
+    :members:
+
 Stochastic Gradient Descent
 ---------------------------
 
@@ -245,12 +282,14 @@ Nearest Neighbors Classification
 
 .. autoclass:: cuml.neighbors.KNeighborsClassifier
     :members:
+    :noindex:
 
 Nearest Neighbors Regression
 ----------------------------
 
 .. autoclass:: cuml.neighbors.KNeighborsRegressor
     :members:
+    :noindex:
 
 Clustering
 ==========
@@ -342,6 +381,9 @@ ARIMA
 .. autoclass:: cuml.tsa.ARIMA
     :members:
 
+.. autoclass:: cuml.tsa.auto_arima.AutoARIMA
+    :members:
+
 Multi-Node, Multi-GPU Algorithms
 ================================
 
@@ -357,6 +399,13 @@ Nearest Neighbors
 .. autoclass:: cuml.dask.neighbors.NearestNeighbors
     :members:
 
+.. autoclass:: cuml.dask.neighbors.KNeighborsRegressor
+    :members:
+
+.. autoclass:: cuml.dask.neighbors.KNeighborsClassifier
+    :members:
+
+
 Principal Component Analysis
 -----------------------------
 .. autoclass:: cuml.dask.decomposition.PCA
@@ -398,6 +447,12 @@ Linear Models
 .. autoclass:: cuml.dask.linear_model.ElasticNet
     :members:
 
+Naive Bayes
+-----------
+
+.. autoclass:: cuml.dask.naive_bayes.MultinomialNB
+    :members:
+
 Solvers
 -------
 
@@ -420,3 +475,27 @@ Dask Base Classes and Mixins
 
 .. autoclass:: cuml.dask.common.base.DelayedInverseTransformMixin
    :members:
+
+Experimental
+============
+
+.. warning:: The `cuml.experimental` module contains features that are still
+    under development. It is not recommended to depend on features in this
+    module as they may change in future releases.
+
+.. note:: Due to the nature of this module, it is not imported by default by
+    the root `cuml` package. Each `experimental` submodule must be imported
+    separately.
+
+Decomposition
+-------------
+.. autoclass:: cuml.experimental.decomposition.IncrementalPCA
+   :members:
+
+Preprocessing
+-------------
+.. automodule:: cuml.experimental.preprocessing
+   :members: Binarizer, KBinsDiscretizer, MaxAbsScaler, MinMaxScaler,
+      Normalizer, RobustScaler, SimpleImputer, StandardScaler,
+      add_dummy_feature, binarize, minmax_scale, normalize,
+      PolynomialFeatures, robust_scale, scale
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 6dc84377bb..ea9d97006a 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -18,8 +18,17 @@
 #
 import os
 import sys
+
+# If extensions (or modules to document with autodoc) are in another
+# directory, add these directories to sys.path here. If the directory
+# is relative to the documentation root, use os.path.abspath to make it
+# absolute, like shown here.
+sys.path.insert(0, os.path.abspath('sphinxext'))
 sys.path.insert(0, os.path.abspath('../../python'))
 
+from github_link import make_linkcode_resolve # noqa
+
+
 # -- General configuration ------------------------------------------------
 
 # If your documentation needs a minimal Sphinx version, state it here.
@@ -30,15 +39,17 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
-    'sphinx.ext.intersphinx',
+    'numpydoc',
     'sphinx.ext.autodoc',
     'sphinx.ext.autosummary',
-    'numpydoc',
-    "sphinx_markdown_tables",
+    'sphinx.ext.doctest',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.linkcode',
     "IPython.sphinxext.ipython_console_highlighting",
     "IPython.sphinxext.ipython_directive",
     "nbsphinx",
     "recommonmark",
+    "sphinx_markdown_tables",
 ]
 
 ipython_mplbackend = "str"
@@ -46,6 +57,9 @@
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
 
+# generate autosummary even if no references
+# autosummary_generate = True
+
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
@@ -57,7 +71,7 @@
 
 # General information about the project.
 project = 'cuml'
-copyright = '2019, nvidia'
+copyright = '2020, nvidia'
 author = 'nvidia'
 
 # The version info for the project you're documenting, acts as replacement for
@@ -65,9 +79,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '0.15'
+version = '0.17'
 # The full version, including alpha/beta/rc tags.
-release = '0.15.0'
+release = '0.17.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
@@ -87,7 +101,6 @@
 # If true, `todo` and `todoList` produce output, else they produce nothing.
 todo_include_todos = False
 
-
 # -- Options for HTML output ----------------------------------------------
 
 # The theme to use for HTML and HTML Help pages.  See the documentation for
@@ -107,7 +120,6 @@
     html_theme = 'sphinx_rtd_theme'
     html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
 
-
 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the
 # documentation.
@@ -119,13 +131,13 @@
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']
 
+html_js_files = ["example_mod.js"]
 
 # -- Options for HTMLHelp output ------------------------------------------
 
 # Output file base name for HTML help builder.
 htmlhelp_basename = 'cuMLdoc'
 
-
 # -- Options for LaTeX output ---------------------------------------------
 
 latex_elements = {
@@ -150,20 +162,14 @@
 # (source start file, target name, title,
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
-    (master_doc, 'cuml.tex', 'cuml Documentation',
-     'nvidia', 'manual'),
+    (master_doc, 'cuml.tex', 'cuml Documentation', 'nvidia', 'manual'),
 ]
 
-
 # -- Options for manual page output ---------------------------------------
 
 # One entry per manual page. List of tuples
 # (source start file, name, description, authors, manual section).
-man_pages = [
-    (master_doc, 'cuml', 'cuml Documentation',
-     [author], 1)
-]
-
+man_pages = [(master_doc, 'cuml', 'cuml Documentation', [author], 1)]
 
 # -- Options for Texinfo output -------------------------------------------
 
@@ -171,20 +177,26 @@
 # (source start file, target name, title, author,
 #  dir menu entry, description, category)
 texinfo_documents = [
-    (master_doc, 'cuml', 'cuml Documentation',
-     author, 'cuml', 'One line description of project.',
-     'Miscellaneous'),
+    (master_doc, 'cuml', 'cuml Documentation', author, 'cuml',
+     'One line description of project.', 'Miscellaneous'),
 ]
 
-
 # Example configuration for intersphinx: refer to the Python standard library.
 intersphinx_mapping = {'https://docs.python.org/': None}
 
-
 # Config numpydoc
 numpydoc_show_inherited_class_members = False
 numpydoc_class_members_toctree = False
 
 
 def setup(app):
+    app.add_css_file('copybutton.css')
     app.add_css_file('params.css')
+    app.add_css_file('references.css')
+
+
+# The following is used by sphinx.ext.linkcode to provide links to github
+linkcode_resolve = make_linkcode_resolve(
+    'cuml', 'https://github.com/rapidsai/'
+    'cuml/blob/{revision}/python/'
+    '{package}/{path}#L{lineno}')
diff --git a/docs/source/cuml_intro.rst b/docs/source/cuml_intro.rst
index 4af6e50575..8a1d191714 100644
--- a/docs/source/cuml_intro.rst
+++ b/docs/source/cuml_intro.rst
@@ -24,7 +24,7 @@ then call ``predict`` or ``transform`` for inference.
    y_prediction = model.predict(X_test)
 
 You can find many more complete examples in the `Introductory Notebook
-estimator_intro.ipynb` and in the cuML API documentation.
+<estimator_intro.ipynb>`_ and in the cuML API documentation.
 
 2. Accept flexible input types, return predictable output types
 ---------------------------------------------------------------
@@ -69,6 +69,6 @@ Learn more
 ----------
 
 To get started learning cuML, walk through the `Introductory Notebook
-estimator_intro.ipynb`. Then try out some of the other notebook
+<estimator_intro.ipynb>`_. Then try out some of the other notebook
 examples in the ``notebooks`` directory of the repository. Finally, do
-a deeper dive with the `cuML blogs cuml_blogs.rst`.
+a deeper dive with the `cuML blogs <cuml_blogs.rst>`_.
diff --git a/docs/source/estimator_intro.ipynb b/docs/source/estimator_intro.ipynb
old mode 100755
new mode 100644
index ff6c98c5dd..b669420dcd
--- a/docs/source/estimator_intro.ipynb
+++ b/docs/source/estimator_intro.ipynb
@@ -4,8 +4,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Training and Evaluating Machine Learning Models in cuML\n",
-    "\n",
+    "# Training and Evaluating Machine Learning Models in cuML"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML's dataset utilities.\n",
     "\n",
     "1. Random Forest Classifier\n",
@@ -14,14 +19,32 @@
     "4. Linear Regression\n",
     "\n",
     "\n",
-    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rapidsai/cuml/blob/tree/branch-0.14/docs/source/estimator_intro.ipynb)"
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rapidsai/cuml/blob/branch-0.15/docs/source/estimator_intro.ipynb)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Classification"
+    "### Shared Library Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cuml\n",
+    "from cupy import asnumpy \n",
+    "from joblib import dump, load"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Classification"
    ]
   },
   {
@@ -43,59 +66,51 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "cuml's accuracy score :  1.0\n",
-      "sklearn's accuracy score :  1.0\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/jzedlewski/code/rapidsai/cuml/python/cuml/utils/input_utils.py:189: UserWarning: Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.\n",
-      "  warnings.warn(\"Expected \" + order_to_str(order) + \" major order, \"\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "import cuml\n",
-    "\n",
-    "from cuml.datasets.blobs import blobs as make_blobs\n",
-    "from cuml.ensemble import RandomForestClassifier as curfc\n",
+    "from cuml.datasets.classification import make_classification\n",
     "from cuml.preprocessing.model_selection import train_test_split\n",
-    "\n",
+    "from cuml.ensemble import RandomForestClassifier as cuRF\n",
     "from sklearn.metrics import accuracy_score\n",
     "\n",
+    "# synthetic dataset dimensions\n",
     "n_samples = 1000\n",
     "n_features = 10\n",
-    "n_info = 7\n",
+    "n_classes = 2\n",
+    "\n",
+    "# random forest depth and size\n",
+    "n_estimators = 25\n",
+    "max_depth = 10\n",
+    "\n",
+    "# generate synthetic data [ binary classification task ]\n",
+    "X, y = make_classification ( n_classes = n_classes,\n",
+    "                             n_features = n_features,\n",
+    "                             n_samples = n_samples,\n",
+    "                             random_state = 0 )\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )\n",
     "\n",
-    "X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.1,\n",
-    "                              n_features=n_features, random_state=0,\n",
-    "                              dtype=np.float32)\n",
+    "model = cuRF( max_depth = max_depth, \n",
+    "              n_estimators = n_estimators,\n",
+    "              seed  = 0 )\n",
     "\n",
-    "X_blobs_train, X_blobs_test, y_blobs_train, y_blobs_test = train_test_split(X_blobs,\n",
-    "                                                                            y_blobs, train_size=0.8,\n",
-    "                                                                            random_state=10)\n",
+    "trained_RF = model.fit ( X_train, y_train )\n",
     "\n",
-    "cuml_class_model = curfc(max_features=1.0, n_bins=8, max_depth=10,\n",
-    "                         split_algo=0, min_rows_per_node=2,\n",
-    "                         n_estimators=30)\n",
-    "cuml_class_model.fit(X_blobs_train, y_blobs_train)\n",
-    "cu_preds = cuml_class_model.predict(X_blobs_test,y_blobs_test)\n",
+    "predictions = model.predict ( X_test )\n",
     "\n",
-    "cu_accuracy = cuml.metrics.accuracy_score(y_blobs_test, cu_preds)\n",
-    "sk_accuracy = accuracy_score(y_blobs_test, cu_preds)\n",
+    "cu_score = cuml.metrics.accuracy_score( y_test, predictions )\n",
+    "sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )\n",
     "\n",
-    "print(\"cuml's accuracy score : \", cu_accuracy)\n",
-    "print(\"sklearn's accuracy score : \", sk_accuracy)"
+    "print( \" cuml accuracy: \", cu_score )\n",
+    "print( \" sklearn accuracy : \", sk_score )\n",
+    "\n",
+    "# save \n",
+    "dump( trained_RF, 'RF.model')\n",
+    "\n",
+    "# to reload the model uncomment the line below \n",
+    "loaded_model = load('RF.model')"
    ]
   },
   {
@@ -113,7 +128,7 @@
     "UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization.\n",
     "For additional information on the UMAP model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP\n",
     "\n",
-    "Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lied within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html\n",
+    "Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lay within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html\n",
     "\n",
     "the documentation for cuML's implementation of the trustworthiness metric is: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness\n",
     "\n",
@@ -122,52 +137,39 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/jzedlewski/anaconda3/envs/cuml_dev/lib/python3.7/site-packages/ipykernel_launcher.py:22: UserWarning: Parameter should_downcast is deprecated, use convert_dtype instead. \n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      " cuml's trustworthiness score :  0.8747406726747047\n",
-      " sklearn's trustworthiness score :  0.8747260626845472\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "import cuml\n",
-    "import numpy as np\n",
-    "\n",
-    "from cuml.datasets.blobs import blobs as make_blobs\n",
+    "from cuml.datasets import make_blobs\n",
     "from cuml.manifold.umap import UMAP as cuUMAP\n",
-    "\n",
     "from sklearn.manifold import trustworthiness\n",
+    "import numpy as np\n",
     "\n",
-    "# Generate a datasets with 8 \"blobs\" of grouped-together points so we have an interesting structure to test DBSCAN clustering and UMAP\n",
-    "\n",
-    "n_samples = 2**10\n",
+    "n_samples = 1000\n",
     "n_features = 100\n",
+    "cluster_std = 0.1\n",
     "\n",
-    "centers = round(n_samples*0.4)\n",
-    "X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.1,\n",
-    "                              n_features=n_features, random_state=0,\n",
-    "                              dtype=np.float32)\n",
+    "X_blobs, y_blobs = make_blobs( n_samples = n_samples,\n",
+    "                               cluster_std = cluster_std,\n",
+    "                               n_features = n_features,\n",
+    "                               random_state = 0,\n",
+    "                               dtype=np.float32 )\n",
     "\n",
+    "trained_UMAP = cuUMAP( n_neighbors = 10 ).fit( X_blobs )\n",
+    "X_embedded = trained_UMAP.transform( X_blobs )\n",
+    "                                            \n",
+    "cu_score = cuml.metrics.trustworthiness( X_blobs, X_embedded )\n",
+    "sk_score = trustworthiness( asnumpy( X_blobs ),  asnumpy( X_embedded ) )\n",
     "\n",
-    "X_embedded = cuUMAP(n_neighbors=10).fit_transform(X_blobs)\n",
+    "print(\" cuml's trustworthiness score : \", cu_score )\n",
+    "print(\" sklearn's trustworthiness score : \", sk_score )\n",
     "\n",
-    "cu_score = cuml.metrics.trustworthiness(X_blobs, X_embedded)\n",
-    "sk_score = trustworthiness(X_blobs, X_embedded)\n",
+    "# save\n",
+    "dump( trained_UMAP, 'UMAP.model')\n",
     "\n",
-    "print(\" cuml's trustworthiness score : \", cu_score)\n",
-    "print(\" sklearn's trustworthiness score : \", sk_score)"
+    "# to reload the model uncomment the line below \n",
+    "# loaded_model = load('UMAP.model')"
    ]
   },
   {
@@ -187,56 +189,43 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      " cuml's adjusted random index score :  1.0\n",
-      " sklearn's adjusted random index score :  1.0\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/jzedlewski/anaconda3/envs/cuml_dev/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning:  The dtype of ground truth is not int32 converting the ground truth to int32\n",
-      "/home/jzedlewski/anaconda3/envs/cuml_dev/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning:  The dtype of predicted labels is not int32 converting the predicted labels to int32\n",
-      "/home/jzedlewski/code/rapidsai/cuml/python/cuml/utils/input_utils.py:189: UserWarning: Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.\n",
-      "  warnings.warn(\"Expected \" + order_to_str(order) + \" major order, \"\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "import cuml\n",
-    "\n",
-    "from cuml.datasets.blobs import blobs as make_blobs\n",
+    "from cuml.datasets import make_blobs\n",
     "from cuml import DBSCAN as cumlDBSCAN\n",
-    "\n",
     "from sklearn.metrics import adjusted_rand_score\n",
+    "import numpy as np\n",
     "\n",
-    "n_samples = 2**10\n",
+    "n_samples = 1000\n",
     "n_features = 100\n",
+    "cluster_std = 0.1\n",
+    "\n",
+    "X_blobs, y_blobs = make_blobs( n_samples = n_samples, \n",
+    "                               n_features = n_features, \n",
+    "                               cluster_std = cluster_std,                               \n",
+    "                               random_state = 0,\n",
+    "                               dtype=np.float32 )\n",
     "\n",
-    "centers = round(n_samples*0.4)\n",
-    "X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.01,\n",
-    "                              n_features=n_features, random_state=0,\n",
-    "                              dtype=np.float32)\n",
+    "cuml_dbscan = cumlDBSCAN( eps = 3, \n",
+    "                          min_samples = 2)\n",
     "\n",
-    "cuml_dbscan = cumlDBSCAN(eps=3, min_samples=2)\n",
-    "cu_y_pred = cuml_dbscan.fit_predict(X_blobs)\n",
+    "trained_DBSCAN = cuml_dbscan.fit( X_blobs )\n",
     "\n",
-    "cu_y_pred = cu_y_pred.copy_to_host()\n",
-    "y_blobs = y_blobs.copy_to_host()\n",
+    "cu_y_pred = trained_DBSCAN.fit_predict ( X_blobs )\n",
     "\n",
-    "cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)\n",
-    "sk_adjusted_rand_index = adjusted_rand_score(y_blobs, cu_y_pred)\n",
+    "cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score( y_blobs, cu_y_pred )\n",
+    "sk_adjusted_rand_index = adjusted_rand_score( asnumpy(y_blobs), asnumpy(cu_y_pred) )\n",
     "\n",
     "print(\" cuml's adjusted random index score : \", cu_adjusted_rand_index)\n",
-    "print(\" sklearn's adjusted random index score : \", sk_adjusted_rand_index)\n"
+    "print(\" sklearn's adjusted random index score : \", sk_adjusted_rand_index)\n",
+    "\n",
+    "# save and optionally reload\n",
+    "dump( trained_DBSCAN, 'DBSCAN.model')\n",
+    "\n",
+    "# to reload the model uncomment the line below \n",
+    "# loaded_model = load('DBSCAN.model')"
    ]
   },
   {
@@ -264,58 +253,46 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "cuml's r2 score :  1.0\n",
-      "sklearn's r2 score :  0.9999999999987945\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/jzedlewski/code/rapidsai/cuml/python/cuml/utils/input_utils.py:189: UserWarning: Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.\n",
-      "  warnings.warn(\"Expected \" + order_to_str(order) + \" major order, \"\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "import cuml\n",
-    "\n",
     "from cuml.datasets import make_regression\n",
-    "from cuml.linear_model import LinearRegression as culr\n",
     "from cuml.preprocessing.model_selection import train_test_split\n",
-    "\n",
+    "from cuml.linear_model import LinearRegression as cuLR\n",
     "from sklearn.metrics import r2_score\n",
     "\n",
     "n_samples = 2**10\n",
     "n_features = 100\n",
     "n_info = 70\n",
     "\n",
-    "X_reg, y_reg = make_regression(n_samples=n_samples, n_features=n_features,\n",
-    "                               n_informative=n_info, random_state=123, dtype=np.float32)\n",
+    "X_reg, y_reg = make_regression( n_samples = n_samples, \n",
+    "                                n_features = n_features,\n",
+    "                                n_informative = n_info, \n",
+    "                                random_state = 123 )\n",
     "\n",
-    "# using cuML's train_test_split function to divide the dataset into training and testing splits\n",
-    "X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg,\n",
-    "                                                                    y_reg, train_size=0.8,\n",
-    "                                                                    random_state=10)\n",
-    "cuml_reg_model = culr(fit_intercept=True,\n",
-    "                      normalize=True,\n",
-    "                      algorithm='eig')\n",
-    "cuml_reg_model.fit(X_reg_train,y_reg_train)\n",
-    "cu_preds = cuml_reg_model.predict(X_reg_test)\n",
+    "X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split( X_reg,\n",
+    "                                                                     y_reg, \n",
+    "                                                                     train_size = 0.8,\n",
+    "                                                                     random_state = 10 )\n",
+    "cuml_reg_model = cuLR( fit_intercept = True,\n",
+    "                       normalize = True,\n",
+    "                       algorithm = 'eig' )\n",
     "\n",
-    "cu_r2 = cuml.metrics.r2_score(y_reg_test, cu_preds)\n",
-    "sk_r2 = r2_score(y_reg_test, cu_preds)\n",
+    "trained_LR = cuml_reg_model.fit( X_reg_train, y_reg_train )\n",
+    "cu_preds = trained_LR.predict( X_reg_test )\n",
+    "\n",
+    "cu_r2 = cuml.metrics.r2_score( y_reg_test, cu_preds )\n",
+    "sk_r2 = r2_score( asnumpy( y_reg_test ), asnumpy( cu_preds ) )\n",
     "\n",
     "print(\"cuml's r2 score : \", cu_r2)\n",
-    "print(\"sklearn's r2 score : \", sk_r2)"
+    "print(\"sklearn's r2 score : \", sk_r2)\n",
+    "\n",
+    "# save and reload \n",
+    "dump( trained_LR, 'LR.model')         \n",
+    "\n",
+    "# to reload the model uncomment the line below \n",
+    "# loaded_model = load('LR.model')"
    ]
   }
  ],
@@ -335,9 +312,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.7.8"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}
diff --git a/docs/source/sphinxext/github_link.py b/docs/source/sphinxext/github_link.py
new file mode 100644
index 0000000000..a7a46fdd9d
--- /dev/null
+++ b/docs/source/sphinxext/github_link.py
@@ -0,0 +1,146 @@
+# This contains code with copyright by the scikit-learn project, subject to the
+# license in /thirdparty/LICENSES/LICENSE.scikit_learn
+
+import inspect
+import os
+import re
+import subprocess
+import sys
+from functools import partial
+from operator import attrgetter
+
+orig = inspect.isfunction
+
+
+# See https://opendreamkit.org/2017/06/09/CythonSphinx/
+def isfunction(obj):
+
+    orig_val = orig(obj)
+
+    new_val = hasattr(type(obj), "__code__")
+
+    if (orig_val != new_val):
+        return new_val
+
+    return orig_val
+
+
+inspect.isfunction = isfunction
+
+REVISION_CMD = 'git rev-parse --short HEAD'
+
+source_regex = re.compile(r"^File: (.*?) \(starting at line ([0-9]*?)\)$",
+                          re.MULTILINE)
+
+
+def _get_git_revision():
+    try:
+        revision = subprocess.check_output(REVISION_CMD.split()).strip()
+    except (subprocess.CalledProcessError, OSError):
+        print('Failed to execute git to get revision')
+        return None
+    return revision.decode('utf-8')
+
+
+def _linkcode_resolve(domain, info, package, url_fmt, revision):
+    """Determine a link to online source for a class/method/function
+
+    This is called by sphinx.ext.linkcode
+
+    An example with a long-untouched module that everyone has
+    >>> _linkcode_resolve('py', {'module': 'tty',
+    ...                          'fullname': 'setraw'},
+    ...                   package='tty',
+    ...                   url_fmt='http://hg.python.org/cpython/file/'
+    ...                           '{revision}/Lib/{package}/{path}#L{lineno}',
+    ...                   revision='xxxx')
+    'http://hg.python.org/cpython/file/xxxx/Lib/tty/tty.py#L18'
+    """
+
+    if revision is None:
+        return
+    if domain not in ('py', 'pyx'):
+        return
+    if not info.get('module') or not info.get('fullname'):
+        return
+
+    class_name = info['fullname'].split('.')[0]
+    module = __import__(info['module'], fromlist=[class_name])
+    obj = attrgetter(info['fullname'])(module)
+
+    # Unwrap the object to get the correct source
+    # file in case that is wrapped by a decorator
+    obj = inspect.unwrap(obj)
+
+    fn: str = None
+    lineno: str = None
+
+    try:
+        fn = inspect.getsourcefile(obj)
+    except Exception:
+        fn = None
+    if not fn:
+        try:
+            fn = inspect.getsourcefile(sys.modules[obj.__module__])
+        except Exception:
+            fn = None
+
+    if not fn:
+        # Possibly Cython code. Search docstring for source
+        m = source_regex.search(obj.__doc__)
+
+        if (m is not None):
+            source_file = m.group(1)
+            lineno = m.group(2)
+
+            # fn is expected to be the absolute path.
+            fn = os.path.relpath(source_file, start=package)
+            print("{}:{}".format(
+                os.path.abspath(os.path.join("..", "python", "cuml", fn)),
+                lineno))
+        else:
+            return
+    else:
+        # Test if we are absolute or not (pyx are relative)
+        if (not os.path.isabs(fn)):
+            # Should be relative to docs right now
+            fn = os.path.abspath(os.path.join("..", "python", fn))
+
+        # Convert to relative from module root
+        fn = os.path.relpath(fn,
+                             start=os.path.dirname(
+                                 __import__(package).__file__))
+
+    # Get the line number if we need it. (Can work without it)
+    if (lineno is None):
+        try:
+            lineno = inspect.getsourcelines(obj)[1]
+        except Exception:
+
+            # Can happen if its a cyfunction. See if it has `__code__`
+            if (hasattr(obj, "__code__")):
+                lineno = obj.__code__.co_firstlineno
+            else:
+                lineno = ''
+    return url_fmt.format(revision=revision,
+                          package=package,
+                          path=fn,
+                          lineno=lineno)
+
+
+def make_linkcode_resolve(package, url_fmt):
+    """Returns a linkcode_resolve function for the given URL format
+
+    revision is a git commit reference (hash or name)
+
+    package is the name of the root module of the package
+
+    url_fmt is along the lines of ('https://github.com/USER/PROJECT/'
+                                   'blob/{revision}/{package}/'
+                                   '{path}#L{lineno}')
+    """
+    revision = _get_git_revision()
+    return partial(_linkcode_resolve,
+                   revision=revision,
+                   package=package,
+                   url_fmt=url_fmt)
diff --git a/notebooks/arima_demo.ipynb b/notebooks/arima_demo.ipynb
index 2abea6c2bf..f4cb2eb863 100644
--- a/notebooks/arima_demo.ipynb
+++ b/notebooks/arima_demo.ipynb
@@ -97,8 +97,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def visualize(y, pred=None, pred_start=None):\n",
+    "def visualize(y, pred=None, pred_start=None, lower=None, upper=None):\n",
     "    n_obs, batch_size = y.shape\n",
+    "    col = [\"#1f77b4\", \"#ff7f0e\"]\n",
     "\n",
     "    # Create the subplots\n",
     "    c = min(batch_size, 2)\n",
@@ -114,11 +115,17 @@
     "    # Plot the data\n",
     "    for i in range(batch_size):\n",
     "        title = y.columns[i]\n",
-    "        ax[i].plot(np.r_[:n_obs], y[title].to_array())\n",
+    "        ax[i].plot(np.r_[:n_obs], y[title].to_array(), color=col[0])\n",
     "        if pred is not None:\n",
     "            ax[i].plot(np.r_[pred_start:pred_end],\n",
     "                       pred[pred.columns[i]].to_array(),\n",
-    "                       linestyle=\"--\")\n",
+    "                       linestyle=\"--\", color=col[1])\n",
+    "        # Prediction intervals\n",
+    "        if lower is not None and upper is not None:\n",
+    "            ax[i].fill_between(np.r_[pred_start:pred_end],\n",
+    "                               lower[lower.columns[i]].to_array(),\n",
+    "                               upper[upper.columns[i]].to_array(),\n",
+    "                               alpha=0.2, color=col[1])\n",
     "        ax[i].title.set_text(title)\n",
     "    for i in range(batch_size, r*c):\n",
     "        fig.delaxes(ax[i])\n",
@@ -193,7 +200,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If we want to get the parameters that were fitted to the model, we use the `get_params` method:"
+    "If we want to get the parameters that were fitted to the model, we can use `get_fit_params` or the corresponding properties:"
    ]
   },
   {
@@ -202,8 +209,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "param_mig = model_mig.get_params()\n",
-    "print(param_mig.keys())"
+    "param_mig = model_mig.get_fit_params()\n",
+    "print(param_mig[\"ma\"])\n",
+    "print(model_mig.ma)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It is also possible to get a compact numpy array containing all the parameters with `pack`, or similarly to load the parameters into a model with `unpack`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(model_mig.pack())"
    ]
   },
   {
@@ -266,8 +290,27 @@
     "model_pop.fit()\n",
     "\n",
     "# Predict in-sample and forecast out-of-sample\n",
-    "fc_pop = model_pop.predict(80, 160)\n",
-    "visualize(df_pop, fc_pop, 80)"
+    "pred_pop = model_pop.predict(80, 160)\n",
+    "visualize(df_pop, pred_pop, 80)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Confidence intervals\n",
+    "\n",
+    "To get confidence intervals when forecasting, we can specify the confidence level (here 95%):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fc_pop, lower_pop, upper_pop = model_pop.forecast(23, level=0.95)\n",
+    "visualize(df_pop, fc_pop, lower=lower_pop, upper=upper_pop)"
    ]
   },
   {
@@ -331,7 +374,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.7.8"
   },
   "mimetype": "text/x-python",
   "name": "python",
@@ -341,4 +384,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}
diff --git a/notebooks/kmeans_mnmg_demo.ipynb b/notebooks/kmeans_mnmg_demo.ipynb
index b1b740236a..995d049119 100755
--- a/notebooks/kmeans_mnmg_demo.ipynb
+++ b/notebooks/kmeans_mnmg_demo.ipynb
@@ -10,7 +10,7 @@
     "\n",
     "The main difference between cuML's MNMG implementation of k-means and the single-GPU is that the fit can be performed in parallel for each iteration, sharing only the centroids between iterations. The MNMG version also provides the same scalable k-means++ initialization algorithm as the single-GPU version.\n",
     "\n",
-    "Unlike the single-GPU implementation, The MNMG k-means API requires a Dask cuDF Dataframe as input. `predict()` and `transform()` also return a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.\n",
+    "Unlike the single-GPU implementation, The MNMG k-means API requires a Dask Dataframe or Array as input. `predict()` and `transform()` return the same type as input. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas. Dask cuPy arrays are also available.\n",
     "\n",
     "For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).\n",
     "\n",
@@ -37,7 +37,8 @@
     "from cuml.metrics import adjusted_rand_score\n",
     "from dask.distributed import Client, wait\n",
     "from dask_cuda import LocalCUDACluster\n",
-    "from dask_ml.cluster import KMeans as skKMeans"
+    "from dask_ml.cluster import KMeans as skKMeans\n",
+    "import cupy as cp"
    ]
   },
   {
@@ -86,7 +87,7 @@
     "\n",
     "### Device\n",
     "\n",
-    "We can generate a dask_cudf.DataFrame of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`."
+    "We can generate a Dask cuPY Array of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`."
    ]
   },
   {
@@ -95,12 +96,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "X_cudf, Y_cudf = make_blobs(n_samples, \n",
-    "                            n_features,\n",
-    "                            centers = 5, \n",
-    "                            n_parts = n_total_partitions,\n",
-    "                            cluster_std=0.1, \n",
-    "                            verbose=True)"
+    "X_dca, Y_dca = make_blobs(n_samples, \n",
+    "                          n_features,\n",
+    "                          centers = 5, \n",
+    "                          n_parts = n_total_partitions,\n",
+    "                          cluster_std=0.1, \n",
+    "                          verbose=True)"
    ]
   },
   {
@@ -109,7 +110,7 @@
    "source": [
     "### Host\n",
     "\n",
-    "We use `cuml.dask.common.to_dask_df` to convert a dask_cuml.DataFrame using device memory into a dask.DataFrame containing Pandas in host memory. "
+    "We collect the Dask cuPy Array on a single node as a cuPy array. Then we transfer the cuPy array from device to host memory into a Numpy array."
    ]
   },
   {
@@ -118,9 +119,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "wait(X_cudf)\n",
-    "\n",
-    "X_df = to_dask_df(X_cudf)"
+    "X_cp = X_dca.compute()\n",
+    "X_np = cp.asnumpy(X_cp)\n",
+    "del X_cp"
    ]
   },
   {
@@ -146,7 +147,7 @@
     "                     n_jobs=-1,\n",
     "                     random_state=100)\n",
     "\n",
-    "kmeans_sk.fit(X_df)"
+    "kmeans_sk.fit(X_np)"
    ]
   },
   {
@@ -156,7 +157,7 @@
    "outputs": [],
    "source": [
     "%%time\n",
-    "labels_sk = kmeans_sk.predict(X_df).compute()"
+    "labels_sk = kmeans_sk.predict(X_np).compute()"
    ]
   },
   {
@@ -179,7 +180,7 @@
     "                       n_clusters=5,\n",
     "                       random_state=100)\n",
     "\n",
-    "kmeans_cuml.fit(X_cudf)"
+    "kmeans_cuml.fit(X_dca)"
    ]
   },
   {
@@ -189,7 +190,7 @@
    "outputs": [],
    "source": [
     "%%time\n",
-    "labels_cuml = kmeans_cuml.predict(X_cudf).compute()"
+    "labels_cuml = kmeans_cuml.predict(X_dca).compute()"
    ]
   },
   {
@@ -205,7 +206,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "score = adjusted_rand_score(labels_sk, labels_cuml.to_pandas().values)"
+    "score = adjusted_rand_score(labels_sk, labels_cuml)"
    ]
   },
   {
@@ -221,9 +222,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "cuml_dev",
    "language": "python",
-   "name": "python3"
+   "name": "cuml_dev"
   },
   "language_info": {
    "codemirror_mode": {
@@ -235,7 +236,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/random_forest_mnmg_demo.ipynb b/notebooks/random_forest_mnmg_demo.ipynb
index 3837a49601..0c5be5ec03 100755
--- a/notebooks/random_forest_mnmg_demo.ipynb
+++ b/notebooks/random_forest_mnmg_demo.ipynb
@@ -110,6 +110,7 @@
     "X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,\n",
     "                                 n_clusters_per_class=1, n_informative=int(n_features / 3),\n",
     "                                 random_state=123, n_classes=5)\n",
+    "X = X.astype(np.float32)\n",
     "y = y.astype(np.int32)\n",
     "X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)"
    ]
@@ -129,18 +130,24 @@
    "source": [
     "n_partitions = n_workers\n",
     "\n",
-    "# First convert to cudf (with real data, you would likely load in cuDF format to start)\n",
-    "X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))\n",
-    "y_train_cudf = cudf.Series(y_train)\n",
+    "def distribute(X, y):\n",
+    "    # First convert to cudf (with real data, you would likely load in cuDF format to start)\n",
+    "    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))\n",
+    "    y_cudf = cudf.Series(y)\n",
     "\n",
-    "# Partition with Dask\n",
-    "# In this case, each worker will train on 1/n_partitions fraction of the data\n",
-    "X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)\n",
-    "y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)\n",
+    "    # Partition with Dask\n",
+    "    # In this case, each worker will train on 1/n_partitions fraction of the data\n",
+    "    X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)\n",
+    "    y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)\n",
     "\n",
-    "# Persist to cache the data in active memory\n",
-    "X_train_dask, y_train_dask = \\\n",
-    "  dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)"
+    "    # Persist to cache the data in active memory\n",
+    "    X_dask, y_dask = \\\n",
+    "      dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)\n",
+    "    \n",
+    "    return X_dask, y_dask\n",
+    "\n",
+    "X_train_dask, y_train_dask = distribute(X_train, y_train)\n",
+    "X_test_dask, y_test_dask = distribute(X_test, y_test)"
    ]
   },
   {
@@ -200,7 +207,7 @@
    "outputs": [],
    "source": [
     "skl_y_pred = skl_model.predict(X_test)\n",
-    "cuml_y_pred = cuml_model.predict(X_test)\n",
+    "cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()\n",
     "\n",
     "# Due to randomness in the algorithm, you may see slight variation in accuracies\n",
     "print(\"SKLearn accuracy:  \", accuracy_score(y_test, skl_y_pred))\n",
@@ -224,9 +231,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}
diff --git a/notebooks/target_encoder_walkthrough.ipynb b/notebooks/target_encoder_walkthrough.ipynb
new file mode 100644
index 0000000000..48bc83a8a7
--- /dev/null
+++ b/notebooks/target_encoder_walkthrough.ipynb
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook, we walk through the design of target encoding. We start with a motivating example, `criteo dataset`, to show why target encoding is preferred over one hot encoding and label encoding. The concepts and optimizations of target encoding are introduced step by step. The key takeaway is that target encoding differs from traditional sklearn style encoders in the following aspects:\n",
+    "\n",
+    "- The ground truth column `target` is used as input for encoding.\n",
+    "- The training data and test data are transformed differently.\n",
+    "- Multi-column joint transformation is supported by target encoding."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Table of contents\n",
+    "[1. Motivation](#motivation)<br>\n",
+    "> [Criteo data](#criteo)<br>\n",
+    "[Why not one-hot encoding?](#onehot)<br>\n",
+    "[Label encoding](#lbl)<br>\n",
+    "[Train XGB with label encoding ](#lblxgb)<br>\n",
+    "\n",
+    "[2. Target Encoding](#tar)<br>\n",
+    "> [A naive implementation](#naive)<br>\n",
+    "[A K-fold cross validate implementation](#kfold)<br>\n",
+    "[An optimized implementation](#opt)<br>\n",
+    "[Multi-column joint encoding](#multi)<br>\n",
+    "\n",
+    "[3. Conclusions](#conclusions)<br>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "GPU_id = '0,1,2,3'\n",
+    "os.environ['CUDA_VISIBLE_DEVICES'] = GPU_id\n",
+    "num_gpus = len(GPU_id.split(','))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "\n",
+    "import numpy as np\n",
+    "import cudf as gd\n",
+    "import cupy as cp\n",
+    "from cuml.preprocessing.LabelEncoder import LabelEncoder\n",
+    "from cuml.preprocessing.TargetEncoder import TargetEncoder\n",
+    "import dask as dask, dask_cudf\n",
+    "from dask.distributed import Client, wait\n",
+    "from dask_cuda import LocalCUDACluster\n",
+    "import xgboost as xgb\n",
+    "import matplotlib.pyplot as plt\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"motivation\"></a>\n",
+    "## 1. Motivation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"criteo\"></a>\n",
+    "### Criteo data\n",
+    "The [criteo 1-TB benchmark](https://github.com/rambler-digital-solutions/criteo-1tb-benchmark) is a well-known dataset for click thourgh rate modeling. We only use three categorical features to make it a simple dataset for the problem."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "path = '/datasets/criteo/raw_csvs/split_train_data'\n",
+    "train_name = f'{path}/day_0_part_0000'\n",
+    "valid_name = f'{path}/day_0_part_0001'\n",
+    "num_cols = ['num_%d'%i for i in range(13)]\n",
+    "cat_cols = ['cat_%d'%i for i in range(26)]\n",
+    "cols = ['label']+num_cols+cat_cols\n",
+    "dtypes = {i:'str' if i.startswith('cat_') else 'float32' for i in cols}\n",
+    "train = gd.read_csv(train_name, sep = '\\t', header=None, names=cols, dtypes=dtypes)\n",
+    "valid = gd.read_csv(valid_name, sep = '\\t', header=None, names=cols, dtypes=dtypes)\n",
+    "\n",
+    "used_cols = ['label']+cat_cols[:3]\n",
+    "\n",
+    "train = train[used_cols]\n",
+    "valid = valid[used_cols]\n",
+    "train.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The categorical columns are strings originally so we need some kind of encoding to turn them into numerical columns."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"onehot\"></a>\n",
+    "### Why not one-hot encoding?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for col in cat_cols[:3]:\n",
+    "    print(col,'cardinality',len(train[col].unique()), len(valid[col].unique()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With such high cardinality, it is inefficient to do one-hot encoding because it leads to either huge memory consumption or very sparse data, which is less optimized when running on GPU.\n",
+    "\n",
+    "Therefore, we use label encoding to transform such string columns to numerical columns."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"lbl\"></a>\n",
+    "### Label encoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "for col in cat_cols[:3]:\n",
+    "    train[col] = train[col].fillna('None')\n",
+    "    valid[col] = valid[col].fillna('None')\n",
+    "    lbl = LabelEncoder()\n",
+    "    lbl.fit(gd.concat([train[col],valid[col]]))\n",
+    "    train[col] = lbl.transform(train[col])\n",
+    "    valid[col] = lbl.transform(valid[col])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Label encoding transforms string columns to integer columns. However, the mapping from a string to an integer is arbitary, which makes the encoded features less informative. For example, the first three rows of `cat_2` are `9218`, `5875` and `5199`. Although `5875` is closer to `5199` than `9218`, there is absolutely no guarantee that the string of `5875` is more similar to string of `5199` than string of `9218`. In other words, a tree classifier has make many splits to learn the pattern buried within such encoded features.   "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"lblxgb\"></a>\n",
+    "### Train XGB with label encoding features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xgb_parms = { \n",
+    "    'max_depth':6, \n",
+    "    'learning_rate':0.1, \n",
+    "    'subsample':0.8,\n",
+    "    'colsample_bytree':1.0, \n",
+    "    'eval_metric':'auc',\n",
+    "    'objective':'binary:logistic',\n",
+    "    'tree_method':'gpu_hist',\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NROUND = 100\n",
+    "VERBOSE_EVAL = 10\n",
+    "ESR = 10\n",
+    "\n",
+    "start = time.time(); print('Creating DMatrix...')\n",
+    "dtrain = xgb.DMatrix(data=train.drop('label',axis=1),label=train['label'])\n",
+    "dvalid = xgb.DMatrix(data=valid.drop('label',axis=1),label=valid['label'])\n",
+    "print('Took %.1f seconds'%(time.time()-start))\n",
+    "\n",
+    "start = time.time(); print('Training...')\n",
+    "model = xgb.train(xgb_parms, \n",
+    "                       dtrain=dtrain,\n",
+    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
+    "                       num_boost_round=NROUND,\n",
+    "                       early_stopping_rounds=ESR,\n",
+    "                       verbose_eval=VERBOSE_EVAL) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As shown above, using label encoding features results in a valid auc score of 0.60. Let's see if target encoding can improve this score."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"tar\"></a>\n",
+    "## 2. Target encoding"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The idea of target encoding is very simple: we encode the categorical column by the mean value of the `target` of the group associated with each unique value of the categorical column, where `target` is the ground truth column to be predicted. In other words, it is essentially just a simple groupby-aggregation-merge or `groupby-transform`, in pandas terms:<br> `df['fea_encode'] = df.groupby('fea')['target'].transform(lambda x: x.mean())`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"naive\"></a>\n",
+    "### A naive implementation\n",
+    "Let's implement targe encoding of the idea above and study where we can improve."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "for col in cat_cols[:3]:\n",
+    "    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})\n",
+    "    tmp.columns = [col, f'{col}_TE']\n",
+    "    train = train.merge(tmp, on=col, how='left')\n",
+    "    valid = valid.merge(tmp, on=col, how='left')\n",
+    "    del tmp\n",
+    "train.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will only use the target encoding features to train XGB."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
+    "print(te_cols)\n",
+    "\n",
+    "start = time.time(); print('Creating DMatrix...')\n",
+    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
+    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
+    "print('Took %.1f seconds'%(time.time()-start))\n",
+    "\n",
+    "start = time.time(); print('Training...')\n",
+    "model = xgb.train(xgb_parms, \n",
+    "                       dtrain=dtrain,\n",
+    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
+    "                       num_boost_round=NROUND,\n",
+    "                       early_stopping_rounds=ESR,\n",
+    "                       verbose_eval=VERBOSE_EVAL) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However the valid auc is not improved with the naive target encoding. Furthermore, the bigger discrepancy between `train auc` and `valid auc` is alarming. It means the naive target encoding suffers from an overfitting problem. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels = ['Label encoding', 'Target encoding naive']\n",
+    "train_auc = [0.65, 0.84]\n",
+    "valid_auc = [0.64, 0.63]\n",
+    "\n",
+    "x = np.arange(len(labels))  # the label locations\n",
+    "width = 0.35  # the width of the bars\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
+    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
+    "\n",
+    "ax.set_ylabel('Auc')\n",
+    "ax.set_title('The overfitting problem of naive target encoding')\n",
+    "ax.set_xticks(x)\n",
+    "ax.set_xticklabels(labels)\n",
+    "ax.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The cause is actually obvious. We use the ground truth column directly in creating the features for the training data, which doesn't generalize to validation data  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"kfold\"></a>\n",
+    "### A K-fold cross validate implementation\n",
+    "To alleviate such overfitting, we can encode the traning data in k-folds, so that a sample's ground truth is not touched when creating its target encoding feature. The procedure is shown in the animation below.<br>\n",
+    "![ChessUrl](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F100236%2F64cc45bbe25144503bc93cf4b9e102f1%2Fmte.gif?generation=1594620515929361&alt=media \"chess\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop the naive TE columns\n",
+    "train = train.drop(te_cols, axis=1)\n",
+    "valid = valid.drop(te_cols, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "FOLDS = 10\n",
+    "train['fold'] = cp.arange(len(train))%FOLDS\n",
+    "train['row_id'] = cp.arange(len(train))\n",
+    "mean = train['label'].mean()\n",
+    "for col in cat_cols[:3]:\n",
+    "    res = []\n",
+    "    out_col = f'{col}_TE'\n",
+    "    for i in range(FOLDS):\n",
+    "        tmp = train[train['fold']!=i].groupby(col, as_index=False).agg({'label':'mean'})\n",
+    "        tmp.columns = [col, out_col]\n",
+    "        tr = train[train['fold']==i][['row_id',col]]\n",
+    "        tr = tr.merge(tmp,on=col,how='left')\n",
+    "        res.append(tr)\n",
+    "        del tmp\n",
+    "    res = gd.concat(res)\n",
+    "    res = res.sort_values('row_id')\n",
+    "    train[out_col] = res[out_col].fillna(mean).values\n",
+    "    del res\n",
+    "    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})\n",
+    "    tmp.columns = [col, out_col]\n",
+    "    valid = valid.merge(tmp, on=col, how='left')\n",
+    "    del tmp\n",
+    "train.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A key observation here is that training data and test/validation data are encoded differently. The training data is encoded using this *fancy kfold cross validated* fashion while test data is encoded just using *group mean*. Comparing to `LabelEncoder`, the implication is that with `TargetEncoder`we can't use the exactly same api `transform` for both `training data` and `test data`.\n",
+    "\n",
+    "```\n",
+    "# Using transform for both data works\n",
+    "lbl = LabelEncoder()\n",
+    "lbl.fit(gd.concat([train[col],valid[col]]))\n",
+    "train[col] = lbl.transform(train[col])\n",
+    "valid[col] = lbl.transform(valid[col])\n",
+    "\n",
+    "# Using transform for both data doesn't work\n",
+    "tar = TargetEncoder()\n",
+    "tar.fit(train[col], train['target'])\n",
+    "train[col] = tar.transform(train[col]) \n",
+    "valid[col] = tar.transform(valid[col])\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
+    "print(te_cols)\n",
+    "\n",
+    "start = time.time(); print('Creating DMatrix...')\n",
+    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
+    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
+    "print('Took %.1f seconds'%(time.time()-start))\n",
+    "\n",
+    "start = time.time(); print('Training...')\n",
+    "model = xgb.train(xgb_parms, \n",
+    "                       dtrain=dtrain,\n",
+    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
+    "                       num_boost_round=NROUND,\n",
+    "                       early_stopping_rounds=ESR,\n",
+    "                       verbose_eval=VERBOSE_EVAL) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop']\n",
+    "train_auc = [0.65, 0.84, 0.71]\n",
+    "valid_auc = [0.64, 0.63, 0.7]\n",
+    "\n",
+    "x = np.arange(len(labels))  # the label locations\n",
+    "width = 0.35  # the width of the bars\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "fig.set_figwidth(15)\n",
+    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
+    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
+    "\n",
+    "ax.set_ylabel('Auc')\n",
+    "ax.set_title('The overfitting problem is fixed by kfold target encoding')\n",
+    "ax.set_xticks(x)\n",
+    "ax.set_xticklabels(labels)\n",
+    "ax.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"opt\"></a>\n",
+    "### An optimized implementation\n",
+    "We can make further improvements:\n",
+    "- calculate the encoding in one shot instead of the for loop.\n",
+    "- encode one column or many columns jointly\n",
+    "- smooth the encoding so that it is not skewed by infrequent values.\n",
+    "- support both single and multi gpus"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop the previous TE columns\n",
+    "train = train.drop(te_cols, axis=1)\n",
+    "valid = valid.drop(te_cols, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that the optimized implementation is about 6x faster than the previous `for loop` based implementation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "SMOOTH = 0.001\n",
+    "SPLIT = 'interleaved'\n",
+    "for col in cat_cols[:3]:\n",
+    "    out_col = f'{col}_TE'\n",
+    "    encoder = TargetEncoder(n_folds=FOLDS, smooth=SMOOTH, split_method=SPLIT)\n",
+    "    #train[out_col] = encoder.fit_transform(train[col], train['label'])\n",
+    "    encoder.fit(train[col], train['label'])\n",
+    "    train[out_col] = encoder.transform(train[col])\n",
+    "    valid[out_col] = encoder.transform(valid[col])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
+    "print(te_cols)\n",
+    "\n",
+    "start = time.time(); print('Creating DMatrix...')\n",
+    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
+    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
+    "print('Took %.1f seconds'%(time.time()-start))\n",
+    "\n",
+    "start = time.time(); print('Training...')\n",
+    "model = xgb.train(xgb_parms, \n",
+    "                       dtrain=dtrain,\n",
+    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
+    "                       num_boost_round=NROUND,\n",
+    "                       early_stopping_rounds=ESR,\n",
+    "                       verbose_eval=VERBOSE_EVAL) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The optimized version is slightly more accurate and it could be up to 10x faster than the `kfold for loop` implementation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop', 'Target encoding optimized']\n",
+    "train_auc = [0.65, 0.84, 0.71, 0.72]\n",
+    "valid_auc = [0.64, 0.63, 0.7, 0.704]\n",
+    "\n",
+    "x = np.arange(len(labels))  # the label locations\n",
+    "width = 0.35  # the width of the bars\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "fig.set_figwidth(15)\n",
+    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
+    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
+    "\n",
+    "ax.set_ylabel('Auc')\n",
+    "ax.set_title('The overfitting problem is fixed by kfold target encoding')\n",
+    "ax.set_xticks(x)\n",
+    "ax.set_xticklabels(labels)\n",
+    "ax.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a id=\"multi\"></a>\n",
+    "### Multi-column joint encoding\n",
+    "Instead of encoding one column at a time, we can also encoding multiple columns jointly into one new feature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "for cols in [['cat_0', 'cat_1'],\n",
+    "             ['cat_0', 'cat_2'],\n",
+    "             ['cat_1', 'cat_2'],\n",
+    "             ['cat_0', 'cat_1', 'cat_2']\n",
+    "            ]:\n",
+    "    out_col = '_'.join(cols)+'_TE'\n",
+    "    encoder = TargetEncoder(n_folds=FOLDS,smooth=SMOOTH, split_method=SPLIT)\n",
+    "    train[out_col] = encoder.fit_transform(train[cols], train['label'])\n",
+    "    valid[out_col] = encoder.transform(valid[cols])\n",
+    "    del encoder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
+    "print(te_cols)\n",
+    "\n",
+    "start = time.time(); print('Creating DMatrix...')\n",
+    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
+    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
+    "print('Took %.1f seconds'%(time.time()-start))\n",
+    "\n",
+    "start = time.time(); print('Training...')\n",
+    "model = xgb.train(xgb_parms, \n",
+    "                       dtrain=dtrain,\n",
+    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
+    "                       num_boost_round=NROUND,\n",
+    "                       early_stopping_rounds=ESR,\n",
+    "                       verbose_eval=VERBOSE_EVAL) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although the validation AUC doesn't improve much for this dataset, the functionality of multi-column joint encoding is necessary and might improve the prediction for other datasets."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Conclusion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook, we explains the key design choices of target encoding. The takeaways are:\n",
+    "- The ground truth column `target` is used as input for encoding.\n",
+    "- The training data and test data are transformed differently.\n",
+    "- Multi-column joint transformation is supported by target encoding."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/python/.coveragerc b/python/.coveragerc
new file mode 100644
index 0000000000..4b8eeccf0f
--- /dev/null
+++ b/python/.coveragerc
@@ -0,0 +1,4 @@
+# Configuration file for Python coverage tests
+[run]
+include = cuml/*
+omit = cuml/test/*
diff --git a/python/README.md b/python/README.md
index 9e4be02c18..033913c62f 100644
--- a/python/README.md
+++ b/python/README.md
@@ -4,9 +4,11 @@ This folder contains the Python and Cython code of the algorithms and ML primiti
 
 Contents:
 
-- [Build Configuration](#build-configuration)
-- [RAFT Integration in cuml.raft](#raft-integration-in-cumlraft)
-- [Running Unit Tests](#build-requirements)
+- [cuML Python Package](#cuml-python-package)
+    - [Build Configuration](#build-configuration)
+    - [RAFT Integration in cuml.raft](#raft-integration-in-cumlraft)
+    - [Build Requirements](#build-requirements)
+    - [Python Tests](#python-tests)
 
 ### Build Configuration
 
@@ -36,7 +38,7 @@ example `setup.py --singlegpu`) are:
 
 RAFT's Python and Cython is located in the [RAFT repository](https://github.com/rapidsai/raft/python). It was designed to be included in projects as opposed to be distributed by itself, so at build time, **setup.py creates a symlink from cuML, located in `/python/cuml/raft/` to the Python folder of RAFT**.
 
-For developers that need to modify RAFT code, please refer to the [RAFT Developer Guide](https://github.com/rapidsai/raft/blob/branch-0.14/BUILD.md#developer-guide) for recommendations.
+For developers that need to modify RAFT code, please refer to the [RAFT Developer Guide](https://github.com/rapidsai/raft/blob/branch-0.15/BUILD.md#developer-guide) for recommendations.
 
 To configure RAFT at build time:
 
@@ -48,7 +50,7 @@ The RAFT Python code gets included in the cuML build and distributable artifacts
 
 ### Build Requirements
 
-cuML's convenience [development yaml files](https://github.com/rapidsai/cuml/tree/branch-0.14/conda/environments) includes all dependencies required to build cuML.
+cuML's convenience [development yaml files](https://github.com/rapidsai/cuml/tree/branch-0.15/conda/environments) includes all dependencies required to build cuML.
 
 To build cuML's Python package, the following dependencies are required:
 
@@ -60,7 +62,7 @@ To build cuML's Python package, the following dependencies are required:
 - cudf version matching the cuML version
 - libcuml version matching the cuML version
 - libcuml={{ version }}
-- cupy>=7,<8.0.0a0
+- cupy>7.1.0,<9.0.0a0
 - joblib >=0.11
 
 Packages required for multigpu algorithms*:
diff --git a/python/cuml/__init__.py b/python/cuml/__init__.py
index 9a065a8829..c5a0a0167c 100644
--- a/python/cuml/__init__.py
+++ b/python/cuml/__init__.py
@@ -50,6 +50,8 @@
 from cuml.metrics.cluster.adjustedrandindex import adjusted_rand_score
 from cuml.metrics.regression import r2_score
 
+from cuml.naive_bayes.naive_bayes import MultinomialNB
+
 from cuml.neighbors.nearest_neighbors import NearestNeighbors
 
 from cuml.preprocessing.LabelEncoder import LabelEncoder
@@ -84,7 +86,6 @@
 
 from cuml.common.memory_utils import set_global_output_type, using_output_type
 
-
 # Version configuration
 
 from ._version import get_versions
diff --git a/python/cuml/_thirdparty/__init__.py b/python/cuml/_thirdparty/__init__.py
new file mode 100644
index 0000000000..3c603f3772
--- /dev/null
+++ b/python/cuml/_thirdparty/__init__.py
@@ -0,0 +1,2 @@
+# Third party code, respective licenses apply
+from . import sklearn
diff --git a/python/cuml/_thirdparty/sklearn/README.md b/python/cuml/_thirdparty/sklearn/README.md
new file mode 100644
index 0000000000..38332cdcc1
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/README.md
@@ -0,0 +1,15 @@
+# GPU accelerated Scikit-Learn preprocessing
+
+This directory contains code originating from the Scikit-Learn library. The Scikit-Learn license applies accordingly (see `/thirdparty/LICENSES/LICENSE.scikit_learn`). Original authors mentioned in the code do not endorse or promote this production.
+
+This work is dedicated to providing GPU accelerated tools for preprocessing. The Scikit-Learn code is slightly modified to make it possible to take common inputs used throughout cuML such as Numpy and Cupy arrays, Pandas and cuDF dataframes and compute the results on GPU.
+
+The code originates from the Scikit-Learn Github repository : https://github.com/scikit-learn/scikit-learn.git and is based on version/branch 0.23.1.
+
+## For developers:
+    When adding new preprocessors or updating, keep in mind:
+    - Files should be copied as-is from the scikit-learn repo (preserving scikit-learn license text)
+    - Changes should be kept minimal, large portions of modified imported code should lie in the thirdparty_adapter directory
+    - Only well-tested, reliable accelerated preprocessing functions should be exposed in cuml.preprocessing.__init__.py
+    - Tests must be added for each exposed function
+    - Remember that a preprocessing model should always return the same datatype it received as input (NumPy, CuPy, Pandas, cuDF, Numba)
\ No newline at end of file
diff --git a/python/cuml/_thirdparty/sklearn/__init__.py b/python/cuml/_thirdparty/sklearn/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/python/cuml/_thirdparty/sklearn/exceptions.py b/python/cuml/_thirdparty/sklearn/exceptions.py
new file mode 100644
index 0000000000..c874389392
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/exceptions.py
@@ -0,0 +1,154 @@
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+
+
+"""
+The :mod:`sklearn.exceptions` module includes all custom warnings and error
+classes used across scikit-learn.
+"""
+
+__all__ = ['NotFittedError',
+           'ChangedBehaviorWarning',
+           'ConvergenceWarning',
+           'DataConversionWarning',
+           'DataDimensionalityWarning',
+           'EfficiencyWarning',
+           'FitFailedWarning',
+           'NonBLASDotWarning',
+           'SkipTestWarning',
+           'UndefinedMetricWarning',
+           'PositiveSpectrumWarning']
+
+
+class NotFittedError(ValueError, AttributeError):
+    """Exception class to raise if estimator is used before fitting.
+
+    This class inherits from both ValueError and AttributeError to help with
+    exception handling and backward compatibility.
+
+    Examples
+    --------
+    >>> from sklearn.svm import LinearSVC
+    >>> from sklearn.exceptions import NotFittedError
+    >>> try:
+    ...     LinearSVC().predict([[1, 2], [2, 3], [3, 4]])
+    ... except NotFittedError as e:
+    ...     print(repr(e))
+    NotFittedError("This LinearSVC instance is not fitted yet. Call 'fit' with
+    appropriate arguments before using this estimator."...)
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.utils.validation.
+    """
+
+
+class ChangedBehaviorWarning(UserWarning):
+    """Warning class used to notify the user of any change in the behavior.
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.base.
+    """
+
+
+class ConvergenceWarning(UserWarning):
+    """Custom warning to capture convergence problems
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.utils.
+    """
+
+
+class DataConversionWarning(UserWarning):
+    """Warning used to notify implicit data conversions happening in the code.
+
+    This warning occurs when some input data needs to be converted or
+    interpreted in a way that may not match the user's expectations.
+
+    For example, this warning may occur when the user
+        - passes an integer array to a function which expects float input and
+          will convert the input
+        - requests a non-copying operation, but a copy is required to meet the
+          implementation's data-type expectations;
+        - passes an input whose shape can be interpreted ambiguously.
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.utils.validation.
+    """
+
+
+class DataDimensionalityWarning(UserWarning):
+    """Custom warning to notify potential issues with data dimensionality.
+
+    For example, in random projection, this warning is raised when the
+    number of components, which quantifies the dimensionality of the target
+    projection space, is higher than the number of features, which quantifies
+    the dimensionality of the original source space, to imply that the
+    dimensionality of the problem will not be reduced.
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.utils.
+    """
+
+
+class EfficiencyWarning(UserWarning):
+    """Warning used to notify the user of inefficient computation.
+
+    This warning notifies the user that the efficiency may not be optimal due
+    to some reason which may be included as a part of the warning message.
+    This may be subclassed into a more specific Warning class.
+
+    .. versionadded:: 0.18
+    """
+
+
+class FitFailedWarning(RuntimeWarning):
+    """Warning class used if there is an error while fitting the estimator.
+
+    This Warning is used in meta estimators GridSearchCV and RandomizedSearchCV
+    and the cross-validation helper function cross_val_score to warn when there
+    is an error while fitting the estimator.
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.cross_validation.
+    """
+
+
+class NonBLASDotWarning(EfficiencyWarning):
+    """Warning used when the dot operation does not use BLAS.
+
+    This warning is used to notify the user that BLAS was not used for dot
+    operation and hence the efficiency may be affected.
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.utils.validation, extends EfficiencyWarning.
+    """
+
+
+class SkipTestWarning(UserWarning):
+    """Warning class used to notify the user of a test that was skipped.
+
+    For example, one of the estimator checks requires a pandas import.
+    If the pandas package cannot be imported, the test will be skipped rather
+    than register as a failure.
+    """
+
+
+class UndefinedMetricWarning(UserWarning):
+    """Warning used when the metric is invalid
+
+    .. versionchanged:: 0.18
+       Moved from sklearn.base.
+    """
+
+
+class PositiveSpectrumWarning(UserWarning):
+    """Warning raised when the eigenvalues of a PSD matrix have issues
+
+    This warning is typically raised by ``_check_psd_eigenvalues`` when the
+    eigenvalues of a positive semidefinite (PSD) matrix such as a gram matrix
+    (kernel) present significant negative eigenvalues, or bad conditioning i.e.
+    very small non-zero eigenvalues compared to the largest eigenvalue.
+
+    .. versionadded:: 0.22
+    """
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/__init__.py b/python/cuml/_thirdparty/sklearn/preprocessing/__init__.py
new file mode 100644
index 0000000000..ea17a3f197
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/__init__.py
@@ -0,0 +1,65 @@
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+
+
+"""
+The :mod:`sklearn.preprocessing` module includes scaling, centering,
+normalization, binarization methods.
+"""
+
+from ._function_transformer import FunctionTransformer
+
+from ._data import Binarizer
+from ._data import KernelCenterer
+from ._data import MinMaxScaler
+from ._data import MaxAbsScaler
+from ._data import Normalizer
+from ._data import RobustScaler
+from ._data import StandardScaler
+from ._data import QuantileTransformer
+from ._data import add_dummy_feature
+from ._data import binarize
+from ._data import normalize
+from ._data import scale
+from ._data import robust_scale
+from ._data import maxabs_scale
+from ._data import minmax_scale
+from ._data import quantile_transform
+from ._data import power_transform
+from ._data import PowerTransformer
+from ._data import PolynomialFeatures
+
+from ._imputation import SimpleImputer
+from ._discretization import KBinsDiscretizer
+
+__all__ = [
+    'Binarizer',
+    'FunctionTransformer',
+    'KBinsDiscretizer',
+    'KernelCenterer',
+    'LabelBinarizer',
+    'LabelEncoder',
+    'MultiLabelBinarizer',
+    'MinMaxScaler',
+    'MaxAbsScaler',
+    'QuantileTransformer',
+    'Normalizer',
+    'OneHotEncoder',
+    'OrdinalEncoder',
+    'PowerTransformer',
+    'RobustScaler',
+    'StandardScaler',
+    'SimpleImputer',
+    'add_dummy_feature',
+    'PolynomialFeatures',
+    'binarize',
+    'normalize',
+    'scale',
+    'robust_scale',
+    'maxabs_scale',
+    'minmax_scale',
+    'label_binarize',
+    'quantile_transform',
+    'power_transform',
+]
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_data.py b/python/cuml/_thirdparty/sklearn/preprocessing/_data.py
new file mode 100644
index 0000000000..0c7a9dbbf9
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_data.py
@@ -0,0 +1,3102 @@
+# Original authors from Sckit-Learn:
+#          Alexandre Gramfort <alexandre.gramfort@inria.fr>
+#          Mathieu Blondel <mathieu@mblondel.org>
+#          Olivier Grisel <olivier.grisel@ensta.org>
+#          Andreas Mueller <amueller@ais.uni-bonn.de>
+#          Eric Martin <eric@ericmart.in>
+#          Giorgio Patrini <giorgio.patrini@anu.edu.au>
+#          Eric Chang <ericchang2017@u.northwestern.edu>
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+from itertools import chain, combinations
+import numbers
+import warnings
+from itertools import combinations_with_replacement as combinations_w_r
+
+import cupy as np
+from cupy import sparse
+from scipy import stats
+from scipy import optimize
+from scipy.special import boxcox
+
+from ..utils.skl_dependencies import BaseEstimator, TransformerMixin
+from ....thirdparty_adapters import check_array, get_input_type, \
+                                    to_output_type
+from ..utils.extmath import row_norms
+from ..utils.extmath import _incremental_mean_and_var
+from ..utils.validation import (check_is_fitted, check_random_state,
+                                FLOAT_DTYPES, _deprecate_positional_args)
+
+from ..utils.sparsefuncs import (inplace_column_scale,
+                                 min_max_axis,
+                                 mean_variance_axis)
+
+from ....thirdparty_adapters.sparsefuncs_fast import \
+    (inplace_csr_row_normalize_l1, inplace_csr_row_normalize_l2,
+     csr_polynomial_expansion)
+from ....common.import_utils import check_cupy8
+
+
+BOUNDS_THRESHOLD = 1e-7
+
+__all__ = [
+    'Binarizer',
+    'KernelCenterer',
+    'MinMaxScaler',
+    'MaxAbsScaler',
+    'Normalizer',
+    'RobustScaler',
+    'StandardScaler',
+    'QuantileTransformer',
+    'PowerTransformer',
+    'add_dummy_feature',
+    'binarize',
+    'normalize',
+    'scale',
+    'robust_scale',
+    'maxabs_scale',
+    'minmax_scale',
+    'quantile_transform',
+    'power_transform',
+]
+
+
+def _handle_zeros_in_scale(scale, copy=True):
+    ''' Makes sure that whenever scale is zero, we handle it correctly.
+
+    This happens in most scalers when we have constant features.'''
+
+    # if we are fitting on 1D arrays, scale might be a scalar
+    if np.isscalar(scale):
+        if scale == .0:
+            scale = 1.
+        return scale
+    elif isinstance(scale, np.ndarray):
+        if copy:
+            # New array to avoid side-effects
+            scale = scale.copy()
+        scale[scale == 0.0] = 1.0
+        return scale
+
+
+@_deprecate_positional_args
+def scale(X, *, axis=0, with_mean=True, with_std=True, copy=True):
+    """Standardize a dataset along any axis
+
+    Center to the mean and component wise scale to unit variance.
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}
+        The data to center and scale.
+
+    axis : int (0 by default)
+        axis used to compute the means and standard deviations along. If 0,
+        independently standardize each feature, otherwise (if 1) standardize
+        each sample.
+
+    with_mean : boolean, True by default
+        If True, center the data before scaling.
+
+    with_std : boolean, True by default
+        If True, scale the data to unit variance (or equivalently,
+        unit standard deviation).
+
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Notes
+    -----
+    This implementation will refuse to center sparse matrices
+    since it would make them non-sparse and would potentially crash the
+    program with memory exhaustion problems.
+
+    Instead the caller is expected to either set explicitly
+    `with_mean=False` (in that case, only variance scaling will be
+    performed on the features of the sparse matrix) or to densify the matrix
+    if he/she expects the materialized dense array to fit in memory.
+
+    For optimal processing the caller should pass a CSC matrix.
+
+    NaNs are treated as missing values: disregarded to compute the statistics,
+    and maintained during the data transformation.
+
+    We use a biased estimator for the standard deviation, equivalent to
+    `numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
+    affect model performance.
+
+    See also
+    --------
+    StandardScaler: Performs scaling to unit variance using the``Transformer`` API
+
+    """  # noqa
+    output_type = get_input_type(X)
+    X = check_array(X, accept_sparse=['csr', 'csc'], copy=copy,
+                    ensure_2d=False, estimator='the scale function',
+                    dtype=FLOAT_DTYPES, force_all_finite='allow-nan')
+
+    if sparse.issparse(X):
+        if with_mean:
+            raise ValueError(
+                "Cannot center sparse matrices: pass `with_mean=False` instead"
+                " See docstring for motivation and alternatives.")
+        if axis != 0:
+            raise ValueError("Can only scale sparse matrix on axis=0, "
+                             " got axis=%d" % axis)
+        if with_std:
+            _, var = mean_variance_axis(X, axis=0)
+            var = _handle_zeros_in_scale(var, copy=False)
+            inplace_column_scale(X, 1 / np.sqrt(var))
+    else:
+        X = np.asarray(X)
+        if with_mean:
+            mean_ = np.nanmean(X, axis)
+        if with_std:
+            scale_ = np.nanstd(X, axis)
+        # Xr is a view on the original array that enables easy use of
+        # broadcasting on the axis in which we are interested in
+        Xr = np.rollaxis(X, axis)
+        if with_mean:
+            Xr -= mean_
+            mean_1 = np.nanmean(Xr, axis=0)
+            # Verify that mean_1 is 'close to zero'. If X contains very
+            # large values, mean_1 can also be very large, due to a lack of
+            # precision of mean_. In this case, a pre-scaling of the
+            # concerned feature is efficient, for instance by its mean or
+            # maximum.
+            if not np.allclose(mean_1, 0):
+                warnings.warn("Numerical issues were encountered "
+                              "when centering the data "
+                              "and might not be solved. Dataset may "
+                              "contain too large values. You may need "
+                              "to prescale your features.")
+                Xr -= mean_1
+        if with_std:
+            scale_ = _handle_zeros_in_scale(scale_, copy=False)
+            Xr /= scale_
+            if with_mean:
+                mean_2 = np.nanmean(Xr, axis=0)
+                # If mean_2 is not 'close to zero', it comes from the fact that
+                # scale_ is very small so that mean_2 = mean_1/scale_ > 0, even
+                # if mean_1 was close to zero. The problem is thus essentially
+                # due to the lack of precision of mean_. A solution is then to
+                # subtract the mean again:
+                if not np.allclose(mean_2, 0):
+                    warnings.warn("Numerical issues were encountered "
+                                  "when scaling the data "
+                                  "and might not be solved. The standard "
+                                  "deviation of the data is probably "
+                                  "very close to 0. ")
+                    Xr -= mean_2
+
+    X = to_output_type(X, output_type)
+    return X
+
+
+class MinMaxScaler(TransformerMixin, BaseEstimator):
+    """Transform features by scaling each feature to a given range.
+
+    This estimator scales and translates each feature individually such
+    that it is in the given range on the training set, e.g. between
+    zero and one.
+
+    The transformation is given by::
+
+        X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
+        X_scaled = X_std * (max - min) + min
+
+    where min, max = feature_range.
+
+    This transformation is often used as an alternative to zero mean,
+    unit variance scaling.
+
+    Parameters
+    ----------
+    feature_range : tuple (min, max), default=(0, 1)
+        Desired range of transformed data.
+
+    copy : bool, default=True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Attributes
+    ----------
+    min_ : ndarray of shape (n_features,)
+        Per feature adjustment for minimum. Equivalent to
+        ``min - X.min(axis=0) * self.scale_``
+
+    scale_ : ndarray of shape (n_features,)
+        Per feature relative scaling of the data. Equivalent to
+        ``(max - min) / (X.max(axis=0) - X.min(axis=0))``
+
+    data_min_ : ndarray of shape (n_features,)
+        Per feature minimum seen in the data
+
+    data_max_ : ndarray of shape (n_features,)
+        Per feature maximum seen in the data
+
+    data_range_ : ndarray of shape (n_features,)
+        Per feature range ``(data_max_ - data_min_)`` seen in the data
+
+    n_samples_seen_ : int
+        The number of samples processed by the estimator.
+        It will be reset on new calls to fit, but increments across
+        ``partial_fit`` calls.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import MinMaxScaler
+    >>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+    >>> scaler = MinMaxScaler()
+    >>> print(scaler.fit(data))
+    MinMaxScaler()
+    >>> print(scaler.data_max_)
+    [ 1. 18.]
+    >>> print(scaler.transform(data))
+    [[0.   0.  ]
+     [0.25 0.25]
+     [0.5  0.5 ]
+     [1.   1.  ]]
+    >>> print(scaler.transform([[2, 2]]))
+    [[1.5 0. ]]
+
+    See also
+    --------
+    minmax_scale: Equivalent function without the estimator API.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, feature_range=(0, 1), *, copy=True):
+        self.feature_range = feature_range
+        self.copy = copy
+
+    def _reset(self):
+        """Reset internal data-dependent state of the scaler, if necessary.
+
+        __init__ parameters are not touched.
+        """
+
+        # Checking one attribute is enough, becase they are all set together
+        # in partial_fit
+        if hasattr(self, 'scale_'):
+            del self.scale_
+            del self.min_
+            del self.n_samples_seen_
+            del self.data_min_
+            del self.data_max_
+            del self.data_range_
+
+    def fit(self, X, y=None):
+        """Compute the minimum and maximum to be used for later scaling.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            The data used to compute the per-feature minimum and maximum
+            used for later scaling along the features axis.
+
+        y : None
+            Ignored.
+
+        Returns
+        -------
+        self : object
+            Fitted scaler.
+        """
+
+        # Reset internal state before fitting
+        self._reset()
+        return self.partial_fit(X, y)
+
+    def partial_fit(self, X, y=None):
+        """Online computation of min and max on X for later scaling.
+
+        All of X is processed as a single batch. This is intended for cases
+        when :meth:`fit` is not feasible due to very large number of
+        `n_samples` or because X is read from a continuous stream.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            The data used to compute the mean and standard deviation
+            used for later scaling along the features axis.
+
+        y : None
+            Ignored.
+
+        Returns
+        -------
+        self : object
+            Transformer instance.
+        """
+        feature_range = self.feature_range
+        if feature_range[0] >= feature_range[1]:
+            raise ValueError("Minimum of desired feature range must be smaller"
+                             " than maximum. Got %s." % str(feature_range))
+
+        first_pass = not hasattr(self, 'n_samples_seen_')
+        X = self._validate_data(X, reset=first_pass,
+                                estimator=self, dtype=FLOAT_DTYPES,
+                                force_all_finite="allow-nan")
+
+        data_min = np.nanmin(X, axis=0)
+        data_max = np.nanmax(X, axis=0)
+
+        if first_pass:
+            self.n_samples_seen_ = X.shape[0]
+        else:
+            data_min = np.minimum(self.data_min_, data_min)
+            data_max = np.maximum(self.data_max_, data_max)
+            self.n_samples_seen_ += X.shape[0]
+
+        data_range = data_max - data_min
+        self.scale_ = ((feature_range[1] - feature_range[0]) /
+                       _handle_zeros_in_scale(data_range))
+        self.min_ = feature_range[0] - data_min * self.scale_
+        self.data_min_ = data_min
+        self.data_max_ = data_max
+        self.data_range_ = data_range
+        return self
+
+    def transform(self, X):
+        """Scale features of X according to feature_range.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            Input data that will be transformed.
+
+        Returns
+        -------
+        Xt : array-like of shape (n_samples, n_features)
+            Transformed data.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES,
+                        force_all_finite="allow-nan")
+
+        X *= self.scale_
+        X += self.min_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def inverse_transform(self, X):
+        """Undo the scaling of X according to feature_range.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            Input data that will be transformed. It cannot be sparse.
+
+        Returns
+        -------
+        Xt : array-like of shape (n_samples, n_features)
+            Transformed data.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES,
+                        force_all_finite="allow-nan")
+
+        X -= self.min_
+        X /= self.scale_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+@_deprecate_positional_args
+def minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True):
+    """Transform features by scaling each feature to a given range.
+
+    This estimator scales and translates each feature individually such
+    that it is in the given range on the training set, i.e. between
+    zero and one.
+
+    The transformation is given by (when ``axis=0``)::
+
+        X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
+        X_scaled = X_std * (max - min) + min
+
+    where min, max = feature_range.
+
+    The transformation is calculated as (when ``axis=0``)::
+
+       X_scaled = scale * X + min - X.min(axis=0) * scale
+       where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
+
+    This transformation is often used as an alternative to zero mean,
+    unit variance scaling.
+
+    Parameters
+    ----------
+    X : array-like of shape (n_samples, n_features)
+        The data.
+
+    feature_range : tuple (min, max), default=(0, 1)
+        Desired range of transformed data.
+
+    axis : int, default=0
+        Axis used to scale along. If 0, independently scale each feature,
+        otherwise (if 1) scale each sample.
+
+    copy : bool, default=True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    See also
+    --------
+    MinMaxScaler: Performs scaling to a given range using the``Transformer`` API
+    """  # noqa
+    # Unlike the scaler object, this function allows 1d input.
+    # If copy is required, it will be done inside the scaler object.
+
+    output_type = get_input_type(X)
+    X = check_array(X, copy=False, ensure_2d=False,
+                    dtype=FLOAT_DTYPES, force_all_finite='allow-nan')
+    original_ndim = X.ndim
+
+    if original_ndim == 1:
+        X = X.reshape(X.shape[0], 1)
+
+    s = MinMaxScaler(feature_range=feature_range, copy=copy)
+    if axis == 0:
+        X = s.fit_transform(X)
+    else:
+        X = s.fit_transform(X.T).T
+
+    if original_ndim == 1:
+        X = X.ravel()
+
+    X = to_output_type(X, output_type)
+    return X
+
+
+class StandardScaler(TransformerMixin, BaseEstimator):
+    """Standardize features by removing the mean and scaling to unit variance
+
+    The standard score of a sample `x` is calculated as:
+
+        z = (x - u) / s
+
+    where `u` is the mean of the training samples or zero if `with_mean=False`,
+    and `s` is the standard deviation of the training samples or one if
+    `with_std=False`.
+
+    Centering and scaling happen independently on each feature by computing
+    the relevant statistics on the samples in the training set. Mean and
+    standard deviation are then stored to be used on later data using
+    :meth:`transform`.
+
+    Standardization of a dataset is a common requirement for many
+    machine learning estimators: they might behave badly if the
+    individual features do not more or less look like standard normally
+    distributed data (e.g. Gaussian with 0 mean and unit variance).
+
+    For instance many elements used in the objective function of
+    a learning algorithm (such as the RBF kernel of Support Vector
+    Machines or the L1 and L2 regularizers of linear models) assume that
+    all features are centered around 0 and have variance in the same
+    order. If a feature has a variance that is orders of magnitude larger
+    that others, it might dominate the objective function and make the
+    estimator unable to learn from other features correctly as expected.
+
+    This scaler can also be applied to sparse CSR or CSC matrices by passing
+    `with_mean=False` to avoid breaking the sparsity structure of the data.
+
+    Parameters
+    ----------
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    with_mean : boolean, True by default
+        If True, center the data before scaling.
+        This does not work (and will raise an exception) when attempted on
+        sparse matrices, because centering them entails building a dense
+        matrix which in common use cases is likely to be too large to fit in
+        memory.
+
+    with_std : boolean, True by default
+        If True, scale the data to unit variance (or equivalently,
+        unit standard deviation).
+
+    Attributes
+    ----------
+    scale_ : ndarray or None, shape (n_features,)
+        Per feature relative scaling of the data. This is calculated using
+        `sqrt(var_)`. Equal to ``None`` when ``with_std=False``.
+
+    mean_ : ndarray or None, shape (n_features,)
+        The mean value for each feature in the training set.
+        Equal to ``None`` when ``with_mean=False``.
+
+    var_ : ndarray or None, shape (n_features,)
+        The variance for each feature in the training set. Used to compute
+        `scale_`. Equal to ``None`` when ``with_std=False``.
+
+    n_samples_seen_ : int or array, shape (n_features,)
+        The number of samples processed by the estimator for each feature.
+        If there are not missing samples, the ``n_samples_seen`` will be an
+        integer, otherwise it will be an array.
+        Will be reset on new calls to fit, but increments across
+        ``partial_fit`` calls.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import StandardScaler
+    >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
+    >>> scaler = StandardScaler()
+    >>> print(scaler.fit(data))
+    StandardScaler()
+    >>> print(scaler.mean_)
+    [0.5 0.5]
+    >>> print(scaler.transform(data))
+    [[-1. -1.]
+     [-1. -1.]
+     [ 1.  1.]
+     [ 1.  1.]]
+    >>> print(scaler.transform([[2, 2]]))
+    [[3. 3.]]
+
+    See also
+    --------
+    scale: Equivalent function without the estimator API.
+
+    :class:`cuml.decomposition.PCA`
+        Further removes the linear correlation across features with 'whiten=True'.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.
+
+    We use a biased estimator for the standard deviation, equivalent to
+    `numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
+    affect model performance.
+    """  # noqa
+
+    @_deprecate_positional_args
+    def __init__(self, *, copy=True, with_mean=True, with_std=True):
+        self.with_mean = with_mean
+        self.with_std = with_std
+        self.copy = copy
+
+    def _reset(self):
+        """Reset internal data-dependent state of the scaler, if necessary.
+
+        __init__ parameters are not touched.
+        """
+
+        # Checking one attribute is enough, becase they are all set together
+        # in partial_fit
+        if hasattr(self, 'scale_'):
+            del self.scale_
+            del self.n_samples_seen_
+            del self.mean_
+            del self.var_
+
+    def fit(self, X, y=None):
+        """Compute the mean and std to be used for later scaling.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to compute the mean and standard deviation
+            used for later scaling along the features axis.
+
+        y
+            Ignored
+        """
+
+        # Reset internal state before fitting
+        self._reset()
+        return self.partial_fit(X, y)
+
+    def partial_fit(self, X, y=None):
+        """
+        Online computation of mean and std on X for later scaling.
+
+        All of X is processed as a single batch. This is intended for cases
+        when :meth:`fit` is not feasible due to very large number of
+        `n_samples` or because X is read from a continuous stream.
+
+        The algorithm for incremental mean and std is given in Equation 1.5a,b
+        in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. "Algorithms
+        for computing the sample variance: Analysis and recommendations."
+        The American Statistician 37.3 (1983): 242-247:
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to compute the mean and standard deviation
+            used for later scaling along the features axis.
+
+        y : None
+            Ignored.
+
+        Returns
+        -------
+        self : object
+            Transformer instance.
+        """
+        X = self._validate_data(X, accept_sparse=('csr', 'csc'),
+                                estimator=self, dtype=FLOAT_DTYPES,
+                                force_all_finite='allow-nan')
+
+        # Even in the case of `with_mean=False`, we update the mean anyway
+        # This is needed for the incremental computation of the var
+        # See incr_mean_variance_axis and _incremental_mean_variance_axis
+
+        # if n_samples_seen_ is an integer (i.e. no missing values), we need to
+        # transform it to a NumPy array of shape (n_features,) required by
+        # incr_mean_variance_axis and _incremental_variance_axis
+        if (hasattr(self, 'n_samples_seen_') and
+                isinstance(self.n_samples_seen_, numbers.Integral)):
+            self.n_samples_seen_ = np.repeat(
+                self.n_samples_seen_, X.shape[1]).astype(np.int64, copy=False)
+
+        if sparse.issparse(X):
+            if self.with_mean:
+                raise ValueError(
+                    "Cannot center sparse matrices: pass `with_mean=False` "
+                    "instead. See docstring for motivation and alternatives.")
+
+            if X.format == 'csr':
+                X = X.tocsc()
+
+            counts_nan = np.empty(X.shape[1])
+            _isnan = np.isnan(X.data)
+
+            start = X.indptr[0]
+            for i, end in enumerate(X.indptr[1:]):
+                counts_nan[i] = _isnan[start:end].sum()
+                start = end
+
+            if not hasattr(self, 'n_samples_seen_'):
+                self.n_samples_seen_ = (
+                        X.shape[0] - counts_nan).astype(np.int64, copy=False)
+
+            if self.with_std:
+                # First pass
+                if not hasattr(self, 'scale_'):
+                    self.mean_, self.var_ = mean_variance_axis(X, axis=0)
+
+                # TODO
+                """
+                # Next passes
+                else:
+                    self.mean_, self.var_, self.n_samples_seen_ = \
+                        incr_mean_variance_axis(X, axis=0,
+                                                last_mean=self.mean_,
+                                                last_var=self.var_,
+                                                last_n=self.n_samples_seen_)
+                """
+            else:
+                self.mean_ = None
+                self.var_ = None
+                if hasattr(self, 'scale_'):
+                    self.n_samples_seen_ += X.shape[0] - counts_nan
+        else:
+            if not hasattr(self, 'n_samples_seen_'):
+                self.n_samples_seen_ = np.zeros(X.shape[1], dtype=np.int64)
+
+            # First pass
+            if not hasattr(self, 'scale_'):
+                self.mean_ = .0
+                if self.with_std:
+                    self.var_ = .0
+                else:
+                    self.var_ = None
+
+            if not self.with_mean and not self.with_std:
+                self.mean_ = None
+                self.var_ = None
+                self.n_samples_seen_ += X.shape[0] - np.isnan(X).sum(axis=0)
+            else:
+                self.mean_, self.var_, self.n_samples_seen_ = \
+                    _incremental_mean_and_var(X, self.mean_, self.var_,
+                                              self.n_samples_seen_)
+
+        # for backward-compatibility, reduce n_samples_seen_ to an integer
+        # if the number of samples is the same for each feature (i.e. no
+        # missing values)
+        ptp = np.amax(self.n_samples_seen_) - np.amin(self.n_samples_seen_)
+        if ptp == 0:
+            self.n_samples_seen_ = self.n_samples_seen_[0]
+        del ptp
+
+        if self.with_std:
+            self.scale_ = _handle_zeros_in_scale(np.sqrt(self.var_))
+        else:
+            self.scale_ = None
+
+        return self
+
+    def transform(self, X, copy=None):
+        """Perform standardization by centering and scaling
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to scale along the features axis.
+        copy : bool, optional (default: None)
+            Whether a forced copy will be triggered. If copy=False,
+            a copy might be triggered by a conversion.
+        """
+        check_is_fitted(self)
+
+        copy = copy if copy is not None else self.copy
+
+        output_type = get_input_type(X)
+        X = self._validate_data(X, reset=False,
+                                accept_sparse=['csr', 'csc'], copy=copy,
+                                estimator=self, dtype=FLOAT_DTYPES,
+                                force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            if self.with_mean:
+                raise ValueError(
+                    "Cannot center sparse matrices: pass `with_mean=False` "
+                    "instead. See docstring for motivation and alternatives.")
+            if self.scale_ is not None:
+                inplace_column_scale(X, 1 / self.scale_)
+        else:
+            if self.with_mean:
+                X -= self.mean_
+            if self.with_std:
+                X /= self.scale_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def inverse_transform(self, X, copy=None):
+        """Scale back the data to the original representation
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to scale along the features axis.
+        copy : bool, optional (default: None)
+            Whether a forced copy will be triggered. If copy=False,
+            a copy might be triggered by a conversion.
+
+        Returns
+        -------
+        X_tr : {array-like, sparse matrix}, shape [n_samples, n_features]
+            Transformed array.
+        """
+        check_is_fitted(self)
+
+        copy = copy if copy is not None else self.copy
+
+        output_type = get_input_type(X)
+        X = check_array(X, accept_sparse=['csr', 'csc'], copy=copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            if self.with_mean:
+                raise ValueError(
+                    "Cannot uncenter sparse matrices: pass `with_mean=False` "
+                    "instead See docstring for motivation and alternatives.")
+            if not sparse.isspmatrix_csr(X):
+                X = X.tocsr()
+                copy = False
+            if copy:
+                X = X.copy()
+            if self.scale_ is not None:
+                inplace_column_scale(X, self.scale_)
+        else:
+            X = np.asarray(X)
+            if copy:
+                X = X.copy()
+            if self.with_std:
+                X *= self.scale_
+            if self.with_mean:
+                X += self.mean_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+class MaxAbsScaler(TransformerMixin, BaseEstimator):
+    """Scale each feature by its maximum absolute value.
+
+    This estimator scales and translates each feature individually such
+    that the maximal absolute value of each feature in the
+    training set will be 1.0. It does not shift/center the data, and
+    thus does not destroy any sparsity.
+
+    This scaler can also be applied to sparse CSR or CSC matrices.
+
+    Parameters
+    ----------
+    copy : boolean, optional, default is True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Attributes
+    ----------
+    scale_ : ndarray, shape (n_features,)
+        Per feature relative scaling of the data.
+
+    max_abs_ : ndarray, shape (n_features,)
+        Per feature maximum absolute value.
+
+    n_samples_seen_ : int
+        The number of samples processed by the estimator. Will be reset on
+        new calls to fit, but increments across ``partial_fit`` calls.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import MaxAbsScaler
+    >>> X = [[ 1., -1.,  2.],
+    ...      [ 2.,  0.,  0.],
+    ...      [ 0.,  1., -1.]]
+    >>> transformer = MaxAbsScaler().fit(X)
+    >>> transformer
+    MaxAbsScaler()
+    >>> transformer.transform(X)
+    array([[ 0.5, -1. ,  1. ],
+           [ 1. ,  0. ,  0. ],
+           [ 0. ,  1. , -0.5]])
+
+    See also
+    --------
+    maxabs_scale: Equivalent function without the estimator API.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.
+    """
+
+    @check_cupy8()
+    @_deprecate_positional_args
+    def __init__(self, *, copy=True):
+        self.copy = copy
+
+    def _reset(self):
+        """Reset internal data-dependent state of the scaler, if necessary.
+
+        __init__ parameters are not touched.
+        """
+
+        # Checking one attribute is enough, becase they are all set together
+        # in partial_fit
+        if hasattr(self, 'scale_'):
+            del self.scale_
+            del self.n_samples_seen_
+            del self.max_abs_
+
+    def fit(self, X, y=None):
+        """Compute the maximum absolute value to be used for later scaling.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to compute the per-feature minimum and maximum
+            used for later scaling along the features axis.
+        """
+
+        # Reset internal state before fitting
+        self._reset()
+        return self.partial_fit(X, y)
+
+    def partial_fit(self, X, y=None):
+        """
+        Online computation of max absolute value of X for later scaling.
+
+        All of X is processed as a single batch. This is intended for cases
+        when :meth:`fit` is not feasible due to very large number of
+        `n_samples` or because X is read from a continuous stream.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data used to compute the mean and standard deviation
+            used for later scaling along the features axis.
+
+        y : None
+            Ignored.
+
+        Returns
+        -------
+        self : object
+            Transformer instance.
+        """
+        first_pass = not hasattr(self, 'n_samples_seen_')
+        X = self._validate_data(X, reset=first_pass,
+                                accept_sparse=('csr', 'csc'), estimator=self,
+                                dtype=FLOAT_DTYPES,
+                                force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            mins, maxs = min_max_axis(X, axis=0, ignore_nan=True)
+            max_abs = np.maximum(np.abs(mins), np.abs(maxs))
+        else:
+            max_abs = np.nanmax(np.abs(X), axis=0)
+
+        if first_pass:
+            self.n_samples_seen_ = X.shape[0]
+        else:
+            max_abs = np.maximum(self.max_abs_, max_abs)
+            self.n_samples_seen_ += X.shape[0]
+
+        self.max_abs_ = max_abs
+        self.scale_ = _handle_zeros_in_scale(max_abs)
+        return self
+
+    def transform(self, X):
+        """Scale the data
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}
+            The data that should be scaled.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            inplace_column_scale(X, 1.0 / self.scale_)
+        else:
+            X /= self.scale_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def inverse_transform(self, X):
+        """Scale back the data to the original representation
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}
+            The data that should be transformed back.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            inplace_column_scale(X, self.scale_)
+        else:
+            X *= self.scale_
+
+        X = to_output_type(X, output_type)
+        return X
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+@check_cupy8()
+@_deprecate_positional_args
+def maxabs_scale(X, *, axis=0, copy=True):
+    """Scale each feature to the [-1, 1] range without breaking the sparsity.
+
+    This estimator scales each feature individually such
+    that the maximal absolute value of each feature in the
+    training set will be 1.0.
+
+    This scaler can also be applied to sparse CSR or CSC matrices.
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}, shape (n_samples, n_features)
+        The data.
+
+    axis : int (0 by default)
+        axis used to scale along. If 0, independently scale each feature,
+        otherwise (if 1) scale each sample.
+
+    copy : boolean, optional, default is True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    See also
+    --------
+    MaxAbsScaler: Performs scaling to the [-1, 1] range using the``Transformer`` API
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded to compute the statistics,
+    and maintained during the data transformation.
+    """  # noqa
+    # Unlike the scaler object, this function allows 1d input.
+
+    # If copy is required, it will be done inside the scaler object.
+    X = check_array(X, accept_sparse=('csr', 'csc'), copy=False,
+                    ensure_2d=False, dtype=FLOAT_DTYPES,
+                    force_all_finite='allow-nan')
+    original_ndim = X.ndim
+
+    if original_ndim == 1:
+        X = X.reshape(X.shape[0], 1)
+
+    s = MaxAbsScaler(copy=copy)
+    if axis == 0:
+        X = s.fit_transform(X)
+    else:
+        X = s.fit_transform(X.T).T
+
+    if original_ndim == 1:
+        X = X.ravel()
+
+    return X
+
+
+class RobustScaler(TransformerMixin, BaseEstimator):
+    """Scale features using statistics that are robust to outliers.
+
+    This Scaler removes the median and scales the data according to the
+    quantile range (defaults to IQR: Interquartile Range). The IQR is the range
+    between the 1st quartile (25th quantile) and the 3rd quartile (75th
+    quantile).
+
+    Centering and scaling happen independently on each feature by computing the
+    relevant statistics on the samples in the training set. Median and
+    interquartile range are then stored to be used on later data using the
+    ``transform`` method.
+
+    Standardization of a dataset is a common requirement for many machine
+    learning estimators. Typically this is done by removing the mean and
+    scaling to unit variance. However, outliers can often influence the sample
+    mean / variance in a negative way. In such cases, the median and the
+    interquartile range often give better results.
+
+    Parameters
+    ----------
+
+    with_centering : boolean, default=True
+        If True, center the data before scaling.
+        This will cause ``transform`` to raise an exception when attempted on
+        sparse matrices, because centering them entails building a dense
+        matrix which in common use cases is likely to be too large to fit in
+        memory.
+
+    with_scaling : boolean, default=True
+        If True, scale the data to interquartile range.
+
+    quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
+        Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR
+        Quantile range used to calculate ``scale_``.
+
+    copy : boolean, optional, default=True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Attributes
+    ----------
+    center_ : array of floats
+        The median value for each feature in the training set.
+
+    scale_ : array of floats
+        The (scaled) interquartile range for each feature in the training set.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import RobustScaler
+    >>> X = [[ 1., -2.,  2.],
+    ...      [ -2.,  1.,  3.],
+    ...      [ 4.,  1., -2.]]
+    >>> transformer = RobustScaler().fit(X)
+    >>> transformer
+    RobustScaler()
+    >>> transformer.transform(X)
+    array([[ 0. , -2. ,  0. ],
+           [-1. ,  0. ,  0.4],
+           [ 1. ,  0. , -1.6]])
+
+    See also
+    --------
+
+    robust_scale: Equivalent function without the estimator API.
+
+    cuml.decomposition.PCA: Further removes the linear correlation across
+        features with ``whiten=True``.
+
+    """
+    @_deprecate_positional_args
+    def __init__(self, *, with_centering=True, with_scaling=True,
+                 quantile_range=(25.0, 75.0), copy=True):
+        self.with_centering = with_centering
+        self.with_scaling = with_scaling
+        self.quantile_range = quantile_range
+        self.copy = copy
+
+    def fit(self, X, y=None):
+        """Compute the median and quantiles to be used for scaling.
+
+        Parameters
+        ----------
+        X : {array-like, CSC matrix}, shape [n_samples, n_features]
+            The data used to compute the median and quantiles
+            used for later scaling along the features axis.
+        """
+        # at fit, convert sparse matrices to csc for optimized computation of
+        # the quantiles
+        X = self._validate_data(X, accept_sparse='csc', estimator=self,
+                                dtype=FLOAT_DTYPES,
+                                force_all_finite='allow-nan')
+
+        q_min, q_max = self.quantile_range
+        if not 0 <= q_min <= q_max <= 100:
+            raise ValueError("Invalid quantile range: %s" %
+                             str(self.quantile_range))
+
+        if self.with_centering:
+            if sparse.issparse(X):
+                raise ValueError(
+                    "Cannot center sparse matrices: use `with_centering=False`"
+                    " instead. See docstring for motivation and alternatives.")
+            middle, is_odd = divmod(X.shape[0], 2)
+            X_sorted = np.sort(X, axis=0)
+            if is_odd:
+                self.center_ = X_sorted[middle]
+            else:
+                elm1 = X_sorted[middle-1]
+                elm2 = X_sorted[middle]
+                self.center_ = (elm1 + elm2) / 2.
+        else:
+            self.center_ = None
+
+        if self.with_scaling:
+            quantiles = []
+            for feature_idx in range(X.shape[1]):
+                if sparse.issparse(X):
+                    column_nnz_data = X.data[X.indptr[feature_idx]:
+                                             X.indptr[feature_idx + 1]]
+                    column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)
+                    column_data[:len(column_nnz_data)] = column_nnz_data
+                else:
+                    column_data = X[:, feature_idx]
+
+                is_not_nan = ~np.isnan(column_data).astype(np.bool)
+                column_data = column_data[is_not_nan]
+                quantiles.append(np.percentile(column_data,
+                                               self.quantile_range))
+
+            quantiles = np.array(quantiles).T
+
+            self.scale_ = quantiles[1] - quantiles[0]
+            self.scale_ = _handle_zeros_in_scale(self.scale_, copy=False)
+        else:
+            self.scale_ = None
+
+        return self
+
+    def transform(self, X):
+        """Center and scale the data.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}
+            The data used to scale along the specified axis.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            if self.with_scaling:
+                inplace_column_scale(X, 1.0 / self.scale_)
+        else:
+            if self.with_centering:
+                X -= self.center_
+            if self.with_scaling:
+                X /= self.scale_
+        return to_output_type(X, output_type)
+
+    def inverse_transform(self, X):
+        """Scale back the data to the original representation
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}
+            The data used to scale along the specified axis.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')
+
+        if sparse.issparse(X):
+            if self.with_scaling:
+                inplace_column_scale(X, self.scale_)
+        else:
+            if self.with_scaling:
+                X *= self.scale_
+            if self.with_centering:
+                X += self.center_
+        return to_output_type(X, output_type)
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+@_deprecate_positional_args
+def robust_scale(X, *, axis=0, with_centering=True, with_scaling=True,
+                 quantile_range=(25.0, 75.0), copy=True):
+    """
+    Standardize a dataset along any axis
+
+    Center to the median and component wise scale
+    according to the interquartile range.
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}
+        The data to center and scale.
+
+    axis : int (0 by default)
+        axis used to compute the medians and IQR along. If 0,
+        independently scale each feature, otherwise (if 1) scale
+        each sample.
+
+    with_centering : boolean, True by default
+        If True, center the data before scaling.
+
+    with_scaling : boolean, True by default
+        If True, scale the data to unit variance (or equivalently,
+        unit standard deviation).
+
+    quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
+        Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR
+        Quantile range used to calculate ``scale_``.
+
+    copy : boolean, optional, default is True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Notes
+    -----
+    This implementation will refuse to center sparse matrices
+    since it would make them non-sparse and would potentially crash the
+    program with memory exhaustion problems.
+
+    Instead the caller is expected to either set explicitly
+    `with_centering=False` (in that case, only variance scaling will be
+    performed on the features of the CSR matrix) or to densify the matrix
+    if he/she expects the materialized dense array to fit in memory.
+
+    To avoid memory copy the caller should pass a CSR matrix.
+
+    See also
+    --------
+    RobustScaler: Performs centering and scaling using the ``Transformer`` API
+
+    """
+    output_type = get_input_type(X)
+    X = check_array(X, accept_sparse=('csr', 'csc'), copy=False,
+                    ensure_2d=False, dtype=FLOAT_DTYPES,
+                    force_all_finite='allow-nan')
+    original_ndim = X.ndim
+
+    if original_ndim == 1:
+        X = X.reshape(X.shape[0], 1)
+
+    s = RobustScaler(with_centering=with_centering, with_scaling=with_scaling,
+                     quantile_range=quantile_range, copy=copy)
+    if axis == 0:
+        X = s.fit_transform(X)
+    else:
+        X = s.fit_transform(X.T).T
+
+    if original_ndim == 1:
+        X = X.ravel()
+
+    return to_output_type(X, output_type)
+
+
+class PolynomialFeatures(TransformerMixin, BaseEstimator):
+    """Generate polynomial and interaction features.
+
+    Generate a new feature matrix consisting of all polynomial combinations
+    of the features with degree less than or equal to the specified degree.
+    For example, if an input sample is two dimensional and of the form
+    [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
+
+    Parameters
+    ----------
+    degree : integer
+        The degree of the polynomial features. Default = 2.
+
+    interaction_only : boolean, default = False
+        If true, only interaction features are produced: features that are
+        products of at most ``degree`` *distinct* input features (so not
+        ``x[1] ** 2``, ``x[0] * x[2] ** 3``, etc.).
+
+    include_bias : boolean
+        If True (default), then include a bias column, the feature in which
+        all polynomial powers are zero (i.e. a column of ones - acts as an
+        intercept term in a linear model).
+
+    order : str in {'C', 'F'}, default 'C'
+        Order of output array in the dense case. 'F' order is faster to
+        compute, but may slow down subsequent estimators.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from cuml.preprocessing import PolynomialFeatures
+    >>> X = np.arange(6).reshape(3, 2)
+    >>> X
+    array([[0, 1],
+           [2, 3],
+           [4, 5]])
+    >>> poly = PolynomialFeatures(2)
+    >>> poly.fit_transform(X)
+    array([[ 1.,  0.,  1.,  0.,  0.,  1.],
+           [ 1.,  2.,  3.,  4.,  6.,  9.],
+           [ 1.,  4.,  5., 16., 20., 25.]])
+    >>> poly = PolynomialFeatures(interaction_only=True)
+    >>> poly.fit_transform(X)
+    array([[ 1.,  0.,  1.,  0.],
+           [ 1.,  2.,  3.,  6.],
+           [ 1.,  4.,  5., 20.]])
+
+    Attributes
+    ----------
+    powers_ : array, shape (n_output_features, n_input_features)
+        powers_[i, j] is the exponent of the jth input in the ith output.
+
+    n_input_features_ : int
+        The total number of input features.
+
+    n_output_features_ : int
+        The total number of polynomial output features. The number of output
+        features is computed by iterating over all suitably sized combinations
+        of input features.
+
+    Notes
+    -----
+    Be aware that the number of features in the output array scales
+    polynomially in the number of features of the input array, and
+    exponentially in the degree. High degrees can cause overfitting.
+    """
+    @check_cupy8()
+    @_deprecate_positional_args
+    def __init__(self, degree=2, *, interaction_only=False, include_bias=True,
+                 order='C'):
+        self.degree = degree
+        self.interaction_only = interaction_only
+        self.include_bias = include_bias
+        self.order = order
+
+    @staticmethod
+    @check_cupy8()
+    def _combinations(n_features, degree, interaction_only, include_bias):
+        comb = (combinations if interaction_only else combinations_w_r)
+        start = int(not include_bias)
+        return chain.from_iterable(comb(range(n_features), i)
+                                   for i in range(start, degree + 1))
+
+    @property
+    def powers_(self):
+        check_is_fitted(self)
+
+        combinations = self._combinations(self.n_input_features_, self.degree,
+                                          self.interaction_only,
+                                          self.include_bias)
+        return np.vstack([np.bincount(c, minlength=self.n_input_features_)
+                          for c in combinations])
+
+    def get_feature_names(self, input_features=None):
+        """
+        Return feature names for output features
+
+        Parameters
+        ----------
+        input_features : list of string, length n_features, optional
+            String names for input features if available. By default,
+            "x0", "x1", ... "xn_features" is used.
+
+        Returns
+        -------
+        output_feature_names : list of string, length n_output_features
+
+        """
+        powers = self.powers_
+        if input_features is None:
+            input_features = ['x%d' % i for i in range(powers.shape[1])]
+        feature_names = []
+        for row in powers:
+            inds = np.where(row)[0]
+            if len(inds):
+                name = " ".join("%s^%d" % (input_features[ind], exp)
+                                if exp != 1 else input_features[ind]
+                                for ind, exp in zip(inds, row[inds]))
+            else:
+                name = "1"
+            feature_names.append(name)
+        return feature_names
+
+    def fit(self, X, y=None):
+        """
+        Compute number of output features.
+
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            The data.
+
+        Returns
+        -------
+        self : instance
+        """
+        n_samples, n_features = self._validate_data(
+            X, accept_sparse=True).shape
+        combinations = self._combinations(n_features, self.degree,
+                                          self.interaction_only,
+                                          self.include_bias)
+        self.n_input_features_ = n_features
+        self.n_output_features_ = sum(1 for _ in combinations)
+        return self
+
+    def transform(self, X):
+        """Transform data to polynomial features
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data to transform, row by row.
+
+            Prefer CSR over CSC for sparse input (for speed), but CSC is
+            required if the degree is 4 or higher. If the degree is less than
+            4 and the input format is CSC, it will be converted to CSR, have
+            its polynomial features generated, then converted back to CSC.
+
+            If the degree is 2 or 3, the method described in "Leveraging
+            Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices
+            Using K-Simplex Numbers" by Andrew Nystrom and John Hughes is
+            used, which is much faster than the method used on CSC input. For
+            this reason, a CSC input will be converted to CSR, and the output
+            will be converted back to CSC prior to being returned, hence the
+            preference of CSR.
+
+        Returns
+        -------
+        XP : {array-like, sparse matrix}, shape [n_samples, NP]
+            The matrix of features, where NP is the number of polynomial
+            features generated from the combination of inputs.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = check_array(X, order='F', dtype=FLOAT_DTYPES,
+                        accept_sparse=('csr', 'csc'))
+
+        n_samples, n_features = X.shape
+
+        if n_features != self.n_input_features_:
+            raise ValueError("X shape does not match training shape")
+
+        if sparse.isspmatrix_csr(X):
+            if self.degree > 3:
+                res = self.transform(X.tocsc()).tocsr()
+                return to_output_type(res, output_type, order=self.order)
+            to_stack = []
+            if self.include_bias:
+                bias = np.ones(shape=(n_samples, 1), dtype=X.dtype)
+                to_stack.append(sparse.csr_matrix(bias))
+            to_stack.append(X)
+            for deg in range(2, self.degree+1):
+                Xp_next = csr_polynomial_expansion(X, self.interaction_only,
+                                                   deg)
+                if Xp_next is None:
+                    break
+                to_stack.append(Xp_next)
+            XP = sparse.hstack(to_stack, format='csr')
+        elif sparse.isspmatrix_csc(X) and self.degree < 4:
+            res = self.transform(X.tocsr()).tocsc()
+            return to_output_type(res, output_type, order=self.order)
+        else:
+            if sparse.isspmatrix(X):
+                combinations = self._combinations(n_features, self.degree,
+                                                  self.interaction_only,
+                                                  self.include_bias)
+                columns = []
+                for comb in combinations:
+                    if comb:
+                        out_col = 1
+                        for col_idx in comb:
+                            out_col = X[:, col_idx].multiply(out_col)
+                        columns.append(out_col)
+                    else:
+                        bias = sparse.csc_matrix(np.ones((X.shape[0], 1)))
+                        columns.append(bias)
+                XP = sparse.hstack(columns, dtype=X.dtype).tocsc()
+            else:
+                XP = np.empty((n_samples, self.n_output_features_),
+                              dtype=X.dtype, order=self.order)
+
+                # What follows is a faster implementation of:
+                # for i, comb in enumerate(combinations):
+                #     XP[:, i] = X[:, comb].prod(1)
+                # This implementation uses two optimisations.
+                # First one is broadcasting,
+                # multiply ([X1, ..., Xn], X1) -> [X1 X1, ..., Xn X1]
+                # multiply ([X2, ..., Xn], X2) -> [X2 X2, ..., Xn X2]
+                # ...
+                # multiply ([X[:, start:end], X[:, start]) -> ...
+                # Second optimisation happens for degrees >= 3.
+                # Xi^3 is computed reusing previous computation:
+                # Xi^3 = Xi^2 * Xi.
+
+                if self.include_bias:
+                    XP[:, 0] = 1
+                    current_col = 1
+                else:
+                    current_col = 0
+
+                # d = 0
+                XP[:, current_col:current_col + n_features] = X
+                index = list(range(current_col,
+                                   current_col + n_features))
+                current_col += n_features
+                index.append(current_col)
+
+                # d >= 1
+                for _ in range(1, self.degree):
+                    new_index = []
+                    end = index[-1]
+                    for feature_idx in range(n_features):
+                        start = index[feature_idx]
+                        new_index.append(current_col)
+                        if self.interaction_only:
+                            start += (index[feature_idx + 1] -
+                                      index[feature_idx])
+                        next_col = current_col + end - start
+                        if next_col <= current_col:
+                            break
+                        # XP[:, start:end] are terms of degree d - 1
+                        # that exclude feature #feature_idx.
+                        np.multiply(XP[:, start:end],
+                                    X[:, feature_idx:feature_idx + 1],
+                                    out=XP[:, current_col:next_col],
+                                    casting='no')
+                        current_col = next_col
+
+                    new_index.append(current_col)
+                    index = new_index
+
+        XP = to_output_type(XP, output_type, order=self.order)
+        return XP
+
+
+@check_cupy8()
+@_deprecate_positional_args
+def normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False):
+    """Scale input vectors individually to unit norm (vector length).
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        The data to normalize, element by element.
+        Please provide CSC matrix to normalize on axis 0,
+        conversely provide CSR matrix to normalize on axis 1
+
+    norm : 'l1', 'l2', or 'max', optional ('l2' by default)
+        The norm to use to normalize each non zero sample (or each non-zero
+        feature if axis is 0).
+
+    axis : 0 or 1, optional (1 by default)
+        axis used to normalize the data along. If 1, independently normalize
+        each sample, otherwise (if 0) normalize each feature.
+
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    return_norm : boolean, default False
+        whether to return the computed norms
+
+    Returns
+    -------
+    X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        Normalized input X.
+
+    norms : array, shape [n_samples] if axis=1 else [n_features]
+        An array of norms along given axis for X.
+        When X is sparse, a NotImplementedError will be raised
+        for norm 'l1' or 'l2'.
+
+    See also
+    --------
+    Normalizer: Performs normalization using the ``Transformer`` API
+    """
+    if norm not in ('l1', 'l2', 'max'):
+        raise ValueError("'%s' is not a supported norm" % norm)
+
+    if axis == 0:
+        sparse_format = 'csc'
+    elif axis == 1:
+        sparse_format = 'csr'
+    else:
+        raise ValueError("'%d' is not a supported axis" % axis)
+
+    output_type = get_input_type(X)
+    X = check_array(X, accept_sparse=sparse_format, copy=copy,
+                    estimator='the normalize function', dtype=FLOAT_DTYPES)
+
+    if axis == 0:
+        X = X.T
+
+    if sparse.issparse(X):
+        if return_norm and norm in ('l1', 'l2'):
+            raise NotImplementedError("return_norm=True is not implemented "
+                                      "for sparse matrices with norm 'l1' "
+                                      "or norm 'l2'")
+        if norm == 'l1':
+            inplace_csr_row_normalize_l1(X)
+        elif norm == 'l2':
+            inplace_csr_row_normalize_l2(X)
+        elif norm == 'max':
+            mins, maxes = min_max_axis(X, 1)
+            norms = np.maximum(abs(mins), maxes)
+            norms_elementwise = norms.repeat(np.diff(X.indptr).tolist())
+            mask = norms_elementwise != 0
+            X.data[mask] /= norms_elementwise[mask]
+    else:
+        if norm == 'l1':
+            norms = np.abs(X).sum(axis=1)
+        elif norm == 'l2':
+            norms = row_norms(X)
+        elif norm == 'max':
+            norms = np.max(abs(X), axis=1)
+        norms = _handle_zeros_in_scale(norms, copy=False)
+        X /= norms[:, np.newaxis]
+
+    if axis == 0:
+        X = X.T
+
+    X = to_output_type(X, output_type)
+    if return_norm:
+        norms = to_output_type(norms, output_type)
+        return X, norms
+    else:
+        return X
+
+
+class Normalizer(TransformerMixin, BaseEstimator):
+    """Normalize samples individually to unit norm.
+
+    Each sample (i.e. each row of the data matrix) with at least one
+    non zero component is rescaled independently of other samples so
+    that its norm (l1, l2 or inf) equals one.
+
+    This transformer is able to work both with dense numpy arrays and
+    sparse matrix
+
+    Scaling inputs to unit norms is a common operation for text
+    classification or clustering for instance. For instance the dot
+    product of two l2-normalized TF-IDF vectors is the cosine similarity
+    of the vectors and is the base similarity metric for the Vector
+    Space Model commonly used by the Information Retrieval community.
+
+    Parameters
+    ----------
+    norm : 'l1', 'l2', or 'max', optional ('l2' by default)
+        The norm to use to normalize each non zero sample. If norm='max'
+        is used, values will be rescaled by the maximum of the absolute
+        values.
+
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import Normalizer
+    >>> X = [[4, 1, 2, 2],
+    ...      [1, 3, 9, 3],
+    ...      [5, 7, 5, 1]]
+    >>> transformer = Normalizer().fit(X)  # fit does nothing.
+    >>> transformer
+    Normalizer()
+    >>> transformer.transform(X)
+    array([[0.8, 0.2, 0.4, 0.4],
+           [0.1, 0.3, 0.9, 0.3],
+           [0.5, 0.7, 0.5, 0.1]])
+
+    Notes
+    -----
+    This estimator is stateless (besides constructor parameters), the
+    fit method does nothing but is useful when used in a pipeline.
+
+
+    See also
+    --------
+    normalize: Equivalent function without the estimator API.
+    """
+
+    @check_cupy8()
+    @_deprecate_positional_args
+    def __init__(self, norm='l2', *, copy=True):
+        self.norm = norm
+        self.copy = copy
+
+    def fit(self, X, y=None):
+        """Do nothing and return the estimator unchanged
+
+        This method is just there to implement the usual API and hence
+        work in pipelines.
+
+        Parameters
+        ----------
+        X : {array-like, CSR matrix}
+        """
+        self._validate_data(X, accept_sparse='csr')
+        return self
+
+    def transform(self, X, copy=None):
+        """Scale each non zero row of X to unit norm
+
+        Parameters
+        ----------
+        X : {array-like, CSR matrix}, shape [n_samples, n_features]
+            The data to normalize, row by row.
+        copy : bool, optional (default: None)
+            Whether a forced copy will be triggered. If copy=False,
+            a copy might be triggered by a conversion.
+        """
+        output_type = get_input_type(X)
+        copy = copy if copy is not None else self.copy
+        X = check_array(X, accept_sparse='csr')
+        X = normalize(X, norm=self.norm, axis=1, copy=copy)
+        return to_output_type(X, output_type)
+
+    def _more_tags(self):
+        return {'stateless': True}
+
+
+@_deprecate_positional_args
+def binarize(X, *, threshold=0.0, copy=True):
+    """Boolean thresholding of array-like or sparse matrix
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        The data to binarize, element by element.
+
+    threshold : float, optional (0.0 by default)
+        Feature values below or equal to this are replaced by 0, above it by 1.
+        Threshold may not be less than 0 for operations on sparse matrices.
+
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    See also
+    --------
+    Binarizer: Performs binarization using the ``Transformer`` API
+    """
+    output_type = get_input_type(X)
+    X = check_array(X, accept_sparse=['csr', 'csc'], copy=copy)
+    if sparse.issparse(X):
+        if threshold < 0:
+            raise ValueError('Cannot binarize a sparse matrix with threshold '
+                             '< 0')
+        cond = X.data > threshold
+        not_cond = np.logical_not(cond)
+        X.data[cond] = 1
+        X.data[not_cond] = 0
+        X.eliminate_zeros()
+    else:
+        cond = X > threshold
+        not_cond = np.logical_not(cond)
+        X[cond] = 1
+        X[not_cond] = 0
+    return to_output_type(X, output_type)
+
+
+class Binarizer(TransformerMixin, BaseEstimator):
+    """Binarize data (set feature values to 0 or 1) according to a threshold
+
+    Values greater than the threshold map to 1, while values less than
+    or equal to the threshold map to 0. With the default threshold of 0,
+    only positive values map to 1.
+
+    Binarization is a common operation on text count data where the
+    analyst can decide to only consider the presence or absence of a
+    feature rather than a quantified number of occurrences for instance.
+
+    It can also be used as a pre-processing step for estimators that
+    consider boolean random variables (e.g. modelled using the Bernoulli
+    distribution in a Bayesian setting).
+
+    Parameters
+    ----------
+    threshold : float, optional (0.0 by default)
+        Feature values below or equal to this are replaced by 0, above it by 1.
+        Threshold may not be less than 0 for operations on sparse matrices.
+
+    copy : boolean, optional, default True
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    Examples
+    --------
+    >>> from cuml.preprocessing import Binarizer
+    >>> X = [[ 1., -1.,  2.],
+    ...      [ 2.,  0.,  0.],
+    ...      [ 0.,  1., -1.]]
+    >>> transformer = Binarizer().fit(X)  # fit does nothing.
+    >>> transformer
+    Binarizer()
+    >>> transformer.transform(X)
+    array([[1., 0., 1.],
+           [1., 0., 0.],
+           [0., 1., 0.]])
+
+    Notes
+    -----
+    If the input is a sparse matrix, only the non-zero values are subject
+    to update by the Binarizer class.
+
+    This estimator is stateless (besides constructor parameters), the
+    fit method does nothing but is useful when used in a pipeline.
+
+    See also
+    --------
+    binarize: Equivalent function without the estimator API.
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, threshold=0.0, copy=True):
+        self.threshold = threshold
+        self.copy = copy
+
+    def fit(self, X, y=None):
+        """Do nothing and return the estimator unchanged
+
+        This method is just there to implement the usual API and hence
+        work in pipelines.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}
+        """
+        self._validate_data(X, accept_sparse=['csr', 'csc'])
+        return self
+
+    def transform(self, X, copy=None):
+        """Binarize each element of X
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+            The data to binarize, element by element.
+
+        copy : bool
+            Whether a forced copy will be triggered. If copy=False,
+            a copy might be triggered by a conversion.
+        """
+        copy = copy if copy is not None else self.copy
+        return binarize(X, threshold=self.threshold, copy=copy)
+
+    def _more_tags(self):
+        return {'stateless': True}
+
+
+class KernelCenterer(TransformerMixin, BaseEstimator):
+    """Center a kernel matrix
+
+    Let K(x, z) be a kernel defined by phi(x)^T phi(z), where phi is a
+    function mapping x to a Hilbert space. KernelCenterer centers (i.e.,
+    normalize to have zero mean) the data without explicitly computing phi(x).
+    It is equivalent to centering phi(x) with
+    sklearn.preprocessing.StandardScaler(with_std=False).
+
+    Read more in the :ref:`User Guide <kernel_centering>`.
+
+    Attributes
+    ----------
+    K_fit_rows_ : array, shape (n_samples,)
+        Average of each column of kernel matrix
+
+    K_fit_all_ : float
+        Average of kernel matrix
+
+    Examples
+    --------
+    >>> from sklearn.preprocessing import KernelCenterer
+    >>> from sklearn.metrics.pairwise import pairwise_kernels
+    >>> X = [[ 1., -2.,  2.],
+    ...      [ -2.,  1.,  3.],
+    ...      [ 4.,  1., -2.]]
+    >>> K = pairwise_kernels(X, metric='linear')
+    >>> K
+    array([[  9.,   2.,  -2.],
+           [  2.,  14., -13.],
+           [ -2., -13.,  21.]])
+    >>> transformer = KernelCenterer().fit(K)
+    >>> transformer
+    KernelCenterer()
+    >>> transformer.transform(K)
+    array([[  5.,   0.,  -5.],
+           [  0.,  14., -14.],
+           [ -5., -14.,  19.]])
+    """
+
+    def __init__(self):
+        # Needed for backported inspect.signature compatibility with PyPy
+        pass
+
+    def fit(self, K, y=None):
+        """Fit KernelCenterer
+
+        Parameters
+        ----------
+        K : numpy array of shape [n_samples, n_samples]
+            Kernel matrix.
+
+        Returns
+        -------
+        self : returns an instance of self.
+        """
+
+        K = self._validate_data(K, dtype=FLOAT_DTYPES)
+
+        if K.shape[0] != K.shape[1]:
+            raise ValueError("Kernel matrix must be a square matrix."
+                             " Input is a {}x{} matrix."
+                             .format(K.shape[0], K.shape[1]))
+
+        n_samples = K.shape[0]
+        self.K_fit_rows_ = np.sum(K, axis=0) / n_samples
+        self.K_fit_all_ = self.K_fit_rows_.sum() / n_samples
+        return self
+
+    def transform(self, K, copy=True):
+        """Center kernel matrix.
+
+        Parameters
+        ----------
+        K : numpy array of shape [n_samples1, n_samples2]
+            Kernel matrix.
+
+        copy : boolean, optional, default True
+            Whether a forced copy will be triggered. If copy=False,
+            a copy might be triggered by a conversion.
+
+        Returns
+        -------
+        K_new : numpy array of shape [n_samples1, n_samples2]
+        """
+        check_is_fitted(self)
+
+        K = check_array(K, copy=copy, dtype=FLOAT_DTYPES)
+
+        K_pred_cols = (np.sum(K, axis=1) /
+                       self.K_fit_rows_.shape[0])[:, np.newaxis]
+
+        K -= self.K_fit_rows_
+        K -= K_pred_cols
+        K += self.K_fit_all_
+
+        return K
+
+    @property
+    def _pairwise(self):
+        return True
+
+
+def add_dummy_feature(X, value=1.0):
+    """Augment dataset with an additional dummy feature.
+
+    This is useful for fitting an intercept term with implementations which
+    cannot otherwise fit it directly.
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        Data.
+
+    value : float
+        Value to use for the dummy feature.
+
+    Returns
+    -------
+
+    X : {array, sparse matrix}, shape [n_samples, n_features + 1]
+        Same data with dummy feature added as first column.
+
+    Examples
+    --------
+
+    >>> from cuml.preprocessing import add_dummy_feature
+    >>> add_dummy_feature([[0, 1], [1, 0]])
+    array([[1., 0., 1.],
+           [1., 1., 0.]])
+    """
+    output_type = get_input_type(X)
+    X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)
+    n_samples, n_features = X.shape
+    shape = (n_samples, n_features + 1)
+    if sparse.issparse(X):
+        if sparse.isspmatrix_coo(X):
+            # Shift columns to the right.
+            col = X.col + 1
+            # Column indices of dummy feature are 0 everywhere.
+            col = np.concatenate((np.zeros(n_samples), col))
+            # Row indices of dummy feature are 0, ..., n_samples-1.
+            row = np.concatenate((np.arange(n_samples), X.row))
+            # Prepend the dummy feature n_samples times.
+            data = np.concatenate((np.full(n_samples, value), X.data))
+            X = sparse.coo_matrix((data, (row, col)), shape)
+            return to_output_type(X, output_type)
+        elif sparse.isspmatrix_csc(X):
+            # Shift index pointers since we need to add n_samples elements.
+            indptr = X.indptr + n_samples
+            # indptr[0] must be 0.
+            indptr = np.concatenate((np.array([0]), indptr))
+            # Row indices of dummy feature are 0, ..., n_samples-1.
+            indices = np.concatenate((np.arange(n_samples), X.indices))
+            # Prepend the dummy feature n_samples times.
+            data = np.concatenate((np.full(n_samples, value), X.data))
+            X = sparse.csc_matrix((data, indices, indptr), shape)
+            return to_output_type(X, output_type)
+        else:
+            klass = X.__class__
+            X = klass(add_dummy_feature(X.tocoo(), value))
+            return to_output_type(X, output_type)
+    else:
+        X = np.hstack((np.full((n_samples, 1), value), X))
+        return to_output_type(X, output_type)
+
+
+class QuantileTransformer(TransformerMixin, BaseEstimator):
+    """Transform features using quantiles information.
+
+    This method transforms the features to follow a uniform or a normal
+    distribution. Therefore, for a given feature, this transformation tends
+    to spread out the most frequent values. It also reduces the impact of
+    (marginal) outliers: this is therefore a robust preprocessing scheme.
+
+    The transformation is applied on each feature independently. First an
+    estimate of the cumulative distribution function of a feature is
+    used to map the original values to a uniform distribution. The obtained
+    values are then mapped to the desired output distribution using the
+    associated quantile function. Features values of new/unseen data that fall
+    below or above the fitted range will be mapped to the bounds of the output
+    distribution. Note that this transform is non-linear. It may distort linear
+    correlations between variables measured at the same scale but renders
+    variables measured at different scales more directly comparable.
+
+    Read more in the :ref:`User Guide <preprocessing_transformer>`.
+
+    .. versionadded:: 0.19
+
+    Parameters
+    ----------
+    n_quantiles : int, optional (default=1000 or n_samples)
+        Number of quantiles to be computed. It corresponds to the number
+        of landmarks used to discretize the cumulative distribution function.
+        If n_quantiles is larger than the number of samples, n_quantiles is set
+        to the number of samples as a larger number of quantiles does not give
+        a better approximation of the cumulative distribution function
+        estimator.
+
+    output_distribution : str, optional (default='uniform')
+        Marginal distribution for the transformed data. The choices are
+        'uniform' (default) or 'normal'.
+
+    ignore_implicit_zeros : bool, optional (default=False)
+        Only applies to sparse matrices. If True, the sparse entries of the
+        matrix are discarded to compute the quantile statistics. If False,
+        these entries are treated as zeros.
+
+    subsample : int, optional (default=1e5)
+        Maximum number of samples used to estimate the quantiles for
+        computational efficiency. Note that the subsampling procedure may
+        differ for value-identical sparse and dense matrices.
+
+    random_state : int, RandomState instance or None, optional (default=None)
+        Determines random number generation for subsampling and smoothing
+        noise.
+        Please see ``subsample`` for more details.
+        Pass an int for reproducible results across multiple function calls.
+        See :term:`Glossary <random_state>`
+
+    copy : boolean, optional, (default=True)
+        Set to False to perform inplace transformation and avoid a copy (if the
+        input is already a numpy array).
+
+    Attributes
+    ----------
+    n_quantiles_ : integer
+        The actual number of quantiles used to discretize the cumulative
+        distribution function.
+
+    quantiles_ : ndarray, shape (n_quantiles, n_features)
+        The values corresponding the quantiles of reference.
+
+    references_ : ndarray, shape(n_quantiles, )
+        Quantiles of references.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import QuantileTransformer
+    >>> rng = np.random.RandomState(0)
+    >>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
+    >>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
+    >>> qt.fit_transform(X)
+    array([...])
+
+    See also
+    --------
+    quantile_transform : Equivalent function without the estimator API.
+    PowerTransformer : Perform mapping to a normal distribution using a power
+        transform.
+    StandardScaler : Perform standardization that is faster, but less robust
+        to outliers.
+    RobustScaler : Perform robust standardization that removes the influence
+        of outliers but does not put outliers and inliers on the same scale.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.
+
+    For a comparison of the different scalers, transformers, and normalizers,
+    see :ref:`examples/preprocessing/plot_all_scaling.py
+    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, n_quantiles=1000, output_distribution='uniform',
+                 ignore_implicit_zeros=False, subsample=int(1e5),
+                 random_state=None, copy=True):
+        self.n_quantiles = n_quantiles
+        self.output_distribution = output_distribution
+        self.ignore_implicit_zeros = ignore_implicit_zeros
+        self.subsample = subsample
+        self.random_state = random_state
+        self.copy = copy
+
+    def _dense_fit(self, X, random_state):
+        """Compute percentiles for dense matrices.
+
+        Parameters
+        ----------
+        X : ndarray, shape (n_samples, n_features)
+            The data used to scale along the features axis.
+        """
+        if self.ignore_implicit_zeros:
+            warnings.warn("'ignore_implicit_zeros' takes effect only with"
+                          " sparse matrix. This parameter has no effect.")
+
+        n_samples, n_features = X.shape
+        references = self.references_ * 100
+
+        self.quantiles_ = []
+        for col in X.T:
+            if self.subsample < n_samples:
+                subsample_idx = random_state.choice(n_samples,
+                                                    size=self.subsample,
+                                                    replace=False)
+                col = col.take(subsample_idx, mode='clip')
+            self.quantiles_.append(np.nanpercentile(col, references))
+        self.quantiles_ = np.transpose(self.quantiles_)
+        # Due to floating-point precision error in `np.nanpercentile`,
+        # make sure that quantiles are monotonically increasing.
+        # Upstream issue in numpy:
+        # https://github.com/numpy/numpy/issues/14685
+        self.quantiles_ = np.maximum.accumulate(self.quantiles_)
+
+    def _sparse_fit(self, X, random_state):
+        """Compute percentiles for sparse matrices.
+
+        Parameters
+        ----------
+        X : sparse matrix CSC, shape (n_samples, n_features)
+            The data used to scale along the features axis. The sparse matrix
+            needs to be nonnegative.
+        """
+        n_samples, n_features = X.shape
+        references = self.references_ * 100
+
+        self.quantiles_ = []
+        for feature_idx in range(n_features):
+            column_nnz_data = X.data[X.indptr[feature_idx]:
+                                     X.indptr[feature_idx + 1]]
+            if len(column_nnz_data) > self.subsample:
+                column_subsample = (self.subsample * len(column_nnz_data) //
+                                    n_samples)
+                if self.ignore_implicit_zeros:
+                    column_data = np.zeros(shape=column_subsample,
+                                           dtype=X.dtype)
+                else:
+                    column_data = np.zeros(shape=self.subsample, dtype=X.dtype)
+                column_data[:column_subsample] = random_state.choice(
+                    column_nnz_data, size=column_subsample, replace=False)
+            else:
+                if self.ignore_implicit_zeros:
+                    column_data = np.zeros(shape=len(column_nnz_data),
+                                           dtype=X.dtype)
+                else:
+                    column_data = np.zeros(shape=n_samples, dtype=X.dtype)
+                column_data[:len(column_nnz_data)] = column_nnz_data
+
+            if not column_data.size:
+                # if no nnz, an error will be raised for computing the
+                # quantiles. Force the quantiles to be zeros.
+                self.quantiles_.append([0] * len(references))
+            else:
+                self.quantiles_.append(
+                        np.nanpercentile(column_data, references))
+        self.quantiles_ = np.transpose(self.quantiles_)
+        # due to floating-point precision error in `np.nanpercentile`,
+        # make sure the quantiles are monotonically increasing
+        # Upstream issue in numpy:
+        # https://github.com/numpy/numpy/issues/14685
+        self.quantiles_ = np.maximum.accumulate(self.quantiles_)
+
+    def fit(self, X, y=None):
+        """Compute the quantiles used for transforming.
+
+        Parameters
+        ----------
+        X : ndarray or sparse matrix, shape (n_samples, n_features)
+            The data used to scale along the features axis. If a sparse
+            matrix is provided, it will be converted into a sparse
+            ``csc_matrix``. Additionally, the sparse matrix needs to be
+            nonnegative if `ignore_implicit_zeros` is False.
+
+        Returns
+        -------
+        self : object
+        """
+        if self.n_quantiles <= 0:
+            raise ValueError("Invalid value for 'n_quantiles': %d. "
+                             "The number of quantiles must be at least one."
+                             % self.n_quantiles)
+
+        if self.subsample <= 0:
+            raise ValueError("Invalid value for 'subsample': %d. "
+                             "The number of subsamples must be at least one."
+                             % self.subsample)
+
+        if self.n_quantiles > self.subsample:
+            raise ValueError("The number of quantiles cannot be greater than"
+                             " the number of samples used. Got {} quantiles"
+                             " and {} samples.".format(self.n_quantiles,
+                                                       self.subsample))
+
+        X = self._check_inputs(X, in_fit=True, copy=False)
+        n_samples = X.shape[0]
+
+        if self.n_quantiles > n_samples:
+            warnings.warn("n_quantiles (%s) is greater than the total number "
+                          "of samples (%s). n_quantiles is set to "
+                          "n_samples."
+                          % (self.n_quantiles, n_samples))
+        self.n_quantiles_ = max(1, min(self.n_quantiles, n_samples))
+
+        rng = check_random_state(self.random_state)
+
+        # Create the quantiles of reference
+        self.references_ = np.linspace(0, 1, self.n_quantiles_,
+                                       endpoint=True)
+        if sparse.issparse(X):
+            self._sparse_fit(X, rng)
+        else:
+            self._dense_fit(X, rng)
+
+        return self
+
+    def _transform_col(self, X_col, quantiles, inverse):
+        """Private function to transform a single feature"""
+
+        output_distribution = self.output_distribution
+
+        if not inverse:
+            lower_bound_x = quantiles[0]
+            upper_bound_x = quantiles[-1]
+            lower_bound_y = 0
+            upper_bound_y = 1
+        else:
+            lower_bound_x = 0
+            upper_bound_x = 1
+            lower_bound_y = quantiles[0]
+            upper_bound_y = quantiles[-1]
+            # for inverse transform, match a uniform distribution
+            with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+                if output_distribution == 'normal':
+                    X_col = stats.norm.cdf(X_col)
+                # else output distribution is already a uniform distribution
+
+        # find index for lower and higher bounds
+        with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+            if output_distribution == 'normal':
+                lower_bounds_idx = (X_col - BOUNDS_THRESHOLD <
+                                    lower_bound_x)
+                upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
+                                    upper_bound_x)
+            if output_distribution == 'uniform':
+                lower_bounds_idx = (X_col == lower_bound_x)
+                upper_bounds_idx = (X_col == upper_bound_x)
+
+        isfinite_mask = ~np.isnan(X_col)
+        X_col_finite = X_col[isfinite_mask]
+        if not inverse:
+            # Interpolate in one direction and in the other and take the
+            # mean. This is in case of repeated values in the features
+            # and hence repeated quantiles
+            #
+            # If we don't do this, only one extreme of the duplicated is
+            # used (the upper when we do ascending, and the
+            # lower for descending). We take the mean of these two
+            X_col[isfinite_mask] = .5 * (
+                np.interp(X_col_finite, quantiles, self.references_)
+                - np.interp(-X_col_finite, -quantiles[::-1],
+                            -self.references_[::-1]))
+        else:
+            X_col[isfinite_mask] = np.interp(X_col_finite,
+                                             self.references_, quantiles)
+
+        X_col[upper_bounds_idx] = upper_bound_y
+        X_col[lower_bounds_idx] = lower_bound_y
+        # for forward transform, match the output distribution
+        if not inverse:
+            with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+                if output_distribution == 'normal':
+                    X_col = stats.norm.ppf(X_col)
+                    # find the value to clip the data to avoid mapping to
+                    # infinity. Clip such that the inverse transform will be
+                    # consistent
+                    clip_min = stats.norm.ppf(BOUNDS_THRESHOLD - np.spacing(1))
+                    clip_max = stats.norm.ppf(1 - (BOUNDS_THRESHOLD -
+                                                   np.spacing(1)))
+                    X_col = np.clip(X_col, clip_min, clip_max)
+                # else output distribution is uniform and the ppf is the
+                # identity function so we let X_col unchanged
+
+        return X_col
+
+    def _check_inputs(self, X, in_fit, accept_sparse_negative=False,
+                      copy=False):
+        """Check inputs before fit and transform"""
+        # In theory reset should be equal to `in_fit`, but there are tests
+        # checking the input number of feature and they expect a specific
+        # string, which is not the same one raised by check_n_features. So we
+        # don't check n_features_in_ here for now (it's done with adhoc code in
+        # the estimator anyway).
+        # TODO: set reset=in_fit when addressing reset in
+        # predict/transform/etc.
+        reset = True
+
+        X = self._validate_data(X, reset=reset,
+                                accept_sparse='csc', copy=copy,
+                                dtype=FLOAT_DTYPES,
+                                force_all_finite='allow-nan')
+        # we only accept positive sparse matrix when ignore_implicit_zeros is
+        # false and that we call fit or transform.
+        with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+            if (not accept_sparse_negative and not self.ignore_implicit_zeros
+                    and (sparse.issparse(X) and np.any(X.data < 0))):
+                raise ValueError('QuantileTransformer only accepts'
+                                 ' non-negative sparse matrices.')
+
+        # check the output distribution
+        if self.output_distribution not in ('normal', 'uniform'):
+            raise ValueError("'output_distribution' has to be either 'normal'"
+                             " or 'uniform'. Got '{}' instead.".format(
+                                 self.output_distribution))
+
+        return X
+
+    def _check_is_fitted(self, X):
+        """Check the inputs before transforming"""
+        check_is_fitted(self)
+        # check that the dimension of X are adequate with the fitted data
+        if X.shape[1] != self.quantiles_.shape[1]:
+            raise ValueError('X does not have the same number of features as'
+                             ' the previously fitted data. Got {} instead of'
+                             ' {}.'.format(X.shape[1],
+                                           self.quantiles_.shape[1]))
+
+    def _transform(self, X, inverse=False):
+        """Forward and inverse transform.
+
+        Parameters
+        ----------
+        X : ndarray, shape (n_samples, n_features)
+            The data used to scale along the features axis.
+
+        inverse : bool, optional (default=False)
+            If False, apply forward transform. If True, apply
+            inverse transform.
+
+        Returns
+        -------
+        X : ndarray, shape (n_samples, n_features)
+            Projected data
+        """
+
+        if sparse.issparse(X):
+            for feature_idx in range(X.shape[1]):
+                column_slice = slice(X.indptr[feature_idx],
+                                     X.indptr[feature_idx + 1])
+                X.data[column_slice] = self._transform_col(
+                    X.data[column_slice], self.quantiles_[:, feature_idx],
+                    inverse)
+        else:
+            for feature_idx in range(X.shape[1]):
+                X[:, feature_idx] = self._transform_col(
+                    X[:, feature_idx], self.quantiles_[:, feature_idx],
+                    inverse)
+
+        return X
+
+    def transform(self, X):
+        """Feature-wise transformation of the data.
+
+        Parameters
+        ----------
+        X : ndarray or sparse matrix, shape (n_samples, n_features)
+            The data used to scale along the features axis. If a sparse
+            matrix is provided, it will be converted into a sparse
+            ``csc_matrix``. Additionally, the sparse matrix needs to be
+            nonnegative if `ignore_implicit_zeros` is False.
+
+        Returns
+        -------
+        Xt : ndarray or sparse matrix, shape (n_samples, n_features)
+            The projected data.
+        """
+        X = self._check_inputs(X, in_fit=False, copy=self.copy)
+        self._check_is_fitted(X)
+
+        return self._transform(X, inverse=False)
+
+    def inverse_transform(self, X):
+        """Back-projection to the original space.
+
+        Parameters
+        ----------
+        X : ndarray or sparse matrix, shape (n_samples, n_features)
+            The data used to scale along the features axis. If a sparse
+            matrix is provided, it will be converted into a sparse
+            ``csc_matrix``. Additionally, the sparse matrix needs to be
+            nonnegative if `ignore_implicit_zeros` is False.
+
+        Returns
+        -------
+        Xt : ndarray or sparse matrix, shape (n_samples, n_features)
+            The projected data.
+        """
+        X = self._check_inputs(X, in_fit=False, accept_sparse_negative=True,
+                               copy=self.copy)
+        self._check_is_fitted(X)
+
+        return self._transform(X, inverse=True)
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+@_deprecate_positional_args
+def quantile_transform(X, *, axis=0, n_quantiles=1000,
+                       output_distribution='uniform',
+                       ignore_implicit_zeros=False,
+                       subsample=int(1e5),
+                       random_state=None,
+                       copy=True):
+    """Transform features using quantiles information.
+
+    This method transforms the features to follow a uniform or a normal
+    distribution. Therefore, for a given feature, this transformation tends
+    to spread out the most frequent values. It also reduces the impact of
+    (marginal) outliers: this is therefore a robust preprocessing scheme.
+
+    The transformation is applied on each feature independently. First an
+    estimate of the cumulative distribution function of a feature is
+    used to map the original values to a uniform distribution. The obtained
+    values are then mapped to the desired output distribution using the
+    associated quantile function. Features values of new/unseen data that fall
+    below or above the fitted range will be mapped to the bounds of the output
+    distribution. Note that this transform is non-linear. It may distort linear
+    correlations between variables measured at the same scale but renders
+    variables measured at different scales more directly comparable.
+
+    Read more in the :ref:`User Guide <preprocessing_transformer>`.
+
+    Parameters
+    ----------
+    X : array-like, sparse matrix
+        The data to transform.
+
+    axis : int, (default=0)
+        Axis used to compute the means and standard deviations along. If 0,
+        transform each feature, otherwise (if 1) transform each sample.
+
+    n_quantiles : int, optional (default=1000 or n_samples)
+        Number of quantiles to be computed. It corresponds to the number
+        of landmarks used to discretize the cumulative distribution function.
+        If n_quantiles is larger than the number of samples, n_quantiles is set
+        to the number of samples as a larger number of quantiles does not give
+        a better approximation of the cumulative distribution function
+        estimator.
+
+    output_distribution : str, optional (default='uniform')
+        Marginal distribution for the transformed data. The choices are
+        'uniform' (default) or 'normal'.
+
+    ignore_implicit_zeros : bool, optional (default=False)
+        Only applies to sparse matrices. If True, the sparse entries of the
+        matrix are discarded to compute the quantile statistics. If False,
+        these entries are treated as zeros.
+
+    subsample : int, optional (default=1e5)
+        Maximum number of samples used to estimate the quantiles for
+        computational efficiency. Note that the subsampling procedure may
+        differ for value-identical sparse and dense matrices.
+
+    random_state : int, RandomState instance or None, optional (default=None)
+        Determines random number generation for subsampling and smoothing
+        noise.
+        Please see ``subsample`` for more details.
+        Pass an int for reproducible results across multiple function calls.
+        See :term:`Glossary <random_state>`
+
+    copy : boolean, optional, (default=True)
+        Set to False to perform inplace transformation and avoid a copy (if the
+        input is already a numpy array). If True, a copy of `X` is transformed,
+        leaving the original `X` unchanged
+
+        ..versionchnanged:: 0.23
+            The default value of `copy` changed from False to True in 0.23.
+
+    Returns
+    -------
+    Xt : ndarray or sparse matrix, shape (n_samples, n_features)
+        The transformed data.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import quantile_transform
+    >>> rng = np.random.RandomState(0)
+    >>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
+    >>> quantile_transform(X, n_quantiles=10, random_state=0, copy=True)
+    array([...])
+
+    See also
+    --------
+    QuantileTransformer : Performs quantile-based scaling using the
+        ``Transformer`` API (e.g. as part of a preprocessing
+        :class:`sklearn.pipeline.Pipeline`).
+    power_transform : Maps data to a normal distribution using a
+        power transformation.
+    scale : Performs standardization that is faster, but less robust
+        to outliers.
+    robust_scale : Performs robust standardization that removes the influence
+        of outliers but does not put outliers and inliers on the same scale.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.
+
+    For a comparison of the different scalers, transformers, and normalizers,
+    see :ref:`examples/preprocessing/plot_all_scaling.py
+    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
+    """
+    n = QuantileTransformer(n_quantiles=n_quantiles,
+                            output_distribution=output_distribution,
+                            subsample=subsample,
+                            ignore_implicit_zeros=ignore_implicit_zeros,
+                            random_state=random_state,
+                            copy=copy)
+    if axis == 0:
+        return n.fit_transform(X)
+    elif axis == 1:
+        return n.fit_transform(X.T).T
+    else:
+        raise ValueError("axis should be either equal to 0 or 1. Got"
+                         " axis={}".format(axis))
+
+
+class PowerTransformer(TransformerMixin, BaseEstimator):
+    """Apply a power transform featurewise to make data more Gaussian-like.
+
+    Power transforms are a family of parametric, monotonic transformations
+    that are applied to make data more Gaussian-like. This is useful for
+    modeling issues related to heteroscedasticity (non-constant variance),
+    or other situations where normality is desired.
+
+    Currently, PowerTransformer supports the Box-Cox transform and the
+    Yeo-Johnson transform. The optimal parameter for stabilizing variance and
+    minimizing skewness is estimated through maximum likelihood.
+
+    Box-Cox requires input data to be strictly positive, while Yeo-Johnson
+    supports both positive or negative data.
+
+    By default, zero-mean, unit-variance normalization is applied to the
+    transformed data.
+
+    Read more in the :ref:`User Guide <preprocessing_transformer>`.
+
+    .. versionadded:: 0.20
+
+    Parameters
+    ----------
+    method : str, (default='yeo-johnson')
+        The power transform method. Available methods are:
+
+        - 'yeo-johnson' [1]_, works with positive and negative values
+        - 'box-cox' [2]_, only works with strictly positive values
+
+    standardize : boolean, default=True
+        Set to True to apply zero-mean, unit-variance normalization to the
+        transformed output.
+
+    copy : boolean, optional, default=True
+        Set to False to perform inplace computation during transformation.
+
+    Attributes
+    ----------
+    lambdas_ : array of float, shape (n_features,)
+        The parameters of the power transformation for the selected features.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import PowerTransformer
+    >>> pt = PowerTransformer()
+    >>> data = [[1, 2], [3, 2], [4, 5]]
+    >>> print(pt.fit(data))
+    PowerTransformer()
+    >>> print(pt.lambdas_)
+    [ 1.386... -3.100...]
+    >>> print(pt.transform(data))
+    [[-1.316... -0.707...]
+     [ 0.209... -0.707...]
+     [ 1.106...  1.414...]]
+
+    See also
+    --------
+    power_transform : Equivalent function without the estimator API.
+
+    QuantileTransformer : Maps data to a standard normal distribution with
+        the parameter `output_distribution='normal'`.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in ``fit``, and maintained
+    in ``transform``.
+
+    For a comparison of the different scalers, transformers, and normalizers,
+    see :ref:`examples/preprocessing/plot_all_scaling.py
+    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
+
+    References
+    ----------
+
+    .. [1] I.K. Yeo and R.A. Johnson, "A new family of power transformations to
+           improve normality or symmetry." Biometrika, 87(4), pp.954-959,
+           (2000).
+
+    .. [2] G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal
+           of the Royal Statistical Society B, 26, 211-252 (1964).
+
+    """
+    @_deprecate_positional_args
+    def __init__(self, method='yeo-johnson', *, standardize=True, copy=True):
+        self.method = method
+        self.standardize = standardize
+        self.copy = copy
+
+    def fit(self, X, y=None):
+        """Estimate the optimal parameter lambda for each feature.
+
+        The optimal lambda parameter for minimizing skewness is estimated on
+        each feature independently using maximum likelihood.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            The data used to estimate the optimal transformation parameters.
+
+        y : Ignored
+
+        Returns
+        -------
+        self : object
+        """
+        self._fit(X, y=y, force_transform=False)
+        return self
+
+    def fit_transform(self, X, y=None):
+        return self._fit(X, y, force_transform=True)
+
+    def _fit(self, X, y=None, force_transform=False):
+        X = self._check_input(X, in_fit=True, check_positive=True,
+                              check_method=True)
+
+        if not self.copy and not force_transform:  # if call from fit()
+            X = X.copy()  # force copy so that fit does not change X inplace
+
+        optim_function = {'box-cox': self._box_cox_optimize,
+                          'yeo-johnson': self._yeo_johnson_optimize
+                          }[self.method]
+        with np.errstate(invalid='ignore'):  # hide NaN warnings
+            self.lambdas_ = np.array([optim_function(col) for col in X.T])
+
+        if self.standardize or force_transform:
+            transform_function = {'box-cox': boxcox,
+                                  'yeo-johnson': self._yeo_johnson_transform
+                                  }[self.method]
+            for i, lmbda in enumerate(self.lambdas_):
+                with np.errstate(invalid='ignore'):  # hide NaN warnings
+                    X[:, i] = transform_function(X[:, i], lmbda)
+
+        if self.standardize:
+            self._scaler = StandardScaler(copy=False)
+            if force_transform:
+                X = self._scaler.fit_transform(X)
+            else:
+                self._scaler.fit(X)
+
+        return X
+
+    def transform(self, X):
+        """Apply the power transform to each feature using the fitted lambdas.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            The data to be transformed using a power transformation.
+
+        Returns
+        -------
+        X_trans : array-like, shape (n_samples, n_features)
+            The transformed data.
+        """
+        check_is_fitted(self)
+        X = self._check_input(X, in_fit=False, check_positive=True,
+                              check_shape=True)
+
+        transform_function = {'box-cox': boxcox,
+                              'yeo-johnson': self._yeo_johnson_transform
+                              }[self.method]
+        for i, lmbda in enumerate(self.lambdas_):
+            with np.errstate(invalid='ignore'):  # hide NaN warnings
+                X[:, i] = transform_function(X[:, i], lmbda)
+
+        if self.standardize:
+            X = self._scaler.transform(X)
+
+        return X
+
+    def inverse_transform(self, X):
+        """Apply the inverse power transformation using the fitted lambdas.
+
+        The inverse of the Box-Cox transformation is given by::
+
+            if lambda_ == 0:
+                X = exp(X_trans)
+            else:
+                X = (X_trans * lambda_ + 1) ** (1 / lambda_)
+
+        The inverse of the Yeo-Johnson transformation is given by::
+
+            if X >= 0 and lambda_ == 0:
+                X = exp(X_trans) - 1
+            elif X >= 0 and lambda_ != 0:
+                X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1
+            elif X < 0 and lambda_ != 2:
+                X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_))
+            elif X < 0 and lambda_ == 2:
+                X = 1 - exp(-X_trans)
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            The transformed data.
+
+        Returns
+        -------
+        X : array-like, shape (n_samples, n_features)
+            The original data
+        """
+        check_is_fitted(self)
+        X = self._check_input(X, in_fit=False, check_shape=True)
+
+        if self.standardize:
+            X = self._scaler.inverse_transform(X)
+
+        inv_fun = {'box-cox': self._box_cox_inverse_tranform,
+                   'yeo-johnson': self._yeo_johnson_inverse_transform
+                   }[self.method]
+        for i, lmbda in enumerate(self.lambdas_):
+            with np.errstate(invalid='ignore'):  # hide NaN warnings
+                X[:, i] = inv_fun(X[:, i], lmbda)
+
+        return X
+
+    def _box_cox_inverse_tranform(self, x, lmbda):
+        """Return inverse-transformed input x following Box-Cox inverse
+        transform with parameter lambda.
+        """
+        if lmbda == 0:
+            x_inv = np.exp(x)
+        else:
+            x_inv = (x * lmbda + 1) ** (1 / lmbda)
+
+        return x_inv
+
+    def _yeo_johnson_inverse_transform(self, x, lmbda):
+        """Return inverse-transformed input x following Yeo-Johnson inverse
+        transform with parameter lambda.
+        """
+        x_inv = np.zeros_like(x)
+        pos = x >= 0
+
+        # when x >= 0
+        if abs(lmbda) < np.spacing(1.):
+            x_inv[pos] = np.exp(x[pos]) - 1
+        else:  # lmbda != 0
+            x_inv[pos] = np.power(x[pos] * lmbda + 1, 1 / lmbda) - 1
+
+        # when x < 0
+        if abs(lmbda - 2) > np.spacing(1.):
+            x_inv[~pos] = 1 - np.power(-(2 - lmbda) * x[~pos] + 1,
+                                       1 / (2 - lmbda))
+        else:  # lmbda == 2
+            x_inv[~pos] = 1 - np.exp(-x[~pos])
+
+        return x_inv
+
+    def _yeo_johnson_transform(self, x, lmbda):
+        """Return transformed input x following Yeo-Johnson transform with
+        parameter lambda.
+        """
+
+        out = np.zeros_like(x)
+        pos = x >= 0  # binary mask
+
+        # when x >= 0
+        if abs(lmbda) < np.spacing(1.):
+            out[pos] = np.log1p(x[pos])
+        else:  # lmbda != 0
+            out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
+
+        # when x < 0
+        if abs(lmbda - 2) > np.spacing(1.):
+            out[~pos] = -(np.power(-x[~pos] + 1, 2 - lmbda) - 1) / (2 - lmbda)
+        else:  # lmbda == 2
+            out[~pos] = -np.log1p(-x[~pos])
+
+        return out
+
+    def _box_cox_optimize(self, x):
+        """Find and return optimal lambda parameter of the Box-Cox transform by
+        MLE, for observed data x.
+
+        We here use scipy builtins which uses the brent optimizer.
+        """
+        # the computation of lambda is influenced by NaNs so we need to
+        # get rid of them
+        _, lmbda = stats.boxcox(x[~np.isnan(x)], lmbda=None)
+
+        return lmbda
+
+    def _yeo_johnson_optimize(self, x):
+        """Find and return optimal lambda parameter of the Yeo-Johnson
+        transform by MLE, for observed data x.
+
+        Like for Box-Cox, MLE is done via the brent optimizer.
+        """
+
+        def _neg_log_likelihood(lmbda):
+            """Return the negative log likelihood of the observed data x as a
+            function of lambda."""
+            x_trans = self._yeo_johnson_transform(x, lmbda)
+            n_samples = x.shape[0]
+
+            loglike = -n_samples / 2 * np.log(x_trans.var())
+            loglike += (lmbda - 1) * (np.sign(x) * np.log1p(np.abs(x))).sum()
+
+            return -loglike
+
+        # the computation of lambda is influenced by NaNs so we need to
+        # get rid of them
+        x = x[~np.isnan(x)]
+        # choosing bracket -2, 2 like for boxcox
+        return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
+
+    def _check_input(self, X, in_fit, check_positive=False, check_shape=False,
+                     check_method=False):
+        """Validate the input before fit and transform.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+
+        check_positive : bool
+            If True, check that all data is positive and non-zero (only if
+            ``self.method=='box-cox'``).
+
+        check_shape : bool
+            If True, check that n_features matches the length of self.lambdas_
+
+        check_method : bool
+            If True, check that the transformation method is valid.
+        """
+        X = self._validate_data(X, ensure_2d=True, dtype=FLOAT_DTYPES,
+                                copy=self.copy, force_all_finite='allow-nan')
+
+        with np.warnings.catch_warnings():
+            np.warnings.filterwarnings(
+                'ignore', r'All-NaN (slice|axis) encountered')
+            if (check_positive and self.method == 'box-cox' and
+                    np.nanmin(X) <= 0):
+                raise ValueError("The Box-Cox transformation can only be "
+                                 "applied to strictly positive data")
+
+        if check_shape and not X.shape[1] == len(self.lambdas_):
+            raise ValueError("Input data has a different number of features "
+                             "than fitting data. Should have {n}, data has {m}"
+                             .format(n=len(self.lambdas_), m=X.shape[1]))
+
+        valid_methods = ('box-cox', 'yeo-johnson')
+        if check_method and self.method not in valid_methods:
+            raise ValueError("'method' must be one of {}, "
+                             "got {} instead."
+                             .format(valid_methods, self.method))
+
+        return X
+
+    def _more_tags(self):
+        return {'allow_nan': True}
+
+
+@_deprecate_positional_args
+def power_transform(X, method='yeo-johnson', *, standardize=True, copy=True):
+    """
+    Power transforms are a family of parametric, monotonic transformations
+    that are applied to make data more Gaussian-like. This is useful for
+    modeling issues related to heteroscedasticity (non-constant variance),
+    or other situations where normality is desired.
+
+    Currently, power_transform supports the Box-Cox transform and the
+    Yeo-Johnson transform. The optimal parameter for stabilizing variance and
+    minimizing skewness is estimated through maximum likelihood.
+
+    Box-Cox requires input data to be strictly positive, while Yeo-Johnson
+    supports both positive or negative data.
+
+    By default, zero-mean, unit-variance normalization is applied to the
+    transformed data.
+
+    Read more in the :ref:`User Guide <preprocessing_transformer>`.
+
+    Parameters
+    ----------
+    X : array-like, shape (n_samples, n_features)
+        The data to be transformed using a power transformation.
+
+    method : {'yeo-johnson', 'box-cox'}, default='yeo-johnson'
+        The power transform method. Available methods are:
+
+        - 'yeo-johnson' [1]_, works with positive and negative values
+        - 'box-cox' [2]_, only works with strictly positive values
+
+        .. versionchanged:: 0.23
+            The default value of the `method` parameter changed from
+            'box-cox' to 'yeo-johnson' in 0.23.
+
+    standardize : boolean, default=True
+        Set to True to apply zero-mean, unit-variance normalization to the
+        transformed output.
+
+    copy : boolean, optional, default=True
+        Set to False to perform inplace computation during transformation.
+
+    Returns
+    -------
+    X_trans : array-like, shape (n_samples, n_features)
+        The transformed data.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import power_transform
+    >>> data = [[1, 2], [3, 2], [4, 5]]
+    >>> print(power_transform(data, method='box-cox'))
+    [[-1.332... -0.707...]
+     [ 0.256... -0.707...]
+     [ 1.076...  1.414...]]
+
+    See also
+    --------
+    PowerTransformer : Equivalent transformation with the
+        ``Transformer`` API (e.g. as part of a preprocessing
+        :class:`sklearn.pipeline.Pipeline`).
+
+    quantile_transform : Maps data to a standard normal distribution with
+        the parameter `output_distribution='normal'`.
+
+    Notes
+    -----
+    NaNs are treated as missing values: disregarded in ``fit``, and maintained
+    in ``transform``.
+
+    For a comparison of the different scalers, transformers, and normalizers,
+    see :ref:`examples/preprocessing/plot_all_scaling.py
+    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
+
+    References
+    ----------
+
+    .. [1] I.K. Yeo and R.A. Johnson, "A new family of power transformations to
+           improve normality or symmetry." Biometrika, 87(4), pp.954-959,
+           (2000).
+
+    .. [2] G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal
+           of the Royal Statistical Society B, 26, 211-252 (1964).
+    """
+    pt = PowerTransformer(method=method, standardize=standardize, copy=copy)
+    return pt.fit_transform(X)
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_discretization.py b/python/cuml/_thirdparty/sklearn/preprocessing/_discretization.py
new file mode 100644
index 0000000000..66ce22b82d
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_discretization.py
@@ -0,0 +1,361 @@
+# -*- coding: utf-8 -*-
+
+# Original authors from Sckit-Learn:
+#         Henry Lin <hlin117@gmail.com>
+#         Tom Dupré la Tour
+
+# License: BSD
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+import numbers
+import cupy as np
+import numpy as cpu_np
+import warnings
+
+from cuml.preprocessing import OneHotEncoder
+from cuml.cluster import KMeans
+
+from ..utils.skl_dependencies import BaseEstimator, TransformerMixin
+from ..utils.validation import check_is_fitted
+from ..utils.validation import FLOAT_DTYPES
+from ..utils.validation import _deprecate_positional_args
+from ....thirdparty_adapters import check_array, get_input_type, \
+                                    to_output_type
+from ....common.import_utils import check_cupy8
+
+
+def digitize(x, bins):
+    # With right = Flase and bins in increasing order
+    out = np.full(shape=x.shape, fill_value=0, dtype=np.int32)
+    for i in range(1, len(bins)):
+        bool_arr = np.logical_and(bins[i-1] <= x, x < bins[i])
+        matched = np.where(bool_arr)
+        out[matched] = i
+
+    bool_arr = x >= bins[-1]
+    matched = np.where(bool_arr)
+    out[matched] = len(bins)
+    return out
+
+
+class KBinsDiscretizer(TransformerMixin, BaseEstimator):
+    """
+    Bin continuous data into intervals.
+
+    Parameters
+    ----------
+    n_bins : int or array-like, shape (n_features,) (default=5)
+        The number of bins to produce. Raises ValueError if ``n_bins < 2``.
+
+    encode : {'onehot', 'onehot-dense', 'ordinal'}, (default='onehot')
+        Method used to encode the transformed result.
+
+        onehot
+            Encode the transformed result with one-hot encoding
+            and return a sparse matrix. Ignored features are always
+            stacked to the right.
+        onehot-dense
+            Encode the transformed result with one-hot encoding
+            and return a dense array. Ignored features are always
+            stacked to the right.
+        ordinal
+            Return the bin identifier encoded as an integer value.
+
+    strategy : {'uniform', 'quantile', 'kmeans'}, (default='quantile')
+        Strategy used to define the widths of the bins.
+
+        uniform
+            All bins in each feature have identical widths.
+        quantile
+            All bins in each feature have the same number of points.
+        kmeans
+            Values in each bin have the same nearest center of a 1D k-means
+            cluster.
+
+    Attributes
+    ----------
+    n_bins_ : int array, shape (n_features,)
+        Number of bins per feature. Bins whose width are too small
+        (i.e., <= 1e-8) are removed with a warning.
+
+    bin_edges_ : array of arrays, shape (n_features, )
+        The edges of each bin. Contain arrays of varying shapes ``(n_bins_, )``
+        Ignored features will have empty arrays.
+
+    See Also
+    --------
+     cuml.preprocessing.Binarizer : Class used to bin values as ``0`` or
+        ``1`` based on a parameter ``threshold``.
+
+    Notes
+    -----
+    In bin edges for feature ``i``, the first and last values are used only for
+    ``inverse_transform``. During transform, bin edges are extended to::
+
+      np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])
+
+    You can combine ``KBinsDiscretizer`` with
+    :class:`sklearn.compose.ColumnTransformer` if you only want to preprocess
+    part of the features.
+
+    ``KBinsDiscretizer`` might produce constant features (e.g., when
+    ``encode = 'onehot'`` and certain bins do not contain any data).
+    These features can be removed with feature selection algorithms
+    (e.g., :class:`sklearn.feature_selection.VarianceThreshold`).
+
+    Examples
+    --------
+    >>> X = [[-2, 1, -4,   -1],
+    ...      [-1, 2, -3, -0.5],
+    ...      [ 0, 3, -2,  0.5],
+    ...      [ 1, 4, -1,    2]]
+    >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
+    >>> est.fit(X)
+    KBinsDiscretizer(...)
+    >>> Xt = est.transform(X)
+    >>> Xt  # doctest: +SKIP
+    array([[ 0., 0., 0., 0.],
+           [ 1., 1., 1., 0.],
+           [ 2., 2., 2., 1.],
+           [ 2., 2., 2., 2.]])
+
+    Sometimes it may be useful to convert the data back into the original
+    feature space. The ``inverse_transform`` function converts the binned
+    data into the original feature space. Each value will be equal to the mean
+    of the two bin edges.
+
+    >>> est.bin_edges_[0]
+    array([-2., -1.,  0.,  1.])
+    >>> est.inverse_transform(Xt)
+    array([[-1.5,  1.5, -3.5, -0.5],
+           [-0.5,  2.5, -2.5, -0.5],
+           [ 0.5,  3.5, -1.5,  0.5],
+           [ 0.5,  3.5, -1.5,  1.5]])
+
+    """
+
+    @check_cupy8()
+    @_deprecate_positional_args
+    def __init__(self, n_bins=5, *, encode='onehot', strategy='quantile'):
+        self.n_bins = n_bins
+        self.encode = encode
+        self.strategy = strategy
+
+    def fit(self, X, y=None):
+        """
+        Fit the estimator.
+
+        Parameters
+        ----------
+        X : numeric array-like, shape (n_samples, n_features)
+            Data to be discretized.
+
+        y : None
+            Ignored. This parameter exists only for compatibility with
+            :class:`sklearn.pipeline.Pipeline`.
+
+        Returns
+        -------
+        self
+        """
+        X = self._validate_data(X, dtype='numeric')
+
+        valid_encode = ('onehot', 'onehot-dense', 'ordinal')
+        if self.encode not in valid_encode:
+            raise ValueError("Valid options for 'encode' are {}. "
+                             "Got encode={!r} instead."
+                             .format(valid_encode, self.encode))
+        valid_strategy = ('uniform', 'quantile', 'kmeans')
+        if self.strategy not in valid_strategy:
+            raise ValueError("Valid options for 'strategy' are {}. "
+                             "Got strategy={!r} instead."
+                             .format(valid_strategy, self.strategy))
+
+        n_features = X.shape[1]
+        n_bins = self._validate_n_bins(n_features)
+        n_bins = np.asnumpy(n_bins)
+
+        bin_edges = cpu_np.zeros(n_features, dtype=object)
+        for jj in range(n_features):
+            column = X[:, jj]
+            col_min, col_max = column.min(), column.max()
+
+            if col_min == col_max:
+                warnings.warn("Feature %d is constant and will be "
+                              "replaced with 0." % jj)
+                n_bins[jj] = 1
+                bin_edges[jj] = np.array([-np.inf, np.inf])
+                continue
+
+            if self.strategy == 'uniform':
+                bin_edges[jj] = np.linspace(col_min, col_max, n_bins[jj] + 1)
+
+            elif self.strategy == 'quantile':
+                quantiles = np.linspace(0, 100, n_bins[jj] + 1)
+                bin_edges[jj] = np.asarray(np.percentile(column, quantiles))
+
+            elif self.strategy == 'kmeans':
+                # Deterministic initialization with uniform spacing
+                uniform_edges = np.linspace(col_min, col_max, n_bins[jj] + 1)
+                init = (uniform_edges[1:] + uniform_edges[:-1])[:, None] * 0.5
+
+                # 1D k-means procedure
+                km = KMeans(n_clusters=n_bins[jj], init=init, n_init=1)
+                centers = km.fit(column[:, None]).cluster_centers_[:, 0]
+                # Must sort, centers may be unsorted even with sorted init
+                centers.sort()
+                bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5
+                bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max]
+
+            # Remove bins whose width are too small (i.e., <= 1e-8)
+            if self.strategy in ('quantile', 'kmeans'):
+                mask = np.diff(bin_edges[jj], prepend=-np.inf) > 1e-8
+                bin_edges[jj] = bin_edges[jj][mask]
+                if len(bin_edges[jj]) - 1 != n_bins[jj]:
+                    warnings.warn('Bins whose width are too small (i.e., <= '
+                                  '1e-8) in feature %d are removed. Consider '
+                                  'decreasing the number of bins.' % jj)
+                    n_bins[jj] = len(bin_edges[jj]) - 1
+
+        self.bin_edges_ = bin_edges
+        self.n_bins_ = n_bins
+
+        if 'onehot' in self.encode:
+            self._encoder = OneHotEncoder(
+                categories=np.array([np.arange(i) for i in self.n_bins_]),
+                sparse=self.encode == 'onehot')
+            # Fit the OneHotEncoder with toy datasets
+            # so that it's ready for use after the KBinsDiscretizer is fitted
+            self._encoder.fit(np.zeros((1, len(self.n_bins_)), dtype=int))
+
+        return self
+
+    def _validate_n_bins(self, n_features):
+        """Returns n_bins_, the number of bins per feature.
+        """
+        orig_bins = self.n_bins
+        if isinstance(orig_bins, numbers.Number):
+            if not isinstance(orig_bins, numbers.Integral):
+                raise ValueError("{} received an invalid n_bins type. "
+                                 "Received {}, expected int."
+                                 .format(KBinsDiscretizer.__name__,
+                                         type(orig_bins).__name__))
+            if orig_bins < 2:
+                raise ValueError("{} received an invalid number "
+                                 "of bins. Received {}, expected at least 2."
+                                 .format(KBinsDiscretizer.__name__, orig_bins))
+            return np.full(n_features, orig_bins, dtype=np.int)
+
+        n_bins = check_array(orig_bins, dtype=np.int, copy=True,
+                             ensure_2d=False)
+
+        if n_bins.ndim > 1 or n_bins.shape[0] != n_features:
+            raise ValueError("n_bins must be a scalar or array "
+                             "of shape (n_features,).")
+
+        bad_nbins_value = (n_bins < 2) | (n_bins != orig_bins)
+
+        violating_indices = np.where(bad_nbins_value)[0]
+        if violating_indices.shape[0] > 0:
+            indices = ", ".join(str(i) for i in violating_indices)
+            raise ValueError("{} received an invalid number "
+                             "of bins at indices {}. Number of bins "
+                             "must be at least 2, and must be an int."
+                             .format(KBinsDiscretizer.__name__, indices))
+        return n_bins
+
+    def transform(self, X):
+        """
+        Discretize the data.
+
+        Parameters
+        ----------
+        X : numeric array-like, shape (n_samples, n_features)
+            Data to be discretized.
+
+        Returns
+        -------
+        Xt : numeric array-like or sparse matrix
+            Data in the binned space.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        Xt = check_array(X, copy=True, dtype=FLOAT_DTYPES)
+        n_features = self.n_bins_.shape[0]
+        if Xt.shape[1] != n_features:
+            raise ValueError("Incorrect number of features. Expecting {}, "
+                             "received {}.".format(n_features, Xt.shape[1]))
+
+        bin_edges = self.bin_edges_
+        for jj in range(Xt.shape[1]):
+            # Values which are close to a bin edge are susceptible to numeric
+            # instability. Add eps to X so these values are binned correctly
+            # with respect to their decimal truncation. See documentation of
+            # numpy.isclose for an explanation of ``rtol`` and ``atol``.
+            rtol = 1.e-5
+            atol = 1.e-8
+            eps = atol + rtol * np.abs(Xt[:, jj])
+            Xt[:, jj] = digitize(Xt[:, jj] + eps, bin_edges[jj][1:])
+        self.n_bins_ = np.asarray(self.n_bins_)
+        np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
+
+        Xt = Xt.astype(np.int32)
+        if self.encode == 'ordinal':
+            return to_output_type(Xt, output_type)
+
+        Xt = self._encoder.transform(Xt)
+        if self.encode == 'onehot':
+            return Xt
+        else:
+            return to_output_type(Xt, output_type)
+
+    def inverse_transform(self, Xt):
+        """
+        Transform discretized data back to original feature space.
+
+        Note that this function does not regenerate the original data
+        due to discretization rounding.
+
+        Parameters
+        ----------
+        Xt : numeric array-like, shape (n_sample, n_features)
+            Transformed data in the binned space.
+
+        Returns
+        -------
+        Xinv : numeric array-like
+            Data in the original feature space.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(Xt)
+        sparse_input = hasattr(Xt, 'format')
+        if 'onehot' in self.encode:
+            Xt = check_array(Xt, accept_sparse='coo', copy=True)
+            Xt = self._encoder.inverse_transform(Xt)
+
+        Xinv = check_array(Xt, copy=True, dtype=FLOAT_DTYPES)
+        n_features = self.n_bins_.shape[0]
+        if Xinv.shape[1] != n_features:
+            raise ValueError("Incorrect number of features. Expecting {}, "
+                             "received {}.".format(n_features, Xinv.shape[1]))
+
+        for jj in range(n_features):
+            bin_edges = self.bin_edges_[jj]
+            bin_centers = (bin_edges[1:] + bin_edges[:-1]) * 0.5
+            idxs = np.asnumpy(Xinv[:, jj])
+            Xinv[:, jj] = bin_centers[idxs.astype(np.int32)]
+
+        if not sparse_input:
+            # Dense input -> Dense output in correct format
+            return to_output_type(Xinv, output_type)
+        else:
+            # Sparse input -> Dense CuPy output
+            return Xinv
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_encoders.py b/python/cuml/_thirdparty/sklearn/preprocessing/_encoders.py
new file mode 100644
index 0000000000..bcfb00ffd8
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_encoders.py
@@ -0,0 +1,745 @@
+# Original authors from Sckit-Learn:
+#          Andreas Mueller <amueller@ais.uni-bonn.de>
+#          Joris Van den Bossche <jorisvandenbossche@gmail.com>
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+import numpy as np
+from scipy import sparse
+
+from ..base import BaseEstimator, TransformerMixin
+from ..utils import check_array
+from ..utils.validation import check_is_fitted
+from ..utils.validation import _deprecate_positional_args
+
+from ._label import _encode, _encode_check_unknown
+
+
+__all__ = [
+    'OneHotEncoder',
+    'OrdinalEncoder'
+]
+
+
+class _BaseEncoder(TransformerMixin, BaseEstimator):
+    """
+    Base class for encoders that includes the code to categorize and
+    transform the input features.
+
+    """
+
+    def _check_X(self, X):
+        """
+        Perform custom check_array:
+        - convert list of strings to object dtype
+        - check for missing values for object dtype data (check_array does
+          not do that)
+        - return list of features (arrays): this list of features is
+          constructed feature by feature to preserve the data types
+          of pandas DataFrame columns, as otherwise information is lost
+          and cannot be used, eg for the `categories_` attribute.
+
+        """
+        if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
+            # if not a dataframe, do normal check_array validation
+            X_temp = check_array(X, dtype=None)
+            if (not hasattr(X, 'dtype')
+                    and np.issubdtype(X_temp.dtype, np.str_)):
+                X = check_array(X, dtype=np.object)
+            else:
+                X = X_temp
+            needs_validation = False
+        else:
+            # pandas dataframe, do validation later column by column, in order
+            # to keep the dtype information to be used in the encoder.
+            needs_validation = True
+
+        n_samples, n_features = X.shape
+        X_columns = []
+
+        for i in range(n_features):
+            Xi = self._get_feature(X, feature_idx=i)
+            Xi = check_array(Xi, ensure_2d=False, dtype=None,
+                             force_all_finite=needs_validation)
+            X_columns.append(Xi)
+
+        return X_columns, n_samples, n_features
+
+    def _get_feature(self, X, feature_idx):
+        if hasattr(X, 'iloc'):
+            # pandas dataframes
+            return X.iloc[:, feature_idx]
+        # numpy arrays, sparse arrays
+        return X[:, feature_idx]
+
+    def _fit(self, X, handle_unknown='error'):
+        X_list, n_samples, n_features = self._check_X(X)
+
+        if self.categories != 'auto':
+            if len(self.categories) != n_features:
+                raise ValueError("Shape mismatch: if categories is an array,"
+                                 " it has to be of shape (n_features,).")
+
+        self.categories_ = []
+
+        for i in range(n_features):
+            Xi = X_list[i]
+            if self.categories == 'auto':
+                cats = _encode(Xi)
+            else:
+                cats = np.array(self.categories[i], dtype=Xi.dtype)
+                if Xi.dtype != object:
+                    if not np.all(np.sort(cats) == cats):
+                        raise ValueError("Unsorted categories are not "
+                                         "supported for numerical categories")
+                if handle_unknown == 'error':
+                    diff = _encode_check_unknown(Xi, cats)
+                    if diff:
+                        msg = ("Found unknown categories {0} in column {1}"
+                               " during fit".format(diff, i))
+                        raise ValueError(msg)
+            self.categories_.append(cats)
+
+    def _transform(self, X, handle_unknown='error'):
+        X_list, n_samples, n_features = self._check_X(X)
+
+        X_int = np.zeros((n_samples, n_features), dtype=np.int)
+        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
+
+        if n_features != len(self.categories_):
+            raise ValueError(
+                "The number of features in X is different to the number of "
+                "features of the fitted data. The fitted data had {} features "
+                "and the X has {} features."
+                .format(len(self.categories_,), n_features)
+            )
+
+        for i in range(n_features):
+            Xi = X_list[i]
+            diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
+                                                     return_mask=True)
+
+            if not np.all(valid_mask):
+                if handle_unknown == 'error':
+                    msg = ("Found unknown categories {0} in column {1}"
+                           " during transform".format(diff, i))
+                    raise ValueError(msg)
+                else:
+                    # Set the problematic rows to an acceptable value and
+                    # continue `The rows are marked `X_mask` and will be
+                    # removed later.
+                    X_mask[:, i] = valid_mask
+                    # cast Xi into the largest string type necessary
+                    # to handle different lengths of numpy strings
+                    if (self.categories_[i].dtype.kind in ('U', 'S')
+                            and self.categories_[i].itemsize > Xi.itemsize):
+                        Xi = Xi.astype(self.categories_[i].dtype)
+                    else:
+                        Xi = Xi.copy()
+
+                    Xi[~valid_mask] = self.categories_[i][0]
+            # We use check_unknown=False, since _encode_check_unknown was
+            # already called above.
+            _, encoded = _encode(Xi, self.categories_[i], encode=True,
+                                 check_unknown=False)
+            X_int[:, i] = encoded
+
+        return X_int, X_mask
+
+    def _more_tags(self):
+        return {'X_types': ['categorical']}
+
+
+class OneHotEncoder(_BaseEncoder):
+    """
+    Encode categorical features as a one-hot numeric array.
+
+    The input to this transformer should be an array-like of integers or
+    strings, denoting the values taken on by categorical (discrete) features.
+    The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
+    encoding scheme. This creates a binary column for each category and
+    returns a sparse matrix or dense array (depending on the ``sparse``
+    parameter)
+
+    By default, the encoder derives the categories based on the unique values
+    in each feature. Alternatively, you can also specify the `categories`
+    manually.
+
+    This encoding is needed for feeding categorical data to many scikit-learn
+    estimators, notably linear models and SVMs with the standard kernels.
+
+    Note: a one-hot encoding of y labels should use a LabelBinarizer
+    instead.
+
+    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
+
+    .. versionchanged:: 0.20
+
+    Parameters
+    ----------
+    categories : 'auto' or a list of array-like, default='auto'
+        Categories (unique values) per feature:
+
+        - 'auto' : Determine categories automatically from the training data.
+        - list : ``categories[i]`` holds the categories expected in the ith
+          column. The passed categories should not mix strings and numeric
+          values within a single feature, and should be sorted in case of
+          numeric values.
+
+        The used categories can be found in the ``categories_`` attribute.
+
+        .. versionadded:: 0.20
+
+    drop : {'first', 'if_binary'} or a array-like of shape (n_features,), \
+            default=None
+        Specifies a methodology to use to drop one of the categories per
+        feature. This is useful in situations where perfectly collinear
+        features cause problems, such as when feeding the resulting data
+        into a neural network or an unregularized regression.
+
+        However, dropping one category breaks the symmetry of the original
+        representation and can therefore induce a bias in downstream models,
+        for instance for penalized linear classification or regression models.
+
+        - None : retain all features (the default).
+        - 'first' : drop the first category in each feature. If only one
+          category is present, the feature will be dropped entirely.
+        - 'if_binary' : drop the first category in each feature with two
+          categories. Features with 1 or more than 2 categories are
+          left intact.
+        - array : ``drop[i]`` is the category in feature ``X[:, i]`` that
+          should be dropped.
+
+    sparse : bool, default=True
+        Will return sparse matrix if set True else will return an array.
+
+    dtype : number type, default=np.float
+        Desired dtype of output.
+
+    handle_unknown : {'error', 'ignore'}, default='error'
+        Whether to raise an error or ignore if an unknown categorical feature
+        is present during transform (default is to raise). When this parameter
+        is set to 'ignore' and an unknown category is encountered during
+        transform, the resulting one-hot encoded columns for this feature
+        will be all zeros. In the inverse transform, an unknown category
+        will be denoted as None.
+
+    Attributes
+    ----------
+    categories_ : list of arrays
+        The categories of each feature determined during fitting
+        (in order of the features in X and corresponding with the output
+        of ``transform``). This includes the category specified in ``drop``
+        (if any).
+
+    drop_idx_ : array of shape (n_features,)
+        - ``drop_idx_[i]`` is the index in ``categories_[i]`` of the category
+          to be dropped for each feature.
+        - ``drop_idx_[i] = None`` if no category is to be dropped from the
+          feature with index ``i``, e.g. when `drop='if_binary'` and the
+          feature isn't binary.
+        - ``drop_idx_ = None`` if all the transformed features will be
+          retained.
+
+    See Also
+    --------
+    sklearn.preprocessing.OrdinalEncoder : Performs an ordinal (integer)
+      encoding of the categorical features.
+    sklearn.feature_extraction.DictVectorizer : Performs a one-hot encoding of
+      dictionary items (also handles string-valued features).
+    sklearn.feature_extraction.FeatureHasher : Performs an approximate one-hot
+      encoding of dictionary items or strings.
+    sklearn.preprocessing.LabelBinarizer : Binarizes labels in a one-vs-all
+      fashion.
+    sklearn.preprocessing.MultiLabelBinarizer : Transforms between iterable of
+      iterables and a multilabel format, e.g. a (samples x classes) binary
+      matrix indicating the presence of a class label.
+
+    Examples
+    --------
+    Given a dataset with two features, we let the encoder find the unique
+    values per feature and transform the data to a binary one-hot encoding.
+
+    >>> from sklearn.preprocessing import OneHotEncoder
+
+    One can discard categories not seen during `fit`:
+
+    >>> enc = OneHotEncoder(handle_unknown='ignore')
+    >>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
+    >>> enc.fit(X)
+    OneHotEncoder(handle_unknown='ignore')
+    >>> enc.categories_
+    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
+    >>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
+    array([[1., 0., 1., 0., 0.],
+           [0., 1., 0., 0., 0.]])
+    >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
+    array([['Male', 1],
+           [None, 2]], dtype=object)
+    >>> enc.get_feature_names(['gender', 'group'])
+    array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
+      dtype=object)
+
+    One can always drop the first column for each feature:
+
+    >>> drop_enc = OneHotEncoder(drop='first').fit(X)
+    >>> drop_enc.categories_
+    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
+    >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
+    array([[0., 0., 0.],
+           [1., 1., 0.]])
+
+    Or drop a column for feature only having 2 categories:
+
+    >>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
+    >>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
+    array([[0., 1., 0., 0.],
+           [1., 0., 1., 0.]])
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, categories='auto', drop=None, sparse=True,
+                 dtype=np.float64, handle_unknown='error'):
+        self.categories = categories
+        self.sparse = sparse
+        self.dtype = dtype
+        self.handle_unknown = handle_unknown
+        self.drop = drop
+
+    def _validate_keywords(self):
+        if self.handle_unknown not in ('error', 'ignore'):
+            msg = ("handle_unknown should be either 'error' or 'ignore', "
+                   "got {0}.".format(self.handle_unknown))
+            raise ValueError(msg)
+        # If we have both dropped columns and ignored unknown
+        # values, there will be ambiguous cells. This creates difficulties
+        # in interpreting the model.
+        if self.drop is not None and self.handle_unknown != 'error':
+            raise ValueError(
+                "`handle_unknown` must be 'error' when the drop parameter is "
+                "specified, as both would create categories that are all "
+                "zero.")
+
+    def _compute_drop_idx(self):
+        if self.drop is None:
+            return None
+        elif isinstance(self.drop, str):
+            if self.drop == 'first':
+                return np.zeros(len(self.categories_), dtype=np.object)
+            elif self.drop == 'if_binary':
+                return np.array([0 if len(cats) == 2 else None
+                                for cats in self.categories_], dtype=np.object)
+            else:
+                msg = (
+                    "Wrong input for parameter `drop`. Expected "
+                    "'first', 'if_binary', None or array of objects, got {}"
+                    )
+                raise ValueError(msg.format(type(self.drop)))
+
+        else:
+            try:
+                self.drop = np.asarray(self.drop, dtype=object)
+                droplen = len(self.drop)
+            except (ValueError, TypeError):
+                msg = (
+                    "Wrong input for parameter `drop`. Expected "
+                    "'first', 'if_binary', None or array of objects, got {}"
+                    )
+                raise ValueError(msg.format(type(self.drop)))
+            if droplen != len(self.categories_):
+                msg = ("`drop` should have length equal to the number "
+                       "of features ({}), got {}")
+                raise ValueError(msg.format(len(self.categories_),
+                                            len(self.drop)))
+            missing_drops = [(i, val) for i, val in enumerate(self.drop)
+                             if val not in self.categories_[i]]
+            if any(missing_drops):
+                msg = ("The following categories were supposed to be "
+                       "dropped, but were not found in the training "
+                       "data.\n{}".format(
+                           "\n".join(
+                                ["Category: {}, Feature: {}".format(c, v)
+                                    for c, v in missing_drops])))
+                raise ValueError(msg)
+            return np.array([np.where(cat_list == val)[0][0]
+                             for (val, cat_list) in
+                             zip(self.drop, self.categories_)],
+                            dtype=np.object)
+
+    def fit(self, X, y=None):
+        """
+        Fit OneHotEncoder to X.
+
+        Parameters
+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to determine the categories of each feature.
+
+        y : None
+            Ignored. This parameter exists only for compatibility with
+            :class:`sklearn.pipeline.Pipeline`.
+
+        Returns
+        -------
+        self
+        """
+        self._validate_keywords()
+        self._fit(X, handle_unknown=self.handle_unknown)
+        self.drop_idx_ = self._compute_drop_idx()
+        return self
+
+    def fit_transform(self, X, y=None):
+        """
+        Fit OneHotEncoder to X, then transform X.
+
+        Equivalent to fit(X).transform(X) but more convenient.
+
+        Parameters
+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to encode.
+
+        y : None
+            Ignored. This parameter exists only for compatibility with
+            :class:`sklearn.pipeline.Pipeline`.
+
+        Returns
+        -------
+        X_out : sparse matrix if sparse=True else a 2-d array
+            Transformed input.
+        """
+        self._validate_keywords()
+        return super().fit_transform(X, y)
+
+    def transform(self, X):
+        """
+        Transform X using one-hot encoding.
+
+        Parameters
+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to encode.
+
+        Returns
+        -------
+        X_out : sparse matrix if sparse=True else a 2-d array
+            Transformed input.
+        """
+        check_is_fitted(self)
+        # validation of X happens in _check_X called by _transform
+        X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
+
+        n_samples, n_features = X_int.shape
+
+        if self.drop_idx_ is not None:
+            to_drop = self.drop_idx_.copy()
+            # We remove all the dropped categories from mask, and decrement all
+            # categories that occur after them to avoid an empty column.
+            keep_cells = X_int != to_drop
+            n_values = []
+            for i, cats in enumerate(self.categories_):
+                n_cats = len(cats)
+
+                # drop='if_binary' but feature isn't binary
+                if to_drop[i] is None:
+                    # set to cardinality to not drop from X_int
+                    to_drop[i] = n_cats
+                    n_values.append(n_cats)
+                else:  # dropped
+                    n_values.append(n_cats - 1)
+
+            to_drop = to_drop.reshape(1, -1)
+            X_int[X_int > to_drop] -= 1
+            X_mask &= keep_cells
+        else:
+            n_values = [len(cats) for cats in self.categories_]
+
+        mask = X_mask.ravel()
+        feature_indices = np.cumsum([0] + n_values)
+        indices = (X_int + feature_indices[:-1]).ravel()[mask]
+
+        indptr = np.empty(n_samples + 1, dtype=np.int)
+        indptr[0] = 0
+        np.sum(X_mask, axis=1, out=indptr[1:])
+        np.cumsum(indptr[1:], out=indptr[1:])
+        data = np.ones(indptr[-1])
+
+        out = sparse.csr_matrix((data, indices, indptr),
+                                shape=(n_samples, feature_indices[-1]),
+                                dtype=self.dtype)
+        if not self.sparse:
+            return out.toarray()
+        else:
+            return out
+
+    def inverse_transform(self, X):
+        """
+        Convert the data back to the original representation.
+
+        In case unknown categories are encountered (all zeros in the
+        one-hot encoding), ``None`` is used to represent this category.
+
+        Parameters
+        ----------
+        X : array-like or sparse matrix, shape [n_samples, n_encoded_features]
+            The transformed data.
+
+        Returns
+        -------
+        X_tr : array-like, shape [n_samples, n_features]
+            Inverse transformed array.
+        """
+        check_is_fitted(self)
+        X = check_array(X, accept_sparse='csr')
+
+        n_samples, _ = X.shape
+        n_features = len(self.categories_)
+        if self.drop_idx_ is None:
+            n_transformed_features = sum(len(cats)
+                                         for cats in self.categories_)
+        else:
+            n_transformed_features = sum(
+                len(cats) - 1 if to_drop is not None else len(cats)
+                for cats, to_drop in zip(self.categories_, self.drop_idx_)
+            )
+
+        # validate shape of passed X
+        msg = ("Shape of the passed X data is not correct. Expected {0} "
+               "columns, got {1}.")
+        if X.shape[1] != n_transformed_features:
+            raise ValueError(msg.format(n_transformed_features, X.shape[1]))
+
+        # create resulting array of appropriate dtype
+        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])
+        X_tr = np.empty((n_samples, n_features), dtype=dt)
+
+        j = 0
+        found_unknown = {}
+
+        for i in range(n_features):
+            if self.drop_idx_ is None or self.drop_idx_[i] is None:
+                cats = self.categories_[i]
+            else:
+                cats = np.delete(self.categories_[i], self.drop_idx_[i])
+            n_categories = len(cats)
+
+            # Only happens if there was a column with a unique
+            # category. In this case we just fill the column with this
+            # unique category value.
+            if n_categories == 0:
+                X_tr[:, i] = self.categories_[i][self.drop_idx_[i]]
+                j += n_categories
+                continue
+            sub = X[:, j:j + n_categories]
+            # for sparse X argmax returns 2D matrix, ensure 1D array
+            labels = np.asarray(sub.argmax(axis=1)).flatten()
+            X_tr[:, i] = cats[labels]
+            if self.handle_unknown == 'ignore':
+                unknown = np.asarray(sub.sum(axis=1) == 0).flatten()
+                # ignored unknown categories: we have a row of all zero
+                if unknown.any():
+                    found_unknown[i] = unknown
+            # drop will either be None or handle_unknown will be error. If
+            # self.drop_idx_ is not None, then we can safely assume that all of
+            # the nulls in each column are the dropped value
+            elif self.drop_idx_ is not None:
+                dropped = np.asarray(sub.sum(axis=1) == 0).flatten()
+                if dropped.any():
+                    X_tr[dropped, i] = self.categories_[i][self.drop_idx_[i]]
+
+            j += n_categories
+
+        # if ignored are found: potentially need to upcast result to
+        # insert None values
+        if found_unknown:
+            if X_tr.dtype != object:
+                X_tr = X_tr.astype(object)
+
+            for idx, mask in found_unknown.items():
+                X_tr[mask, idx] = None
+
+        return X_tr
+
+    def get_feature_names(self, input_features=None):
+        """
+        Return feature names for output features.
+
+        Parameters
+        ----------
+        input_features : list of str of shape (n_features,)
+            String names for input features if available. By default,
+            "x0", "x1", ... "xn_features" is used.
+
+        Returns
+        -------
+        output_feature_names : ndarray of shape (n_output_features,)
+            Array of feature names.
+        """
+        check_is_fitted(self)
+        cats = self.categories_
+        if input_features is None:
+            input_features = ['x%d' % i for i in range(len(cats))]
+        elif len(input_features) != len(self.categories_):
+            raise ValueError(
+                "input_features should have length equal to number of "
+                "features ({}), got {}".format(len(self.categories_),
+                                               len(input_features)))
+
+        feature_names = []
+        for i in range(len(cats)):
+            names = [
+                input_features[i] + '_' + str(t) for t in cats[i]]
+            if self.drop_idx_ is not None and self.drop_idx_[i] is not None:
+                names.pop(self.drop_idx_[i])
+            feature_names.extend(names)
+
+        return np.array(feature_names, dtype=object)
+
+
+class OrdinalEncoder(_BaseEncoder):
+    """
+    Encode categorical features as an integer array.
+
+    The input to this transformer should be an array-like of integers or
+    strings, denoting the values taken on by categorical (discrete) features.
+    The features are converted to ordinal integers. This results in
+    a single column of integers (0 to n_categories - 1) per feature.
+
+    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
+
+    .. versionadded:: 0.20
+
+    Parameters
+    ----------
+    categories : 'auto' or a list of array-like, default='auto'
+        Categories (unique values) per feature:
+
+        - 'auto' : Determine categories automatically from the training data.
+        - list : ``categories[i]`` holds the categories expected in the ith
+          column. The passed categories should not mix strings and numeric
+          values, and should be sorted in case of numeric values.
+
+        The used categories can be found in the ``categories_`` attribute.
+
+    dtype : number type, default np.float64
+        Desired dtype of output.
+
+    Attributes
+    ----------
+    categories_ : list of arrays
+        The categories of each feature determined during fitting
+        (in order of the features in X and corresponding with the output
+        of ``transform``).
+
+    See Also
+    --------
+    sklearn.preprocessing.OneHotEncoder : Performs a one-hot encoding of
+      categorical features.
+    sklearn.preprocessing.LabelEncoder : Encodes target labels with values
+      between 0 and n_classes-1.
+
+    Examples
+    --------
+    Given a dataset with two features, we let the encoder find the unique
+    values per feature and transform the data to an ordinal encoding.
+
+    >>> from sklearn.preprocessing import OrdinalEncoder
+    >>> enc = OrdinalEncoder()
+    >>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
+    >>> enc.fit(X)
+    OrdinalEncoder()
+    >>> enc.categories_
+    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
+    >>> enc.transform([['Female', 3], ['Male', 1]])
+    array([[0., 2.],
+           [1., 0.]])
+
+    >>> enc.inverse_transform([[1, 0], [0, 1]])
+    array([['Male', 1],
+           ['Female', 2]], dtype=object)
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, categories='auto', dtype=np.float64):
+        self.categories = categories
+        self.dtype = dtype
+
+    def fit(self, X, y=None):
+        """
+        Fit the OrdinalEncoder to X.
+
+        Parameters
+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to determine the categories of each feature.
+
+        y : None
+            Ignored. This parameter exists only for compatibility with
+            :class:`sklearn.pipeline.Pipeline`.
+
+        Returns
+        -------
+        self
+        """
+        self._fit(X)
+
+        return self
+
+    def transform(self, X):
+        """
+        Transform X to ordinal codes.
+
+        Parameters
+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to encode.
+
+        Returns
+        -------
+        X_out : sparse matrix or a 2-d array
+            Transformed input.
+        """
+        X_int, _ = self._transform(X)
+        return X_int.astype(self.dtype, copy=False)
+
+    def inverse_transform(self, X):
+        """
+        Convert the data back to the original representation.
+
+        Parameters
+        ----------
+        X : array-like or sparse matrix, shape [n_samples, n_encoded_features]
+            The transformed data.
+
+        Returns
+        -------
+        X_tr : array-like, shape [n_samples, n_features]
+            Inverse transformed array.
+        """
+        check_is_fitted(self)
+        X = check_array(X, accept_sparse='csr')
+
+        n_samples, _ = X.shape
+        n_features = len(self.categories_)
+
+        # validate shape of passed X
+        msg = ("Shape of the passed X data is not correct. Expected {0} "
+               "columns, got {1}.")
+        if X.shape[1] != n_features:
+            raise ValueError(msg.format(n_features, X.shape[1]))
+
+        # create resulting array of appropriate dtype
+        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])
+        X_tr = np.empty((n_samples, n_features), dtype=dt)
+
+        for i in range(n_features):
+            labels = X[:, i].astype('int64', copy=False)
+            X_tr[:, i] = self.categories_[i][labels]
+
+        return X_tr
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_function_transformer.py b/python/cuml/_thirdparty/sklearn/preprocessing/_function_transformer.py
new file mode 100644
index 0000000000..563ea952ca
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_function_transformer.py
@@ -0,0 +1,180 @@
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+
+
+import warnings
+
+from ..utils.skl_dependencies import BaseEstimator, TransformerMixin
+from ..utils.validation import _allclose_dense_sparse
+from ..utils.validation import _deprecate_positional_args
+
+
+def _identity(X):
+    """The identity function.
+    """
+    return X
+
+
+class FunctionTransformer(TransformerMixin, BaseEstimator):
+    """Constructs a transformer from an arbitrary callable.
+
+    A FunctionTransformer forwards its X (and optionally y) arguments to a
+    user-defined function or function object and returns the result of this
+    function. This is useful for stateless transformations such as taking the
+    log of frequencies, doing custom scaling, etc.
+
+    Note: If a lambda is used as the function, then the resulting
+    transformer will not be pickleable.
+
+    .. versionadded:: 0.17
+
+    Read more in the :ref:`User Guide <function_transformer>`.
+
+    Parameters
+    ----------
+    func : callable, optional default=None
+        The callable to use for the transformation. This will be passed
+        the same arguments as transform, with args and kwargs forwarded.
+        If func is None, then func will be the identity function.
+
+    inverse_func : callable, optional default=None
+        The callable to use for the inverse transformation. This will be
+        passed the same arguments as inverse transform, with args and
+        kwargs forwarded. If inverse_func is None, then inverse_func
+        will be the identity function.
+
+    validate : bool, optional default=False
+        Indicate that the input X array should be checked before calling
+        ``func``. The possibilities are:
+
+        - If False, there is no input validation.
+        - If True, then X will be converted to a 2-dimensional NumPy array or
+          sparse matrix. If the conversion is not possible an exception is
+          raised.
+
+        .. versionchanged:: 0.22
+           The default of ``validate`` changed from True to False.
+
+    accept_sparse : boolean, optional
+        Indicate that func accepts a sparse matrix as input. If validate is
+        False, this has no effect. Otherwise, if accept_sparse is false,
+        sparse matrix inputs will cause an exception to be raised.
+
+    check_inverse : bool, default=True
+       Whether to check that or ``func`` followed by ``inverse_func`` leads to
+       the original inputs. It can be used for a sanity check, raising a
+       warning when the condition is not fulfilled.
+
+       .. versionadded:: 0.20
+
+    kw_args : dict, optional
+        Dictionary of additional keyword arguments to pass to func.
+
+        .. versionadded:: 0.18
+
+    inv_kw_args : dict, optional
+        Dictionary of additional keyword arguments to pass to inverse_func.
+
+        .. versionadded:: 0.18
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import FunctionTransformer
+    >>> transformer = FunctionTransformer(np.log1p)
+    >>> X = np.array([[0, 1], [2, 3]])
+    >>> transformer.transform(X)
+    array([[0.       , 0.6931...],
+           [1.0986..., 1.3862...]])
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, func=None, inverse_func=None, *, validate=False,
+                 accept_sparse=False, check_inverse=True, kw_args=None,
+                 inv_kw_args=None):
+        self.func = func
+        self.inverse_func = inverse_func
+        self.validate = validate
+        self.accept_sparse = accept_sparse
+        self.check_inverse = check_inverse
+        self.kw_args = kw_args
+        self.inv_kw_args = inv_kw_args
+
+    def _check_input(self, X):
+        if self.validate:
+            return self._validate_data(X, accept_sparse=self.accept_sparse)
+        return X
+
+    def _check_inverse_transform(self, X):
+        """Check that func and inverse_func are the inverse."""
+        idx_selected = slice(None, None, max(1, X.shape[0] // 100))
+        X_round_trip = self.inverse_transform(self.transform(X[idx_selected]))
+        if not _allclose_dense_sparse(X[idx_selected], X_round_trip):
+            warnings.warn("The provided functions are not strictly"
+                          " inverse of each other. If you are sure you"
+                          " want to proceed regardless, set"
+                          " 'check_inverse=False'.", UserWarning)
+
+    def fit(self, X, y=None):
+        """Fit transformer by checking X.
+
+        If ``validate`` is ``True``, ``X`` will be checked.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            Input array.
+
+        Returns
+        -------
+        self
+        """
+        X = self._check_input(X)
+        if (self.check_inverse and not (self.func is None or
+                                        self.inverse_func is None)):
+            self._check_inverse_transform(X)
+        return self
+
+    def transform(self, X):
+        """Transform X using the forward function.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            Input array.
+
+        Returns
+        -------
+        X_out : array-like, shape (n_samples, n_features)
+            Transformed input.
+        """
+        return self._transform(X, func=self.func, kw_args=self.kw_args)
+
+    def inverse_transform(self, X):
+        """Transform X using the inverse function.
+
+        Parameters
+        ----------
+        X : array-like, shape (n_samples, n_features)
+            Input array.
+
+        Returns
+        -------
+        X_out : array-like, shape (n_samples, n_features)
+            Transformed input.
+        """
+        return self._transform(X, func=self.inverse_func,
+                               kw_args=self.inv_kw_args)
+
+    def _transform(self, X, func=None, kw_args=None):
+        X = self._check_input(X)
+
+        if func is None:
+            func = _identity
+
+        return func(X, **(kw_args if kw_args else {}))
+
+    def _more_tags(self):
+        return {'no_validation': not self.validate,
+                'stateless': True}
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_imputation.py b/python/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
new file mode 100644
index 0000000000..454417ec6b
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
@@ -0,0 +1,718 @@
+# Original authors from Sckit-Learn:
+#          Nicolas Tresegnie <nicolas.tresegnie@gmail.com>
+#          Sergey Feldman <sergeyfeldman@gmail.com>
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+import numbers
+import warnings
+
+import cupy as np
+from cupy import sparse
+
+from ....thirdparty_adapters import (get_input_type, to_output_type, _get_mask,
+                                     _masked_column_median,
+                                     _masked_column_mean, _masked_column_mode)
+from ..utils.skl_dependencies import BaseEstimator, TransformerMixin
+from ..utils.validation import check_is_fitted
+from ..utils.validation import FLOAT_DTYPES
+from ..utils.validation import _deprecate_positional_args
+from ....common.import_utils import check_cupy8
+
+
+def is_scalar_nan(x):
+    return bool(isinstance(x, numbers.Real) and np.isnan(x))
+
+
+def _check_inputs_dtype(X, missing_values):
+    if (X.dtype.kind in ("f", "i", "u") and
+            not isinstance(missing_values, numbers.Real)):
+        raise ValueError("'X' and 'missing_values' types are expected to be"
+                         " both numerical. Got X.dtype={} and "
+                         " type(missing_values)={}."
+                         .format(X.dtype, type(missing_values)))
+
+
+def _get_elem_at_rank(rank, data, n_negative, n_zeros):
+    """Find the value in data augmented with n_zeros for the given rank"""
+    if rank < n_negative:
+        return data[rank]
+    if rank - n_negative < n_zeros:
+        return 0
+    return data[rank - n_zeros]
+
+
+def _get_median(data, n_zeros):
+    """Compute the median of data with n_zeros additional zeros.
+    This function is used to support sparse matrices; it modifies data in-place
+    """
+    n_elems = len(data) + n_zeros
+    if not n_elems:
+        return np.nan
+    n_negative = (data < 0).sum()
+    middle, is_odd = divmod(n_elems, 2)
+    data = np.sort(data)
+    if is_odd:
+        return _get_elem_at_rank(middle, data,
+                                 n_negative, n_zeros)
+    elm1 = _get_elem_at_rank(middle - 1, data,
+                             n_negative, n_zeros)
+    elm2 = _get_elem_at_rank(middle, data,
+                             n_negative, n_zeros)
+    return (elm1 + elm2) / 2.
+
+
+def _most_frequent(array, extra_value, n_repeat):
+    """Compute the most frequent value in a 1d array extended with
+       [extra_value] * n_repeat, where extra_value is assumed to be not part
+       of the array."""
+    values, counts = np.unique(array,
+                               return_counts=True)
+    most_frequent_count = counts.max()
+    if most_frequent_count > n_repeat:
+        value = values[counts == most_frequent_count].min()
+    elif n_repeat > most_frequent_count:
+        value = extra_value
+    else:
+        value = min(extra_value, values[counts == most_frequent_count].min())
+    return value
+
+
+class _BaseImputer(TransformerMixin, BaseEstimator):
+    """Base class for all imputers.
+
+    It adds automatically support for `add_indicator`.
+    """
+
+    def __init__(self, *, missing_values=np.nan, add_indicator=False):
+        self.missing_values = missing_values
+        self.add_indicator = add_indicator
+
+    def _fit_indicator(self, X):
+        """Fit a MissingIndicator."""
+        if self.add_indicator:
+            self.indicator_ = MissingIndicator(
+                missing_values=self.missing_values, error_on_new=False
+            )
+            self.indicator_.fit(X)
+        else:
+            self.indicator_ = None
+
+    def _transform_indicator(self, X):
+        """Compute the indicator mask.'
+
+        Note that X must be the original data as passed to the imputer before
+        any imputation, since imputation may be done inplace in some cases.
+        """
+        if self.add_indicator:
+            if not hasattr(self, 'indicator_'):
+                raise ValueError(
+                    "Make sure to call _fit_indicator before "
+                    "_transform_indicator"
+                )
+            return self.indicator_.transform(X)
+
+    def _concatenate_indicator(self, X_imputed, X_indicator):
+        """Concatenate indicator mask with the imputed data."""
+        if not self.add_indicator:
+            return X_imputed
+
+        hstack = sparse.hstack if sparse.issparse(X_imputed) else np.hstack
+        if X_indicator is None:
+            raise ValueError(
+                "Data from the missing indicator are not provided. Call "
+                "_fit_indicator and _transform_indicator in the imputer "
+                "implementation."
+                )
+
+        return hstack((X_imputed, X_indicator))
+
+    def _more_tags(self):
+        return {'allow_nan': is_scalar_nan(self.missing_values)}
+
+
+class SimpleImputer(_BaseImputer):
+    """Imputation transformer for completing missing values.
+
+    Parameters
+    ----------
+    missing_values : number, string, np.nan (default) or None
+        The placeholder for the missing values. All occurrences of
+        `missing_values` will be imputed. For pandas' dataframes with
+        nullable integer dtypes with missing values, `missing_values`
+        should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
+
+    strategy : string, default='mean'
+        The imputation strategy.
+
+        - If "mean", then replace missing values using the mean along
+          each column. Can only be used with numeric data.
+        - If "median", then replace missing values using the median along
+          each column. Can only be used with numeric data.
+        - If "most_frequent", then replace missing using the most frequent
+          value along each column. Can be used with strings or numeric data.
+        - If "constant", then replace missing values with fill_value. Can be
+          used with strings or numeric data.
+
+        strategy="constant" for fixed value imputation.
+
+    fill_value : string or numerical value, default=None
+        When strategy == "constant", fill_value is used to replace all
+        occurrences of missing_values.
+        If left to the default, fill_value will be 0 when imputing numerical
+        data and "missing_value" for strings or object data types.
+
+    verbose : integer, default=0
+        Controls the verbosity of the imputer.
+
+    copy : boolean, default=True
+        If True, a copy of X will be created. If False, imputation will
+        be done in-place whenever possible. Note that, in the following cases,
+        a new copy will always be made, even if `copy=False`:
+
+        - If X is not an array of floating values;
+        - If X is encoded as a CSR matrix;
+        - If add_indicator=True.
+
+    add_indicator : boolean, default=False
+        If True, a :class:`MissingIndicator` transform will stack onto output
+        of the imputer's transform. This allows a predictive estimator
+        to account for missingness despite imputation. If a feature has no
+        missing values at fit/train time, the feature won't appear on
+        the missing indicator even if there are missing values at
+        transform/test time.
+
+    Attributes
+    ----------
+    statistics_ : array of shape (n_features,)
+        The imputation fill value for each feature.
+        Computing statistics can result in `np.nan` values.
+        During :meth:`transform`, features corresponding to `np.nan`
+        statistics will be discarded.
+
+    See also
+    --------
+    IterativeImputer : Multivariate imputation of missing values.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from cuml.impute import SimpleImputer
+    >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
+    >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
+    SimpleImputer()
+    >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
+    >>> print(imp_mean.transform(X))
+    [[ 7.   2.   3. ]
+     [ 4.   3.5  6. ]
+     [10.   3.5  9. ]]
+
+    Notes
+    -----
+    Columns which only contained missing values at :meth:`fit` are discarded
+    upon :meth:`transform` if strategy is not "constant".
+
+    """
+    @check_cupy8()
+    @_deprecate_positional_args
+    def __init__(self, *, missing_values=np.nan, strategy="mean",
+                 fill_value=None, verbose=0, copy=True, add_indicator=False):
+        super().__init__(
+            missing_values=missing_values,
+            add_indicator=add_indicator
+        )
+        self.strategy = strategy
+        self.fill_value = fill_value
+        self.verbose = verbose
+        self.copy = copy
+
+    def _validate_input(self, X, in_fit):
+        allowed_strategies = ["mean", "median", "most_frequent", "constant"]
+        if self.strategy not in allowed_strategies:
+            raise ValueError("Can only use these strategies: {0} "
+                             " got strategy={1}".format(allowed_strategies,
+                                                        self.strategy))
+
+        if self.strategy in ("most_frequent", "constant"):
+            dtype = None
+        else:
+            dtype = FLOAT_DTYPES
+
+        if not is_scalar_nan(self.missing_values):
+            force_all_finite = True
+        else:
+            force_all_finite = "allow-nan"
+
+        try:
+            X = self._validate_data(X, reset=in_fit,
+                                    accept_sparse='csc', dtype=dtype,
+                                    force_all_finite=force_all_finite,
+                                    copy=self.copy)
+        except ValueError as ve:
+            if "could not convert" in str(ve):
+                new_ve = ValueError("Cannot use {} strategy with non-numeric "
+                                    "data:\n{}".format(self.strategy, ve))
+                raise new_ve from None
+            else:
+                raise ve
+
+        _check_inputs_dtype(X, self.missing_values)
+        if X.dtype.kind not in ("i", "u", "f", "O"):
+            raise ValueError("SimpleImputer does not support data with dtype "
+                             "{0}. Please provide either a numeric array (with"
+                             " a floating point or integer dtype) or "
+                             "categorical data represented either as an array "
+                             "with integer dtype or an array of string values "
+                             "with an object dtype.".format(X.dtype))
+
+        return X
+
+    def fit(self, X, y=None):
+        """Fit the imputer on X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            Input data, where ``n_samples`` is the number of samples and
+            ``n_features`` is the number of features.
+
+        Returns
+        -------
+        self : SimpleImputer
+        """
+        X = self._validate_input(X, in_fit=True)
+        super()._fit_indicator(X)
+
+        # default fill_value is 0 for numerical input and "missing_value"
+        # otherwise
+        if self.fill_value is None:
+            if X.dtype.kind in ("i", "u", "f"):
+                fill_value = 0
+            else:
+                fill_value = "missing_value"
+        else:
+            fill_value = self.fill_value
+
+        # fill_value should be numerical in case of numerical input
+        if (self.strategy == "constant" and
+                X.dtype.kind in ("i", "u", "f") and
+                not isinstance(fill_value, numbers.Real)):
+            raise ValueError("'fill_value'={0} is invalid. Expected a "
+                             "numerical value when imputing numerical "
+                             "data".format(fill_value))
+
+        if sparse.issparse(X):
+            # missing_values = 0 not allowed with sparse data as it would
+            # force densification
+            if self.missing_values == 0:
+                raise ValueError("Imputation not possible when missing_values "
+                                 "== 0 and input is sparse. Provide a dense "
+                                 "array instead.")
+            else:
+                self.statistics_ = self._sparse_fit(X,
+                                                    self.strategy,
+                                                    self.missing_values,
+                                                    fill_value)
+        else:
+            self.statistics_ = self._dense_fit(X,
+                                               self.strategy,
+                                               self.missing_values,
+                                               fill_value)
+        return self
+
+    def _sparse_fit(self, X, strategy, missing_values, fill_value):
+        """Fit the transformer on sparse data."""
+        mask_data = _get_mask(X.data, missing_values)
+        n_implicit_zeros = X.shape[0] - np.diff(X.indptr)
+
+        statistics = np.empty(X.shape[1])
+
+        if strategy == "constant":
+            # for constant strategy, self.statistcs_ is used to store
+            # fill_value in each column
+            statistics.fill(fill_value)
+        else:
+            for i in range(X.shape[1]):
+                column = X.data[X.indptr[i]:X.indptr[i + 1]]
+                mask_column = mask_data[X.indptr[i]:X.indptr[i + 1]]
+                column = column[~mask_column]
+
+                # combine explicit and implicit zeros
+                mask_zeros = _get_mask(column, 0)
+                column = column[~mask_zeros]
+                n_explicit_zeros = mask_zeros.sum()
+                n_zeros = n_implicit_zeros[i] + n_explicit_zeros
+
+                if strategy == "mean":
+                    s = column.size + n_zeros
+                    statistics[i] = np.nan if s == 0 else column.sum() / s
+
+                elif strategy == "median":
+                    statistics[i] = _get_median(column,
+                                                n_zeros)
+
+                elif strategy == "most_frequent":
+                    statistics[i] = _most_frequent(column,
+                                                   0,
+                                                   n_zeros)
+        return statistics
+
+    def _dense_fit(self, X, strategy, missing_values, fill_value):
+        """Fit the transformer on dense data."""
+        # Mean
+        if strategy == "mean":
+            return _masked_column_mean(X, missing_values)
+
+        # Median
+        elif strategy == "median":
+            return _masked_column_median(X, missing_values)
+
+        # Most frequent
+        elif strategy == "most_frequent":
+            return _masked_column_mode(X, missing_values)
+
+        # Constant
+        elif strategy == "constant":
+            return np.full(X.shape[1], fill_value, dtype=X.dtype)
+
+    def transform(self, X):
+        """Impute all missing values in X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            The input data to complete.
+        """
+        check_is_fitted(self)
+
+        output_type = get_input_type(X)
+        X = self._validate_input(X, in_fit=False)
+        X_indicator = super()._transform_indicator(X)
+
+        statistics = self.statistics_
+
+        if X.shape[1] != statistics.shape[0]:
+            raise ValueError("X has %d features per sample, expected %d"
+                             % (X.shape[1], self.statistics_.shape[0]))
+
+        # Delete the invalid columns if strategy is not constant
+        if self.strategy == "constant":
+            valid_statistics = statistics
+        else:
+            # same as np.isnan but also works for object dtypes
+            invalid_mask = _get_mask(statistics, np.nan)
+            valid_mask = np.logical_not(invalid_mask)
+            valid_statistics = statistics[valid_mask]
+            valid_statistics_indexes = np.flatnonzero(valid_mask)
+
+            if invalid_mask.any():
+                missing = np.arange(X.shape[1])[invalid_mask]
+                if self.verbose:
+                    warnings.warn("Deleting features without "
+                                  "observed values: %s" % missing)
+                X = X[:, valid_statistics_indexes]
+
+        # Do actual imputation
+        if sparse.issparse(X):
+            if self.missing_values == 0:
+                raise ValueError("Imputation not possible when missing_values "
+                                 "== 0 and input is sparse. Provide a dense "
+                                 "array instead.")
+            else:
+                mask = _get_mask(X.data, self.missing_values)
+                indexes = np.repeat(
+                    np.arange(len(X.indptr) - 1, dtype=np.int),
+                    np.diff(X.indptr).tolist())[mask]
+
+                X.data[mask] = valid_statistics[indexes].astype(X.dtype,
+                                                                copy=False)
+        else:
+            mask = _get_mask(X, self.missing_values)
+            if self.strategy == "constant":
+                X[mask] = valid_statistics[0]
+            else:
+                for i, vi in enumerate(valid_statistics_indexes):
+                    feature_idxs = np.flatnonzero(mask[:, vi])
+                    X[feature_idxs, vi] = valid_statistics[i]
+
+        X = super()._concatenate_indicator(X, X_indicator)
+        X = to_output_type(X, output_type)
+        return X
+
+
+class MissingIndicator(TransformerMixin, BaseEstimator):
+    """Binary indicators for missing values.
+
+    Note that this component typically should not be used in a vanilla
+    :class:`Pipeline` consisting of transformers and a classifier, but rather
+    could be added using a :class:`FeatureUnion` or :class:`ColumnTransformer`.
+
+    Read more in the :ref:`User Guide <impute>`.
+
+    .. versionadded:: 0.20
+
+    Parameters
+    ----------
+    missing_values : number, string, np.nan (default) or None
+        The placeholder for the missing values. All occurrences of
+        `missing_values` will be imputed. For pandas' dataframes with
+        nullable integer dtypes with missing values, `missing_values`
+        should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
+
+    features : str, default=None
+        Whether the imputer mask should represent all or a subset of
+        features.
+
+        - If "missing-only" (default), the imputer mask will only represent
+          features containing missing values during fit time.
+        - If "all", the imputer mask will represent all features.
+
+    sparse : boolean or "auto", default=None
+        Whether the imputer mask format should be sparse or dense.
+
+        - If "auto" (default), the imputer mask will be of same type as
+          input.
+        - If True, the imputer mask will be a sparse matrix.
+        - If False, the imputer mask will be a numpy array.
+
+    error_on_new : boolean, default=None
+        If True (default), transform will raise an error when there are
+        features with missing values in transform that have no missing values
+        in fit. This is applicable only when ``features="missing-only"``.
+
+    Attributes
+    ----------
+    features_ : ndarray, shape (n_missing_features,) or (n_features,)
+        The features indices which will be returned when calling ``transform``.
+        They are computed during ``fit``. For ``features='all'``, it is
+        to ``range(n_features)``.
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.impute import MissingIndicator
+    >>> X1 = np.array([[np.nan, 1, 3],
+    ...                [4, 0, np.nan],
+    ...                [8, 1, 0]])
+    >>> X2 = np.array([[5, 1, np.nan],
+    ...                [np.nan, 2, 3],
+    ...                [2, 4, 0]])
+    >>> indicator = MissingIndicator()
+    >>> indicator.fit(X1)
+    MissingIndicator()
+    >>> X2_tr = indicator.transform(X2)
+    >>> X2_tr
+    array([[False,  True],
+           [ True, False],
+           [False, False]])
+
+    """
+    @_deprecate_positional_args
+    def __init__(self, *, missing_values=np.nan, features="missing-only",
+                 sparse="auto", error_on_new=True):
+        self.missing_values = missing_values
+        self.features = features
+        self.sparse = sparse
+        self.error_on_new = error_on_new
+
+    def _get_missing_features_info(self, X):
+        """Compute the imputer mask and the indices of the features
+        containing missing values.
+
+        Parameters
+        ----------
+        X : {ndarray or sparse matrix}, shape (n_samples, n_features)
+            The input data with missing values. Note that ``X`` has been
+            checked in ``fit`` and ``transform`` before to call this function.
+
+        Returns
+        -------
+        imputer_mask : {ndarray or sparse matrix}, shape \
+        (n_samples, n_features)
+            The imputer mask of the original data.
+
+        features_with_missing : ndarray, shape (n_features_with_missing)
+            The features containing missing values.
+
+        """
+        if sparse.issparse(X):
+            mask = _get_mask(X.data, self.missing_values)
+
+            # The imputer mask will be constructed with the same sparse format
+            # as X.
+            sparse_constructor = (sparse.csr_matrix if X.format == 'csr'
+                                  else sparse.csc_matrix)
+            imputer_mask = sparse_constructor(
+                (mask, X.indices.copy(), X.indptr.copy()),
+                shape=X.shape, dtype=bool)
+            imputer_mask.eliminate_zeros()
+
+            if self.features == 'missing-only':
+                n_missing = imputer_mask.getnnz(axis=0)
+
+            if self.sparse is False:
+                imputer_mask = imputer_mask.toarray()
+            elif imputer_mask.format == 'csr':
+                imputer_mask = imputer_mask.tocsc()
+        else:
+            imputer_mask = _get_mask(X, self.missing_values)
+
+            if self.features == 'missing-only':
+                n_missing = imputer_mask.sum(axis=0)
+
+            if self.sparse is True:
+                imputer_mask = sparse.csc_matrix(imputer_mask)
+
+        if self.features == 'all':
+            features_indices = np.arange(X.shape[1])
+        else:
+            features_indices = np.flatnonzero(n_missing)
+
+        return imputer_mask, features_indices
+
+    def _validate_input(self, X, in_fit):
+        if not is_scalar_nan(self.missing_values):
+            force_all_finite = True
+        else:
+            force_all_finite = "allow-nan"
+        X = self._validate_data(X, reset=in_fit,
+                                accept_sparse=('csc', 'csr'), dtype=None,
+                                force_all_finite=force_all_finite)
+        _check_inputs_dtype(X, self.missing_values)
+        if X.dtype.kind not in ("i", "u", "f", "O"):
+            raise ValueError("MissingIndicator does not support data with "
+                             "dtype {0}. Please provide either a numeric array"
+                             " (with a floating point or integer dtype) or "
+                             "categorical data represented either as an array "
+                             "with integer dtype or an array of string values "
+                             "with an object dtype.".format(X.dtype))
+
+        if sparse.issparse(X) and self.missing_values == 0:
+            # missing_values = 0 not allowed with sparse data as it would
+            # force densification
+            raise ValueError("Sparse input with missing_values=0 is "
+                             "not supported. Provide a dense "
+                             "array instead.")
+
+        return X
+
+    def _fit(self, X, y=None):
+        """Fit the transformer on X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            Input data, where ``n_samples`` is the number of samples and
+            ``n_features`` is the number of features.
+
+        Returns
+        -------
+        imputer_mask : {ndarray or sparse matrix}, shape (n_samples, \
+        n_features)
+            The imputer mask of the original data.
+
+        """
+        X = self._validate_input(X, in_fit=True)
+        self._n_features = X.shape[1]
+
+        if self.features not in ('missing-only', 'all'):
+            raise ValueError("'features' has to be either 'missing-only' or "
+                             "'all'. Got {} instead.".format(self.features))
+
+        if not ((isinstance(self.sparse, str) and
+                self.sparse == "auto") or isinstance(self.sparse, bool)):
+            raise ValueError("'sparse' has to be a boolean or 'auto'. "
+                             "Got {!r} instead.".format(self.sparse))
+
+        missing_features_info = self._get_missing_features_info(X)
+        self.features_ = missing_features_info[1]
+
+        return missing_features_info[0]
+
+    def fit(self, X, y=None):
+        """Fit the transformer on X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            Input data, where ``n_samples`` is the number of samples and
+            ``n_features`` is the number of features.
+
+        Returns
+        -------
+        self : object
+            Returns self.
+        """
+        self._fit(X, y)
+
+        return self
+
+    def transform(self, X):
+        """Generate missing values indicator for X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            The input data to complete.
+
+        Returns
+        -------
+        Xt : {ndarray or sparse matrix}, shape (n_samples, n_features) \
+        or (n_samples, n_features_with_missing)
+            The missing indicator for input data. The data type of ``Xt``
+            will be boolean.
+
+        """
+        check_is_fitted(self)
+        X = self._validate_input(X, in_fit=False)
+
+        if X.shape[1] != self._n_features:
+            raise ValueError("X has a different number of features "
+                             "than during fitting.")
+
+        imputer_mask, features = self._get_missing_features_info(X)
+
+        if self.features == "missing-only":
+            features_diff_fit_trans = np.setdiff1d(features, self.features_)
+            if (self.error_on_new and features_diff_fit_trans.size > 0):
+                raise ValueError("The features {} have missing values "
+                                 "in transform but have no missing values "
+                                 "in fit.".format(features_diff_fit_trans))
+
+            if self.features_.size < self._n_features:
+                imputer_mask = imputer_mask[:, self.features_]
+
+        return imputer_mask
+
+    def fit_transform(self, X, y=None):
+        """Generate missing values indicator for X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            The input data to complete.
+
+        Returns
+        -------
+        Xt : {ndarray or sparse matrix}, shape (n_samples, n_features) \
+        or (n_samples, n_features_with_missing)
+            The missing indicator for input data. The data type of ``Xt``
+            will be boolean.
+
+        """
+        imputer_mask = self._fit(X, y)
+
+        if self.features_.size < self._n_features:
+            imputer_mask = imputer_mask[:, self.features_]
+
+        return imputer_mask
+
+    def _more_tags(self):
+        return {'allow_nan': True,
+                'X_types': ['2darray', 'string']}
diff --git a/python/cuml/_thirdparty/sklearn/preprocessing/_label.py b/python/cuml/_thirdparty/sklearn/preprocessing/_label.py
new file mode 100644
index 0000000000..360f566c10
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/preprocessing/_label.py
@@ -0,0 +1,1044 @@
+# Original authors from Sckit-Learn:
+#          Alexandre Gramfort <alexandre.gramfort@inria.fr>
+#          Mathieu Blondel <mathieu@mblondel.org>
+#          Olivier Grisel <olivier.grisel@ensta.org>
+#          Andreas Mueller <amueller@ais.uni-bonn.de>
+#          Joel Nothman <joel.nothman@gmail.com>
+#          Hamzeh Alsalhi <ha258@cornell.edu>
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+from collections import defaultdict
+import itertools
+import array
+import warnings
+
+import numpy as np
+import scipy.sparse as sp
+
+from ..base import BaseEstimator, TransformerMixin
+
+from ..utils.sparsefuncs import min_max_axis
+from ..utils import column_or_1d
+from ..utils.validation import check_array
+from ..utils.validation import check_is_fitted
+from ..utils.validation import _num_samples
+from ..utils.validation import _deprecate_positional_args
+from ..utils.multiclass import unique_labels
+from ..utils.multiclass import type_of_target
+
+
+__all__ = [
+    'label_binarize',
+    'LabelBinarizer',
+    'LabelEncoder',
+    'MultiLabelBinarizer',
+]
+
+
+def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
+    # only used in _encode below, see docstring there for details
+    if uniques is None:
+        if encode:
+            uniques, encoded = np.unique(values, return_inverse=True)
+            return uniques, encoded
+        else:
+            # unique sorts
+            return np.unique(values)
+    if encode:
+        if check_unknown:
+            diff = _encode_check_unknown(values, uniques)
+            if diff:
+                raise ValueError("y contains previously unseen labels: %s"
+                                 % str(diff))
+        encoded = np.searchsorted(uniques, values)
+        return uniques, encoded
+    else:
+        return uniques
+
+
+def _encode_python(values, uniques=None, encode=False):
+    # only used in _encode below, see docstring there for details
+    if uniques is None:
+        uniques = sorted(set(values))
+        uniques = np.array(uniques, dtype=values.dtype)
+    if encode:
+        table = {val: i for i, val in enumerate(uniques)}
+        try:
+            encoded = np.array([table[v] for v in values])
+        except KeyError as e:
+            raise ValueError("y contains previously unseen labels: %s"
+                             % str(e))
+        return uniques, encoded
+    else:
+        return uniques
+
+
+def _encode(values, uniques=None, encode=False, check_unknown=True):
+    """Helper function to factorize (find uniques) and encode values.
+
+    Uses pure python method for object dtype, and numpy method for
+    all other dtypes.
+    The numpy method has the limitation that the `uniques` need to
+    be sorted. Importantly, this is not checked but assumed to already be
+    the case. The calling method needs to ensure this for all non-object
+    values.
+
+    Parameters
+    ----------
+    values : array
+        Values to factorize or encode.
+    uniques : array, optional
+        If passed, uniques are not determined from passed values (this
+        can be because the user specified categories, or because they
+        already have been determined in fit).
+    encode : bool, default False
+        If True, also encode the values into integer codes based on `uniques`.
+    check_unknown : bool, default True
+        If True, check for values in ``values`` that are not in ``unique``
+        and raise an error. This is ignored for object dtype, and treated as
+        True in this case. This parameter is useful for
+        _BaseEncoder._transform() to avoid calling _encode_check_unknown()
+        twice.
+
+    Returns
+    -------
+    uniques
+        If ``encode=False``. The unique values are sorted if the `uniques`
+        parameter was None (and thus inferred from the data).
+    (uniques, encoded)
+        If ``encode=True``.
+
+    """
+    if values.dtype == object:
+        try:
+            res = _encode_python(values, uniques, encode)
+        except TypeError:
+            types = sorted(t.__qualname__
+                           for t in set(type(v) for v in values))
+            raise TypeError("Encoders require their input to be uniformly "
+                            f"strings or numbers. Got {types}")
+        return res
+    else:
+        return _encode_numpy(values, uniques, encode,
+                             check_unknown=check_unknown)
+
+
+def _encode_check_unknown(values, uniques, return_mask=False):
+    """
+    Helper function to check for unknowns in values to be encoded.
+
+    Uses pure python method for object dtype, and numpy method for
+    all other dtypes.
+
+    Parameters
+    ----------
+    values : array
+        Values to check for unknowns.
+    uniques : array
+        Allowed uniques values.
+    return_mask : bool, default False
+        If True, return a mask of the same shape as `values` indicating
+        the valid values.
+
+    Returns
+    -------
+    diff : list
+        The unique values present in `values` and not in `uniques` (the
+        unknown values).
+    valid_mask : boolean array
+        Additionally returned if ``return_mask=True``.
+
+    """
+    if values.dtype == object:
+        uniques_set = set(uniques)
+        diff = list(set(values) - uniques_set)
+        if return_mask:
+            if diff:
+                valid_mask = np.array([val in uniques_set for val in values])
+            else:
+                valid_mask = np.ones(len(values), dtype=bool)
+            return diff, valid_mask
+        else:
+            return diff
+    else:
+        unique_values = np.unique(values)
+        diff = list(np.setdiff1d(unique_values, uniques, assume_unique=True))
+        if return_mask:
+            if diff:
+                valid_mask = np.in1d(values, uniques)
+            else:
+                valid_mask = np.ones(len(values), dtype=bool)
+            return diff, valid_mask
+        else:
+            return diff
+
+
+class LabelEncoder(TransformerMixin, BaseEstimator):
+    """Encode target labels with value between 0 and n_classes-1.
+
+    This transformer should be used to encode target values, *i.e.* `y`, and
+    not the input `X`.
+
+    Read more in the :ref:`User Guide <preprocessing_targets>`.
+
+    .. versionadded:: 0.12
+
+    Attributes
+    ----------
+    classes_ : array of shape (n_class,)
+        Holds the label for each class.
+
+    Examples
+    --------
+    `LabelEncoder` can be used to normalize labels.
+
+    >>> from sklearn import preprocessing
+    >>> le = preprocessing.LabelEncoder()
+    >>> le.fit([1, 2, 2, 6])
+    LabelEncoder()
+    >>> le.classes_
+    array([1, 2, 6])
+    >>> le.transform([1, 1, 2, 6])
+    array([0, 0, 1, 2]...)
+    >>> le.inverse_transform([0, 0, 1, 2])
+    array([1, 1, 2, 6])
+
+    It can also be used to transform non-numerical labels (as long as they are
+    hashable and comparable) to numerical labels.
+
+    >>> le = preprocessing.LabelEncoder()
+    >>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
+    LabelEncoder()
+    >>> list(le.classes_)
+    ['amsterdam', 'paris', 'tokyo']
+    >>> le.transform(["tokyo", "tokyo", "paris"])
+    array([2, 2, 1]...)
+    >>> list(le.inverse_transform([2, 2, 1]))
+    ['tokyo', 'tokyo', 'paris']
+
+    See also
+    --------
+    sklearn.preprocessing.OrdinalEncoder : Encode categorical features
+        using an ordinal encoding scheme.
+
+    sklearn.preprocessing.OneHotEncoder : Encode categorical features
+        as a one-hot numeric array.
+    """
+
+    def fit(self, y):
+        """Fit label encoder
+
+        Parameters
+        ----------
+        y : array-like of shape (n_samples,)
+            Target values.
+
+        Returns
+        -------
+        self : returns an instance of self.
+        """
+        y = column_or_1d(y, warn=True)
+        self.classes_ = _encode(y)
+        return self
+
+    def fit_transform(self, y):
+        """Fit label encoder and return encoded labels
+
+        Parameters
+        ----------
+        y : array-like of shape [n_samples]
+            Target values.
+
+        Returns
+        -------
+        y : array-like of shape [n_samples]
+        """
+        y = column_or_1d(y, warn=True)
+        self.classes_, y = _encode(y, encode=True)
+        return y
+
+    def transform(self, y):
+        """Transform labels to normalized encoding.
+
+        Parameters
+        ----------
+        y : array-like of shape [n_samples]
+            Target values.
+
+        Returns
+        -------
+        y : array-like of shape [n_samples]
+        """
+        check_is_fitted(self)
+        y = column_or_1d(y, warn=True)
+        # transform of empty array is empty array
+        if _num_samples(y) == 0:
+            return np.array([])
+
+        _, y = _encode(y, uniques=self.classes_, encode=True)
+        return y
+
+    def inverse_transform(self, y):
+        """Transform labels back to original encoding.
+
+        Parameters
+        ----------
+        y : numpy array of shape [n_samples]
+            Target values.
+
+        Returns
+        -------
+        y : numpy array of shape [n_samples]
+        """
+        check_is_fitted(self)
+        y = column_or_1d(y, warn=True)
+        # inverse transform of empty array is empty array
+        if _num_samples(y) == 0:
+            return np.array([])
+
+        diff = np.setdiff1d(y, np.arange(len(self.classes_)))
+        if len(diff):
+            raise ValueError(
+                    "y contains previously unseen labels: %s" % str(diff))
+        y = np.asarray(y)
+        return self.classes_[y]
+
+    def _more_tags(self):
+        return {'X_types': ['1dlabels']}
+
+
+class LabelBinarizer(TransformerMixin, BaseEstimator):
+    """Binarize labels in a one-vs-all fashion
+
+    Several regression and binary classification algorithms are
+    available in scikit-learn. A simple way to extend these algorithms
+    to the multi-class classification case is to use the so-called
+    one-vs-all scheme.
+
+    At learning time, this simply consists in learning one regressor
+    or binary classifier per class. In doing so, one needs to convert
+    multi-class labels to binary labels (belong or does not belong
+    to the class). LabelBinarizer makes this process easy with the
+    transform method.
+
+    At prediction time, one assigns the class for which the corresponding
+    model gave the greatest confidence. LabelBinarizer makes this easy
+    with the inverse_transform method.
+
+    Read more in the :ref:`User Guide <preprocessing_targets>`.
+
+    Parameters
+    ----------
+
+    neg_label : int (default: 0)
+        Value with which negative labels must be encoded.
+
+    pos_label : int (default: 1)
+        Value with which positive labels must be encoded.
+
+    sparse_output : boolean (default: False)
+        True if the returned array from transform is desired to be in sparse
+        CSR format.
+
+    Attributes
+    ----------
+
+    classes_ : array of shape [n_class]
+        Holds the label for each class.
+
+    y_type_ : str,
+        Represents the type of the target data as evaluated by
+        utils.multiclass.type_of_target. Possible type are 'continuous',
+        'continuous-multioutput', 'binary', 'multiclass',
+        'multiclass-multioutput', 'multilabel-indicator', and 'unknown'.
+
+    sparse_input_ : boolean,
+        True if the input data to transform is given as a sparse matrix, False
+        otherwise.
+
+    Examples
+    --------
+    >>> from sklearn import preprocessing
+    >>> lb = preprocessing.LabelBinarizer()
+    >>> lb.fit([1, 2, 6, 4, 2])
+    LabelBinarizer()
+    >>> lb.classes_
+    array([1, 2, 4, 6])
+    >>> lb.transform([1, 6])
+    array([[1, 0, 0, 0],
+           [0, 0, 0, 1]])
+
+    Binary targets transform to a column vector
+
+    >>> lb = preprocessing.LabelBinarizer()
+    >>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
+    array([[1],
+           [0],
+           [0],
+           [1]])
+
+    Passing a 2D matrix for multilabel classification
+
+    >>> import numpy as np
+    >>> lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))
+    LabelBinarizer()
+    >>> lb.classes_
+    array([0, 1, 2])
+    >>> lb.transform([0, 1, 2, 1])
+    array([[1, 0, 0],
+           [0, 1, 0],
+           [0, 0, 1],
+           [0, 1, 0]])
+
+    See also
+    --------
+    label_binarize : function to perform the transform operation of
+        LabelBinarizer with fixed classes.
+    sklearn.preprocessing.OneHotEncoder : encode categorical features
+        using a one-hot aka one-of-K scheme.
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, neg_label=0, pos_label=1, sparse_output=False):
+        if neg_label >= pos_label:
+            raise ValueError("neg_label={0} must be strictly less than "
+                             "pos_label={1}.".format(neg_label, pos_label))
+
+        if sparse_output and (pos_label == 0 or neg_label != 0):
+            raise ValueError("Sparse binarization is only supported with non "
+                             "zero pos_label and zero neg_label, got "
+                             "pos_label={0} and neg_label={1}"
+                             "".format(pos_label, neg_label))
+
+        self.neg_label = neg_label
+        self.pos_label = pos_label
+        self.sparse_output = sparse_output
+
+    def fit(self, y):
+        """Fit label binarizer
+
+        Parameters
+        ----------
+        y : array of shape [n_samples,] or [n_samples, n_classes]
+            Target values. The 2-d matrix should only contain 0 and 1,
+            represents multilabel classification.
+
+        Returns
+        -------
+        self : returns an instance of self.
+        """
+        self.y_type_ = type_of_target(y)
+        if 'multioutput' in self.y_type_:
+            raise ValueError("Multioutput target data is not supported with "
+                             "label binarization")
+        if _num_samples(y) == 0:
+            raise ValueError('y has 0 samples: %r' % y)
+
+        self.sparse_input_ = sp.issparse(y)
+        self.classes_ = unique_labels(y)
+        return self
+
+    def fit_transform(self, y):
+        """Fit label binarizer and transform multi-class labels to binary
+        labels.
+
+        The output of transform is sometimes referred to as
+        the 1-of-K coding scheme.
+
+        Parameters
+        ----------
+        y : array or sparse matrix of shape [n_samples,] or \
+            [n_samples, n_classes]
+            Target values. The 2-d matrix should only contain 0 and 1,
+            represents multilabel classification. Sparse matrix can be
+            CSR, CSC, COO, DOK, or LIL.
+
+        Returns
+        -------
+        Y : array or CSR matrix of shape [n_samples, n_classes]
+            Shape will be [n_samples, 1] for binary problems.
+        """
+        return self.fit(y).transform(y)
+
+    def transform(self, y):
+        """Transform multi-class labels to binary labels
+
+        The output of transform is sometimes referred to by some authors as
+        the 1-of-K coding scheme.
+
+        Parameters
+        ----------
+        y : array or sparse matrix of shape [n_samples,] or \
+            [n_samples, n_classes]
+            Target values. The 2-d matrix should only contain 0 and 1,
+            represents multilabel classification. Sparse matrix can be
+            CSR, CSC, COO, DOK, or LIL.
+
+        Returns
+        -------
+        Y : numpy array or CSR matrix of shape [n_samples, n_classes]
+            Shape will be [n_samples, 1] for binary problems.
+        """
+        check_is_fitted(self)
+
+        y_is_multilabel = type_of_target(y).startswith('multilabel')
+        if y_is_multilabel and not self.y_type_.startswith('multilabel'):
+            raise ValueError("The object was not fitted with multilabel"
+                             " input.")
+
+        return label_binarize(y, classes=self.classes_,
+                              pos_label=self.pos_label,
+                              neg_label=self.neg_label,
+                              sparse_output=self.sparse_output)
+
+    def inverse_transform(self, Y, threshold=None):
+        """Transform binary labels back to multi-class labels
+
+        Parameters
+        ----------
+        Y : numpy array or sparse matrix with shape [n_samples, n_classes]
+            Target values. All sparse matrices are converted to CSR before
+            inverse transformation.
+
+        threshold : float or None
+            Threshold used in the binary and multi-label cases.
+
+            Use 0 when ``Y`` contains the output of decision_function
+            (classifier).
+            Use 0.5 when ``Y`` contains the output of predict_proba.
+
+            If None, the threshold is assumed to be half way between
+            neg_label and pos_label.
+
+        Returns
+        -------
+        y : numpy array or CSR matrix of shape [n_samples] Target values.
+
+        Notes
+        -----
+        In the case when the binary labels are fractional
+        (probabilistic), inverse_transform chooses the class with the
+        greatest value. Typically, this allows to use the output of a
+        linear model's decision_function method directly as the input
+        of inverse_transform.
+        """
+        check_is_fitted(self)
+
+        if threshold is None:
+            threshold = (self.pos_label + self.neg_label) / 2.
+
+        if self.y_type_ == "multiclass":
+            y_inv = _inverse_binarize_multiclass(Y, self.classes_)
+        else:
+            y_inv = _inverse_binarize_thresholding(Y, self.y_type_,
+                                                   self.classes_, threshold)
+
+        if self.sparse_input_:
+            y_inv = sp.csr_matrix(y_inv)
+        elif sp.issparse(y_inv):
+            y_inv = y_inv.toarray()
+
+        return y_inv
+
+    def _more_tags(self):
+        return {'X_types': ['1dlabels']}
+
+
+@_deprecate_positional_args
+def label_binarize(y, *, classes, neg_label=0, pos_label=1,
+                   sparse_output=False):
+    """Binarize labels in a one-vs-all fashion
+
+    Several regression and binary classification algorithms are
+    available in scikit-learn. A simple way to extend these algorithms
+    to the multi-class classification case is to use the so-called
+    one-vs-all scheme.
+
+    This function makes it possible to compute this transformation for a
+    fixed set of class labels known ahead of time.
+
+    Parameters
+    ----------
+    y : array-like
+        Sequence of integer labels or multilabel data to encode.
+
+    classes : array-like of shape [n_classes]
+        Uniquely holds the label for each class.
+
+    neg_label : int (default: 0)
+        Value with which negative labels must be encoded.
+
+    pos_label : int (default: 1)
+        Value with which positive labels must be encoded.
+
+    sparse_output : boolean (default: False),
+        Set to true if output binary array is desired in CSR sparse format
+
+    Returns
+    -------
+    Y : numpy array or CSR matrix of shape [n_samples, n_classes]
+        Shape will be [n_samples, 1] for binary problems.
+
+    Examples
+    --------
+    >>> from sklearn.preprocessing import label_binarize
+    >>> label_binarize([1, 6], classes=[1, 2, 4, 6])
+    array([[1, 0, 0, 0],
+           [0, 0, 0, 1]])
+
+    The class ordering is preserved:
+
+    >>> label_binarize([1, 6], classes=[1, 6, 4, 2])
+    array([[1, 0, 0, 0],
+           [0, 1, 0, 0]])
+
+    Binary targets transform to a column vector
+
+    >>> label_binarize(['yes', 'no', 'no', 'yes'], classes=['no', 'yes'])
+    array([[1],
+           [0],
+           [0],
+           [1]])
+
+    See also
+    --------
+    LabelBinarizer : class used to wrap the functionality of label_binarize and
+        allow for fitting to classes independently of the transform operation
+    """
+    if not isinstance(y, list):
+        # XXX Workaround that will be removed when list of list format is
+        # dropped
+        y = check_array(y, accept_sparse='csr', ensure_2d=False, dtype=None)
+    else:
+        if _num_samples(y) == 0:
+            raise ValueError('y has 0 samples: %r' % y)
+    if neg_label >= pos_label:
+        raise ValueError("neg_label={0} must be strictly less than "
+                         "pos_label={1}.".format(neg_label, pos_label))
+
+    if (sparse_output and (pos_label == 0 or neg_label != 0)):
+        raise ValueError("Sparse binarization is only supported with non "
+                         "zero pos_label and zero neg_label, got "
+                         "pos_label={0} and neg_label={1}"
+                         "".format(pos_label, neg_label))
+
+    # To account for pos_label == 0 in the dense case
+    pos_switch = pos_label == 0
+    if pos_switch:
+        pos_label = -neg_label
+
+    y_type = type_of_target(y)
+    if 'multioutput' in y_type:
+        raise ValueError("Multioutput target data is not supported with label "
+                         "binarization")
+    if y_type == 'unknown':
+        raise ValueError("The type of target data is not known")
+
+    n_samples = y.shape[0] if sp.issparse(y) else len(y)
+    n_classes = len(classes)
+    classes = np.asarray(classes)
+
+    if y_type == "binary":
+        if n_classes == 1:
+            if sparse_output:
+                return sp.csr_matrix((n_samples, 1), dtype=int)
+            else:
+                Y = np.zeros((len(y), 1), dtype=np.int)
+                Y += neg_label
+                return Y
+        elif len(classes) >= 3:
+            y_type = "multiclass"
+
+    sorted_class = np.sort(classes)
+    if y_type == "multilabel-indicator":
+        y_n_classes = y.shape[1] if hasattr(y, 'shape') else len(y[0])
+        if classes.size != y_n_classes:
+            raise ValueError("classes {0} mismatch with the labels {1}"
+                             " found in the data"
+                             .format(classes, unique_labels(y)))
+
+    if y_type in ("binary", "multiclass"):
+        y = column_or_1d(y)
+
+        # pick out the known labels from y
+        y_in_classes = np.in1d(y, classes)
+        y_seen = y[y_in_classes]
+        indices = np.searchsorted(sorted_class, y_seen)
+        indptr = np.hstack((0, np.cumsum(y_in_classes)))
+
+        data = np.empty_like(indices)
+        data.fill(pos_label)
+        Y = sp.csr_matrix((data, indices, indptr),
+                          shape=(n_samples, n_classes))
+    elif y_type == "multilabel-indicator":
+        Y = sp.csr_matrix(y)
+        if pos_label != 1:
+            data = np.empty_like(Y.data)
+            data.fill(pos_label)
+            Y.data = data
+    else:
+        raise ValueError("%s target data is not supported with label "
+                         "binarization" % y_type)
+
+    if not sparse_output:
+        Y = Y.toarray()
+        Y = Y.astype(int, copy=False)
+
+        if neg_label != 0:
+            Y[Y == 0] = neg_label
+
+        if pos_switch:
+            Y[Y == pos_label] = 0
+    else:
+        Y.data = Y.data.astype(int, copy=False)
+
+    # preserve label ordering
+    if np.any(classes != sorted_class):
+        indices = np.searchsorted(sorted_class, classes)
+        Y = Y[:, indices]
+
+    if y_type == "binary":
+        if sparse_output:
+            Y = Y.getcol(-1)
+        else:
+            Y = Y[:, -1].reshape((-1, 1))
+
+    return Y
+
+
+def _inverse_binarize_multiclass(y, classes):
+    """Inverse label binarization transformation for multiclass.
+
+    Multiclass uses the maximal score instead of a threshold.
+    """
+    classes = np.asarray(classes)
+
+    if sp.issparse(y):
+        # Find the argmax for each row in y where y is a CSR matrix
+
+        y = y.tocsr()
+        n_samples, n_outputs = y.shape
+        outputs = np.arange(n_outputs)
+        row_max = min_max_axis(y, 1)[1]
+        row_nnz = np.diff(y.indptr)
+
+        y_data_repeated_max = np.repeat(row_max, row_nnz)
+        # picks out all indices obtaining the maximum per row
+        y_i_all_argmax = np.flatnonzero(y_data_repeated_max == y.data)
+
+        # For corner case where last row has a max of 0
+        if row_max[-1] == 0:
+            y_i_all_argmax = np.append(y_i_all_argmax, [len(y.data)])
+
+        # Gets the index of the first argmax in each row from y_i_all_argmax
+        index_first_argmax = np.searchsorted(y_i_all_argmax, y.indptr[:-1])
+        # first argmax of each row
+        y_ind_ext = np.append(y.indices, [0])
+        y_i_argmax = y_ind_ext[y_i_all_argmax[index_first_argmax]]
+        # Handle rows of all 0
+        y_i_argmax[np.where(row_nnz == 0)[0]] = 0
+
+        # Handles rows with max of 0 that contain negative numbers
+        samples = np.arange(n_samples)[(row_nnz > 0) &
+                                       (row_max.ravel() == 0)]
+        for i in samples:
+            ind = y.indices[y.indptr[i]:y.indptr[i + 1]]
+            y_i_argmax[i] = classes[np.setdiff1d(outputs, ind)][0]
+
+        return classes[y_i_argmax]
+    else:
+        return classes.take(y.argmax(axis=1), mode="clip")
+
+
+def _inverse_binarize_thresholding(y, output_type, classes, threshold):
+    """Inverse label binarization transformation using thresholding."""
+
+    if output_type == "binary" and y.ndim == 2 and y.shape[1] > 2:
+        raise ValueError("output_type='binary', but y.shape = {0}".
+                         format(y.shape))
+
+    if output_type != "binary" and y.shape[1] != len(classes):
+        raise ValueError("The number of class is not equal to the number of "
+                         "dimension of y.")
+
+    classes = np.asarray(classes)
+
+    # Perform thresholding
+    if sp.issparse(y):
+        if threshold > 0:
+            if y.format not in ('csr', 'csc'):
+                y = y.tocsr()
+            y.data = np.array(y.data > threshold, dtype=np.int)
+            y.eliminate_zeros()
+        else:
+            y = np.array(y.toarray() > threshold, dtype=np.int)
+    else:
+        y = np.array(y > threshold, dtype=np.int)
+
+    # Inverse transform data
+    if output_type == "binary":
+        if sp.issparse(y):
+            y = y.toarray()
+        if y.ndim == 2 and y.shape[1] == 2:
+            return classes[y[:, 1]]
+        else:
+            if len(classes) == 1:
+                return np.repeat(classes[0], len(y))
+            else:
+                return classes[y.ravel()]
+
+    elif output_type == "multilabel-indicator":
+        return y
+
+    else:
+        raise ValueError("{0} format is not supported".format(output_type))
+
+
+class MultiLabelBinarizer(TransformerMixin, BaseEstimator):
+    """Transform between iterable of iterables and a multilabel format
+
+    Although a list of sets or tuples is a very intuitive format for multilabel
+    data, it is unwieldy to process. This transformer converts between this
+    intuitive format and the supported multilabel format: a (samples x classes)
+    binary matrix indicating the presence of a class label.
+
+    Parameters
+    ----------
+    classes : array-like of shape [n_classes] (optional)
+        Indicates an ordering for the class labels.
+        All entries should be unique (cannot contain duplicate classes).
+
+    sparse_output : boolean (default: False),
+        Set to true if output binary array is desired in CSR sparse format
+
+    Attributes
+    ----------
+    classes_ : array of labels
+        A copy of the `classes` parameter where provided,
+        or otherwise, the sorted set of classes found when fitting.
+
+    Examples
+    --------
+    >>> from sklearn.preprocessing import MultiLabelBinarizer
+    >>> mlb = MultiLabelBinarizer()
+    >>> mlb.fit_transform([(1, 2), (3,)])
+    array([[1, 1, 0],
+           [0, 0, 1]])
+    >>> mlb.classes_
+    array([1, 2, 3])
+
+    >>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
+    array([[0, 1, 1],
+           [1, 0, 0]])
+    >>> list(mlb.classes_)
+    ['comedy', 'sci-fi', 'thriller']
+
+    A common mistake is to pass in a list, which leads to the following issue:
+
+    >>> mlb = MultiLabelBinarizer()
+    >>> mlb.fit(['sci-fi', 'thriller', 'comedy'])
+    MultiLabelBinarizer()
+    >>> mlb.classes_
+    array(['-', 'c', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 'r', 's', 't',
+        'y'], dtype=object)
+
+    To correct this, the list of labels should be passed in as:
+
+    >>> mlb = MultiLabelBinarizer()
+    >>> mlb.fit([['sci-fi', 'thriller', 'comedy']])
+    MultiLabelBinarizer()
+    >>> mlb.classes_
+    array(['comedy', 'sci-fi', 'thriller'], dtype=object)
+
+    See also
+    --------
+    sklearn.preprocessing.OneHotEncoder : encode categorical features
+        using a one-hot aka one-of-K scheme.
+    """
+
+    @_deprecate_positional_args
+    def __init__(self, *, classes=None, sparse_output=False):
+        self.classes = classes
+        self.sparse_output = sparse_output
+
+    def fit(self, y):
+        """Fit the label sets binarizer, storing :term:`classes_`
+
+        Parameters
+        ----------
+        y : iterable of iterables
+            A set of labels (any orderable and hashable object) for each
+            sample. If the `classes` parameter is set, `y` will not be
+            iterated.
+
+        Returns
+        -------
+        self : returns this MultiLabelBinarizer instance
+        """
+        self._cached_dict = None
+        if self.classes is None:
+            classes = sorted(set(itertools.chain.from_iterable(y)))
+        elif len(set(self.classes)) < len(self.classes):
+            raise ValueError("The classes argument contains duplicate "
+                             "classes. Remove these duplicates before passing "
+                             "them to MultiLabelBinarizer.")
+        else:
+            classes = self.classes
+        dtype = np.int if all(isinstance(c, int) for c in classes) else object
+        self.classes_ = np.empty(len(classes), dtype=dtype)
+        self.classes_[:] = classes
+        return self
+
+    def fit_transform(self, y):
+        """Fit the label sets binarizer and transform the given label sets
+
+        Parameters
+        ----------
+        y : iterable of iterables
+            A set of labels (any orderable and hashable object) for each
+            sample. If the `classes` parameter is set, `y` will not be
+            iterated.
+
+        Returns
+        -------
+        y_indicator : array or CSR matrix, shape (n_samples, n_classes)
+            A matrix such that `y_indicator[i, j] = 1` iff `classes_[j]` is in
+            `y[i]`, and 0 otherwise.
+        """
+        self._cached_dict = None
+
+        if self.classes is not None:
+            return self.fit(y).transform(y)
+
+        # Automatically increment on new class
+        class_mapping = defaultdict(int)
+        class_mapping.default_factory = class_mapping.__len__
+        yt = self._transform(y, class_mapping)
+
+        # sort classes and reorder columns
+        tmp = sorted(class_mapping, key=class_mapping.get)
+
+        # (make safe for tuples)
+        dtype = np.int if all(isinstance(c, int) for c in tmp) else object
+        class_mapping = np.empty(len(tmp), dtype=dtype)
+        class_mapping[:] = tmp
+        self.classes_, inverse = np.unique(class_mapping, return_inverse=True)
+        # ensure yt.indices keeps its current dtype
+        yt.indices = np.array(inverse[yt.indices], dtype=yt.indices.dtype,
+                              copy=False)
+
+        if not self.sparse_output:
+            yt = yt.toarray()
+
+        return yt
+
+    def transform(self, y):
+        """Transform the given label sets
+
+        Parameters
+        ----------
+        y : iterable of iterables
+            A set of labels (any orderable and hashable object) for each
+            sample. If the `classes` parameter is set, `y` will not be
+            iterated.
+
+        Returns
+        -------
+        y_indicator : array or CSR matrix, shape (n_samples, n_classes)
+            A matrix such that `y_indicator[i, j] = 1` iff `classes_[j]` is in
+            `y[i]`, and 0 otherwise.
+        """
+        check_is_fitted(self)
+
+        class_to_index = self._build_cache()
+        yt = self._transform(y, class_to_index)
+
+        if not self.sparse_output:
+            yt = yt.toarray()
+
+        return yt
+
+    def _build_cache(self):
+        if self._cached_dict is None:
+            self._cached_dict = dict(zip(self.classes_,
+                                         range(len(self.classes_))))
+
+        return self._cached_dict
+
+    def _transform(self, y, class_mapping):
+        """Transforms the label sets with a given mapping
+
+        Parameters
+        ----------
+        y : iterable of iterables
+        class_mapping : Mapping
+            Maps from label to column index in label indicator matrix
+
+        Returns
+        -------
+        y_indicator : sparse CSR matrix, shape (n_samples, n_classes)
+            Label indicator matrix
+        """
+        indices = array.array('i')
+        indptr = array.array('i', [0])
+        unknown = set()
+        for labels in y:
+            index = set()
+            for label in labels:
+                try:
+                    index.add(class_mapping[label])
+                except KeyError:
+                    unknown.add(label)
+            indices.extend(index)
+            indptr.append(len(indices))
+        if unknown:
+            warnings.warn('unknown class(es) {0} will be ignored'
+                          .format(sorted(unknown, key=str)))
+        data = np.ones(len(indices), dtype=int)
+
+        return sp.csr_matrix((data, indices, indptr),
+                             shape=(len(indptr) - 1, len(class_mapping)))
+
+    def inverse_transform(self, yt):
+        """Transform the given indicator matrix into label sets
+
+        Parameters
+        ----------
+        yt : array or sparse matrix of shape (n_samples, n_classes)
+            A matrix containing only 1s ands 0s.
+
+        Returns
+        -------
+        y : list of tuples
+            The set of labels for each sample such that `y[i]` consists of
+            `classes_[j]` for each `yt[i, j] == 1`.
+        """
+        check_is_fitted(self)
+
+        if yt.shape[1] != len(self.classes_):
+            raise ValueError('Expected indicator for {0} classes, but got {1}'
+                             .format(len(self.classes_), yt.shape[1]))
+
+        if sp.issparse(yt):
+            yt = yt.tocsr()
+            if len(yt.data) != 0 and len(np.setdiff1d(yt.data, [0, 1])) > 0:
+                raise ValueError('Expected only 0s and 1s in label indicator.')
+            return [tuple(self.classes_.take(yt.indices[start:end]))
+                    for start, end in zip(yt.indptr[:-1], yt.indptr[1:])]
+        else:
+            unexpected = np.setdiff1d(yt, [0, 1])
+            if len(unexpected) > 0:
+                raise ValueError('Expected only 0s and 1s in label indicator. '
+                                 'Also got {0}'.format(unexpected))
+            return [tuple(self.classes_.compress(indicators)) for indicators
+                    in yt]
+
+    def _more_tags(self):
+        return {'X_types': ['2dlabels']}
diff --git a/python/cuml/_thirdparty/sklearn/utils/__init__.py b/python/cuml/_thirdparty/sklearn/utils/__init__.py
new file mode 100644
index 0000000000..bcbed6b6cd
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/utils/__init__.py
@@ -0,0 +1,7 @@
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+
+
+from . import validation
+from . import extmath
diff --git a/python/cuml/_thirdparty/sklearn/utils/extmath.py b/python/cuml/_thirdparty/sklearn/utils/extmath.py
new file mode 100644
index 0000000000..af1b6a6d32
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/utils/extmath.py
@@ -0,0 +1,242 @@
+"""
+Extended math utilities.
+"""
+# Original authors from Sckit-Learn:
+#          Gael Varoquaux
+#          Alexandre Gramfort
+#          Alexandre T. Passos
+#          Olivier Grisel
+#          Lars Buitinck
+#          Stefan van der Walt
+#          Kyle Kastner
+#          Giorgio Patrini
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+import cupy as np
+import cupyx
+from cupy import sparse
+
+from .validation import _deprecate_positional_args
+
+
+def row_norms(X, squared=False):
+    """Row-wise (squared) Euclidean norm of X.
+
+    Equivalent to np.sqrt((X * X).sum(axis=1)), but also supports sparse
+    matrices and does not create an X.shape-sized temporary.
+
+    Performs no input validation.
+
+    Parameters
+    ----------
+    X : array_like
+        The input array
+    squared : bool, optional (default = False)
+        If True, return squared norms.
+
+    Returns
+    -------
+    array_like
+        The row-wise (squared) Euclidean norm of X.
+    """
+    if sparse.issparse(X):
+        if not isinstance(X, sparse.csr_matrix):
+            X = sparse.csr_matrix(X)
+        # norms = csr_row_norms(X)
+    else:
+        norms = np.einsum('ij,ij->i', X, X)
+
+    if not squared:
+        np.sqrt(norms, norms)
+    return norms
+
+
+def _incremental_mean_and_var(X, last_mean, last_variance, last_sample_count):
+    """Calculate mean update and a Youngs and Cramer variance update.
+
+    last_mean and last_variance are statistics computed at the last step by the
+    function. Both must be initialized to 0.0. In case no scaling is required
+    last_variance can be None. The mean is always required and returned because
+    necessary for the calculation of the variance. last_n_samples_seen is the
+    number of samples encountered until now.
+
+    From the paper "Algorithms for computing the sample variance: analysis and
+    recommendations", by Chan, Golub, and LeVeque.
+
+    Parameters
+    ----------
+    X : array-like, shape (n_samples, n_features)
+        Data to use for variance update
+
+    last_mean : array-like, shape: (n_features,)
+
+    last_variance : array-like, shape: (n_features,)
+
+    last_sample_count : array-like, shape (n_features,)
+
+    Returns
+    -------
+    updated_mean : array, shape (n_features,)
+
+    updated_variance : array, shape (n_features,)
+        If None, only mean is computed
+
+    updated_sample_count : array, shape (n_features,)
+
+    Notes
+    -----
+    NaNs are ignored during the algorithm.
+
+    References
+    ----------
+    T. Chan, G. Golub, R. LeVeque. Algorithms for computing the sample
+        variance: recommendations, The American Statistician, Vol. 37, No. 3,
+        pp. 242-247
+
+    Also, see the sparse implementation of this in
+    `utils.sparsefuncs.incr_mean_variance_axis` and
+    `utils.sparsefuncs_fast.incr_mean_variance_axis0`
+    """
+    # old = stats until now
+    # new = the current increment
+    # updated = the aggregated stats
+    last_sum = last_mean * last_sample_count
+    new_sum = _safe_accumulator_op(np.nansum, X, axis=0)
+
+    new_sample_count = np.sum(~np.isnan(X), axis=0)
+    updated_sample_count = last_sample_count + new_sample_count
+
+    updated_mean = (last_sum + new_sum) / updated_sample_count
+
+    if last_variance is None:
+        updated_variance = None
+    else:
+        new_unnormalized_variance = (
+            _safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)
+        last_unnormalized_variance = last_variance * last_sample_count
+
+        with cupyx.errstate(divide=None, invalid=None):
+            last_over_new_count = last_sample_count / new_sample_count
+            updated_unnormalized_variance = (
+                last_unnormalized_variance + new_unnormalized_variance +
+                last_over_new_count / updated_sample_count *
+                (last_sum / last_over_new_count - new_sum) ** 2)
+
+        zeros = last_sample_count == 0
+        updated_unnormalized_variance[zeros] = new_unnormalized_variance[zeros]
+        updated_variance = updated_unnormalized_variance / updated_sample_count
+
+    return updated_mean, updated_variance, updated_sample_count
+
+
+@_deprecate_positional_args
+def weighted_mode(a, w, *, axis=0):
+    """Returns an array of the weighted modal (most common) value in a
+
+    If there is more than one such value, only the first is returned.
+    The bin-count for the modal bins is also returned.
+
+    This is an extension of the algorithm in scipy.stats.mode.
+
+    Parameters
+    ----------
+    a : array_like
+        n-dimensional array of which to find mode(s).
+    w : array_like
+        n-dimensional array of weights for each value
+    axis : int, optional
+        Axis along which to operate. Default is 0, i.e. the first axis.
+
+    Returns
+    -------
+    vals : ndarray
+        Array of modal values.
+    score : ndarray
+        Array of weighted counts for each mode.
+
+    Examples
+    --------
+    >>> from sklearn.utils.extmath import weighted_mode
+    >>> x = [4, 1, 4, 2, 4, 2]
+    >>> weights = [1, 1, 1, 1, 1, 1]
+    >>> weighted_mode(x, weights)
+    (array([4.]), array([3.]))
+
+    The value 4 appears three times: with uniform weights, the result is
+    simply the mode of the distribution.
+
+    >>> weights = [1, 3, 0.5, 1.5, 1, 2]  # deweight the 4's
+    >>> weighted_mode(x, weights)
+    (array([2.]), array([3.5]))
+
+    The value 2 has the highest score: it appears twice with weights of
+    1.5 and 2: the sum of these is 3.5.
+
+    See Also
+    --------
+    scipy.stats.mode
+    """
+    if axis is None:
+        a = np.ravel(a)
+        w = np.ravel(w)
+        axis = 0
+    else:
+        a = np.asarray(a)
+        w = np.asarray(w)
+
+    if a.shape != w.shape:
+        w = np.full(a.shape, w, dtype=w.dtype)
+
+    scores = np.unique(np.ravel(a))       # get ALL unique values
+    testshape = list(a.shape)
+    testshape[axis] = 1
+    oldmostfreq = np.zeros(testshape)
+    oldcounts = np.zeros(testshape)
+    for score in scores:
+        template = np.zeros(a.shape)
+        ind = (a == score)
+        template[ind] = w[ind]
+        counts = np.expand_dims(np.sum(template, axis), axis)
+        mostfrequent = np.where(counts > oldcounts, score, oldmostfreq)
+        oldcounts = np.maximum(counts, oldcounts)
+        oldmostfreq = mostfrequent
+    return mostfrequent, oldcounts
+
+
+# Use at least float64 for the accumulating functions to avoid precision issue
+# see https://github.com/numpy/numpy/issues/9393. The float64 is also retained
+# as it is in case the float overflows
+def _safe_accumulator_op(op, x, *args, **kwargs):
+    """
+    This function provides numpy accumulator functions with a float64 dtype
+    when used on a floating point input. This prevents accumulator overflow on
+    smaller floating point dtypes.
+
+    Parameters
+    ----------
+    op : function
+        A numpy accumulator function such as np.mean or np.sum
+    x : numpy array
+        A numpy array to apply the accumulator function
+    *args : positional arguments
+        Positional arguments passed to the accumulator function after the
+        input x
+    **kwargs : keyword arguments
+        Keyword arguments passed to the accumulator function
+
+    Returns
+    -------
+    result : The output of the accumulator function passed to this function
+    """
+    if np.issubdtype(x.dtype, np.floating) and x.dtype.itemsize < 8:
+        result = op(x, *args, **kwargs, dtype=np.float64)
+    else:
+        result = op(x, *args, **kwargs)
+    return result
diff --git a/python/cuml/_thirdparty/sklearn/utils/skl_dependencies.py b/python/cuml/_thirdparty/sklearn/utils/skl_dependencies.py
new file mode 100644
index 0000000000..a0554c80cd
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/utils/skl_dependencies.py
@@ -0,0 +1,370 @@
+"""
+This file gathers Scikit-Learn code that would otherwise
+require a version-dependent import from the sklearn library
+"""
+
+# Original authors from Sckit-Learn:
+#         Gael Varoquaux <gael.varoquaux@normalesup.org>
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+import warnings
+from collections import defaultdict
+import inspect
+import re
+
+from ..utils.validation import check_X_y
+from ....thirdparty_adapters import check_array
+
+
+__version__ = '0.23.1'
+
+_DEFAULT_TAGS = {
+    'non_deterministic': False,
+    'requires_positive_X': False,
+    'requires_positive_y': False,
+    'X_types': ['2darray'],
+    'poor_score': False,
+    'no_validation': False,
+    'multioutput': False,
+    "allow_nan": False,
+    'stateless': False,
+    'multilabel': False,
+    '_skip_test': False,
+    '_xfail_checks': False,
+    'multioutput_only': False,
+    'binary_only': False,
+    'requires_fit': True,
+    'requires_y': False,
+    }
+
+
+class BaseEstimator:
+    """Base class for all estimators in scikit-learn
+
+    Notes
+    -----
+    All estimators should specify all the parameters that can be set
+    at the class level in their ``__init__`` as explicit keyword
+    arguments (no ``*args`` or ``**kwargs``).
+    """
+
+    @classmethod
+    def _get_param_names(cls):
+        """Get parameter names for the estimator"""
+        # fetch the constructor or the original constructor before
+        # deprecation wrapping if any
+        init = getattr(cls.__init__, 'deprecated_original', cls.__init__)
+        if init is object.__init__:
+            # No explicit constructor to introspect
+            return []
+
+        # introspect the constructor arguments to find the model parameters
+        # to represent
+        init_signature = inspect.signature(init)
+        # Consider the constructor parameters excluding 'self'
+        parameters = [p for p in init_signature.parameters.values()
+                      if p.name != 'self' and p.kind != p.VAR_KEYWORD]
+        for p in parameters:
+            if p.kind == p.VAR_POSITIONAL:
+                raise RuntimeError("scikit-learn estimators should always "
+                                   "specify their parameters in the signature"
+                                   " of their __init__ (no varargs)."
+                                   " %s with constructor %s doesn't "
+                                   " follow this convention."
+                                   % (cls, init_signature))
+        # Extract and sort argument names excluding 'self'
+        return sorted([p.name for p in parameters])
+
+    def get_params(self, deep=True):
+        """
+        Get parameters for this estimator.
+
+        Parameters
+        ----------
+        deep : bool, default=True
+            If True, will return the parameters for this estimator and
+            contained subobjects that are estimators.
+
+        Returns
+        -------
+        params : mapping of string to any
+            Parameter names mapped to their values.
+        """
+        out = dict()
+        for key in self._get_param_names():
+            try:
+                value = getattr(self, key)
+            except AttributeError:
+                warnings.warn('From version 0.24, get_params will raise an '
+                              'AttributeError if a parameter cannot be '
+                              'retrieved as an instance attribute. Previously '
+                              'it would return None.',
+                              FutureWarning)
+                value = None
+            if deep and hasattr(value, 'get_params'):
+                deep_items = value.get_params().items()
+                out.update((key + '__' + k, val) for k, val in deep_items)
+            out[key] = value
+        return out
+
+    def set_params(self, **params):
+        """
+        Set the parameters of this estimator.
+
+        The method works on simple estimators as well as on nested objects
+        (such as pipelines). The latter have parameters of the form
+        ``<component>__<parameter>`` so that it's possible to update each
+        component of a nested object.
+
+        Parameters
+        ----------
+        **params : dict
+            Estimator parameters.
+
+        Returns
+        -------
+        self : object
+            Estimator instance.
+        """
+        if not params:
+            # Simple optimization to gain speed (inspect is slow)
+            return self
+        valid_params = self.get_params(deep=True)
+
+        nested_params = defaultdict(dict)  # grouped by prefix
+        for key, value in params.items():
+            key, delim, sub_key = key.partition('__')
+            if key not in valid_params:
+                raise ValueError('Invalid parameter %s for estimator %s. '
+                                 'Check the list of available parameters '
+                                 'with `estimator.get_params().keys()`.' %
+                                 (key, self))
+
+            if delim:
+                nested_params[key][sub_key] = value
+            else:
+                setattr(self, key, value)
+                valid_params[key] = value
+
+        for key, sub_params in nested_params.items():
+            valid_params[key].set_params(**sub_params)
+
+        return self
+
+    def __repr__(self, N_CHAR_MAX=700):
+        # N_CHAR_MAX is the (approximate) maximum number of non-blank
+        # characters to render. We pass it as an optional parameter to ease
+        # the tests.
+
+        from .utils._pprint import _EstimatorPrettyPrinter
+
+        N_MAX_ELEMENTS_TO_SHOW = 30  # number of elements to show in sequences
+
+        # use ellipsis for sequences with a lot of elements
+        pp = _EstimatorPrettyPrinter(
+            compact=True, indent=1, indent_at_name=True,
+            n_max_elements_to_show=N_MAX_ELEMENTS_TO_SHOW)
+
+        repr_ = pp.pformat(self)
+
+        # Use bruteforce ellipsis when there are a lot of non-blank characters
+        n_nonblank = len(''.join(repr_.split()))
+        if n_nonblank > N_CHAR_MAX:
+            lim = N_CHAR_MAX // 2  # apprx number of chars to keep on both ends
+            regex = r'^(\s*\S){%d}' % lim
+            # The regex '^(\s*\S){%d}' % n
+            # matches from the start of the string until the nth non-blank
+            # character:
+            # - ^ matches the start of string
+            # - (pattern){n} matches n repetitions of pattern
+            # - \s*\S matches a non-blank char following zero or more blanks
+            left_lim = re.match(regex, repr_).end()
+            right_lim = re.match(regex, repr_[::-1]).end()
+
+            if '\n' in repr_[left_lim:-right_lim]:
+                # The left side and right side aren't on the same line.
+                # To avoid weird cuts, e.g.:
+                # categoric...ore',
+                # we need to start the right side with an appropriate newline
+                # character so that it renders properly as:
+                # categoric...
+                # handle_unknown='ignore',
+                # so we add [^\n]*\n which matches until the next \n
+                regex += r'[^\n]*\n'
+                right_lim = re.match(regex, repr_[::-1]).end()
+
+            ellipsis = '...'
+            if left_lim + len(ellipsis) < len(repr_) - right_lim:
+                # Only add ellipsis if it results in a shorter repr
+                repr_ = repr_[:left_lim] + '...' + repr_[-right_lim:]
+
+        return repr_
+
+    def __getstate__(self):
+        try:
+            state = super().__getstate__()
+        except AttributeError:
+            state = self.__dict__.copy()
+
+        if type(self).__module__.startswith('sklearn.'):
+            return dict(state.items(), _sklearn_version=__version__)
+        else:
+            return state
+
+    def __setstate__(self, state):
+        if type(self).__module__.startswith('sklearn.'):
+            pickle_version = state.pop("_sklearn_version", "pre-0.18")
+            if pickle_version != __version__:
+                warnings.warn(
+                    "Trying to unpickle estimator {0} from version {1} when "
+                    "using version {2}. This might lead to breaking code or "
+                    "invalid results. Use at your own risk.".format(
+                        self.__class__.__name__, pickle_version, __version__),
+                    UserWarning)
+        try:
+            super().__setstate__(state)
+        except AttributeError:
+            self.__dict__.update(state)
+
+    def _more_tags(self):
+        return _DEFAULT_TAGS
+
+    def _get_tags(self):
+        collected_tags = {}
+        for base_class in reversed(inspect.getmro(self.__class__)):
+            if hasattr(base_class, '_more_tags'):
+                # need the if because mixins might not have _more_tags
+                # but might do redundant work in estimators
+                # (i.e. calling more tags on BaseEstimator multiple times)
+                more_tags = base_class._more_tags(self)
+                collected_tags.update(more_tags)
+        return collected_tags
+
+    def _check_n_features(self, X, reset):
+        """Set the `n_features_in_` attribute, or check against it.
+
+        Parameters
+        ----------
+        X : {ndarray, sparse matrix} of shape (n_samples, n_features)
+            The input samples.
+        reset : bool
+            If True, the `n_features_in_` attribute is set to `X.shape[1]`.
+            Else, the attribute must already exist and the function checks
+            that it is equal to `X.shape[1]`.
+        """
+        n_features = X.shape[1]
+
+        if reset:
+            self.n_features_in_ = n_features
+        else:
+            if not hasattr(self, 'n_features_in_'):
+                raise RuntimeError(
+                    "The reset parameter is False but there is no "
+                    "n_features_in_ attribute. Is this estimator fitted?"
+                )
+            if n_features != self.n_features_in_:
+                raise ValueError(
+                    'X has {} features, but this {} is expecting {} features '
+                    'as input.'.format(n_features, self.__class__.__name__,
+                                       self.n_features_in_)
+                )
+
+    def _validate_data(self, X, y=None, reset=True,
+                       validate_separately=False, **check_params):
+        """Validate input data and set or check the `n_features_in_` attribute.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix, dataframe} of shape \
+                (n_samples, n_features)
+            The input samples.
+        y : array-like of shape (n_samples,), default=None
+            The targets. If None, `check_array` is called on `X` and
+            `check_X_y` is called otherwise.
+        reset : bool, default=True
+            Whether to reset the `n_features_in_` attribute.
+            If False, the input will be checked for consistency with data
+            provided when reset was last True.
+        validate_separately : False or tuple of dicts, default=False
+            Only used if y is not None.
+            If False, call validate_X_y(). Else, it must be a tuple of kwargs
+            to be used for calling check_array() on X and y respectively.
+        **check_params : kwargs
+            Parameters passed to :func:`sklearn.utils.check_array` or
+            :func:`sklearn.utils.check_X_y`. Ignored if validate_separately
+            is not False.
+
+        Returns
+        -------
+        out : {ndarray, sparse matrix} or tuple of these
+            The validated input. A tuple is returned if `y` is not None.
+        """
+
+        if y is None:
+            if self._get_tags()['requires_y']:
+                raise ValueError(
+                    f"This {self.__class__.__name__} estimator "
+                    f"requires y to be passed, but the target y is None."
+                )
+            X = check_array(X, **check_params)
+            out = X
+        else:
+            if validate_separately:
+                # We need this because some estimators validate X and y
+                # separately, and in general, separately calling check_array()
+                # on X and y isn't equivalent to just calling check_X_y()
+                # :(
+                check_X_params, check_y_params = validate_separately
+                X = check_array(X, **check_X_params)
+                y = check_array(y, **check_y_params)
+            else:
+                X, y = check_X_y(X, y, **check_params)
+            out = X, y
+
+        if check_params.get('ensure_2d', True):
+            self._check_n_features(X, reset=reset)
+
+        return out
+
+
+class TransformerMixin:
+    """Mixin class for all transformers in scikit-learn."""
+
+    def fit_transform(self, X, y=None, **fit_params):
+        """
+        Fit to data, then transform it.
+
+        Fits transformer to X and y with optional parameters fit_params
+        and returns a transformed version of X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix, dataframe} of shape \
+                (n_samples, n_features)
+
+        y : ndarray of shape (n_samples,), default=None
+            Target values.
+
+        **fit_params : dict
+            Additional fit parameters.
+
+        Returns
+        -------
+        X_new : ndarray array of shape (n_samples, n_features_new)
+            Transformed array.
+        """
+        # non-optimized default implementation; override when a better
+        # method is possible for a given clustering algorithm
+        if y is None:
+            # fit method of arity 1 (unsupervised transformation)
+            return self.fit(X, **fit_params).transform(X)
+        else:
+            # fit method of arity 2 (supervised transformation)
+            return self.fit(X, y, **fit_params).transform(X)
diff --git a/python/cuml/_thirdparty/sklearn/utils/sparsefuncs.py b/python/cuml/_thirdparty/sklearn/utils/sparsefuncs.py
new file mode 100644
index 0000000000..ec6f2289c5
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/utils/sparsefuncs.py
@@ -0,0 +1,529 @@
+# Original authors from Sckit-Learn:
+#          Manoj Kumar
+#          Thomas Unterthiner
+#          Giorgio Patrini
+#
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+from scipy import sparse as cpu_sp
+from cupy import sparse as gpu_sp
+import cupy as np
+import numpy as cpu_np
+
+from ....thirdparty_adapters.sparsefuncs_fast import (
+    csr_mean_variance_axis0 as _csr_mean_var_axis0,
+    csc_mean_variance_axis0 as _csc_mean_var_axis0)
+
+
+def iscsr(X):
+    return isinstance(X, cpu_sp.csr_matrix) \
+        or isinstance(X, gpu_sp.csr_matrix)
+
+
+def iscsc(X):
+    return isinstance(X, cpu_sp.csc_matrix) \
+        or isinstance(X, gpu_sp.csc_matrix)
+
+
+def issparse(X):
+    return iscsr(X) or iscsc(X)
+
+
+def _raise_typeerror(X):
+    """Raises a TypeError if X is not a CSR or CSC matrix"""
+    input_type = X.format if issparse(X) else type(X)
+    err = "Expected a CSR or CSC sparse matrix, got %s." % input_type
+    raise TypeError(err)
+
+
+def _raise_error_wrong_axis(axis):
+    if axis not in (0, 1):
+        raise ValueError(
+            "Unknown axis value: %d. Use 0 for rows, or 1 for columns" % axis)
+
+
+def inplace_csr_column_scale(X, scale):
+    """Inplace column scaling of a CSR matrix.
+
+    Scale each feature of the data matrix by multiplying with specific scale
+    provided by the caller assuming a (n_samples, n_features) shape.
+
+    Parameters
+    ----------
+    X : CSR matrix with shape (n_samples, n_features)
+        Matrix to normalize using the variance of the features.
+
+    scale : float array with shape (n_features,)
+        Array of precomputed feature-wise values to use for scaling.
+    """
+    assert scale.shape[0] == X.shape[1]
+    X.indices[X.indices >= X.shape[1]] = X.shape[1] - 1
+    X.data *= scale.take(X.indices)
+
+
+def inplace_csr_row_scale(X, scale):
+    """ Inplace row scaling of a CSR matrix.
+
+    Scale each sample of the data matrix by multiplying with specific scale
+    provided by the caller assuming a (n_samples, n_features) shape.
+
+    Parameters
+    ----------
+    X : CSR sparse matrix, shape (n_samples, n_features)
+        Matrix to be scaled.
+
+    scale : float array with shape (n_samples,)
+        Array of precomputed sample-wise values to use for scaling.
+    """
+    assert scale.shape[0] == X.shape[0]
+    X.data *= np.repeat(scale, np.diff(X.indptr).tolist())
+
+
+def mean_variance_axis(X, axis):
+    """Compute mean and variance along an axix on a CSR or CSC matrix
+
+    Parameters
+    ----------
+    X : CSR or CSC sparse matrix, shape (n_samples, n_features)
+        Input data.
+
+    axis : int (either 0 or 1)
+        Axis along which the axis should be computed.
+
+    Returns
+    -------
+
+    means : float array with shape (n_features,)
+        Feature-wise means
+
+    variances : float array with shape (n_features,)
+        Feature-wise variances
+
+    """
+    _raise_error_wrong_axis(axis)
+
+    if iscsr(X):
+        if axis == 0:
+            return _csr_mean_var_axis0(X)
+        else:
+            return _csc_mean_var_axis0(X.T)
+    elif iscsc(X):
+        if axis == 0:
+            return _csc_mean_var_axis0(X)
+        else:
+            return _csr_mean_var_axis0(X.T)
+    else:
+        _raise_typeerror(X)
+
+
+def inplace_column_scale(X, scale):
+    """Inplace column scaling of a CSC/CSR matrix.
+
+    Scale each feature of the data matrix by multiplying with specific scale
+    provided by the caller assuming a (n_samples, n_features) shape.
+
+    Parameters
+    ----------
+    X : CSC or CSR matrix with shape (n_samples, n_features)
+        Matrix to normalize using the variance of the features.
+
+    scale : float array with shape (n_features,)
+        Array of precomputed feature-wise values to use for scaling.
+    """
+    if iscsc(X):
+        inplace_csr_row_scale(X.T, scale)
+    elif iscsr(X):
+        inplace_csr_column_scale(X, scale)
+    else:
+        _raise_typeerror(X)
+
+
+def inplace_row_scale(X, scale):
+    """ Inplace row scaling of a CSR or CSC matrix.
+
+    Scale each row of the data matrix by multiplying with specific scale
+    provided by the caller assuming a (n_samples, n_features) shape.
+
+    Parameters
+    ----------
+    X : CSR or CSC sparse matrix, shape (n_samples, n_features)
+        Matrix to be scaled.
+
+    scale : float array with shape (n_features,)
+        Array of precomputed sample-wise values to use for scaling.
+    """
+    if iscsc(X):
+        inplace_csr_column_scale(X.T, scale)
+    elif iscsr(X):
+        inplace_csr_row_scale(X, scale)
+    else:
+        _raise_typeerror(X)
+
+
+def inplace_swap_row_csc(X, m, n):
+    """
+    Swaps two rows of a CSC matrix in-place.
+
+    Parameters
+    ----------
+    X : scipy.sparse.csc_matrix, shape=(n_samples, n_features)
+        Matrix whose two rows are to be swapped.
+
+    m : int
+        Index of the row of X to be swapped.
+
+    n : int
+        Index of the row of X to be swapped.
+    """
+    for t in [m, n]:
+        if isinstance(t, np.ndarray):
+            raise TypeError("m and n should be valid integers")
+
+    if m < 0:
+        m += X.shape[0]
+    if n < 0:
+        n += X.shape[0]
+
+    m_mask = X.indices == m
+    X.indices[X.indices == n] = m
+    X.indices[m_mask] = n
+
+
+def inplace_swap_row_csr(X, m, n):
+    """
+    Swaps two rows of a CSR matrix in-place.
+
+    Parameters
+    ----------
+    X : scipy.sparse.csr_matrix, shape=(n_samples, n_features)
+        Matrix whose two rows are to be swapped.
+
+    m : int
+        Index of the row of X to be swapped.
+
+    n : int
+        Index of the row of X to be swapped.
+    """
+    for t in [m, n]:
+        if isinstance(t, np.ndarray):
+            raise TypeError("m and n should be valid integers")
+
+    if m < 0:
+        m += X.shape[0]
+    if n < 0:
+        n += X.shape[0]
+
+    # The following swapping makes life easier since m is assumed to be the
+    # smaller integer below.
+    if m > n:
+        m, n = n, m
+
+    indptr = X.indptr
+    m_start = indptr[m]
+    m_stop = indptr[m + 1]
+    n_start = indptr[n]
+    n_stop = indptr[n + 1]
+    nz_m = m_stop - m_start
+    nz_n = n_stop - n_start
+
+    if nz_m != nz_n:
+        # Modify indptr first
+        X.indptr[m + 2:n] += nz_n - nz_m
+        X.indptr[m + 1] = m_start + nz_n
+        X.indptr[n] = n_stop - nz_m
+
+    X.indices = np.concatenate([X.indices[:m_start],
+                                X.indices[n_start:n_stop],
+                                X.indices[m_stop:n_start],
+                                X.indices[m_start:m_stop],
+                                X.indices[n_stop:]])
+    X.data = np.concatenate([X.data[:m_start],
+                             X.data[n_start:n_stop],
+                             X.data[m_stop:n_start],
+                             X.data[m_start:m_stop],
+                             X.data[n_stop:]])
+
+
+def inplace_swap_row(X, m, n):
+    """
+    Swaps two rows of a CSC/CSR matrix in-place.
+
+    Parameters
+    ----------
+    X : CSR or CSC sparse matrix, shape=(n_samples, n_features)
+        Matrix whose two rows are to be swapped.
+
+    m : int
+        Index of the row of X to be swapped.
+
+    n : int
+        Index of the row of X to be swapped.
+    """
+    if iscsc(X):
+        inplace_swap_row_csc(X, m, n)
+    elif iscsr(X):
+        inplace_swap_row_csr(X, m, n)
+    else:
+        _raise_typeerror(X)
+
+
+def inplace_swap_column(X, m, n):
+    """
+    Swaps two columns of a CSC/CSR matrix in-place.
+
+    Parameters
+    ----------
+    X : CSR or CSC sparse matrix, shape=(n_samples, n_features)
+        Matrix whose two columns are to be swapped.
+
+    m : int
+        Index of the column of X to be swapped.
+
+    n : int
+        Index of the column of X to be swapped.
+    """
+    if m < 0:
+        m += X.shape[1]
+    if n < 0:
+        n += X.shape[1]
+    if iscsc(X):
+        inplace_swap_row_csr(X, m, n)
+    elif iscsr(X):
+        inplace_swap_row_csc(X, m, n)
+    else:
+        _raise_typeerror(X)
+
+
+def _minor_reduce(X, min_or_max):
+    if min_or_max == 'min':
+        min_or_max = np.min
+    else:
+        min_or_max = np.max
+
+    major_index = np.flatnonzero(np.diff(X.indptr))
+
+    # reduceat tries casts X.indptr to intp, which errors
+    # if it is int64 on a 32 bit system.
+    # Reinitializing prevents this where possible, see #13737
+    X = type(X)((X.data, X.indices, X.indptr), shape=X.shape)
+
+    value = cpu_np.zeros(len(X.indptr)-1, dtype=X.dtype)
+
+    start = X.indptr[0]
+    for i, end in enumerate(X.indptr[1:]):
+        value[i] = min_or_max(X.data[start:end])
+        start = end
+
+    value = np.array(value)
+    return major_index, value
+
+
+def _min_or_max_axis(X, axis, min_or_max):
+    N = X.shape[axis]
+    if N == 0:
+        raise ValueError("zero-size array to reduction operation")
+    M = X.shape[1 - axis]
+    mat = X.tocsc() if axis == 0 else X.tocsr()
+    mat.sum_duplicates()
+    major_index, value = _minor_reduce(mat, min_or_max)
+    not_full = np.diff(mat.indptr)[major_index] < N
+    if min_or_max == 'min':
+        min_or_max = np.fmin
+    else:
+        min_or_max = np.fmax
+    value[not_full] = min_or_max(value[not_full], 0)
+    mask = value != 0
+    major_index = np.compress(mask, major_index)
+    value = np.compress(mask, value)
+
+    if axis == 0:
+        res = gpu_sp.coo_matrix((value, (np.zeros(len(value)), major_index)),
+                                dtype=X.dtype, shape=(1, M))
+    else:
+        res = gpu_sp.coo_matrix((value, (major_index, np.zeros(len(value)))),
+                                dtype=X.dtype, shape=(M, 1))
+    return res.A.ravel()
+
+
+def _sparse_min_or_max(X, axis, min_or_max):
+    if axis is None:
+        if 0 in X.shape:
+            raise ValueError("zero-size array to reduction operation")
+        zero = X.dtype.type(0)
+        if X.nnz == 0:
+            return zero
+        if min_or_max == 'min':
+            fminmax = np.min
+        else:
+            fminmax = np.max
+        m = fminmax.reduce(X.data.ravel())
+        if X.nnz != np.product(X.shape):
+            m = fminmax(zero, m)
+        return m
+    if axis < 0:
+        axis += 2
+    if (axis == 0) or (axis == 1):
+        return _min_or_max_axis(X, axis, min_or_max)
+    else:
+        raise ValueError("invalid axis, use 0 for rows, or 1 for columns")
+
+
+def _sparse_min_max(X, axis):
+    return (_sparse_min_or_max(X, axis, 'min'),
+            _sparse_min_or_max(X, axis, 'max'))
+
+
+def _sparse_nan_min_max(X, axis):
+    return(_sparse_min_or_max(X, axis, 'min'),
+           _sparse_min_or_max(X, axis, 'max'))
+
+
+def min_max_axis(X, axis, ignore_nan=False):
+    """Compute minimum and maximum along an axis on a CSR or CSC matrix and
+    optionally ignore NaN values.
+
+    Parameters
+    ----------
+    X : CSR or CSC sparse matrix, shape (n_samples, n_features)
+        Input data.
+
+    axis : int (either 0 or 1)
+        Axis along which the axis should be computed.
+
+    ignore_nan : bool, default is False
+        Ignore or passing through NaN values.
+
+        .. versionadded:: 0.20
+
+    Returns
+    -------
+
+    mins : float array with shape (n_features,)
+        Feature-wise minima
+
+    maxs : float array with shape (n_features,)
+        Feature-wise maxima
+    """
+    if issparse(X):
+        if ignore_nan:
+            return _sparse_nan_min_max(X, axis=axis)
+        else:
+            return _sparse_min_max(X, axis=axis)
+    else:
+        _raise_typeerror(X)
+
+
+def count_nonzero(X, axis=None, sample_weight=None):
+    """A variant of X.getnnz() with extension to weighting on axis 0
+
+    Useful in efficiently calculating multilabel metrics.
+
+    Parameters
+    ----------
+    X : CSR sparse matrix of shape (n_samples, n_labels)
+        Input data.
+
+    axis : None, 0 or 1
+        The axis on which the data is aggregated.
+
+    sample_weight : array-like of shape (n_samples,), default=None
+        Weight for each row of X.
+    """
+    if axis == -1:
+        axis = 1
+    elif axis == -2:
+        axis = 0
+    elif X.format != 'csr':
+        raise TypeError('Expected CSR sparse format, got {0}'.format(X.format))
+
+    # We rely here on the fact that np.diff(Y.indptr) for a CSR
+    # will return the number of nonzero entries in each row.
+    # A bincount over Y.indices will return the number of nonzeros
+    # in each column. See ``csr_matrix.getnnz`` in scipy >= 0.14.
+    if axis is None:
+        if sample_weight is None:
+            return X.nnz
+        else:
+            return np.dot(np.diff(X.indptr), sample_weight)
+    elif axis == 1:
+        out = np.diff(X.indptr)
+        if sample_weight is None:
+            # astype here is for consistency with axis=0 dtype
+            return out.astype('intp')
+        return out * sample_weight
+    elif axis == 0:
+        if sample_weight is None:
+            return np.bincount(X.indices, minlength=X.shape[1])
+        else:
+            weights = np.repeat(sample_weight, np.diff(X.indptr))
+            return np.bincount(X.indices, minlength=X.shape[1],
+                               weights=weights)
+    else:
+        raise ValueError('Unsupported axis: {0}'.format(axis))
+
+
+def _get_median(data, n_zeros):
+    """Compute the median of data with n_zeros additional zeros.
+
+    This function is used to support sparse matrices; it modifies data in-place
+    """
+    n_elems = len(data) + n_zeros
+    if not n_elems:
+        return np.nan
+    n_negative = np.count_nonzero(data < 0)
+    middle, is_odd = divmod(n_elems, 2)
+    data.sort()
+
+    if is_odd:
+        return _get_elem_at_rank(middle, data, n_negative, n_zeros)
+
+    return (_get_elem_at_rank(middle - 1, data, n_negative, n_zeros) +
+            _get_elem_at_rank(middle, data, n_negative, n_zeros)) / 2.
+
+
+def _get_elem_at_rank(rank, data, n_negative, n_zeros):
+    """Find the value in data augmented with n_zeros for the given rank"""
+    if rank < n_negative:
+        return data[rank]
+    if rank - n_negative < n_zeros:
+        return 0
+    return data[rank - n_zeros]
+
+
+def csc_median_axis_0(X):
+    """Find the median across axis 0 of a CSC matrix.
+    It is equivalent to doing np.median(X, axis=0).
+
+    Parameters
+    ----------
+    X : CSC sparse matrix, shape (n_samples, n_features)
+        Input data.
+
+    Returns
+    -------
+    median : ndarray, shape (n_features,)
+        Median.
+
+    """
+    if not iscsc(X):
+        raise TypeError("Expected matrix of CSC format, got %s" % X.format)
+
+    indptr = X.indptr
+    n_samples, n_features = X.shape
+    median = np.zeros(n_features)
+
+    for f_ind, (start, end) in enumerate(zip(indptr[:-1], indptr[1:])):
+
+        # Prevent modifying X in place
+        data = np.copy(X.data[start: end])
+        nz = n_samples - data.size
+        median[f_ind] = _get_median(data, nz)
+
+    return median
diff --git a/python/cuml/_thirdparty/sklearn/utils/validation.py b/python/cuml/_thirdparty/sklearn/utils/validation.py
new file mode 100644
index 0000000000..d288de7ae1
--- /dev/null
+++ b/python/cuml/_thirdparty/sklearn/utils/validation.py
@@ -0,0 +1,1106 @@
+"""Utilities for input validation"""
+
+# Original authors from Sckit-Learn:
+#          Olivier Grisel
+#          Gael Varoquaux
+#          Andreas Mueller
+#          Lars Buitinck
+#          Alexandre Gramfort
+#          Nicolas Tresegnie
+#          Sylvain Marie
+# License: BSD 3 clause
+
+
+# This code originates from the Scikit-Learn library,
+# it was since modified to allow GPU acceleration.
+# This code is under BSD 3 clause license.
+# Authors mentioned above do not endorse or promote this production.
+
+
+from functools import wraps
+import warnings
+import numbers
+
+import numpy as np
+import scipy.sparse as sp
+from distutils.version import LooseVersion
+from inspect import signature, isclass, Parameter
+
+import joblib
+
+from ..exceptions import NonBLASDotWarning, PositiveSpectrumWarning
+from ..exceptions import NotFittedError
+from ..exceptions import DataConversionWarning
+
+from ....thirdparty_adapters import check_array
+
+FLOAT_DTYPES = (np.float64, np.float32, np.float16)
+
+# Silenced by default to reduce verbosity. Turn on at runtime for
+# performance profiling.
+warnings.simplefilter('ignore', NonBLASDotWarning)
+
+
+def _deprecate_positional_args(f):
+    """Decorator for methods that issues warnings for positional arguments
+
+    Using the keyword-only argument syntax in pep 3102, arguments after the
+    * will issue a warning when passed as a positional argument.
+
+    Parameters
+    ----------
+    f : function
+        function to check arguments on
+    """
+    sig = signature(f)
+    kwonly_args = []
+    all_args = []
+
+    for name, param in sig.parameters.items():
+        if param.kind == Parameter.POSITIONAL_OR_KEYWORD:
+            all_args.append(name)
+        elif param.kind == Parameter.KEYWORD_ONLY:
+            kwonly_args.append(name)
+
+    @wraps(f)
+    def inner_f(*args, **kwargs):
+        extra_args = len(args) - len(all_args)
+        if extra_args > 0:
+            # ignore first 'self' argument for instance methods
+            args_msg = ['{}={}'.format(name, arg)
+                        for name, arg in zip(kwonly_args[:extra_args],
+                                             args[-extra_args:])]
+            warnings.warn("Pass {} as keyword args. From version 0.25 "
+                          "passing these as positional arguments will "
+                          "result in an error".format(", ".join(args_msg)),
+                          FutureWarning)
+        kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
+        return f(**kwargs)
+    return inner_f
+
+
+def _assert_all_finite(X, allow_nan=False, msg_dtype=None):
+    """Like assert_all_finite, but only for ndarray."""
+    # validation is also imported in extmath
+    from .extmath import _safe_accumulator_op
+
+    X = np.asanyarray(X)
+    # First try an O(n) time, O(1) space solution for the common case that
+    # everything is finite; fall back to O(n) space np.isfinite to prevent
+    # false positives from overflow in sum method. The sum is also calculated
+    # safely to reduce dtype induced overflows.
+    is_float = X.dtype.kind in 'fc'
+    if is_float and (np.isfinite(_safe_accumulator_op(np.sum, X))):
+        pass
+    elif is_float:
+        msg_err = "Input contains {} or a value too large for {!r}."
+        if (allow_nan and np.isinf(X).any() or
+                not allow_nan and not np.isfinite(X).all()):
+            type_err = 'infinity' if allow_nan else 'NaN, infinity'
+            raise ValueError(
+                    msg_err.format
+                    (type_err,
+                     msg_dtype if msg_dtype is not None else X.dtype)
+            )
+
+
+@_deprecate_positional_args
+def assert_all_finite(X, *, allow_nan=False):
+    """Throw a ValueError if X contains NaN or infinity.
+
+    Parameters
+    ----------
+    X : array or sparse matrix
+
+    allow_nan : bool
+    """
+    _assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
+
+
+@_deprecate_positional_args
+def as_float_array(X, *, copy=True, force_all_finite=True):
+    """Converts an array-like to an array of floats.
+
+    The new dtype will be np.float32 or np.float64, depending on the original
+    type. The function can create a copy or modify the argument depending
+    on the argument copy.
+
+    Parameters
+    ----------
+    X : {array-like, sparse matrix}
+
+    copy : bool, optional
+        If True, a copy of X will be created. If False, a copy may still be
+        returned if X's dtype is not a floating point type.
+
+    force_all_finite : boolean or 'allow-nan', (default=True)
+        Whether to raise an error on np.inf, np.nan, pd.NA in X. The
+        possibilities are:
+
+        - True: Force all values of X to be finite.
+        - False: accepts np.inf, np.nan, pd.NA in X.
+        - 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
+          be infinite.
+
+        .. versionadded:: 0.20
+           ``force_all_finite`` accepts the string ``'allow-nan'``.
+
+        .. versionchanged:: 0.23
+           Accepts `pd.NA` and converts it into `np.nan`
+
+    Returns
+    -------
+    XT : {array, sparse matrix}
+        An array of type np.float
+    """
+    if isinstance(X, np.matrix) or (not isinstance(X, np.ndarray)
+                                    and not sp.issparse(X)):
+        return check_array(X, accept_sparse=['csr', 'csc', 'coo'],
+                           dtype=np.float64, copy=copy,
+                           force_all_finite=force_all_finite, ensure_2d=False)
+    elif sp.issparse(X) and X.dtype in [np.float32, np.float64]:
+        return X.copy() if copy else X
+    elif X.dtype in [np.float32, np.float64]:  # is numpy array
+        return X.copy('F' if X.flags['F_CONTIGUOUS'] else 'C') if copy else X
+    else:
+        if X.dtype.kind in 'uib' and X.dtype.itemsize <= 4:
+            return_dtype = np.float32
+        else:
+            return_dtype = np.float64
+        return X.astype(return_dtype)
+
+
+def _is_arraylike(x):
+    """Returns whether the input is array-like"""
+    return (hasattr(x, '__len__') or
+            hasattr(x, 'shape') or
+            hasattr(x, '__array__'))
+
+
+def _num_samples(x):
+    """Return number of samples in array-like x."""
+    message = 'Expected sequence or array-like, got %s' % type(x)
+    if hasattr(x, 'fit') and callable(x.fit):
+        # Don't get num_samples from an ensembles length!
+        raise TypeError(message)
+
+    if not hasattr(x, '__len__') and not hasattr(x, 'shape'):
+        if hasattr(x, '__array__'):
+            x = np.asarray(x)
+        else:
+            raise TypeError(message)
+
+    if hasattr(x, 'shape') and x.shape is not None:
+        if len(x.shape) == 0:
+            raise TypeError("Singleton array %r cannot be considered"
+                            " a valid collection." % x)
+        # Check that shape is returning an integer or default to len
+        # Dask dataframes may not return numeric shape[0] value
+        if isinstance(x.shape[0], numbers.Integral):
+            return x.shape[0]
+
+    try:
+        return len(x)
+    except TypeError:
+        raise TypeError(message)
+
+
+def check_memory(memory):
+    """Check that ``memory`` is joblib.Memory-like.
+
+    joblib.Memory-like means that ``memory`` can be converted into a
+    joblib.Memory instance (typically a str denoting the ``location``)
+    or has the same interface (has a ``cache`` method).
+
+    Parameters
+    ----------
+    memory : None, str or object with the joblib.Memory interface
+
+    Returns
+    -------
+    memory : object with the joblib.Memory interface
+
+    Raises
+    ------
+    ValueError
+        If ``memory`` is not joblib.Memory-like.
+    """
+
+    if memory is None or isinstance(memory, str):
+        if LooseVersion(joblib.__version__) < '0.12':
+            memory = joblib.Memory(cachedir=memory, verbose=0)
+        else:
+            memory = joblib.Memory(location=memory, verbose=0)
+    elif not hasattr(memory, 'cache'):
+        raise ValueError("'memory' should be None, a string or have the same"
+                         " interface as joblib.Memory."
+                         " Got memory='{}' instead.".format(memory))
+    return memory
+
+
+def check_consistent_length(*arrays):
+    """Check that all arrays have consistent first dimensions.
+
+    Checks whether all objects in arrays have the same shape or length.
+
+    Parameters
+    ----------
+    *arrays : list or tuple of input objects.
+        Objects that will be checked for consistent length.
+    """
+
+    lengths = [_num_samples(X) for X in arrays if X is not None]
+    uniques = np.unique(lengths)
+    if len(uniques) > 1:
+        raise ValueError("Found input variables with inconsistent numbers of"
+                         " samples")
+
+
+def _make_indexable(iterable):
+    """Ensure iterable supports indexing or convert to an indexable variant.
+
+    Convert sparse matrices to csr and other non-indexable iterable to arrays.
+    Let `None` and indexable objects (e.g. pandas dataframes) pass unchanged.
+
+    Parameters
+    ----------
+    iterable : {list, dataframe, array, sparse} or None
+        Object to be converted to an indexable iterable.
+    """
+    if sp.issparse(iterable):
+        return iterable.tocsr()
+    elif hasattr(iterable, "__getitem__") or hasattr(iterable, "iloc"):
+        return iterable
+    elif iterable is None:
+        return iterable
+    return np.array(iterable)
+
+
+def indexable(*iterables):
+    """Make arrays indexable for cross-validation.
+
+    Checks consistent length, passes through None, and ensures that everything
+    can be indexed by converting sparse matrices to csr and converting
+    non-interable objects to arrays.
+
+    Parameters
+    ----------
+    *iterables : lists, dataframes, arrays, sparse matrices
+        List of objects to ensure sliceability.
+    """
+    result = [_make_indexable(X) for X in iterables]
+    check_consistent_length(*result)
+    return result
+
+
+def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
+                          force_all_finite, accept_large_sparse):
+    """Convert a sparse matrix to a given format.
+
+    Checks the sparse format of spmatrix and converts if necessary.
+
+    Parameters
+    ----------
+    spmatrix : scipy sparse matrix
+        Input to validate and convert.
+
+    accept_sparse : string, boolean or list/tuple of strings
+        String[s] representing allowed sparse matrix formats ('csc',
+        'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). If the input is sparse but
+        not in the allowed format, it will be converted to the first listed
+        format. True allows the input to be any format. False means
+        that a sparse matrix input will raise an error.
+
+    dtype : string, type or None
+        Data type of result. If None, the dtype of the input is preserved.
+
+    copy : boolean
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    force_all_finite : boolean or 'allow-nan', (default=True)
+        Whether to raise an error on np.inf, np.nan, pd.NA in X. The
+        possibilities are:
+
+        - True: Force all values of X to be finite.
+        - False: accepts np.inf, np.nan, pd.NA in X.
+        - 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
+          be infinite.
+
+        .. versionadded:: 0.20
+           ``force_all_finite`` accepts the string ``'allow-nan'``.
+
+        .. versionchanged:: 0.23
+           Accepts `pd.NA` and converts it into `np.nan`
+
+    Returns
+    -------
+    spmatrix_converted : scipy sparse matrix.
+        Matrix that is ensured to have an allowed type.
+    """
+    if dtype is None:
+        dtype = spmatrix.dtype
+
+    changed_format = False
+
+    if isinstance(accept_sparse, str):
+        accept_sparse = [accept_sparse]
+
+    # Indices dtype validation
+    _check_large_sparse(spmatrix, accept_large_sparse)
+
+    if accept_sparse is False:
+        raise TypeError('A sparse matrix was passed, but dense '
+                        'data is required. Use X.toarray() to '
+                        'convert to a dense numpy array.')
+    elif isinstance(accept_sparse, (list, tuple)):
+        if len(accept_sparse) == 0:
+            raise ValueError("When providing 'accept_sparse' "
+                             "as a tuple or list, it must contain at "
+                             "least one string value.")
+        # ensure correct sparse format
+        if spmatrix.format not in accept_sparse:
+            # create new with correct sparse
+            spmatrix = spmatrix.asformat(accept_sparse[0])
+            changed_format = True
+    elif accept_sparse is not True:
+        # any other type
+        raise ValueError("Parameter 'accept_sparse' should be a string, "
+                         "boolean or list of strings. You provided "
+                         "'accept_sparse={}'.".format(accept_sparse))
+
+    if dtype != spmatrix.dtype:
+        # convert dtype
+        spmatrix = spmatrix.astype(dtype)
+    elif copy and not changed_format:
+        # force copy
+        spmatrix = spmatrix.copy()
+
+    if force_all_finite:
+        if not hasattr(spmatrix, "data"):
+            warnings.warn("Can't check %s sparse matrix for nan or inf."
+                          % spmatrix.format, stacklevel=2)
+        else:
+            _assert_all_finite(spmatrix.data,
+                               allow_nan=force_all_finite == 'allow-nan')
+
+    return spmatrix
+
+
+def _ensure_no_complex_data(array):
+    if hasattr(array, 'dtype') and array.dtype is not None \
+            and hasattr(array.dtype, 'kind') and array.dtype.kind == "c":
+        raise ValueError("Complex data not supported\n"
+                         "{}\n".format(array))
+
+
+def _check_large_sparse(X, accept_large_sparse=False):
+    """Raise a ValueError if X has 64bit indices and accept_large_sparse=False
+    """
+    if not accept_large_sparse:
+        supported_indices = ["int32"]
+        if X.getformat() == "coo":
+            index_keys = ['col', 'row']
+        elif X.getformat() in ["csr", "csc", "bsr"]:
+            index_keys = ['indices', 'indptr']
+        else:
+            return
+        for key in index_keys:
+            indices_datatype = getattr(X, key).dtype
+            if (indices_datatype not in supported_indices):
+                raise ValueError("Only sparse matrices with 32-bit integer"
+                                 " indices are accepted. Got %s indices."
+                                 % indices_datatype)
+
+
+@_deprecate_positional_args
+def check_X_y(X, y, accept_sparse=False, *, accept_large_sparse=True,
+              dtype="numeric", order=None, copy=False, force_all_finite=True,
+              ensure_2d=True, allow_nd=False, multi_output=False,
+              ensure_min_samples=1, ensure_min_features=1, y_numeric=False,
+              estimator=None):
+    """Input validation for standard estimators.
+
+    Checks X and y for consistent length, enforces X to be 2D and y 1D. By
+    default, X is checked to be non-empty and containing only finite values.
+    Standard input checks are also applied to y, such as checking that y
+    does not have np.nan or np.inf targets. For multi-label y, set
+    multi_output=True to allow 2D and sparse y. If the dtype of X is
+    object, attempt converting to float, raising on failure.
+
+    Parameters
+    ----------
+    X : nd-array, list or sparse matrix
+        Input data.
+
+    y : nd-array, list or sparse matrix
+        Labels.
+
+    accept_sparse : string, boolean or list of string (default=False)
+        String[s] representing allowed sparse matrix formats, such as 'csc',
+        'csr', etc. If the input is sparse but not in the allowed format,
+        it will be converted to the first listed format. True allows the input
+        to be any format. False means that a sparse matrix input will
+        raise an error.
+
+    accept_large_sparse : bool (default=True)
+        If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
+        accept_sparse, accept_large_sparse will cause it to be accepted only
+        if its indices are stored with a 32-bit dtype.
+
+        .. versionadded:: 0.20
+
+    dtype : string, type, list of types or None (default="numeric")
+        Data type of result. If None, the dtype of the input is preserved.
+        If "numeric", dtype is preserved unless array.dtype is object.
+        If dtype is a list of types, conversion on the first type is only
+        performed if the dtype of the input is not in the list.
+
+    order : 'F', 'C' or None (default=None)
+        Whether an array will be forced to be fortran or c-style.
+
+    copy : boolean (default=False)
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+
+    force_all_finite : boolean or 'allow-nan', (default=True)
+        Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter
+        does not influence whether y can have np.inf, np.nan, pd.NA values.
+        The possibilities are:
+
+        - True: Force all values of X to be finite.
+        - False: accepts np.inf, np.nan, pd.NA in X.
+        - 'allow-nan': accepts only np.nan or pd.NA values in X. Values cannot
+          be infinite.
+
+        .. versionadded:: 0.20
+           ``force_all_finite`` accepts the string ``'allow-nan'``.
+
+        .. versionchanged:: 0.23
+           Accepts `pd.NA` and converts it into `np.nan`
+
+    ensure_2d : boolean (default=True)
+        Whether to raise a value error if X is not 2D.
+
+    allow_nd : boolean (default=False)
+        Whether to allow X.ndim > 2.
+
+    multi_output : boolean (default=False)
+        Whether to allow 2D y (array or sparse matrix). If false, y will be
+        validated as a vector. y cannot have np.nan or np.inf values if
+        multi_output=True.
+
+    ensure_min_samples : int (default=1)
+        Make sure that X has a minimum number of samples in its first
+        axis (rows for a 2D array).
+
+    ensure_min_features : int (default=1)
+        Make sure that the 2D array has some minimum number of features
+        (columns). The default value of 1 rejects empty datasets.
+        This check is only enforced when X has effectively 2 dimensions or
+        is originally 1D and ``ensure_2d`` is True. Setting to 0 disables
+        this check.
+
+    y_numeric : boolean (default=False)
+        Whether to ensure that y has a numeric type. If dtype of y is object,
+        it is converted to float64. Should only be used for regression
+        algorithms.
+
+    estimator : str or estimator instance (default=None)
+        If passed, include the name of the estimator in warning messages.
+
+    Returns
+    -------
+    X_converted : object
+        The converted and validated X.
+
+    y_converted : object
+        The converted and validated y.
+    """
+    if y is None:
+        raise ValueError("y cannot be None")
+
+    X = check_array(X, accept_sparse=accept_sparse,
+                    accept_large_sparse=accept_large_sparse,
+                    dtype=dtype, order=order, copy=copy,
+                    force_all_finite=force_all_finite,
+                    ensure_2d=ensure_2d, allow_nd=allow_nd,
+                    ensure_min_samples=ensure_min_samples,
+                    ensure_min_features=ensure_min_features,
+                    estimator=estimator)
+    if multi_output:
+        y = check_array(y, accept_sparse='csr', force_all_finite=True,
+                        ensure_2d=False, dtype=None)
+    else:
+        y = column_or_1d(y, warn=True)
+        _assert_all_finite(y)
+    if y_numeric and y.dtype.kind == 'O':
+        y = y.astype(np.float64)
+
+    check_consistent_length(X, y)
+
+    return X, y
+
+
+@_deprecate_positional_args
+def column_or_1d(y, *, warn=False):
+    """ Ravel column or 1d numpy array, else raises an error
+
+    Parameters
+    ----------
+    y : array-like
+
+    warn : boolean, default False
+       To control display of warnings.
+
+    Returns
+    -------
+    y : array
+
+    """
+    y = np.asarray(y)
+    shape = np.shape(y)
+    if len(shape) == 1:
+        return np.ravel(y)
+    if len(shape) == 2 and shape[1] == 1:
+        if warn:
+            warnings.warn("A column-vector y was passed when a 1d array was"
+                          " expected. Please change the shape of y to "
+                          "(n_samples, ), for example using ravel().",
+                          DataConversionWarning, stacklevel=2)
+        return np.ravel(y)
+
+    raise ValueError(
+        "y should be a 1d array, "
+        "got an array of shape {} instead.".format(shape))
+
+
+def check_random_state(seed):
+    """Turn seed into a np.random.RandomState instance
+
+    Parameters
+    ----------
+    seed : None | int | instance of RandomState
+        If seed is None, return the RandomState singleton used by np.random.
+        If seed is an int, return a new RandomState instance seeded with seed.
+        If seed is already a RandomState instance, return it.
+        Otherwise raise ValueError.
+    """
+    if seed is None or seed is np.random:
+        return np.random.mtrand._rand
+    if isinstance(seed, numbers.Integral):
+        return np.random.RandomState(seed)
+    if isinstance(seed, np.random.RandomState):
+        return seed
+    raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
+                     ' instance' % seed)
+
+
+def has_fit_parameter(estimator, parameter):
+    """Checks whether the estimator's fit method supports the given parameter.
+
+    Parameters
+    ----------
+    estimator : object
+        An estimator to inspect.
+
+    parameter : str
+        The searched parameter.
+
+    Returns
+    -------
+    is_parameter: bool
+        Whether the parameter was found to be a named parameter of the
+        estimator's fit method.
+
+    Examples
+    --------
+    >>> from sklearn.svm import SVC
+    >>> has_fit_parameter(SVC(), "sample_weight")
+    True
+
+    """
+    return parameter in signature(estimator.fit).parameters
+
+
+@_deprecate_positional_args
+def check_symmetric(array, *, tol=1E-10, raise_warning=True,
+                    raise_exception=False):
+    """Make sure that array is 2D, square and symmetric.
+
+    If the array is not symmetric, then a symmetrized version is returned.
+    Optionally, a warning or exception is raised if the matrix is not
+    symmetric.
+
+    Parameters
+    ----------
+    array : nd-array or sparse matrix
+        Input object to check / convert. Must be two-dimensional and square,
+        otherwise a ValueError will be raised.
+    tol : float
+        Absolute tolerance for equivalence of arrays. Default = 1E-10.
+    raise_warning : boolean (default=True)
+        If True then raise a warning if conversion is required.
+    raise_exception : boolean (default=False)
+        If True then raise an exception if array is not symmetric.
+
+    Returns
+    -------
+    array_sym : ndarray or sparse matrix
+        Symmetrized version of the input array, i.e. the average of array
+        and array.transpose(). If sparse, then duplicate entries are first
+        summed and zeros are eliminated.
+    """
+    if (array.ndim != 2) or (array.shape[0] != array.shape[1]):
+        raise ValueError("array must be 2-dimensional and square. "
+                         "shape = {0}".format(array.shape))
+
+    if sp.issparse(array):
+        diff = array - array.T
+        # only csr, csc, and coo have `data` attribute
+        if diff.format not in ['csr', 'csc', 'coo']:
+            diff = diff.tocsr()
+        symmetric = np.all(abs(diff.data) < tol)
+    else:
+        symmetric = np.allclose(array, array.T, atol=tol)
+
+    if not symmetric:
+        if raise_exception:
+            raise ValueError("Array must be symmetric")
+        if raise_warning:
+            warnings.warn("Array is not symmetric, and will be converted "
+                          "to symmetric by average with its transpose.",
+                          stacklevel=2)
+        if sp.issparse(array):
+            conversion = 'to' + array.format
+            array = getattr(0.5 * (array + array.T), conversion)()
+        else:
+            array = 0.5 * (array + array.T)
+
+    return array
+
+
+@_deprecate_positional_args
+def check_is_fitted(estimator, attributes=None, *, msg=None, all_or_any=all):
+    """Perform is_fitted validation for estimator.
+
+    Checks if the estimator is fitted by verifying the presence of
+    fitted attributes (ending with a trailing underscore) and otherwise
+    raises a NotFittedError with the given message.
+
+    This utility is meant to be used internally by estimators themselves,
+    typically in their own predict / transform methods.
+
+    Parameters
+    ----------
+    estimator : estimator instance.
+        estimator instance for which the check is performed.
+
+    attributes : str, list or tuple of str, default=None
+        Attribute name(s) given as string or a list/tuple of strings
+        Eg.: ``["coef_", "estimator_", ...], "coef_"``
+
+        If `None`, `estimator` is considered fitted if there exist an
+        attribute that ends with a underscore and does not start with double
+        underscore.
+
+    msg : string
+        The default error message is, "This %(name)s instance is not fitted
+        yet. Call 'fit' with appropriate arguments before using this
+        estimator."
+
+        For custom messages if "%(name)s" is present in the message string,
+        it is substituted for the estimator name.
+
+        Eg. : "Estimator, %(name)s, must be fitted before sparsifying".
+
+    all_or_any : callable, {all, any}, default all
+        Specify whether all or any of the given attributes must exist.
+
+    Returns
+    -------
+    None
+
+    Raises
+    ------
+    NotFittedError
+        If the attributes are not found.
+    """
+    if isclass(estimator):
+        raise TypeError("{} is a class, not an instance.".format(estimator))
+    if msg is None:
+        msg = ("This %(name)s instance is not fitted yet. Call 'fit' with "
+               "appropriate arguments before using this estimator.")
+
+    if not hasattr(estimator, 'fit'):
+        raise TypeError("%s is not an estimator instance." % (estimator))
+
+    if attributes is not None:
+        if not isinstance(attributes, (list, tuple)):
+            attributes = [attributes]
+        attrs = all_or_any([hasattr(estimator, attr) for attr in attributes])
+    else:
+        attrs = [v for v in vars(estimator)
+                 if v.endswith("_") and not v.startswith("__")]
+
+    if not attrs:
+        raise NotFittedError(msg % {'name': type(estimator).__name__})
+
+
+def check_non_negative(X, whom):
+    """
+    Check if there is any negative value in an array.
+
+    Parameters
+    ----------
+    X : array-like or sparse matrix
+        Input data.
+
+    whom : string
+        Who passed X to this function.
+    """
+    # avoid X.min() on sparse matrix since it also sorts the indices
+    if sp.issparse(X):
+        if X.format in ['lil', 'dok']:
+            X = X.tocsr()
+        if X.data.size == 0:
+            X_min = 0
+        else:
+            X_min = X.data.min()
+    else:
+        X_min = X.min()
+
+    if X_min < 0:
+        raise ValueError("Negative values in data passed to %s" % whom)
+
+
+def check_scalar(x, name, target_type, *, min_val=None, max_val=None):
+    """Validate scalar parameters type and value.
+
+    Parameters
+    ----------
+    x : object
+        The scalar parameter to validate.
+
+    name : str
+        The name of the parameter to be printed in error messages.
+
+    target_type : type or tuple
+        Acceptable data types for the parameter.
+
+    min_val : float or int, optional (default=None)
+        The minimum valid value the parameter can take. If None (default) it
+        is implied that the parameter does not have a lower bound.
+
+    max_val : float or int, optional (default=None)
+        The maximum valid value the parameter can take. If None (default) it
+        is implied that the parameter does not have an upper bound.
+
+    Raises
+    -------
+    TypeError
+        If the parameter's type does not match the desired type.
+
+    ValueError
+        If the parameter's value violates the given bounds.
+    """
+
+    if not isinstance(x, target_type):
+        raise TypeError('`{}` must be an instance of {}, not {}.'
+                        .format(name, target_type, type(x)))
+
+    if min_val is not None and x < min_val:
+        raise ValueError('`{}`= {}, must be >= {}.'.format(name, x, min_val))
+
+    if max_val is not None and x > max_val:
+        raise ValueError('`{}`= {}, must be <= {}.'.format(name, x, max_val))
+
+
+def _check_psd_eigenvalues(lambdas, enable_warnings=False):
+    """Check the eigenvalues of a positive semidefinite (PSD) matrix.
+
+    Checks the provided array of PSD matrix eigenvalues for numerical or
+    conditioning issues and returns a fixed validated version. This method
+    should typically be used if the PSD matrix is user-provided (e.g. a
+    Gram matrix) or computed using a user-provided dissimilarity metric
+    (e.g. kernel function), or if the decomposition process uses approximation
+    methods (randomized SVD, etc.).
+
+    It checks for three things:
+
+    - that there are no significant imaginary parts in eigenvalues (more than
+      1e-5 times the maximum real part). If this check fails, it raises a
+      ``ValueError``. Otherwise all non-significant imaginary parts that may
+      remain are set to zero. This operation is traced with a
+      ``PositiveSpectrumWarning`` when ``enable_warnings=True``.
+
+    - that eigenvalues are not all negative. If this check fails, it raises a
+      ``ValueError``
+
+    - that there are no significant negative eigenvalues with absolute value
+      more than 1e-10 (1e-6) and more than 1e-5 (5e-3) times the largest
+      positive eigenvalue in double (simple) precision. If this check fails,
+      it raises a ``ValueError``. Otherwise all negative eigenvalues that may
+      remain are set to zero. This operation is traced with a
+      ``PositiveSpectrumWarning`` when ``enable_warnings=True``.
+
+    Finally, all the positive eigenvalues that are too small (with a value
+    smaller than the maximum eigenvalue divided by 1e12) are set to zero.
+    This operation is traced with a ``PositiveSpectrumWarning`` when
+    ``enable_warnings=True``.
+
+    Parameters
+    ----------
+    lambdas : array-like of shape (n_eigenvalues,)
+        Array of eigenvalues to check / fix.
+
+    enable_warnings : bool, default=False
+        When this is set to ``True``, a ``PositiveSpectrumWarning`` will be
+        raised when there are imaginary parts, negative eigenvalues, or
+        extremely small non-zero eigenvalues. Otherwise no warning will be
+        raised. In both cases, imaginary parts, negative eigenvalues, and
+        extremely small non-zero eigenvalues will be set to zero.
+
+    Returns
+    -------
+    lambdas_fixed : ndarray of shape (n_eigenvalues,)
+        A fixed validated copy of the array of eigenvalues.
+
+    Examples
+    --------
+    >>> _check_psd_eigenvalues([1, 2])      # nominal case
+    array([1, 2])
+    >>> _check_psd_eigenvalues([5, 5j])     # significant imag part
+    Traceback (most recent call last):
+        ...
+    ValueError: There are significant imaginary parts in eigenvalues (1
+        of the maximum real part). Either the matrix is not PSD, or there was
+        an issue while computing the eigendecomposition of the matrix.
+    >>> _check_psd_eigenvalues([5, 5e-5j])  # insignificant imag part
+    array([5., 0.])
+    >>> _check_psd_eigenvalues([-5, -1])    # all negative
+    Traceback (most recent call last):
+        ...
+    ValueError: All eigenvalues are negative (maximum is -1). Either the
+        matrix is not PSD, or there was an issue while computing the
+        eigendecomposition of the matrix.
+    >>> _check_psd_eigenvalues([5, -1])     # significant negative
+    Traceback (most recent call last):
+        ...
+    ValueError: There are significant negative eigenvalues (0.2 of the
+        maximum positive). Either the matrix is not PSD, or there was an issue
+        while computing the eigendecomposition of the matrix.
+    >>> _check_psd_eigenvalues([5, -5e-5])  # insignificant negative
+    array([5., 0.])
+    >>> _check_psd_eigenvalues([5, 4e-12])  # bad conditioning (too small)
+    array([5., 0.])
+
+    """
+
+    lambdas = np.array(lambdas)
+    is_double_precision = lambdas.dtype == np.float64
+
+    # note: the minimum value available is
+    #  - single-precision: np.finfo('float32').eps = 1.2e-07
+    #  - double-precision: np.finfo('float64').eps = 2.2e-16
+
+    # the various thresholds used for validation
+    # we may wish to change the value according to precision.
+    significant_imag_ratio = 1e-5
+    significant_neg_ratio = 1e-5 if is_double_precision else 5e-3
+    significant_neg_value = 1e-10 if is_double_precision else 1e-6
+    small_pos_ratio = 1e-12
+
+    # Check that there are no significant imaginary parts
+    if not np.isreal(lambdas).all():
+        max_imag_abs = np.abs(np.imag(lambdas)).max()
+        max_real_abs = np.abs(np.real(lambdas)).max()
+        if max_imag_abs > significant_imag_ratio * max_real_abs:
+            raise ValueError(
+                "There are significant imaginary parts in eigenvalues (%g "
+                "of the maximum real part). Either the matrix is not PSD, or "
+                "there was an issue while computing the eigendecomposition "
+                "of the matrix."
+                % (max_imag_abs / max_real_abs))
+
+        # warn about imaginary parts being removed
+        if enable_warnings:
+            warnings.warn("There are imaginary parts in eigenvalues (%g "
+                          "of the maximum real part). Either the matrix is not"
+                          " PSD, or there was an issue while computing the "
+                          "eigendecomposition of the matrix. Only the real "
+                          "parts will be kept."
+                          % (max_imag_abs / max_real_abs),
+                          PositiveSpectrumWarning)
+
+    # Remove all imaginary parts (even if zero)
+    lambdas = np.real(lambdas)
+
+    # Check that there are no significant negative eigenvalues
+    max_eig = lambdas.max()
+    if max_eig < 0:
+        raise ValueError("All eigenvalues are negative (maximum is %g). "
+                         "Either the matrix is not PSD, or there was an "
+                         "issue while computing the eigendecomposition of "
+                         "the matrix." % max_eig)
+
+    else:
+        min_eig = lambdas.min()
+        if (min_eig < -significant_neg_ratio * max_eig
+                and min_eig < -significant_neg_value):
+            raise ValueError("There are significant negative eigenvalues (%g"
+                             " of the maximum positive). Either the matrix is "
+                             "not PSD, or there was an issue while computing "
+                             "the eigendecomposition of the matrix."
+                             % (-min_eig / max_eig))
+        elif min_eig < 0:
+            # Remove all negative values and warn about it
+            if enable_warnings:
+                warnings.warn("There are negative eigenvalues (%g of the "
+                              "maximum positive). Either the matrix is not "
+                              "PSD, or there was an issue while computing the"
+                              " eigendecomposition of the matrix. Negative "
+                              "eigenvalues will be replaced with 0."
+                              % (-min_eig / max_eig),
+                              PositiveSpectrumWarning)
+            lambdas[lambdas < 0] = 0
+
+    # Check for conditioning (small positive non-zeros)
+    too_small_lambdas = (0 < lambdas) & (lambdas < small_pos_ratio * max_eig)
+    if too_small_lambdas.any():
+        if enable_warnings:
+            warnings.warn("Badly conditioned PSD matrix spectrum: the largest "
+                          "eigenvalue is more than %g times the smallest. "
+                          "Small eigenvalues will be replaced with 0."
+                          "" % (1 / small_pos_ratio),
+                          PositiveSpectrumWarning)
+        lambdas[too_small_lambdas] = 0
+
+    return lambdas
+
+
+def _check_sample_weight(sample_weight, X, dtype=None):
+    """Validate sample weights.
+
+    Note that passing sample_weight=None will output an array of ones.
+    Therefore, in some cases, you may want to protect the call with:
+    if sample_weight is not None:
+        sample_weight = _check_sample_weight(...)
+
+    Parameters
+    ----------
+    sample_weight : {ndarray, Number or None}, shape (n_samples,)
+       Input sample weights.
+
+    X : nd-array, list or sparse matrix
+        Input data.
+
+    dtype: dtype
+       dtype of the validated `sample_weight`.
+       If None, and the input `sample_weight` is an array, the dtype of the
+       input is preserved; otherwise an array with the default numpy dtype
+       is be allocated.  If `dtype` is not one of `float32`, `float64`,
+       `None`, the output will be of dtype `float64`.
+
+    Returns
+    -------
+    sample_weight : ndarray, shape (n_samples,)
+       Validated sample weight. It is guaranteed to be "C" contiguous.
+    """
+    n_samples = _num_samples(X)
+
+    if dtype is not None and dtype not in [np.float32, np.float64]:
+        dtype = np.float64
+
+    if sample_weight is None:
+        sample_weight = np.ones(n_samples, dtype=dtype)
+    elif isinstance(sample_weight, numbers.Number):
+        sample_weight = np.full(n_samples, sample_weight, dtype=dtype)
+    else:
+        if dtype is None:
+            dtype = [np.float64, np.float32]
+        sample_weight = check_array(
+            sample_weight, accept_sparse=False, ensure_2d=False, dtype=dtype,
+            order="C"
+        )
+        if sample_weight.ndim != 1:
+            raise ValueError("Sample weights must be 1D array or scalar")
+
+        if sample_weight.shape != (n_samples,):
+            raise ValueError("sample_weight.shape == {}, expected {}!"
+                             .format(sample_weight.shape, (n_samples,)))
+    return sample_weight
+
+
+def _allclose_dense_sparse(x, y, rtol=1e-7, atol=1e-9):
+    """Check allclose for sparse and dense data.
+
+    Both x and y need to be either sparse or dense, they
+    can't be mixed.
+
+    Parameters
+    ----------
+    x : array-like or sparse matrix
+        First array to compare.
+
+    y : array-like or sparse matrix
+        Second array to compare.
+
+    rtol : float, optional
+        relative tolerance; see numpy.allclose
+
+    atol : float, optional
+        absolute tolerance; see numpy.allclose. Note that the default here is
+        more tolerant than the default for numpy.testing.assert_allclose, where
+        atol=0.
+    """
+    if sp.issparse(x) and sp.issparse(y):
+        x = x.tocsr()
+        y = y.tocsr()
+        x.sum_duplicates()
+        y.sum_duplicates()
+        return (np.array_equal(x.indices, y.indices) and
+                np.array_equal(x.indptr, y.indptr) and
+                np.allclose(x.data, y.data, rtol=rtol, atol=atol))
+    elif not sp.issparse(x) and not sp.issparse(y):
+        return np.allclose(x, y, rtol=rtol, atol=atol)
+    raise ValueError("Can only compare two sparse matrices, not a sparse "
+                     "matrix and an array")
+
+
+def _check_fit_params(X, fit_params, indices=None):
+    """Check and validate the parameters passed during `fit`.
+
+    Parameters
+    ----------
+    X : array-like of shape (n_samples, n_features)
+        Data array.
+
+    fit_params : dict
+        Dictionary containing the parameters passed at fit.
+
+    indices : array-like of shape (n_samples,), default=None
+        Indices to be selected if the parameter has the same size as `X`.
+
+    Returns
+    -------
+    fit_params_validated : dict
+        Validated parameters. We ensure that the values support indexing.
+    """
+    from . import _safe_indexing
+    fit_params_validated = {}
+    for param_key, param_value in fit_params.items():
+        if (not _is_arraylike(param_value) or
+                _num_samples(param_value) != _num_samples(X)):
+            # Non-indexable pass-through (for now for backward-compatibility).
+            # https://github.com/scikit-learn/scikit-learn/issues/15805
+            fit_params_validated[param_key] = param_value
+        else:
+            # Any other fit_params should support indexing
+            # (e.g. for cross-validation).
+            fit_params_validated[param_key] = _make_indexable(param_value)
+            fit_params_validated[param_key] = _safe_indexing(
+                fit_params_validated[param_key], indices
+            )
+
+    return fit_params_validated
diff --git a/python/cuml/benchmark/algorithms.py b/python/cuml/benchmark/algorithms.py
index 351d3b23f0..e562ee1642 100644
--- a/python/cuml/benchmark/algorithms.py
+++ b/python/cuml/benchmark/algorithms.py
@@ -22,6 +22,7 @@
 import sklearn.random_projection
 import sklearn.naive_bayes
 from sklearn import metrics
+from sklearn.impute import SimpleImputer as skSimpleImputer
 import cuml.metrics
 import cuml.decomposition
 import cuml.naive_bayes
@@ -29,6 +30,11 @@
 import numpy as np
 import tempfile
 
+from cuml.experimental.preprocessing import StandardScaler, MinMaxScaler, \
+                                            MaxAbsScaler, Normalizer, \
+                                            SimpleImputer, RobustScaler, \
+                                            PolynomialFeatures
+
 from cuml.benchmark.bench_helper_funcs import (
     fit,
     fit_kneighbors,
@@ -438,6 +444,118 @@ def all_algorithms():
             name="UMAP-Supervised",
             accepts_labels=True,
             accuracy_function=cuml.metrics.trustworthiness,
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.StandardScaler,
+            StandardScaler,
+            shared_args=dict(),
+            name="StandardScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.MinMaxScaler,
+            MinMaxScaler,
+            shared_args=dict(),
+            name="MinMaxScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.MaxAbsScaler,
+            MaxAbsScaler,
+            shared_args=dict(),
+            name="MaxAbsScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.Normalizer,
+            Normalizer,
+            shared_args=dict(),
+            name="Normalizer",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            skSimpleImputer,
+            SimpleImputer,
+            shared_args=dict(),
+            name="SimpleImputer",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.RobustScaler,
+            RobustScaler,
+            shared_args=dict(),
+            name="RobustScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.PolynomialFeatures,
+            PolynomialFeatures,
+            shared_args=dict(),
+            name="PolynomialFeatures",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.StandardScaler,
+            StandardScaler,
+            shared_args=dict(),
+            name="SparseCSRStandardScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.MinMaxScaler,
+            MinMaxScaler,
+            shared_args=dict(),
+            name="SparseCSRMinMaxScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.MaxAbsScaler,
+            MaxAbsScaler,
+            shared_args=dict(),
+            name="SparseCSRMaxAbsScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.Normalizer,
+            Normalizer,
+            shared_args=dict(),
+            name="SparseCSRNormalizer",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.RobustScaler,
+            RobustScaler,
+            shared_args=dict(),
+            name="SparseCSCRobustScaler",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            skSimpleImputer,
+            SimpleImputer,
+            shared_args=dict(),
+            name="SparseCSCSimpleImputer",
+            accepts_labels=False,
+            bench_func=fit_transform
+        ),
+        AlgorithmPair(
+            sklearn.preprocessing.PolynomialFeatures,
+            PolynomialFeatures,
+            shared_args=dict(),
+            name="SparseCSRPolynomialFeatures",
+            accepts_labels=False,
+            bench_func=fit_transform
         )
     ]
 
diff --git a/python/cuml/benchmark/ci_benchmark.py b/python/cuml/benchmark/ci_benchmark.py
index 81eb3eb6f2..90fa9dce89 100644
--- a/python/cuml/benchmark/ci_benchmark.py
+++ b/python/cuml/benchmark/ci_benchmark.py
@@ -80,6 +80,38 @@ def report_asv(results_df, output_dir,
         db.addResult(b_info, result)
 
 
+preprocessing_algo_defs = [
+    ("StandardScaler", "classification",
+        [1000000], [256, 1024], [{'copy': False}]),
+    ("MinMaxScaler", "classification",
+        [1000000], [256, 1024], [{'copy': False}]),
+    ("MaxAbsScaler", "classification",
+        [1000000], [256, 1024], [{'copy': False}]),
+    ("Normalizer", "classification",
+        [1000000], [256, 1024], [{'copy': False}]),
+    ("RobustScaler", "classification",
+        [1000000], [128, 256], [{'copy': False}]),
+    ("SimpleImputer", "classification",
+        [1000000], [256, 1024], [{'copy': False}]),
+    ("PolynomialFeatures", "classification",
+        [1000000], [128, 256], [{}]),
+    ("SparseCSRStandardScaler", "classification",
+        [1000000], [512], [{'copy': False, 'with_mean': False}]),
+    ("SparseCSRMaxAbsScaler", "classification",
+        [300000], [512], [{'copy': False}]),
+    ("SparseCSRNormalizer", "classification",
+        [1000000], [512], [{'copy': False}]),
+    ("SparseCSCRobustScaler", "classification",
+        [1000000], [512], [{'copy': False, 'with_centering': False}]),
+    ("SparseCSCSimpleImputer", "classification",
+        [1000000], [512], [{'copy': False}]),
+    ("SparseCSRPolynomialFeatures", "classification",
+        [30000], [128], [{}])
+]
+
+preprocessing_algo_names = set([a[0] for a in preprocessing_algo_defs])
+
+
 def make_bench_configs(long_config):
     """Defines the configurations we want to benchmark
     If `long_config` is True, this may take over an hour.
@@ -115,9 +147,11 @@ def make_bench_configs(long_config):
         ("tSVD", "blobs", large_rows, [32, 256],
          expand_params("n_components", [2, 25]),),
         ("GaussianRandomProjection", "blobs", large_rows, [32, 256],
-         expand_params("n_components", [2, 25]),),
+         expand_params("n_components", [2, 25]),)
     ]
 
+    algo_defs += preprocessing_algo_defs
+
     for algo_name, dataset_name, rows, dims, params in algo_defs:
         configs.append(
             dict(
@@ -188,23 +222,37 @@ def make_bench_configs(long_config):
     parser.add_argument('--n_reps', type=int, default=3)
 
     args = parser.parse_args()
-    invalidAlgoNames = (set(args.algo) - allAlgoNames)
+
+    algos = set(args.algo)
+    if 'preprocessing' in algos:
+        algos = algos.union(preprocessing_algo_names)
+        algos.remove('preprocessing')
+    invalidAlgoNames = (algos - allAlgoNames)
     if invalidAlgoNames:
         raise ValueError("Invalid algo name(s): %s" % invalidAlgoNames)
 
     bench_to_run = bench_config[args.benchmark]
 
-    default_args = dict(run_cpu=False, n_reps=args.n_reps)
+    default_args = dict(run_cpu=True, n_reps=args.n_reps)
     all_results = []
     for cfg_in in bench_to_run:
-        if (args.algo is None) or ("ALL" in args.algo) or \
-           (cfg_in["algo_name"] in args.algo):
+        if (algos is None) or ("ALL" in algos) or \
+           (cfg_in["algo_name"] in algos):
             # Pass an actual algo object instead of an algo_name string
             cfg = cfg_in.copy()
             algo = algorithms.algorithm_by_name(cfg_in["algo_name"])
             cfg["algos"] = [algo]
+            alg_name = cfg["algo_name"]
+            if alg_name.startswith('Sparse'):
+                if alg_name.startswith('SparseCSR'):
+                    input_type = 'scipy-sparse-csr'
+                elif alg_name.startswith('SparseCSC'):
+                    input_type = 'scipy-sparse-csc'
+            else:
+                input_type = 'numpy'
             del cfg["algo_name"]
-            res = run_variations(**{**default_args, **cfg})
+            res = run_variations(**{**default_args, **cfg},
+                                 input_type=input_type)
             all_results.append(res)
 
     results_df = pd.concat(all_results)
diff --git a/python/cuml/benchmark/datagen.py b/python/cuml/benchmark/datagen.py
index f74965848f..5d37ae30cc 100644
--- a/python/cuml/benchmark/datagen.py
+++ b/python/cuml/benchmark/datagen.py
@@ -48,6 +48,8 @@
 from cuml.common import input_utils
 from numba import cuda
 
+from cuml.common.import_utils import has_scipy
+
 
 def _gen_data_regression(n_samples, n_features, random_state=42):
     """Wrapper for sklearn make_regression"""
@@ -69,7 +71,6 @@ def _gen_data_blobs(n_samples, n_features, random_state=42, centers=None):
     X_arr, y_arr = cuml.datasets.make_blobs(
         n_samples=n_samples, n_features=n_features, centers=centers,
         random_state=random_state)
-    print(type(X_arr), type(y_arr))
     return (
         cudf.DataFrame(X_arr.astype(np.float32)),
         cudf.Series(y_arr.astype(np.float32)),
@@ -225,12 +226,57 @@ def _convert_to_gpuarray_c(data):
     return _convert_to_gpuarray(data, order='C')
 
 
+def _sparsify_and_convert(data, input_type, sparsity_ratio=0.3):
+    """Randomly set values to 0 and produce a sparse array."""
+    if not has_scipy():
+        raise RuntimeError("Scipy is required")
+    import scipy
+    random_loc = np.random.choice(data.size,
+                                  int(data.size * sparsity_ratio),
+                                  replace=False)
+    data.ravel()[random_loc] = 0
+    if input_type == 'csr':
+        return scipy.sparse.csr_matrix(data)
+    elif input_type == 'csc':
+        return scipy.sparse.csc_matrix(data)
+    else:
+        TypeError('Wrong sparse input type {}'.format(input_type))
+
+
+def _convert_to_scipy_sparse(data, input_type):
+    """Returns a tuple of arrays. Each of the arrays
+    have some of its values being set randomly to 0,
+    it is then converted to a scipy sparse array"""
+    if data is None:
+        return None
+    elif isinstance(data, tuple):
+        return tuple([_convert_to_scipy_sparse(d, input_type) for d in data])
+    elif isinstance(data, np.ndarray):
+        return _sparsify_and_convert(data, input_type)
+    elif isinstance(data, cudf.DataFrame):
+        return _sparsify_and_convert(data.as_matrix(), input_type)
+    elif isinstance(data, cudf.Series):
+        return _sparsify_and_convert(data.to_array(), input_type)
+    elif isinstance(data, (pd.DataFrame, pd.Series)):
+        return _sparsify_and_convert(data.to_numpy(), input_type)
+    else:
+        raise Exception("Unsupported type %s" % str(type(data)))
+
+
+def _convert_to_scipy_sparse_csr(data):
+    return _convert_to_scipy_sparse(data, 'csr')
+
+
+def _convert_to_scipy_sparse_csc(data):
+    return _convert_to_scipy_sparse(data, 'csc')
+
+
 _data_generators = {
     'blobs': _gen_data_blobs,
     'zeros': _gen_data_zeros,
     'classification': _gen_data_classification,
     'regression': _gen_data_regression,
-    'higgs': _gen_data_higgs,
+    'higgs': _gen_data_higgs
 }
 _data_converters = {
     'numpy': _convert_to_numpy,
@@ -238,6 +284,8 @@ def _convert_to_gpuarray_c(data):
     'pandas': _convert_to_pandas,
     'gpuarray': _convert_to_gpuarray,
     'gpuarray-c': _convert_to_gpuarray_c,
+    'scipy-sparse-csr': _convert_to_scipy_sparse_csr,
+    'scipy-sparse-csc': _convert_to_scipy_sparse_csc
 }
 
 
diff --git a/python/cuml/cluster/dbscan.pyx b/python/cuml/cluster/dbscan.pyx
index 263a4b8bdd..e4e57220cd 100644
--- a/python/cuml/cluster/dbscan.pyx
+++ b/python/cuml/cluster/dbscan.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -30,14 +27,15 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 
 from collections import defaultdict
 
 cdef extern from "cuml/cluster/dbscan.hpp" namespace "ML":
 
-    cdef void dbscanFit(cumlHandle& handle,
+    cdef void dbscanFit(handle_t& handle,
                         float *input,
                         int n_rows,
                         int n_cols,
@@ -48,7 +46,7 @@ cdef extern from "cuml/cluster/dbscan.hpp" namespace "ML":
                         size_t max_mbytes_per_batch,
                         int verbosity) except +
 
-    cdef void dbscanFit(cumlHandle& handle,
+    cdef void dbscanFit(handle_t& handle,
                         double *input,
                         int n_rows,
                         int n_cols,
@@ -59,7 +57,7 @@ cdef extern from "cuml/cluster/dbscan.hpp" namespace "ML":
                         size_t max_mbytes_per_batch,
                         int verbosity) except +
 
-    cdef void dbscanFit(cumlHandle& handle,
+    cdef void dbscanFit(handle_t& handle,
                         float *input,
                         int64_t n_rows,
                         int64_t n_cols,
@@ -70,7 +68,7 @@ cdef extern from "cuml/cluster/dbscan.hpp" namespace "ML":
                         size_t max_mbytes_per_batch,
                         int verbosity) except +
 
-    cdef void dbscanFit(cumlHandle& handle,
+    cdef void dbscanFit(handle_t& handle,
                         double *input,
                         int64_t n_rows,
                         int64_t n_cols,
@@ -93,7 +91,7 @@ class DBSCAN(Base):
     neighbours.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -127,12 +125,18 @@ class DBSCAN(Base):
         The maximum distance between 2 points such they reside in the same
         neighborhood.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     min_samples : int (default = 5)
         The number of samples in a neighborhood such that this group can be
         considered as an important core point (including the point itself).
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     max_mbytes_per_batch : (optional) int64
         Calculate batch size using no more than this number of megabytes for
         the pairwise distance computation. This enables the trade-off between
@@ -143,13 +147,11 @@ class DBSCAN(Base):
         Note: this option does not set the maximum total memory used in the
         DBSCAN computation and so this value will not be able to be set to
         the total memory available on the device.
-    output_type : (optional) {'input', 'cudf', 'cupy', 'numpy'} default = None
-        Use it to control output type of the results and attributes.
-        If None it'll inherit the output type set at the
-        module level, cuml.output_type. If that has not been changed, by
-        default the estimator will mirror the type of the data used for each
-        fit or predict call.
-        If set, the estimator will override the global option for its behavior.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
     calc_core_sample_indices : (optional) boolean (default = True)
         Indicates whether the indices of the core samples should be calculated.
         The the attribute `core_sample_indices_` will not be used, setting this
@@ -204,22 +206,19 @@ class DBSCAN(Base):
         if self.max_mbytes_per_batch is None:
             self.max_mbytes_per_batch = 0
 
+    @generate_docstring(skip_parameters_heading=True)
     def fit(self, X, out_dtype="int32"):
         """
         Perform DBSCAN clustering from features.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-           Dense matrix (floats or doubles) of shape (n_samples, n_features).
-           Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-           ndarray, cuda array interface compliant array like CuPy
         out_dtype: dtype Determines the precision of the output labels array.
             default: "int32". Valid values are { "int32", np.int32,
-            "int64", np.int64}. When the number of samples exceed
+            "int64", np.int64}.
+
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         if self._labels_ is not None:
             del self._labels_
@@ -235,7 +234,7 @@ class DBSCAN(Base):
 
         cdef uintptr_t input_ptr = X_m.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         self._labels_ = CumlArray.empty(n_rows, dtype=out_dtype)
         cdef uintptr_t labels_ptr = self._labels_.ptr
@@ -321,24 +320,29 @@ class DBSCAN(Base):
 
         return self
 
+    @generate_docstring(skip_parameters_heading=True,
+                        return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Cluster labels',
+                                       'shape': '(n_samples, 1)'})
     def fit_predict(self, X, out_dtype="int32"):
         """
-        Performs clustering on input_gdf and returns cluster labels.
+        Performs clustering on X and returns cluster labels.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-          Dense matrix (floats or doubles) of shape (n_samples, n_features)
-          Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-          ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        y : cuDF Series, shape (n_samples)
-          cluster labels
+        out_dtype: dtype Determines the precision of the output labels array.
+            default: "int32". Valid values are { "int32", np.int32,
+            "int64", np.int64}.
+
         """
         self.fit(X, out_dtype)
         return self.labels_
 
     def get_param_names(self):
-        return ["eps", "min_samples"]
+        return super().get_param_names() + [
+            "eps",
+            "min_samples",
+            "max_mbytes_per_batch",
+            "calc_core_sample_indices",
+        ]
diff --git a/python/cuml/cluster/kmeans.pyx b/python/cuml/cluster/kmeans.pyx
index 5b64c87775..4a9bdfe68d 100644
--- a/python/cuml/cluster/kmeans.pyx
+++ b/python/cuml/cluster/kmeans.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -31,13 +28,14 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from cuml.cluster.kmeans_utils cimport *
 
 cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
 
-    cdef void fit_predict(cumlHandle& handle,
+    cdef void fit_predict(handle_t& handle,
                           KMeansParams& params,
                           const float *X,
                           int n_samples,
@@ -48,7 +46,7 @@ cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
                           float &inertia,
                           int &n_iter) except +
 
-    cdef void fit_predict(cumlHandle& handle,
+    cdef void fit_predict(handle_t& handle,
                           KMeansParams& params,
                           const double *X,
                           int n_samples,
@@ -59,7 +57,7 @@ cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
                           double &inertia,
                           int &n_iter) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       KMeansParams& params,
                       const float *centroids,
                       const float *X,
@@ -69,7 +67,7 @@ cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
                       int *labels,
                       float &inertia) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       KMeansParams& params,
                       double *centroids,
                       const double *X,
@@ -79,7 +77,7 @@ cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
                       int *labels,
                       double &inertia) except +
 
-    cdef void transform(cumlHandle& handle,
+    cdef void transform(handle_t& handle,
                         KMeansParams& params,
                         const float *centroids,
                         const float *X,
@@ -88,7 +86,7 @@ cdef extern from "cuml/cluster/kmeans.hpp" namespace "ML::kmeans":
                         int metric,
                         float *X_new) except +
 
-    cdef void transform(cumlHandle& handle,
+    cdef void transform(handle_t& handle,
                         KMeansParams& params,
                         const double *centroids,
                         const double *X,
@@ -140,7 +138,7 @@ class KMeans(Base):
         print(b)
 
         print("Calling fit")
-        kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
+        kmeans_float = KMeans(n_clusters=2)
         kmeans_float.fit(b)
 
         print("labels:")
@@ -179,15 +177,21 @@ class KMeans(Base):
     Parameters
     ----------
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     n_clusters : int (default = 8)
         The number of centroids or clusters you want.
     max_iter : int (default = 300)
         The more iterations of EM, the more accurate, but slower.
     tol : float64 (default = 1e-4)
         Stopping criterion when centroid means do not change much.
-    verbose : int or boolean (default = False)
-        Logging level.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     random_state : int (default = 1)
         If you want results to be the same when you restart Python, select a
         state.
@@ -220,6 +224,11 @@ class KMeans(Base):
         pairwise distance computation is max_samples_per_batch * n_clusters.
         It might become necessary to lower this number when n_clusters
         becomes prohibitively large.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
@@ -306,24 +315,13 @@ class KMeans(Base):
         params.n_init = <int>self.n_init
         self._params = params
 
+    @generate_docstring()
     def fit(self, X, sample_weight=None):
         """
         Compute k-means clustering with X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        sample_weight : array-like (device or host) shape = (n_samples,), default=None # noqa
-            The weights for each observation in X. If None, all observations
-            are assigned equal weight.
-
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         if self.init == 'preset':
             check_cols = self.n_cols
@@ -339,7 +337,7 @@ class KMeans(Base):
 
         cdef uintptr_t input_ptr = X_m.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if sample_weight is None:
             sample_weight_m = CumlArray.ones(shape=n_rows, dtype=self.dtype)
@@ -407,21 +405,14 @@ class KMeans(Base):
         del(sample_weight_m)
         return self
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Cluster indexes',
+                                       'shape': '(n_samples, 1)'})
     def fit_predict(self, X, sample_weight=None):
         """
         Compute cluster centers and predict cluster index for each sample.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        sample_weight : array-like (device or host) shape = (n_samples,), default=None # noqa
-            The weights for each observation in X. If None, all observations
-            are assigned equal weight.
-
         """
         return self.fit(X, sample_weight=sample_weight).labels_
 
@@ -475,7 +466,7 @@ class KMeans(Base):
 
         cdef uintptr_t sample_weight_ptr = sample_weight_m.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef uintptr_t cluster_centers_ptr = self._cluster_centers_.ptr
 
@@ -522,44 +513,29 @@ class KMeans(Base):
         del(sample_weight_m)
         return self._labels_.to_output(out_type), inertia
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Cluster indexes',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False, sample_weight=None):
         """
         Predict the closest cluster each sample in X belongs to.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        labels : array
-        Which cluster each datapoint belongs to.
         """
 
         labels, _ = self._predict_labels_inertia(X,
-                                                 convert_dtype=convert_dtype)
+                                                 convert_dtype=convert_dtype,
+                                                 sample_weight=sample_weight)
         return labels
 
+    @generate_docstring(return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Transformed data',
+                                       'shape': '(n_samples, n_clusters)'})
     def transform(self, X, convert_dtype=False):
         """
         Transform X to a cluster-distance space.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the transform method will, when necessary,
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
-
         """
 
         out_type = self._get_output_type(X)
@@ -572,7 +548,7 @@ class KMeans(Base):
 
         cdef uintptr_t input_ptr = X_m.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef uintptr_t cluster_centers_ptr = self._cluster_centers_.ptr
 
@@ -615,45 +591,34 @@ class KMeans(Base):
         del(X_m)
         return preds.to_output(out_type)
 
-    def score(self, X):
+    @generate_docstring(return_values={'name': 'score',
+                                       'type': 'float',
+                                       'description': 'Opposite of the value \
+                                                        of X on the K-means \
+                                                        objective.'})
+    def score(self, X, y=None, sample_weight=None, convert_dtype=True):
         """
         Opposite of the value of X on the K-means objective.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        score: float
-                 Opposite of the value of X on the K-means objective.
         """
 
-        return -1 * self._predict_labels_inertia(X)[1]
+        return -1 * self._predict_labels_inertia(
+            X, convert_dtype=convert_dtype,
+            sample_weight=sample_weight)[1]
 
+    @generate_docstring(return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Transformed data',
+                                       'shape': '(n_samples, n_clusters)'})
     def fit_transform(self, X, convert_dtype=False):
         """
         Compute clustering and transform X to cluster-distance space.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the fit_transform method will automatically
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
         """
         return self.fit(X).transform(X, convert_dtype=convert_dtype)
 
     def get_param_names(self):
-        return ['n_init', 'oversampling_factor', 'max_samples_per_batch',
+        return super().get_param_names() + \
+            ['n_init', 'oversampling_factor', 'max_samples_per_batch',
                 'init', 'max_iter', 'n_clusters', 'random_state',
                 'tol']
diff --git a/python/cuml/cluster/kmeans_mg.pyx b/python/cuml/cluster/kmeans_mg.pyx
index 32a54bb49f..cbac199b72 100644
--- a/python/cuml/cluster/kmeans_mg.pyx
+++ b/python/cuml/cluster/kmeans_mg.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -32,7 +29,7 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 
 from cuml.cluster import KMeans
@@ -42,7 +39,7 @@ from cuml.cluster.kmeans_utils cimport *
 cdef extern from "cuml/cluster/kmeans_mg.hpp" \
         namespace "ML::kmeans::opg" nogil:
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   KMeansParams& params,
                   const float *X,
                   int n_samples,
@@ -51,7 +48,7 @@ cdef extern from "cuml/cluster/kmeans_mg.hpp" \
                   float &inertia,
                   int &n_iter) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   KMeansParams& params,
                   const double *X,
                   int n_samples,
@@ -87,14 +84,14 @@ class KMeansMG(KMeans):
             ndarray, cuda array interface compliant array like CuPy
 
         """
-        self._set_n_features_in(X)
+        self._set_base_attributes(n_features=X)
 
         X_m, self.n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, order='C')
 
         cdef uintptr_t input_ptr = X_m.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if (self.init in ['scalable-k-means++', 'k-means||', 'random']):
             self._cluster_centers_ = CumlArray.zeros(shape=(self.n_clusters,
diff --git a/python/cuml/cluster/kmeans_utils.pxd b/python/cuml/cluster/kmeans_utils.pxd
index 7bdb36541b..5ae36f1ca6 100644
--- a/python/cuml/cluster/kmeans_utils.pxd
+++ b/python/cuml/cluster/kmeans_utils.pxd
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 import ctypes
 from libcpp cimport bool
 
diff --git a/python/cuml/comm/serialize.py b/python/cuml/comm/serialize.py
index 29774cc2ee..30a9c32f66 100644
--- a/python/cuml/comm/serialize.py
+++ b/python/cuml/comm/serialize.py
@@ -33,6 +33,8 @@ def serialize_mat_descriptor(m):
 
     from distributed.protocol import register_generic
 
+    from cuml.common.array_sparse import SparseCumlArray
+
     from cuml.ensemble import RandomForestRegressor
     from cuml.ensemble import RandomForestClassifier
 
@@ -60,6 +62,12 @@ def rfc_serialize(rf):
     def rfc_deserialize(header, frames):
         return pickle_loads(header, frames)
 
+    register_generic(SparseCumlArray, 'cuda',
+                     cuda_serialize, cuda_deserialize)
+
+    register_generic(SparseCumlArray, 'dask',
+                     dask_serialize, dask_deserialize)
+
     register_generic(cuml.Base, 'cuda',
                      cuda_serialize, cuda_deserialize)
 
diff --git a/python/cuml/common/__init__.py b/python/cuml/common/__init__.py
index 1563bf07ab..8a12f2ac36 100644
--- a/python/cuml/common/__init__.py
+++ b/python/cuml/common/__init__.py
@@ -15,6 +15,7 @@
 #
 
 from cuml.common.array import CumlArray
+from cuml.common.array_sparse import SparseCumlArray
 
 ## utils
 
@@ -41,3 +42,4 @@
 from cuml.common.input_utils import get_dev_array_ptr
 from cuml.common.input_utils import input_to_dev_array
 from cuml.common.input_utils import sparse_scipy_to_cp
+from cuml.common.timing_utils import timed
diff --git a/python/cuml/common/array.py b/python/cuml/common/array.py
index 9b7661c654..91c8e71846 100644
--- a/python/cuml/common/array.py
+++ b/python/cuml/common/array.py
@@ -16,6 +16,7 @@
 
 import cupy as cp
 import numpy as np
+import operator
 
 from rmm import DeviceBuffer
 from cudf.core import Buffer, Series, DataFrame
@@ -31,16 +32,16 @@ class CumlArray(Buffer):
     """
     Array represents an abstracted array allocation. It can be instantiated by
     itself, creating an rmm.DeviceBuffer underneath, or can be instantiated by
-    __cuda_array_interface__ or __array_interface__ compliant arrays, in which
-    case it'll keep a reference to that data underneath. Also can be created
-    from a pointer, specifying the characteristics of the array, in that case
-    the owner of the data referred to by the pointer should be specified
-    explicitly.
+    ``__cuda_array_interface__`` or ``__array_interface__`` compliant arrays,
+    in which case it'll keep a reference to that data underneath. Also can be
+    created from a pointer, specifying the characteristics of the array, in
+    that case the owner of the data referred to by the pointer should be
+    specified explicitly.
 
     Parameters
     ----------
 
-    data : rmm.DeviceBuffer, cudf.Buffer, array_like, int, bytes, bytearray or
+    data : rmm.DeviceBuffer, cudf.Buffer, array_like, int, bytes, bytearray or\
            memoryview
         An array-like object or integer representing a
         device or host pointer to pre-allocated memory.
@@ -71,23 +72,7 @@ class CumlArray(Buffer):
     strides : tuple of ints
         Strides of the data
     __cuda_array_interface__ : dictionary
-        __cuda_array_interface__ to interop with other libraries.
-
-    Object Methods
-    --------------
-
-    to_output : Convert the array to the appropriate output format.
-
-    Class Methods
-    -------------
-
-    Array.empty : Create an empty array, allocating a DeviceBuffer.
-    Array.full : Create an Array with allocated DeviceBuffer initialized with
-        a particular value.
-    Array.ones : Create an Array with allocated DeviceBuffer initialized with
-        ones.
-    Array.zeros : Create an Array with allocated DeviceBuffer initialized with
-        zeros.
+        ``__cuda_array_interface__`` to interop with other libraries.
 
     Notes
     -----
@@ -128,6 +113,12 @@ def __init__(self, data=None, owner=None, dtype=None, shape=None,
         elif dtype is not None and shape is not None and order is not None:
             detailed_construction = True
         else:
+            # Catch a likely developer error if CumlArray is created
+            # incorrectly
+            assert dtype is None and shape is None and order is None, \
+                ("Creating array from array-like object. The arguments "
+                 "`dtype`, `shape` and `order` should be `None`.")
+
             detailed_construction = False
 
         ary_interface = False
@@ -136,7 +127,15 @@ def __init__(self, data=None, owner=None, dtype=None, shape=None,
         size, shape = _get_size_from_shape(shape, dtype)
 
         if not memview_construction and not detailed_construction:
-            flattened_data = cp.asarray(data).ravel(order='A').view('u1')
+            # Convert to cupy array and manually specify the ptr, size and
+            # owner. This is to avoid the restriction on Buffer that requires
+            # all data be u8
+            cupy_data = cp.asarray(data)
+            flattened_data = cupy_data.data.ptr
+
+            # Size for Buffer is not the same as for cupy. Use nbytes
+            size = cupy_data.nbytes
+            owner = cupy_data if cupy_data.flags.owndata else data
         else:
             flattened_data = data
 
@@ -144,10 +143,6 @@ def __init__(self, data=None, owner=None, dtype=None, shape=None,
                                         owner=owner,
                                         size=size)
 
-        if owner is None and not isinstance(data, np.ndarray):
-            # need to reference original owner instead of flattened_data
-            self._owner = data
-
         # Post processing of meta data
         if detailed_construction:
             self.shape = shape
@@ -184,6 +179,15 @@ def __setitem__(self, slice, value):
     def __len__(self):
         return self.shape[0]
 
+    def _operator_overload(self, other, fn):
+        return CumlArray(fn(self.to_output('cupy'), other))
+
+    def __add__(self, other):
+        return self._operator_overload(other, operator.add)
+
+    def __sub__(self, other):
+        return self._operator_overload(other, operator.sub)
+
     @property
     def __cuda_array_interface__(self):
         output = {
@@ -204,16 +208,19 @@ def to_output(self, output_type='cupy', output_dtype=None):
         ----------
         output_type : string
             Format to convert the array to. Acceptable formats are:
-            'cupy' - to cupy array
-            'numpy' - to numpy (host) array
-            'numba' - to numba device array
-            'dataframe' - to cuDF DataFrame
-            'series' - to cuDF Series
-            'cudf' - to cuDF Series if array is single dimensional, to
-                DataFrame otherwise
+
+            - 'cupy' - to cupy array
+            - 'numpy' - to numpy (host) array
+            - 'numba' - to numba device array
+            - 'dataframe' - to cuDF DataFrame
+            - 'series' - to cuDF Series
+            - 'cudf' - to cuDF Series if array is single dimensional, to
+               DataFrame otherwise
+
         output_dtype : string, optional
             Optionally cast the array to a specified dtype, creating
             a copy if necessary.
+
         """
         if output_dtype is None:
             output_dtype = self.dtype
@@ -350,12 +357,12 @@ def ones(cls, shape, dtype='float32', order='F'):
 
 
 def _check_low_level_type(data):
-    if not (
+    if isinstance(data, CumlArray):
+        return False
+    elif not (
         hasattr(data, "__array_interface__")
         or hasattr(data, "__cuda_array_interface__")
-    ):
-        return True
-    elif isinstance(data, (DeviceBuffer, Buffer)):
+    ) or isinstance(data, (DeviceBuffer, Buffer)):
         return True
     else:
         return False
diff --git a/python/cuml/common/array_sparse.py b/python/cuml/common/array_sparse.py
new file mode 100644
index 0000000000..e43d8e9339
--- /dev/null
+++ b/python/cuml/common/array_sparse.py
@@ -0,0 +1,176 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import cupyx as cpx
+import numpy as np
+
+from cuml.common.import_utils import has_scipy
+from cuml.common.input_utils import input_to_cuml_array
+from cuml.common.logger import debug
+from cuml.common.memory_utils import with_cupy_rmm
+
+
+if has_scipy():
+    import scipy.sparse
+
+
+class SparseCumlArray:
+    """
+    SparseCumlArray abstracts sparse array allocations. This will
+    accept either a Scipy or Cupy sparse array and construct CumlArrays
+    out of the underlying index and data arrays. Currently, this class
+    only supports the CSR array format and input in any other sparse
+    format will be converted to CSR by default. Set `convert_format=False`
+    to disable automatic conversion to CSR.
+
+    Parameters
+    ----------
+
+    data : scipy.sparse.spmatrix or cupyx.scipy.sparse.spmatrix
+        A Scipy or Cupy sparse matrix
+    convert_to_dtype : data-type or False, optional
+        Any object that can be interpreted as a numpy or cupy data type.
+        Specifies whether to convert the data array to a different dtype.
+    convert_index : data-type or False (default: np.int32), optional
+        Any object that can be interpreted as a numpy or cupy data type.
+        Specifies whether to convert the indices to a different dtype. By
+        default, it is preferred to use 32-bit indexing.
+    convert_format : bool, optional (default: True)
+        Specifies whether to convert any non-CSR inputs to CSR. If False,
+        an exception is thrown.
+
+
+    Attributes
+    ----------
+
+    indptr : CumlArray
+        Compressed row index array
+    indices : CumlArray
+        Column indices array
+    data : CumlArray
+        Data array
+    dtype : dtype
+        Data type of data array
+    shape : tuple of ints
+        Shape of the array
+    nnz : int
+        Number of nonzeros in underlying arrays
+    """
+
+    @with_cupy_rmm
+    def __init__(self, data=None,
+                 convert_to_dtype=False,
+                 convert_index=np.int32,
+                 convert_format=True):
+        if not cpx.scipy.sparse.isspmatrix(data) and \
+                not (has_scipy() and scipy.sparse.isspmatrix(data)):
+            raise ValueError("A sparse matrix is expected as input. "
+                             "Received %s" % type(data))
+
+        check_classes = [cpx.scipy.sparse.csr_matrix]
+        if has_scipy():
+            check_classes.append(scipy.sparse.csr_matrix)
+
+        if not isinstance(data, tuple(check_classes)):
+            if convert_format:
+                debug('Received sparse matrix in %s format but CSR is '
+                      'expected. Data will be converted to CSR, but this '
+                      'will require additional memory copies. If this '
+                      'conversion is not desired, set '
+                      'set_convert_format=False to raise an exception '
+                      'instead.' % type(data))
+                data = data.tocsr()  # currently only CSR is supported
+            else:
+                raise ValueError("Expected CSR matrix but received %s"
+                                 % type(data))
+
+        if not convert_to_dtype:
+            convert_to_dtype = data.dtype
+
+        if not convert_index:
+            convert_index = data.indptr.dtype
+
+        # Note: Only 32-bit indexing is supported currently.
+        # In CUDA11, Cusparse provides 64-bit function calls
+        # but these are not yet used in RAFT/Cuml
+        self.indptr, _, _, _ = input_to_cuml_array(
+            data.indptr, check_dtype=convert_index,
+            convert_to_dtype=convert_index)
+
+        self.indices, _, _, _ = input_to_cuml_array(
+            data.indices, check_dtype=convert_index,
+            convert_to_dtype=convert_index)
+
+        self.data, _, _, _ = input_to_cuml_array(
+            data.data, check_dtype=data.dtype,
+            convert_to_dtype=convert_to_dtype)
+
+        self.shape = data.shape
+        self.dtype = self.data.dtype
+        self.nnz = data.nnz
+
+    @with_cupy_rmm
+    def to_output(self, output_type='cupy',
+                  output_format=None,
+                  output_dtype=None):
+        """
+        Convert array to output format
+
+        Parameters
+        ----------
+        output_type : string
+            Format to convert the array to. Acceptable formats are:
+
+            - 'cupy' - to cupy array
+            - 'scipy' - to scipy (host) array
+
+        output_format : string, optional { 'coo', 'csc' }
+            Optionally convert the output to the specified format.
+        output_dtype : string, optional
+            Optionally cast the array to a specified dtype, creating
+            a copy if necessary.
+        """
+        output_dtype = self.data.dtype \
+            if output_dtype is None else output_dtype
+
+        if output_type not in ['cupy', 'scipy']:
+            raise ValueError("Unsupported output_type: %s" % output_dtype)
+
+        cuml_arr_output_type = 'numpy' if output_type == 'scipy' else 'cupy'
+
+        data = self.data.to_output(cuml_arr_output_type, output_dtype)
+        indices = self.indices.to_output(cuml_arr_output_type)
+        indptr = self.indptr.to_output(cuml_arr_output_type)
+
+        if output_type == 'cupy':
+            constructor = cpx.scipy.sparse.csr_matrix
+        elif output_type == 'scipy' and has_scipy(raise_if_unavailable=True):
+            constructor = scipy.sparse.csr_matrix
+        else:
+            raise ValueError("Unsupported output_type: %s" % output_type)
+
+        ret = constructor((data, indices, indptr),
+                          dtype=output_dtype, shape=self.shape)
+
+        if output_format is not None:
+            if output_format == 'coo':
+                ret = ret.tocoo()
+            elif output_format == 'csc':
+                ret = ret.tocsc()
+            else:
+                raise ValueError("Output format %s not supported"
+                                 % output_format)
+
+        return ret
diff --git a/python/cuml/common/base.pyx b/python/cuml/common/base.pyx
index 2e0ce3f385..23465b5402 100644
--- a/python/cuml/common/base.pyx
+++ b/python/cuml/common/base.pyx
@@ -14,14 +14,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cuml
 import cuml.common.cuda
-import cuml.common.handle
+import cuml.raft.common.handle
 import cuml.common.logger as logger
 from cuml.common import input_to_cuml_array
 import inspect
@@ -29,6 +26,7 @@ import inspect
 from cudf.core import Series as cuSeries
 from cudf.core import DataFrame as cuDataFrame
 from cuml.common.array import CumlArray
+from cuml.common.doc_utils import generate_docstring
 from cupy import ndarray as cupyArray
 from numba.cuda import devicearray as numbaArray
 from numpy import ndarray as numpyArray
@@ -110,20 +108,19 @@ class Base:
         stream that will be used for the model's computations, so users can
         run different models concurrently in different streams by creating
         handles in several streams.
-        If it is None, a new one is created just for this class.
-    verbose : int or boolean (default = False)
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
         Sets logging level. It must be one of `cuml.common.logger.level_*`.
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
         Variable to control output type of the results and attributes of
-        the estimators. If None, it'll inherit the output type set at the
-        module level, cuml.output_type. If set, the estimator will override
-        the global option for its behavior.
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
     --------
 
-
-
     .. code-block:: python
 
         from cuml import Base
@@ -170,7 +167,8 @@ class Base:
         Constructor. All children must call init method of this base class.
 
         """
-        self.handle = cuml.common.handle.Handle() if handle is None else handle
+        self.handle = cuml.raft.common.handle.Handle() if handle is None \
+            else handle
 
         # Internally, self.verbose follows the spdlog/c++ standard of
         # 0 is most logging, and logging decreases from there.
@@ -182,8 +180,8 @@ class Base:
         else:
             self.verbose = verbose
 
-        self.output_type = cuml.global_output_type if output_type is None \
-            else _check_output_type_str(output_type)
+        self.output_type = _check_output_type_str(
+            cuml.global_output_type if output_type is None else output_type)
 
         self._mirror_input = True if self.output_type == 'input' else False
 
@@ -192,7 +190,7 @@ class Base:
         Pretty prints the arguments of a class using Scikit-learn standard :)
         """
         cdef list signature = inspect.getfullargspec(self.__init__).args
-        if signature[0] == 'self':
+        if len(signature) > 0 and signature[0] == 'self':
             del signature[0]
         cdef dict state = self.__dict__
         cdef str string = self.__class__.__name__ + '('
@@ -218,7 +216,7 @@ class Base:
         extra set of parameters that it in-turn owns. This is to simplify the
         implementation of `get_params` and `set_params` methods.
         """
-        return []
+        return ["handle", "verbose", "output_type"]
 
     def get_params(self, deep=True):
         """
@@ -274,14 +272,56 @@ class Base:
             else:
                 return self.__dict__[real_name]
         else:
-            raise AttributeError
+            if attr == "solver_model":
+                return self.__dict__['solver_model']
+            if "solver_model" in self.__dict__.keys():
+                return getattr(self.solver_model, attr)
+            else:
+                raise AttributeError
 
-    def _set_output_type(self, input):
+    def _set_base_attributes(self,
+                             output_type=None,
+                             target_dtype=None,
+                             n_features=None):
         """
-        Method to be called by fit methods of inheriting classes
-        to correctly set the output type depending on the type of inputs,
-        class output type and global output type
+        Method to set the base class attributes - output type,
+        target dtype and n_features. It combines the three different
+        function calls. It's called in fit function from estimators.
+
+        Parameters
+        --------
+        output_type : DataFrame (default = None)
+            Is output_type is passed, aets the output_type on the
+            dataframe passed
+        target_dtype : Target column (default = None)
+            If target_dtype is passed, we call _set_target_dtype
+            on it
+        n_features: int or DataFrame (default=None)
+            If an int is passed, we set it to the number passed
+            If dataframe, we set it based on the passed df.
+
+        Examples
+        --------
+
+        .. code-block:: python
+
+                # To set output_type and n_features based on X
+                self._set_base_attributes(output_type=X, n_features=X)
+
+                # To set output_type on X and n_features to 10
+                self._set_base_attributes(output_type=X, n_features=10)
+
+                # To only set target_dtype
+                self._set_base_attributes(output_type=X, target_dtype=y)
         """
+        if output_type is not None:
+            self._set_output_type(output_type)
+        if target_dtype is not None:
+            self._set_target_dtype(target_dtype)
+        if n_features is not None:
+            self._set_n_features_in(n_features)
+
+    def _set_output_type(self, input):
         if self.output_type == 'input' or self._mirror_input:
             self.output_type = _input_to_type(input)
 
@@ -297,11 +337,6 @@ class Base:
             return self.output_type
 
     def _set_target_dtype(self, target):
-        """
-        Method to be called by fit methods of inheriting classifier
-        classes to correctly set the output dtype depending on the dtype of
-        the target.
-        """
         self.target_dtype = _input_target_to_dtype(target)
 
     def _get_target_dtype(self):
@@ -317,9 +352,6 @@ class Base:
         return out_dtype
 
     def _set_n_features_in(self, X):
-        """Method to be called by the fit method of the inheriting class.
-        Sets the n_features_in_ attribute based on the data passed to fit.
-        """
         if isinstance(X, int):
             self.n_features_in_ = X
         else:
@@ -331,26 +363,16 @@ class RegressorMixin:
 
     _estimator_type = "regressor"
 
+    @generate_docstring(return_values={'name': 'score',
+                                       'type': 'float',
+                                       'description': 'R^2 of self.predict(X) '
+                                                      'wrt. y.'})
     def score(self, X, y, **kwargs):
-        """Scoring function for regression estimators
+        """
+        Scoring function for regression estimators
 
         Returns the coefficient of determination R^2 of the prediction.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Test samples on which we predict
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : array-like (device or host) shape = (n_samples, n_features)
-            Ground truth values for predict(X)
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        score : float
-            R^2 of self.predict(X) wrt. y.
         """
         from cuml.metrics.regression import r2_score
 
@@ -359,7 +381,7 @@ class RegressorMixin:
         else:
             handle = None
 
-        preds = self.predict(X)
+        preds = self.predict(X, **kwargs)
         return r2_score(y, preds, handle=handle)
 
 
@@ -368,21 +390,16 @@ class ClassifierMixin:
 
     _estimator_type = "classifier"
 
+    @generate_docstring(return_values={'name': 'score',
+                                       'type': 'float',
+                                       'description': 'Accuracy of \
+                                                      self.predict(X) wrt. y \
+                                                      (fraction where y == \
+                                                      pred_y)'})
     def score(self, X, y, **kwargs):
         """
         Scoring function for classifier estimators based on mean accuracy.
 
-        Parameters
-        ----------
-        X : [cudf.DataFrame]
-            Test samples on which we predict
-        y : [cudf.Series, device array, or numpy array]
-            Ground truth values for predict(X)
-
-        Returns
-        -------
-        score : float
-            Accuracy of self.predict(X) wrt. y (fraction where y == pred_y)
         """
         from cuml.metrics.accuracy import accuracy_score
         from cuml.common import input_to_dev_array
@@ -424,11 +441,17 @@ def _input_to_type(input):
 def _check_output_type_str(output_str):
     if isinstance(output_str, str):
         output_type = output_str.lower()
-        if output_type in ['numpy', 'cupy', 'cudf', 'numba']:
+        if output_type in ['numpy', 'cupy', 'cudf', 'numba', 'input']:
             return output_str
         else:
-            raise ValueError("output_type must be one of " +
-                             "'numpy', 'cupy', 'cudf' or 'numba'")
+            raise ValueError(("output_type must be one of "
+                              "'numpy', 'cupy', 'cudf', 'numba', or 'input'."
+                              " Got: '{}'"
+                              ).format(output_str))
+    else:
+        raise ValueError(("output_type must be a string"
+                          " Got: '{}'"
+                          ).format(type(output_str)))
 
 
 def _input_target_to_dtype(target):
@@ -442,3 +465,23 @@ def _input_target_to_dtype(target):
     else:
         dtype = None
     return dtype
+
+
+def _determine_stateless_output_type(output_type, input_obj):
+    """
+    This function determines the output type using the same steps that are
+    performed in `cuml.common.base.Base`. This can be used to mimic the
+    functionality in `Base` for stateless functions or objects that do not
+    derive from `Base`.
+    """
+
+    # Default to the global type if not specified, otherwise, check the
+    # output_type string
+    temp_output = cuml.global_output_type if output_type is None \
+        else _check_output_type_str(output_type)
+
+    # If we are using 'input', determine the the type from the input object
+    if temp_output == 'input':
+        temp_output = _input_to_type(input_obj)
+
+    return temp_output
diff --git a/python/cuml/common/cuda.pxd b/python/cuml/common/cuda.pxd
index e407213f44..5ac815cbb8 100644
--- a/python/cuml/common/cuda.pxd
+++ b/python/cuml/common/cuda.pxd
@@ -14,12 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-
 # Populate this with more typedef's (eg: events) as and when needed
 cdef extern from * nogil:
     ctypedef void* _Stream "cudaStream_t"
diff --git a/python/cuml/common/cuda.pyx b/python/cuml/common/cuda.pyx
index 13ec55f481..c99e697568 100644
--- a/python/cuml/common/cuda.pyx
+++ b/python/cuml/common/cuda.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import functools
 from libcpp.string cimport string
diff --git a/python/cuml/common/doc_utils.py b/python/cuml/common/doc_utils.py
new file mode 100644
index 0000000000..f452e2c66b
--- /dev/null
+++ b/python/cuml/common/doc_utils.py
@@ -0,0 +1,434 @@
+#
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Decorators to generate common docstrings in the codebase.
+Dense datatypes are currently the default, if you're a developer that landed
+here, the docstrings apply to every parameter to which the decorators
+are applied. The docstrings are generated at import time.
+
+There are 2 decorators:
+- generate_docstring: Meant to be used by fit/predict/et.al methods that have
+    the typical signatures (i.e. fit(x,y) or predict(x)). It detects the
+    parameters and default values and generates the appropriate docstring,
+    with some configurability for shapes and formats.
+- insert_into_docstring: More flexible but less automatic method, meant to be
+    used by functions that use our common dense or sparse datatypes, but have
+    many more custom parameters that are particular to the class(es) as opposed
+    to being common in the codebase. Allows to keep our documentation up to
+    date and correct with minimal changes by keeping our common datatypes
+    concentrated here. NearestNeigbors is a good example of this use case.
+
+More data types can be added as we need them.
+
+cuml.dask datatype version of the docstrings will come in a future update.
+
+"""
+
+from inspect import signature
+
+
+_parameters_docstrings = {
+    'dense':
+    '{name} : array-like (device or host) shape = {shape}\n'
+    '    Dense matrix containing floats or doubles.\n'
+    '    Acceptable formats: CUDA array interface compliant objects like\n'
+    '    CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas\n'
+    '    DataFrame/Series.',
+
+    'dense_anydtype':
+    '{name} : array-like (device or host) shape = {shape}\n'
+    '    Dense matrix of any dtype.\n'
+    '    Acceptable formats: CUDA array interface compliant objects like\n'
+    '    CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas\n'
+    '    DataFrame/Series.',
+
+    'dense_intdtype':
+    '{name} : array-like (device or host) shape = {shape}\n'
+    '    Dense matrix of type np.int32.\n'
+    '    Acceptable formats: CUDA array interface compliant objects like\n'
+    '    CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas\n'
+    '    DataFrame/Series.',
+
+    'sparse':
+    '{name} : sparse array-like (device) shape = {shape}\n'
+    '    Dense matrix containing floats or doubles.\n'
+    '    Acceptable formats: cupyx.scipy.sparse',
+
+    'dense_sparse':
+    '{name} : array-like (device or host) shape = {shape}\n'
+    '    Dense or sparse matrix containing floats or doubles.\n'
+    '    Acceptable dense formats: CUDA array interface compliant objects like\n'  # noqa
+    '    CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas\n'
+    '    DataFrame/Series.',
+
+    'convert_dtype_fit':
+    'convert_dtype : bool, optional (default = {default})\n'
+    '    When set to True, the train method will, when necessary, convert\n'
+    '    y to be the same data type as X if they differ. This\n'
+    '    will increase memory used for the method.',
+
+    'convert_dtype_other':
+    'convert_dtype : bool, optional (default = {default})\n'
+    '    When set to True, the {func_name} method will, when necessary,\n'
+    '    convert the input to the data type which was used to train the\n'
+    '    model. This will increase memory used for the method.',
+
+    'convert_dtype_single':
+    'convert_dtype : bool, optional (default = {default})\n'
+    '    When set to True, the method will automatically\n'
+    '    convert the inputs to {dtype}.',
+
+    'sample_weight':
+    'sample_weight : array-like (device or host) shape = (n_samples,), default={default}\n'  # noqa
+    '    The weights for each observation in X. If None, all observations\n'
+    '    are assigned equal weight.\n'
+    '    Acceptable dense formats: CUDA array interface compliant objects like\n'  # noqa
+    '    CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas\n'
+    '    DataFrame/Series.',  # noqa
+    'return_sparse':
+    'return_sparse : bool, optional (default = {default})\n'
+    '    Ignored when the model is not fit on a sparse matrix\n'
+    '    If True, the method will convert the result to a\n'
+    '    cupyx.scipy.sparse.csr_matrix object.\n'
+    '    NOTE: Currently, there is a loss of information when converting\n'
+    '    to csr matrix (cusolver bug). Default will be switched to True\n'
+    '    once this is solved.',
+
+    'sparse_tol':
+    'sparse_tol : float, optional (default = {default})\n'
+    '    Ignored when return_sparse=False.\n'
+    '    If True, values in the inverse transform below this parameter\n'
+    '    are clipped to 0.'
+}
+
+_parameter_possible_values = ['name',
+                              'type',
+                              'shape',
+                              'default',
+                              'description',
+                              'accepted']
+
+_return_values_docstrings = {
+    'dense':
+    '{name} : cuDF, CuPy or NumPy object depending on cuML\'s output type configuration, shape = {shape}\n'  # noqa
+    '    {description}\n\n    For more information on how to configure cuML\'s output type,\n'  # noqa
+    '    refer to: `Output Data Type Configuration`_.',  # noqa
+
+    'dense_sparse':
+    '{name} : cuDF, CuPy or NumPy object depending on cuML\'s output type configuration, cupyx.scipy.sparse for sparse output, shape = {shape}\n'  # noqa
+    '    {description}\n\n    For more information on how to configure cuML\'s dense output type,\n'  # noqa
+    '    refer to: `Output Data Type Configuration`_.',  # noqa
+
+    'dense_datatype':
+    'cuDF, CuPy or NumPy object depending on cuML\'s output type'
+    'configuration, shape ={shape}',
+
+    'dense_sparse_datatype':
+    'cuDF, CuPy or NumPy object depending on cuML\'s output type'
+    'configuration, shape ={shape}',
+
+    'custom_type':
+    '{name} : {type}\n'
+    '    {description}'
+}
+
+_return_values_possible_values = ['name',
+                                  'type',
+                                  'shape',
+                                  'description']
+
+_simple_params = ['return_sparse',
+                  'sparse_tol',
+                  'sample_weight']
+
+
+def generate_docstring(X='dense',
+                       X_shape='(n_samples, n_features)',
+                       y='dense',
+                       y_shape='(n_samples, 1)',
+                       convert_dtype_cast=False,
+                       skip_parameters=[],
+                       skip_parameters_heading=False,
+                       prepend_parameters=True,
+                       parameters=False,
+                       return_values=False):
+    """
+    Decorator to generate dostrings of common functions in the codebase.
+    It will auto detect what parameters and default values the function has.
+    Unfortunately due to using cython, we cannot (cheaply) do detection of
+    return values.
+
+    Currently auto detected variables include:
+    - X
+    - y
+    - convert_dtype
+    - sample_weights
+    - return_sparse
+    - sparse_tol
+
+    Typical usage scenarios:
+
+    Examples
+    --------
+
+    # for a function that passes all dense parameters, no need to specify
+    # anything, and the decorator auto detects the parameters and defaults
+
+    @generate_docstring()
+    def fit(self, X, y, convert_dtype=True):
+
+    # for a function that takes X as dense or sparse
+
+    @generate_docstring(X='dense_sparse')
+    def fit(self, X, y, sample_weight=None):
+
+    # to specify return values
+
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+
+
+    Parameters
+    -----------
+    X : str (default = 'dense')
+        Data type of variable X. Currently accepted types are: dense,
+        dense_anydtype, dense_intdtype, sparse, dense_sparse
+    X_shape : str (default = '(n_samples, n_features)')
+        Shape of variable X
+    y : str (default = 'dense')
+        Data type of variable y. Currently accepted types are: dense,
+        dense_anydtype, dense_intdtype, sparse, dense_sparse
+    y_shape : str (default = '(n_samples, 1)')
+        Shape of variable y
+    convert_dtype_cast : Boolean or str (default = False)
+        If not false, use it to specify when convert_dtype is used to convert
+        to a single specific dtype (as opposed to converting the dtype of one
+        variable to the dtype of another for example). Example of this is how
+        NearestNeighbors and UMAP use convert_dtype to convert inputs to
+        np.float32.
+    skip_parameters : list of str (default = [])
+        Use if you want the decorator to skip generating a docstring entry
+        for a specific parameter
+    skip_parameters_heading : boolean (default = False)
+        Set to True to not generate the Parameters section heading
+    prepend_parameters : boolean (default = True)
+        Use when setting skip_parameters_heading to True, so that the
+        parameters inserted by the decorator are inserted before the
+        parameters you already have in your docstring.
+    return_values : dict or list of dicts (default = False)
+        Use to generate docstrings of return values. One dictionary per
+        return value, this is the format:
+            {'name': 'name_of_variable',
+             'type': 'data type of returned value',
+             'description': 'Description of variable',
+             'shape': 'shape of returned variable'}
+
+        If type is one of dense or dense_sparse then the type is generated
+        from the corresponding entry in _return_values_docstrings. Otherwise
+        the type is used as specified.
+    """
+
+    def deco(func):
+        params = signature(func).parameters
+
+        # Add parameter section header if needed, can be skipped
+        if(('X' in params or 'y' in params or parameters) and not
+                skip_parameters_heading):
+
+            func.__doc__ += \
+                '\nParameters\n----------\n'
+
+        # Check if we want to prepend the parameters
+        if skip_parameters_heading and prepend_parameters:
+            loc_pars = func.__doc__.find("----------") + 11
+            current_params_in_docstring = \
+                func.__doc__[loc_pars:]
+
+            func.__doc__ = func.__doc__[:loc_pars]
+
+        # Process each parameter
+        for par, value in params.items():
+            if par == 'self':
+                pass
+
+            # X and y are the most common
+            elif par == 'X' and par not in skip_parameters:
+                func.__doc__ += \
+                    _parameters_docstrings[X].format(name=par,
+                                                     shape=X_shape)
+            elif par == 'y' and par not in skip_parameters:
+                func.__doc__ += \
+                    _parameters_docstrings[y].format(name=par,
+                                                     shape=y_shape)
+
+            # convert_dtype requires some magic to distinguish
+            # whether we use the fit version or the version
+            # for the other methods.
+            elif par == 'convert_dtype' and par not in skip_parameters:
+                if not convert_dtype_cast:
+                    if func.__name__ == 'fit':
+                        k = 'convert_dtype_fit'
+                    else:
+                        k = 'convert_dtype_other'
+
+                    func.__doc__ += \
+                        _parameters_docstrings[k].format(
+                            default=params['convert_dtype'].default,
+                            func_name=func.__name__
+                        )
+
+                else:
+                    func.__doc__ += \
+                        _parameters_docstrings['convert_dtype_single'].format(
+                            default=params['convert_dtype'].default,
+                            dtype=convert_dtype_cast
+                        )
+
+            # All other parameters only take a default (for now).
+            else:
+                if par in _simple_params:
+                    func.__doc__ += \
+                        _parameters_docstrings[par].format(
+                            default=params[par].default
+                        )
+            func.__doc__ += '\n\n'
+
+        if skip_parameters_heading and prepend_parameters:
+            # indexing at 8 to match indentation of inserted parameters
+            # this can be replaced with indentation detection
+            # https://github.com/rapidsai/cuml/issues/2714
+            func.__doc__ += current_params_in_docstring[8:]
+
+        # Add return section header if needed, no option to skip currently.
+        if(return_values):
+            func.__doc__ += \
+                '\nReturns\n----------\n'
+
+            # convenience call to allow users to pass a single return
+            # value as a dictionary instead of a list of dictionaries
+            rets = [return_values] if not isinstance(return_values, list) \
+                else return_values
+
+            # process each entry in the return_values
+            # auto naming of predicted variable names will be a
+            # future improvement
+            for ret in rets:
+                if ret['type'] in _return_values_docstrings:
+                    key = ret['type']
+                    # non custom types don't take the type parameter
+                    del ret['type']
+                else:
+                    key = 'custom_type'
+
+                # ret is already a dictionary, we just use it for the named
+                # parameters
+                func.__doc__ += \
+                    _return_values_docstrings[key].format(
+                        **ret
+                    )
+                func.__doc__ += '\n\n'
+
+        return func
+    return deco
+
+
+def insert_into_docstring(parameters=False,
+                          return_values=False):
+    """
+    Decorator to insert a single entry into an existing docstring. Use
+    standard {} format parameters in your docstring, and then use this
+    decorator to insert the standard type information for that variable.
+
+    Examples
+    --------
+
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')],
+                           return_values=[('dense', '(n_samples, n_features)'),
+                                          ('dense',
+                                           '(n_samples, n_features)')])
+    def kneighbors(self, X=None, n_neighbors=None, return_distance=True,
+                   convert_dtype=True):
+        \"""
+        Query the GPU index for the k nearest neighbors of column vectors in X.
+
+        Parameters
+        ----------
+        X : {}
+
+        n_neighbors : Integer
+            Number of neighbors to search. If not provided, the n_neighbors
+            from the model instance is used (default=10)
+
+        return_distance: Boolean
+            If False, distances will not be returned
+
+        convert_dtype : bool, optional (default = True)
+            When set to True, the kneighbors method will automatically
+            convert the inputs to np.float32.
+
+        Returns
+        -------
+        distances : {}
+            The distances of the k-nearest neighbors for each column vector
+            in X
+
+        indices : {}
+            The indices of the k-nearest neighbors for each column vector in X
+        \"""
+
+    Parameters
+    ----------
+    parameters : list of tuples
+        List of tuples, each tuple containing: (type, shape) for the type
+        and shape of each parameter to be inserted. Current accepted values
+        are `dense` and `dense_sparse`.
+    return_values : list of tuples
+        List of tuples, each tuple containing: (type, shape) for the type
+        and shape of each parameter to be inserted. Current accepted values
+        are `dense` and `dense_sparse`.
+
+    """
+
+    def deco(func):
+        # List of parameters to use in `format` call of the docstring
+        to_add = []
+
+        # See if we need to add parameter data types
+        if parameters:
+            for par in parameters:
+                to_add.append(
+                    _parameters_docstrings[par[0]][9:].format(shape=par[1])
+                )
+
+        # See if we need to add return value data types
+        if return_values:
+            for ret in return_values:
+                to_add.append(
+                    _return_values_docstrings[ret[0] + '_datatype'].format(
+                        shape=ret[1]
+                    )
+                )
+
+        if(len(to_add) > 0):
+            func.__doc__ = str(func.__doc__).format(*to_add)
+
+        func.__doc__ += '\n\n'
+
+        return func
+    return deco
diff --git a/python/cuml/common/handle.pxd b/python/cuml/common/handle.pxd
deleted file mode 100644
index dcf21c9643..0000000000
--- a/python/cuml/common/handle.pxd
+++ /dev/null
@@ -1,38 +0,0 @@
-#
-# Copyright (c) 2019, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-
-from libcpp.memory cimport shared_ptr
-cimport cuml.common.cuda
-
-
-cdef extern from "cuml/cuml.hpp" namespace "ML" nogil:
-    cdef cppclass deviceAllocator:
-        pass
-
-    cdef cppclass cumlHandle:
-        cumlHandle() except +
-        cumlHandle(int ns) except +
-        void setStream(cuml.common.cuda._Stream s) except +
-        void setDeviceAllocator(shared_ptr[deviceAllocator] a) except +
-        cuml.common.cuda._Stream getStream() except +
-        shared_ptr[deviceAllocator] getDeviceAllocator() except +
-        int getNumInternalStreams() except +
diff --git a/python/cuml/common/handle.pyx b/python/cuml/common/handle.pyx
index 1682cb2b2a..805079f80c 100644
--- a/python/cuml/common/handle.pyx
+++ b/python/cuml/common/handle.pyx
@@ -1,5 +1,5 @@
 #
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -14,106 +14,8 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
+from cuml.raft.common.handle import Handle as raftHandle
 
-import cuml
-from libcpp.memory cimport shared_ptr
-from cuml.common.cuda cimport _Stream, _Error, cudaStreamSynchronize
-
-
-cdef extern from "cuml/common/rmmAllocatorAdapter.hpp" namespace "ML" nogil:
-    cdef cppclass rmmAllocatorAdapter(deviceAllocator):
-        pass
-
-cdef extern from "cuml/common/rmmPoolAllocatorAdapter.hpp" namespace "ML" \
-        nogil:
-    cdef cppclass rmmPoolAllocatorAdapter(rmmAllocatorAdapter):
-        pass
-
-cdef class Handle:
-    """
-    Handle is a lightweight python wrapper around the corresponding C++ class
-    of cumlHandle exposed by cuML's C++ interface. Refer to the header file
-    cuml/cuml.hpp for interface level details of this struct
-
-    Examples
-    --------
-
-    .. code-block:: python
-
-        import cuml
-        stream = cuml.cuda.Stream()
-        handle = cuml.Handle()
-        handle.setStream(stream)
-
-        # call ML algos here
-
-        # final sync of all work launched in the stream of this handle
-        # this is same as `cuml.cuda.Stream.sync()` call, but safer in case
-        # the default stream inside the `cumlHandle` is being used
-        handle.sync()
-        del handle  # optional!
-    """
-
-    # ML::cumlHandle doesn't have copy operator. So, use pointer for the object
-    # python world cannot access to this raw object directly, hence use
-    # 'size_t'!
-    cdef size_t h
-
-    # not using __dict__ unless we need it to keep this Extension as lean as
-    # possible
-    cdef int n_streams
-
-    def __cinit__(self, n_streams=0):
-        self.n_streams = n_streams
-        self.h = <size_t>(new cumlHandle(n_streams))
-        cdef shared_ptr[deviceAllocator] rmmAlloc = (
-            shared_ptr[deviceAllocator](new rmmAllocatorAdapter()))
-        cdef cumlHandle* h_ = <cumlHandle*>self.h
-        h_.setDeviceAllocator(rmmAlloc)
-
-    def __dealloc__(self):
-        h_ = <cumlHandle*>self.h
-        del h_
-
-    def enable_rmm_pool(self):
-        cdef shared_ptr[deviceAllocator] rmmPoolAlloc = (
-            shared_ptr[deviceAllocator](new rmmPoolAllocatorAdapter()))
-        cdef cumlHandle* h_ = <cumlHandle*>self.h
-        h_.setDeviceAllocator(rmmPoolAlloc)
-
-    def setStream(self, stream):
-        cdef size_t s = <size_t>stream.getStream()
-        cdef cumlHandle* h_ = <cumlHandle*>self.h
-        h_.setStream(<_Stream>s)
-
-    def sync(self):
-        """
-        Issues a sync on the stream set for this handle.
-
-        Once we make `cuml.cuda.Stream` as a mandatory option for creating
-        `cuml.Handle`, this should go away
-        """
-        cdef cumlHandle* h_ = <cumlHandle*>self.h
-        cdef _Stream stream = h_.getStream()
-        cdef _Error e = cudaStreamSynchronize(stream)
-        if e != 0:
-            raise cuml.cuda.CudaRuntimeError("Stream sync")
-
-    def getHandle(self):
-        return self.h
-
-    def getNumInternalStreams(self):
-        cdef cumlHandle* h_ = <cumlHandle*>self.h
-        return h_.getNumInternalStreams()
-
-    def __getstate__(self):
-        return self.n_streams
-
-    def __setstate__(self, state):
-        self.n_streams = state
-        self.h = <size_t>(new cumlHandle(self.n_streams))
+Handle = raftHandle
diff --git a/python/cuml/common/import_utils.py b/python/cuml/common/import_utils.py
index 7efc835db7..b7dc288da6 100644
--- a/python/cuml/common/import_utils.py
+++ b/python/cuml/common/import_utils.py
@@ -15,9 +15,11 @@
 #
 
 
+import inspect
 import numba
 
 from distutils.version import LooseVersion
+from functools import wraps
 
 
 def has_dask():
@@ -68,6 +70,13 @@ def has_xgboost():
         return True
     except ImportError:
         return False
+    except Exception as ex:
+        import warnings
+        warnings.warn(
+            ("The XGBoost library was found but raised an exception during "
+             "import. Importing xgboost will be skipped. "
+             "Error message:\n{}").format(str(ex)))
+        return False
 
 
 def has_pytest_benchmark():
@@ -90,12 +99,15 @@ def check_min_cupy_version(version):
         return False
 
 
-def has_scipy():
+def has_scipy(raise_if_unavailable=False):
     try:
         import scipy   # NOQA
         return True
     except ImportError:
-        return False
+        if not raise_if_unavailable:
+            return False
+        else:
+            raise ImportError("Scipy is not available.")
 
 
 def has_sklearn():
@@ -112,3 +124,33 @@ def dummy_function_always_false(*args, **kwargs):
 
 class DummyClass(object):
     pass
+
+
+def check_cupy8(conf=None):
+    """Decorator checking availability of CuPy 8.0+
+
+    Parameters:
+    conf: string (optional, default None): If set to 'pytest' will skip tests.
+    Will otherwise raise an error in case CuPy 8.0+ is unavailable.
+
+    """
+    def check_cupy8_dec(func):
+
+        assert not inspect.isclass(func), \
+            ("Do not use this decorator on classes. Instead decorate "
+             "__init__  and any static or class methods.")
+
+        @wraps(func)
+        def inner(*args, **kwargs):
+            import cupy as cp
+            if LooseVersion(str(cp.__version__)) >= LooseVersion('8.0'):
+                return func(*args, **kwargs)
+            else:
+                err_msg = 'Could not import required module CuPy 8.0+'
+                if conf == 'pytest':
+                    import pytest
+                    pytest.skip(err_msg)
+                else:
+                    raise ImportError(err_msg)
+        return inner
+    return check_cupy8_dec
diff --git a/python/cuml/common/input_utils.py b/python/cuml/common/input_utils.py
index b625a5a08a..9a84571ae9 100644
--- a/python/cuml/common/input_utils.py
+++ b/python/cuml/common/input_utils.py
@@ -17,12 +17,13 @@
 import copy
 import cudf
 import cupy as cp
+import cupyx
 import numpy as np
 import pandas as pd
 
 from collections import namedtuple
 from cuml.common import CumlArray
-from cuml.common.logger import warn
+from cuml.common.logger import debug
 from cuml.common.memory_utils import with_cupy_rmm
 from cuml.common.memory_utils import _check_array_contiguity
 from numba import cuda
@@ -55,6 +56,59 @@ def get_cudf_column_ptr(col):
     return col.__cuda_array_interface__['data'][0]
 
 
+def get_supported_input_type(X):
+    """
+    Determines if the input object is a supported input array-like object or
+    not. If supported, the type is returned. Otherwise, `None` is returned.
+
+    Parameters
+    ----------
+    X : object
+        Input object to test
+
+    Notes
+    -----
+    To closely match the functionality of
+    :func:`~cuml.common.input_utils.input_to_cuml_array`, this method will
+    return ``cupy.ndarray`` for any object supporting
+    `__cuda_array_interface__` and ``numpy.ndarray`` for any object supporting
+    `__array_interface__`.
+
+    Returns
+    -------
+    array-like type or None
+        If the array-like object is supported, the type is returned.
+        Otherwise, `None` is returned.
+    """
+    if (isinstance(X, cudf.Series)):
+        if X.null_count != 0:
+            return None
+        else:
+            return cudf.Series
+
+    # converting pandas to numpy before sending it to CumlArray
+    if isinstance(X, pd.DataFrame):
+        return pd.DataFrame
+
+    if isinstance(X, pd.Series):
+        return pd.Series
+
+    if isinstance(X, cudf.DataFrame):
+        return cudf.DataFrame
+
+    if isinstance(X, CumlArray):
+        return CumlArray
+
+    if hasattr(X, "__cuda_array_interface__"):
+        return cp.ndarray
+
+    if hasattr(X, "__array_interface__"):
+        return np.ndarray
+
+    # Return None if this type isnt supported
+    return None
+
+
 @with_cupy_rmm
 def input_to_cuml_array(X, order='F', deepcopy=False,
                         check_dtype=False, convert_to_dtype=False,
@@ -161,8 +215,8 @@ def input_to_cuml_array(X, order='F', deepcopy=False,
 
         if force_contiguous or hasattr(X, "__array_interface__"):
             if not _check_array_contiguity(X):
-                warn("Non contiguous array or view detected, a \
-                     contiguous copy of the data will be done. ")
+                debug("Non contiguous array or view detected, a \
+                      contiguous copy of the data will be done. ")
                 X = cp.array(X, order=order, copy=True)
 
         X_m = CumlArray(data=X)
@@ -215,9 +269,9 @@ def input_to_cuml_array(X, order='F', deepcopy=False,
             raise ValueError("Expected " + order_to_str(order) +
                              " major order, but got the opposite.")
         else:
-            warn("Expected " + order_to_str(order) + " major order, "
-                 "but got the opposite. Converting data, this will "
-                 "result in additional memory utilization.")
+            debug("Expected " + order_to_str(order) + " major order, "
+                  "but got the opposite. Converting data, this will "
+                  "result in additional memory utilization.")
             X_m = cp.array(X_m, copy=False, order=order)
             X_m = CumlArray(data=X_m)
 
@@ -288,6 +342,17 @@ def input_to_host_array(X, order='F', deepcopy=False,
         array. It is a reference to the input X if it was a NumPy host array
     """
 
+    if isinstance(X, np.ndarray):
+        if len(X.shape) > 1:
+            n_cols = X.shape[1]
+        else:
+            n_cols = 1
+        return inp_array(array=X,
+                         pointer=X.__array_interface__['data'][0],
+                         n_rows=X.shape[0],
+                         n_cols=n_cols,
+                         dtype=X.dtype)
+
     ary_tuple = input_to_cuml_array(X,
                                     order=order,
                                     deepcopy=deepcopy,
@@ -463,7 +528,7 @@ def order_to_str(order):
 def sparse_scipy_to_cp(sp, dtype):
     """
     Convert object of scipy.sparse to
-    cupy.sparse.coo_matrix
+    cupyx.scipy.sparse.coo_matrix
     """
 
     coo = sp.tocoo()
@@ -473,4 +538,4 @@ def sparse_scipy_to_cp(sp, dtype):
     c = cp.asarray(coo.col)
     v = cp.asarray(values, dtype=dtype)
 
-    return cp.sparse.coo_matrix((v, (r, c)), sp.shape)
+    return cupyx.scipy.sparse.coo_matrix((v, (r, c)), sp.shape)
diff --git a/python/cuml/common/kernel_utils.py b/python/cuml/common/kernel_utils.py
index 29f28ed6d9..939a5917c8 100644
--- a/python/cuml/common/kernel_utils.py
+++ b/python/cuml/common/kernel_utils.py
@@ -66,8 +66,8 @@ def cuda_kernel_factory(nvrtc_kernel_str, dtypes, kernel_name=None):
     included in the kernel string. These will be added by this function and
     the function name will be made unique, based on the given dtypes.
 
-    Example
-    -------
+    Examples
+    --------
 
         The following kernel string with dtypes = [float, double, int]
 
diff --git a/python/cuml/common/logger.pyx b/python/cuml/common/logger.pyx
index a68bead6e8..44620a8d59 100644
--- a/python/cuml/common/logger.pyx
+++ b/python/cuml/common/logger.pyx
@@ -14,12 +14,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 
+import sys
+
 from libcpp.string cimport string
 from libcpp cimport bool
 
@@ -30,9 +29,12 @@ cdef extern from "cuml/common/logger.hpp" namespace "ML" nogil:
         Logger& get()
         void setLevel(int level)
         void setPattern(const string& pattern)
+        void setCallback(void(*callback)(int, char*))
+        void setFlush(void(*flush)())
         bool shouldLogFor(int level) const
         int getLevel() const
         string getPattern() const
+        void flush()
 
 
 cdef extern from "cuml/common/logger.hpp" nogil:
@@ -73,6 +75,47 @@ level_critical = CUML_LEVEL_CRITICAL
 """Disables all log messages"""
 level_off = CUML_LEVEL_OFF
 
+cdef void _log_callback(int lvl, const char * msg) nogil:
+    """
+    Default spdlogs callback function to redirect logs correctly to sys.stdout
+
+    Parameters
+    ----------
+    lvl : int
+        Level of the logging message as defined by spdlogs
+    msg : char *
+        Message to be logged
+    """
+    with gil:
+        print(msg.decode('utf-8'), end='')
+
+
+cdef void _nogil_log_callback(int lvl, const char * msg) nogil:
+    """
+    Wrapper for _log_callback to explicitly disable Cython's automatic GIL
+    acquire
+    """
+    with nogil:
+        _log_callback(lvl, msg)
+
+
+cdef void _log_flush() nogil:
+    """
+    Default spdlogs callback function to flush logs
+    """
+    with gil:
+        if sys.stdout is not None:
+            sys.stdout.flush()
+
+
+cdef void _nogil_log_flush() nogil:
+    """
+    Wrapper for _log_flush to explicitly disable Cython's automatic GIL
+    acquire
+    """
+    with nogil:
+        _log_flush()
+
 
 class LogLevelSetter:
     """Internal "context manager" object for restoring previous log level"""
@@ -97,7 +140,6 @@ def set_level(level):
 
     .. code-block:: python
 
-
         # regular usage of setting a logging level for all subsequent logs
         # in this case, it will enable all logs upto and including `info()`
         logger.set_level(logger.level_info)
@@ -147,7 +189,6 @@ def set_pattern(pattern):
 
     .. code-block:: python
 
-
         # regular usage of setting a logging pattern for all subsequent logs
         logger.set_pattern("--> [%H-%M-%S] %v")
 
@@ -315,3 +356,15 @@ def critical(msg):
     """
     cdef string s = msg.encode("UTF-8")
     CUML_LOG_CRITICAL(s.c_str())
+
+
+def flush():
+    """
+    Flush the logs.
+    """
+    Logger.get().flush()
+
+
+# Set callback functions to handle redirected sys.stdout in Python
+Logger.get().setCallback(_nogil_log_callback)
+Logger.get().setFlush(_nogil_log_flush)
diff --git a/python/cuml/common/memory_utils.py b/python/cuml/common/memory_utils.py
index 56d2ee4300..dffb591d7c 100644
--- a/python/cuml/common/memory_utils.py
+++ b/python/cuml/common/memory_utils.py
@@ -63,7 +63,6 @@ def rmm_cupy_ary(cupy_fn, *args, **kwargs):
 
     Function to call CuPy functions with RMM memory management
 
-
     Parameters
     ----------
     cupy_fn : cupy function,
@@ -76,13 +75,13 @@ def rmm_cupy_ary(cupy_fn, *args, **kwargs):
         Keyword named arguments to pass to the CuPy function
 
 
-    Note: this function should be used if the result of cupy_fn creates
+    .. note:: this function should be used if the result of cupy_fn creates
     a new array. Functions to create a new CuPy array by reference to
     existing device array (through __cuda_array_interface__) can be used
     directly.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -143,6 +142,9 @@ def _order_to_strides(order, shape, dtype):
     if isinstance(shape, int):
         return (itemsize,)
 
+    elif len(shape) == 0:
+        return None
+
     elif len(shape) == 1:
         return (itemsize,)
 
@@ -235,37 +237,39 @@ def set_global_output_type(output_type):
     Method to set cuML's single GPU estimators global output type.
     It will be used by all estimators unless overriden in their initialization
     with their own output_type parameter. Can also be overriden by the context
-    manager method `using_output_type`
+    manager method :func:`using_output_type`.
 
     Parameters
     ----------
     output_type : {'input', 'cudf', 'cupy', 'numpy'} (default = 'input')
         Desired output type of results and attributes of the estimators.
 
-        'input' will mean that the parameters and methods will mirror the
-        format of the data sent to the estimators/methods as much as
-        possible. Specifically:
-
-            Input type -> Output type
-
-            cuDF DataFrame or Series -> cuDF DataFrame or Series
-
-            NumPy arrays -> NumPy arrays
-
-            Pandas DataFrame or Series -> NumPy arrays
-
-            Numba device arrays -> Numba device arrays
-
-            CuPy arrays -> CuPy arrays
-
-            Other __cuda_array_interface__ objects -> CuPy arrays
-
-        'cudf' will return cuDF Series for single dimensional results and
-        DataFrames for the rest.
-
-        'cupy' will return CuPy arrays.
-
-        'numpy' will return NumPy arrays.
+        * ``'input'`` will mean that the parameters and methods will mirror the
+          format of the data sent to the estimators/methods as much as
+          possible. Specifically:
+
+          +---------------------------------------+--------------------------+
+          | Input type                            | Output type              |
+          +=======================================+==========================+
+          | cuDF DataFrame or Series              | cuDF DataFrame or Series |
+          +---------------------------------------+--------------------------+
+          | NumPy arrays                          | NumPy arrays             |
+          +---------------------------------------+--------------------------+
+          | Pandas DataFrame or Series            | NumPy arrays             |
+          +---------------------------------------+--------------------------+
+          | Numba device arrays                   | Numba device arrays      |
+          +---------------------------------------+--------------------------+
+          | CuPy arrays                           | CuPy arrays              |
+          +---------------------------------------+--------------------------+
+          | Other `__cuda_array_interface__` objs | CuPy arrays              |
+          +---------------------------------------+--------------------------+
+
+        * ``'cudf'`` will return cuDF Series for single dimensional results and
+          DataFrames for the rest.
+
+        * ``'cupy'`` will return CuPy arrays.
+
+        * ``'numpy'`` will return NumPy arrays.
 
     Examples
     --------
@@ -288,7 +292,7 @@ def set_global_output_type(output_type):
 
     Output:
 
-    .. code-block:: python
+    .. code-block::
 
         cuML output type
         0    0
@@ -299,10 +303,11 @@ def set_global_output_type(output_type):
 
     Notes
     -----
-    'cupy' and 'numba' options (as well as 'input' when using Numba and CuPy
-    ndarrays for input) have the least overhead. cuDF add memory consumption
-    and processing time needed to build the Series and DataFrames. 'numpy' has
-    the biggest overhead due to the need to transfer data to CPU memory.
+    ``'cupy'`` and ``'numba'`` options (as well as ``'input'`` when using Numba
+    and CuPy ndarrays for input) have the least overhead. cuDF add memory
+    consumption and processing time needed to build the Series and DataFrames.
+    ``'numpy'`` has the biggest overhead due to the need to transfer data to
+    CPU memory.
 
     """
     if isinstance(output_type, str):
@@ -330,30 +335,32 @@ def using_output_type(output_type):
     output_type : {'input', 'cudf', 'cupy', 'numpy'} (default = 'input')
         Desired output type of results and attributes of the estimators.
 
-        'input' will mean that the parameters and methods will mirror the
-        format of the data sent to the estimators/methods as much as
-        possible. Specifically:
-
-            Input type -> Output type
-
-            cuDF DataFrame or Series -> cuDF DataFrame or Series
-
-            NumPy arrays -> NumPy arrays
-
-            Pandas DataFrame or Series -> NumPy arrays
-
-            Numba device arrays -> Numba device arrays
-
-            CuPy arrays -> CuPy arrays
-
-            Other __cuda_array_interface__ objects -> CuPy arrays
-
-        'cudf' will return cuDF Series for single dimensional results and
-        DataFrames for the rest.
-
-        'cupy' will return CuPy arrays.
-
-        'numpy' will return NumPy arrays.
+        * ``'input'`` will mean that the parameters and methods will mirror the
+          format of the data sent to the estimators/methods as much as
+          possible. Specifically:
+
+          +---------------------------------------+--------------------------+
+          | Input type                            | Output type              |
+          +=======================================+==========================+
+          | cuDF DataFrame or Series              | cuDF DataFrame or Series |
+          +---------------------------------------+--------------------------+
+          | NumPy arrays                          | NumPy arrays             |
+          +---------------------------------------+--------------------------+
+          | Pandas DataFrame or Series            | NumPy arrays             |
+          +---------------------------------------+--------------------------+
+          | Numba device arrays                   | Numba device arrays      |
+          +---------------------------------------+--------------------------+
+          | CuPy arrays                           | CuPy arrays              |
+          +---------------------------------------+--------------------------+
+          | Other `__cuda_array_interface__` objs | CuPy arrays              |
+          +---------------------------------------+--------------------------+
+
+        * ``'cudf'`` will return cuDF Series for single dimensional results and
+          DataFrames for the rest.
+
+        * ``'cupy'`` will return CuPy arrays.
+
+        * ``'numpy'`` will return NumPy arrays.
 
     Examples
     --------
@@ -370,7 +377,7 @@ def using_output_type(output_type):
             dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
             dbscan_float.fit(ary)
 
-            print("cuML output inside `with` context")
+            print("cuML output inside 'with' context")
             print(dbscan_float.labels_)
             print(type(dbscan_float.labels_))
 
@@ -384,16 +391,15 @@ def using_output_type(output_type):
 
     Output:
 
-    .. code-block:: python
+    .. code-block::
 
-        cuML output inside `with` context
+        cuML output inside 'with' context
         0    0
         1    1
         2    2
         dtype: int32
         <class 'cudf.core.series.Series'>
 
-
         cuML default output
         [0 1 2]
         <class 'cupy.core.core.ndarray'>
diff --git a/python/cuml/common/opg_data_utils_mg.pxd b/python/cuml/common/opg_data_utils_mg.pxd
index bc3c5fafd9..ba49faf1e5 100644
--- a/python/cuml/common/opg_data_utils_mg.pxd
+++ b/python/cuml/common/opg_data_utils_mg.pxd
@@ -13,10 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 # Util functions, will be moved to their own file as the other methods are
 # refactored
diff --git a/python/cuml/common/opg_data_utils_mg.pyx b/python/cuml/common/opg_data_utils_mg.pyx
index 8204c733fc..e789149e3d 100644
--- a/python/cuml/common/opg_data_utils_mg.pyx
+++ b/python/cuml/common/opg_data_utils_mg.pyx
@@ -114,7 +114,7 @@ def build_rank_size_pair(parts_to_sizes, rank):
     parts_to_sizes: array of tuples in the format: [(rank,size)]
     rank: rank to be mapped
 
-    Returns:
+    Returns
     --------
     ptr: vector pointer of the RankSizePair*
     """
@@ -162,7 +162,7 @@ def build_part_descriptor(m, n, rank_size_t, rank):
         building the part descriptor
     rank: rank to be mapped
 
-    Returns:
+    Returns
     --------
     ptr: PartDescriptor object
     """
diff --git a/python/cuml/common/pointer_utils.pyx b/python/cuml/common/pointer_utils.pyx
index 3a28e53de7..c3c61cdd4c 100644
--- a/python/cuml/common/pointer_utils.pyx
+++ b/python/cuml/common/pointer_utils.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from libc.stdint cimport uintptr_t
 
diff --git a/python/cuml/common/sparsefuncs.py b/python/cuml/common/sparsefuncs.py
index 3949e9e441..3f3fa770f6 100644
--- a/python/cuml/common/sparsefuncs.py
+++ b/python/cuml/common/sparsefuncs.py
@@ -15,6 +15,7 @@
 #
 import math
 import cupy as cp
+import cupyx
 from cuml.common import with_cupy_rmm
 from cuml.common.kernel_utils import cuda_kernel_factory
 
@@ -139,7 +140,7 @@ def create_csr_matrix_from_count_df(count_df, empty_doc_ids, n_doc, n_features,
     indptr = token_counts.cumsum()
     indptr = cp.pad(indptr, (1, 0), "constant")
 
-    return cp.sparse.csr_matrix(
+    return cupyx.scipy.sparse.csr_matrix(
         arg1=(data, indices, indptr), dtype=dtype,
         shape=(n_doc, n_features)
     )
@@ -150,7 +151,8 @@ def _insert_zeros(ary, zero_indices):
     Create a new array of len(ary + zero_indices) where zero_indices
     indicates indexes of 0s in the new array. Ary is used to fill the rest.
 
-    Example:
+    Examples
+    --------
         _insert_zeros([1, 2, 3], [1, 3]) => [1, 0, 2, 0, 3]
     """
     if len(zero_indices) == 0:
diff --git a/python/cuml/common/timing_utils.py b/python/cuml/common/timing_utils.py
new file mode 100644
index 0000000000..1c437eb70c
--- /dev/null
+++ b/python/cuml/common/timing_utils.py
@@ -0,0 +1,44 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import time
+from contextlib import contextmanager
+
+
+# Helper function for timing blocks of code.
+@contextmanager
+def timed(name):
+    """
+    For timing blocks of code.
+
+    Examples
+    --------
+
+    .. code-block:: python
+    with timed("Print Call"):
+        print("Hello, World")
+
+    Output:
+
+    .. code-block:: python
+
+        Hello, World
+        ..Print Call              :    0.0005
+
+    """
+    t0 = time.time()
+    yield
+    t1 = time.time()
+    print("..%-24s:  %8.4f" % (name, t1 - t0))
diff --git a/python/cuml/common/type_utils.py b/python/cuml/common/type_utils.py
index cc2293f58a..99dc028a9b 100644
--- a/python/cuml/common/type_utils.py
+++ b/python/cuml/common/type_utils.py
@@ -16,5 +16,5 @@
 import cupy as cp
 
 
-# Those are the only data types supported by cupy.sparse matrices.
+# Those are the only data types supported by cupyx.scipy.sparse matrices.
 CUPY_SPARSE_DTYPES = [cp.float32, cp.float64, cp.complex64, cp.complex128]
diff --git a/python/cuml/dask/cluster/kmeans.py b/python/cuml/dask/cluster/kmeans.py
index ac654fd753..00d882b4f8 100644
--- a/python/cuml/dask/cluster/kmeans.py
+++ b/python/cuml/dask/cluster/kmeans.py
@@ -23,8 +23,8 @@
 from cuml.dask.common.input_utils import concatenate
 from cuml.dask.common.input_utils import DistributedDataHandler
 
-from cuml.dask.common.comms import CommsContext
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import Comms
+from cuml.raft.dask.common.comms import worker_state
 
 from cuml.dask.common.utils import wait_and_raise_from_futures
 
@@ -48,19 +48,25 @@ class KMeans(BaseEstimator, DelayedPredictionMixin, DelayedTransformMixin):
     ----------
 
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     n_clusters : int (default = 8)
         The number of centroids or clusters you want.
     max_iter : int (default = 300)
         The more iterations of EM, the more accurate, but slower.
     tol : float (default = 1e-4)
         Stopping criterion when centroid means do not change much.
-    verbose : int or boolean (default = False)
-        Logging level for printing diagnostic information
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     random_state : int (default = 1)
         If you want results to be the same when you restart Python,
         select a state.
-    init : {'scalable-kmeans++', 'k-means||' , 'random' or an ndarray}
+    init : {'scalable-kmeans++', 'k-means||' , 'random' or an ndarray} \
            (default = 'scalable-k-means++')
         'scalable-k-means++' or 'k-means||': Uses fast and stable scalable
         kmeans++ intialization.
@@ -126,7 +132,7 @@ def fit(self, X):
         data = DistributedDataHandler.create(X, client=self.client)
         self.datatype = data.datatype
 
-        comms = CommsContext(comms_p2p=False)
+        comms = Comms(comms_p2p=False)
         comms.init(workers=data.workers)
 
         kmeans_fit = [self.client.submit(KMeans._func_fit,
diff --git a/python/cuml/dask/common/__init__.py b/python/cuml/dask/common/__init__.py
index 7c1e540c9b..dbc4132339 100644
--- a/python/cuml/dask/common/__init__.py
+++ b/python/cuml/dask/common/__init__.py
@@ -14,13 +14,6 @@
 # limitations under the License.
 #
 
-from cuml.dask.common.comms import CommsContext, worker_state, default_comms
-
-from cuml.dask.common.comms_utils import inject_comms_on_handle, \
-    perform_test_comms_allreduce, perform_test_comms_send_recv, \
-    perform_test_comms_recv_any_rank, \
-    inject_comms_on_handle_coll_only, is_ucx_enabled
-
 from cuml.dask.common.dask_arr_utils import to_sparse_dask_array # NOQA
 
 from cuml.dask.common.dask_df_utils import get_meta  # NOQA
diff --git a/python/cuml/dask/common/base.py b/python/cuml/dask/common/base.py
index e95c6a5eab..613c29d26e 100644
--- a/python/cuml/dask/common/base.py
+++ b/python/cuml/dask/common/base.py
@@ -25,11 +25,12 @@
 from cuml import Base
 from cuml.common.array import CumlArray
 from cuml.dask.common.utils import wait_and_raise_from_futures
-from cuml.dask.common.comms import CommsContext
+from cuml.raft.dask.common.comms import Comms
 from cuml.dask.common.input_utils import DistributedDataHandler
 from cuml.dask.common import parts_to_ranks
 
 from dask_cudf.core import DataFrame as dcDataFrame
+from dask_cudf.core import Series as dcSeries
 from functools import wraps
 
 from distributed.client import Future
@@ -268,12 +269,12 @@ def _run_parallel_func(self,
                                      traverse=False)
 
         func = dask.delayed(func, pure=False, nout=1)
-
         if isinstance(X, dcDataFrame):
-
             preds = [func(model_delayed, part, **kwargs) for part in X_d]
             dtype = first(X.dtypes) if output_dtype is None else output_dtype
-
+        elif isinstance(X, dcSeries):
+            preds = [func(model_delayed, part, **kwargs) for part in X_d]
+            dtype = X.dtype if output_dtype is None else output_dtype
         else:
             preds = [func(model_delayed, part[0])
                      for part in X_d]
@@ -286,7 +287,6 @@ def _run_parallel_func(self,
         if output_collection_type == 'cupy':
 
             # todo: add parameter for option of not checking directly
-
             shape = (np.nan,) * n_dims
             preds_arr = [
                 dask.array.from_delayed(pred,
@@ -301,14 +301,17 @@ def _run_parallel_func(self,
                 output = dask.array.concatenate(preds_arr, axis=0,
                                                 allow_unknown_chunksizes=True
                                                 )
-
                 return output if delayed else output.persist()
-        else:
+
+        elif output_collection_type == 'cudf':
             if output_futures:
                 return self.client.compute(preds)
             else:
                 output = dask.dataframe.from_delayed(preds)
                 return output if delayed else output.persist()
+        else:
+            raise ValueError("Expected cupy or cudf but found %s" %
+                             (output_collection_type))
 
 
 class DelayedPredictionProbaMixin(DelayedParallelFunc):
@@ -355,7 +358,7 @@ def _fit(self, model_func, data):
         data = DistributedDataHandler.create(data=data, client=self.client)
         self.datatype = data.datatype
 
-        comms = CommsContext(comms_p2p=False)
+        comms = Comms(comms_p2p=False)
         comms.init(workers=data.workers)
 
         data.calculate_parts_to_sizes(comms)
diff --git a/python/cuml/dask/common/comms.py b/python/cuml/dask/common/comms.py
deleted file mode 100644
index 9b9a55acf3..0000000000
--- a/python/cuml/dask/common/comms.py
+++ /dev/null
@@ -1,371 +0,0 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-from cuml.nccl import nccl
-from cuml.dask.common.ucx import UCX
-
-import weakref
-
-from .comms_utils import inject_comms_on_handle, \
-    inject_comms_on_handle_coll_only, is_ucx_enabled
-from .utils import parse_host_port
-from cuml.common.handle import Handle
-import cuml.common.logger as logger
-
-from dask.distributed import get_worker, default_client
-
-from cuml.common.import_utils import has_ucp
-import warnings
-
-import time
-import uuid
-
-
-_global_comms = weakref.WeakValueDictionary()
-_global_comms_index = [0]
-
-
-def _set_global_comms(c):
-    if c is not None:
-        _global_comms[_global_comms_index[0]] = c
-        _global_comms_index[0] += 1
-
-
-def _del_global_comms(c):
-    for k in list(_global_comms):
-        try:
-            if _global_comms[k] is c:
-                del _global_comms[k]
-        except KeyError:
-            pass
-
-
-def worker_state(sessionId=None):
-    """
-    Retrieves cuML comms state on local worker for the given
-    sessionId, creating a new session if it does not exist.
-    If no session id is given, returns the state dict for all
-    sessions.
-    :param sessionId:
-    :return:
-    """
-    worker = get_worker()
-    if not hasattr(worker, "_cuml_comm_state"):
-        worker._cuml_comm_state = {}
-    if sessionId is not None and sessionId not in worker._cuml_comm_state:
-        # Build state for new session and mark session creation time
-        worker._cuml_comm_state[sessionId] = {"ts": time.time()}
-
-    if sessionId is not None:
-        return worker._cuml_comm_state[sessionId]
-    return worker._cuml_comm_state
-
-
-def get_ucx():
-    """
-    A simple convenience wrapper to make sure UCP listener and
-    endpoints are only ever assigned once per worker.
-    """
-    if "ucx" not in worker_state("ucp"):
-        worker_state("ucp")["ucx"] = UCX.get()
-    return worker_state("ucp")["ucx"]
-
-
-def _get_global_comms():
-    L = sorted(list(_global_comms), reverse=True)
-    for k in L:
-        c = _global_comms[k]
-        if c.nccl_initialized and (not c.comms_p2p or c.ucx_initialized):
-            return c
-        else:
-            del _global_comms[k]
-    del L
-    return None
-
-
-def default_comms(comms_p2p=False, client=None):
-    """ Return a comms instance if one has been initialized.
-        Otherwise, initialize a new comms instance.
-    """
-    c = _get_global_comms()
-    if c:
-        return c
-    else:
-        cb = CommsContext(comms_p2p, client)
-        cb.init()
-
-        _set_global_comms(cb)
-
-        return _get_global_comms()
-
-
-def _func_ucp_listener_port():
-    return get_ucx().listener_port()
-
-
-async def _func_init_all(sessionId, uniqueId, comms_p2p,
-                         worker_info, streams_per_handle):
-
-    session_state = worker_state(sessionId)
-    session_state["nccl_uid"] = uniqueId
-    session_state["wid"] = worker_info[get_worker().address]["rank"]
-    session_state["nworkers"] = len(worker_info)
-
-    if logger.should_log_for(logger.level_debug):
-        logger.debug("Initializing NCCL")
-        start = time.time()
-
-    _func_init_nccl(sessionId, uniqueId)
-
-    if logger.should_log_for(logger.level_debug):
-        elapsed = time.time() - start
-        logger.debug("NCCL Initialization took: %f seconds." % elapsed)
-
-    if comms_p2p:
-        logger.debug("Initializing UCX Endpoints")
-
-        if logger.should_log_for(logger.level_debug):
-            start = time.time()
-        await _func_ucp_create_endpoints(sessionId, worker_info)
-
-        if logger.should_log_for(logger.level_debug):
-            elapsed = time.time() - start
-            logger.debug("Done initializing UCX endpoints. Took: %f seconds." %
-                         elapsed)
-            logger.debug("Building handle")
-
-        _func_build_handle_p2p(sessionId, streams_per_handle)
-
-        logger.debug("Done building handle.")
-
-    else:
-        _func_build_handle(sessionId, streams_per_handle)
-
-
-def _func_init_nccl(sessionId, uniqueId):
-    """
-    Initialize ncclComm_t on worker
-    :param workerId: int ID of the current worker running the function
-    :param nWorkers: int Number of workers in the cluster
-    :param uniqueId: array[byte] The NCCL unique Id generated from the
-                     client.
-    """
-
-    wid = worker_state(sessionId)["wid"]
-    nWorkers = worker_state(sessionId)["nworkers"]
-
-    try:
-        n = nccl()
-        n.init(nWorkers, uniqueId, wid)
-        worker_state(sessionId)["nccl"] = n
-    except Exception:
-        logger.error("An error occurred initializing NCCL!")
-
-
-def _func_build_handle_p2p(sessionId, streams_per_handle):
-    """
-    Builds a cumlHandle on the current worker given the initialized comms
-    :param nccl_comm: ncclComm_t Initialized NCCL comm
-    :param eps: size_t initialized endpoints
-    :param nWorkers: int number of workers in cluster
-    :param workerId: int Rank of current worker
-    :return:
-    """
-    ucp_worker = get_ucx().get_worker()
-    session_state = worker_state(sessionId)
-
-    handle = Handle(streams_per_handle)
-    nccl_comm = session_state["nccl"]
-    eps = session_state["ucp_eps"]
-    nWorkers = session_state["nworkers"]
-    workerId = session_state["wid"]
-
-    inject_comms_on_handle(handle, nccl_comm, ucp_worker, eps,
-                           nWorkers, workerId)
-
-    worker_state(sessionId)["handle"] = handle
-
-
-def _func_build_handle(sessionId, streams_per_handle):
-    """
-    Builds a cumlHandle on the current worker given the initialized comms
-    :param nccl_comm: ncclComm_t Initialized NCCL comm
-    :param nWorkers: int number of workers in cluster
-    :param workerId: int Rank of current worker
-    :return:
-    """
-    handle = Handle(streams_per_handle)
-
-    session_state = worker_state(sessionId)
-
-    workerId = session_state["wid"]
-    nWorkers = session_state["nworkers"]
-
-    nccl_comm = session_state["nccl"]
-    inject_comms_on_handle_coll_only(handle, nccl_comm, nWorkers,
-                                     workerId)
-    session_state["handle"] = handle
-
-
-def _func_store_initial_state(nworkers, sessionId, uniqueId, wid):
-    session_state = worker_state(sessionId)
-    session_state["nccl_uid"] = uniqueId
-    session_state["wid"] = wid
-    session_state["nworkers"] = nworkers
-
-
-async def _func_ucp_create_endpoints(sessionId, worker_info):
-    """
-    Runs on each worker to create ucp endpoints to all other workers
-    :param sessionId: uuid unique id for this instance
-    :param worker_info: dict Maps worker address to rank & UCX port
-    :param r: float a random number to stop the function from being cached
-    """
-
-    eps = [None] * len(worker_info)
-    count = 1
-
-    for k in worker_info:
-
-        ip, port = parse_host_port(k)
-
-        ep = await get_ucx().get_endpoint(ip, worker_info[k]["port"])
-
-        eps[worker_info[k]["rank"]] = ep
-        count += 1
-
-    worker_state(sessionId)["ucp_eps"] = eps
-
-
-async def _func_destroy_all(sessionId, comms_p2p):
-    worker_state(sessionId)["nccl"].destroy()
-    del worker_state(sessionId)["nccl"]
-    del worker_state(sessionId)["handle"]
-
-
-def _func_ucp_ports(client, workers):
-    return client.run(_func_ucp_listener_port,
-                      workers=workers)
-
-
-def _func_worker_ranks(workers):
-    """
-    Builds a dictionary of { (worker_address, worker_port) : worker_rank }
-    """
-    return dict(list(zip(workers, range(len(workers)))))
-
-
-class CommsContext:
-
-    """
-    A base class to initialize and manage underlying NCCL and UCX
-    comms handles across a Dask cluster. Classes extending CommsContext
-    are responsible for calling `self.init()` to initialize the comms.
-    Classes that extend or use the CommsContext are also responsible for
-    calling `destroy()` to clean up the underlying comms.
-
-    This class is not meant to be thread-safe.
-    """
-
-    def __init__(self, comms_p2p=False, client=None, streams_per_handle=0):
-        """
-        Construct a new CommsContext instance
-        :param comms_p2p: bool Should p2p comms be initialized?
-        """
-        self.client = client if client is not None else default_client()
-        self.comms_p2p = comms_p2p
-
-        self.streams_per_handle = streams_per_handle
-
-        self.sessionId = uuid.uuid4().bytes
-
-        self.nccl_initialized = False
-        self.ucx_initialized = False
-
-        if comms_p2p and (not is_ucx_enabled() or not has_ucp()):
-            warnings.warn("ucx-py not found. UCP Integration will "
-                          "be disabled.")
-            self.comms_p2p = False
-
-        logger.debug("Initializing comms!")
-
-    def __del__(self):
-        if self.nccl_initialized or self.ucx_initialized:
-            self.destroy()
-
-    def worker_info(self, workers):
-        """
-        Builds a dictionary of { (worker_address, worker_port) :
-                                (worker_rank, worker_port ) }
-        """
-        ranks = _func_worker_ranks(workers)
-        ports = _func_ucp_ports(self.client, workers) \
-            if self.comms_p2p else None
-
-        output = {}
-        for k in ranks.keys():
-            output[k] = {"rank": ranks[k]}
-            if self.comms_p2p:
-                output[k]["port"] = ports[k]
-        return output
-
-    def init(self, workers=None):
-        """
-        Initializes the underlying comms. NCCL is required but
-        UCX is only initialized if `comms_p2p == True`
-        """
-
-        self.worker_addresses = list(set((self.client.has_what().keys()
-                                          if workers is None else workers)))
-
-        if self.nccl_initialized:
-            warnings.warn("CommsContext has already been initialized.")
-            return
-
-        worker_info = self.worker_info(self.worker_addresses)
-        worker_info = {w: worker_info[w] for w in self.worker_addresses}
-
-        self.uniqueId = nccl.get_unique_id()
-
-        self.client.run(_func_init_all,
-                        self.sessionId,
-                        self.uniqueId,
-                        self.comms_p2p,
-                        worker_info,
-                        self.streams_per_handle,
-                        workers=self.worker_addresses,
-                        wait=True)
-
-        self.nccl_initialized = True
-
-        if self.comms_p2p:
-            self.ucx_initialized = True
-
-        logger.debug("Initialization complete.")
-
-    def destroy(self):
-        """
-        Shuts down initialized comms and cleans up resources.
-        """
-        self.client.run(_func_destroy_all,
-                        self.sessionId,
-                        self.comms_p2p,
-                        wait=True,
-                        workers=self.worker_addresses)
-
-        logger.debug("Destroying comms.")
-
-        self.nccl_initialized = False
-        self.ucx_initialized = False
diff --git a/python/cuml/dask/common/comms_utils.pyx b/python/cuml/dask/common/comms_utils.pyx
deleted file mode 100644
index a8a89b093e..0000000000
--- a/python/cuml/dask/common/comms_utils.pyx
+++ /dev/null
@@ -1,157 +0,0 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-from libc.stdlib cimport malloc, free
-from cython.operator cimport dereference as deref
-
-from cpython.long cimport PyLong_AsVoidPtr
-
-from libcpp cimport bool
-
-
-from libc.stdint cimport uintptr_t
-
-cdef extern from "nccl.h":
-
-    cdef struct ncclComm
-    ctypedef ncclComm *ncclComm_t
-
-
-cdef extern from "common/cuML_comms_impl.cpp" namespace "MLCommon":
-    cdef cppclass cumlCommunicator
-
-
-cdef extern from "cuml/cuml.hpp" namespace "ML":
-    cdef cppclass cumlHandle:
-        cumlHandle() except +
-
-cdef extern from "cuML_comms_py.hpp" namespace "ML":
-    void inject_comms_py(cumlHandle *handle,
-                         ncclComm_t comm,
-                         void *ucp_worker,
-                         void *eps,
-                         int size,
-                         int rank) except +
-
-    void inject_comms_py_coll(cumlHandle *handle,
-                              ncclComm_t comm,
-                              int size,
-                              int rank) except +
-
-    bool ucx_enabled()
-
-
-cdef extern from "comms/cuML_comms_test.hpp" namespace "ML::Comms":
-    bool test_collective_allreduce(const cumlHandle &h) except +
-    bool test_pointToPoint_simple_send_recv(const cumlHandle &h,
-                                            int numTrials) except +
-    bool test_pointToPoint_recv_any_rank(const cumlHandle& h,
-                                         int numTrials) except +
-
-
-def is_ucx_enabled():
-    return ucx_enabled()
-
-
-def perform_test_comms_allreduce(handle):
-    """
-    Performs an allreduce on the current worker
-    :param handle: Handle handle containing cumlCommunicator to use
-    """
-    cdef const cumlHandle* h = <cumlHandle*><size_t>handle.getHandle()
-    return test_collective_allreduce(deref(h))
-
-
-def perform_test_comms_send_recv(handle, n_trials):
-    """
-    Performs a p2p send/recv on the current worker
-    :param handle: Handle handle containing cumlCommunicator to use
-    """
-    cdef const cumlHandle *h = <cumlHandle*><size_t>handle.getHandle()
-    return test_pointToPoint_simple_send_recv(deref(h), <int>n_trials)
-
-
-def perform_test_comms_recv_any_rank(handle, n_trials):
-    """
-    Performs a p2p send/recv on the current worker
-    :param handle: Handle handle containing cumlCommunicator to use
-    """
-    cdef const cumlHandle * h = < cumlHandle * > < size_t > handle.getHandle()
-    return test_pointToPoint_recv_any_rank(deref(h), < int > n_trials)
-
-
-def inject_comms_on_handle_coll_only(handle, nccl_inst, size, rank):
-    """
-    Given a handle and initialized nccl comm, creates a cumlCommunicator
-    instance and injects it into the handle.
-    :param handle: Handle cumlHandle to inject comms into
-    :param nccl_inst: ncclComm_t initialized nccl comm
-    :param size: int number of workers in cluster
-    :param rank: int rank of current worker
-    """
-
-    cdef size_t handle_size_t = <size_t>handle.getHandle()
-    handle_ = <cumlHandle*>handle_size_t
-
-    cdef size_t nccl_comm_size_t = <size_t>nccl_inst.get_comm()
-    nccl_comm_ = <ncclComm_t*>nccl_comm_size_t
-
-    inject_comms_py_coll(handle_,
-                         deref(nccl_comm_),
-                         size,
-                         rank)
-
-
-def inject_comms_on_handle(handle, nccl_inst, ucp_worker, eps, size,
-                           rank):
-    """
-    Given a handle and initialized comms, creates a cumlCommunicator instance
-    and injects it into the handle.
-    :param handle: Handle cumlHandle to inject comms into
-    :param nccl_inst: ncclComm_t initialized nccl comm
-    :param ucp_worker: size_t initialized ucp_worker_h instance
-    :param eps: size_t array of initialized ucp_ep_h instances
-    :param size: int number of workers in cluster
-    :param rank: int rank of current worker
-    """
-    cdef size_t *ucp_eps = <size_t*> malloc(len(eps)*sizeof(size_t))
-
-    for i in range(len(eps)):
-        if eps[i] is not None:
-            ep_st = <uintptr_t>eps[i].get_ucp_endpoint()
-            ucp_eps[i] = <size_t>ep_st
-        else:
-            ucp_eps[i] = 0
-
-    cdef void* ucp_worker_st = <void*><size_t>ucp_worker
-
-    cdef size_t handle_size_t = <size_t>handle.getHandle()
-    handle_ = <cumlHandle*>handle_size_t
-
-    cdef size_t nccl_comm_size_t = <size_t>nccl_inst.get_comm()
-    nccl_comm_ = <ncclComm_t*>nccl_comm_size_t
-
-    inject_comms_py(handle_,
-                    deref(nccl_comm_),
-                    <void*>ucp_worker_st,
-                    <void*>ucp_eps,
-                    size,
-                    rank)
-
-    free(ucp_eps)
diff --git a/python/cuml/dask/common/dask_arr_utils.py b/python/cuml/dask/common/dask_arr_utils.py
index 123944ab61..c8cf46d594 100644
--- a/python/cuml/dask/common/dask_arr_utils.py
+++ b/python/cuml/dask/common/dask_arr_utils.py
@@ -43,7 +43,7 @@ def _conv_df_to_sparse(x):
                             x.as_gpu_matrix(),
                             dtype=x.dtypes[0])
 
-    return cp.sparse.csr_matrix(cupy_ary)
+    return cupyx.scipy.sparse.csr_matrix(cupy_ary)
 
 
 def _conv_array_to_sparse(arr):
@@ -83,7 +83,7 @@ def to_sparse_dask_array(cudf_or_array, client=None):
     """
     Converts an array or cuDF to a sparse Dask array backed by sparse CuPy.
     CSR matrices. Unfortunately, due to current limitations in Dask, there is
-    no direct path to convert a cupy.sparse.spmatrix into a CuPy backed
+    no direct path to convert a cupyx.scipy.sparse.spmatrix into a CuPy backed
     dask.Array without copying to host.
 
 
@@ -108,7 +108,7 @@ def to_sparse_dask_array(cudf_or_array, client=None):
 
     Returns
     -------
-    dask_array : dask.Array backed by cupy.sparse.csr_matrix
+    dask_array : dask.Array backed by cupyx.scipy.sparse.csr_matrix
     """
     client = default_client() if client is None else client
 
diff --git a/python/cuml/dask/common/spmg_ipc.py b/python/cuml/dask/common/spmg_ipc.py
deleted file mode 100644
index 0440e56c67..0000000000
--- a/python/cuml/dask/common/spmg_ipc.py
+++ /dev/null
@@ -1,119 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import cuml.common.logger as logger
-import logging
-import numba.cuda
-import time
-
-from cuml.dask.common.utils import get_device_id, select_device
-
-from threading import Lock, Thread
-
-
-class IPCThread(Thread):
-    """
-    This mechanism gets around Numba's restriction of CUDA contexts being
-    thread-local by creating a thread that can select its own device.
-    This allows the user of IPC handles to open them up directly on the
-    same device as the owner (bypassing the need for peer access.)
-    """
-
-    def __init__(self, ipcs, device):
-        """
-        Initializes the thread with the given IPC handles for the
-        given device
-        :param ipcs: list[ipc] list of ipc handles with memory on the
-                     given device
-        :param device: device id to use.
-        """
-
-        Thread.__init__(self)
-
-        self.lock = Lock()
-        self.ipcs = ipcs
-
-        # Use canonical device id
-        self.device = get_device_id(device)
-
-        logger.info("Starting new IPC thread on device %i for ipcs %s" %
-                    (self.device, str(list(ipcs))))
-        self.running = False
-
-    def run(self):
-        """
-        Starts the current Thread instance enabling memory from the selected
-        device to be used.
-        """
-
-        select_device(self.device)
-
-        logger.info("Opening: " + str(self.device) + " "
-                    + str(numba.cuda.get_current_device()))
-
-        self.lock.acquire()
-
-        try:
-            self.arrs = [ipc.open() for ipc in self.ipcs]
-            self.ptr_info = [x.__cuda_array_interface__ for x in self.arrs]
-
-            self.running = True
-        except Exception as e:
-            logging.error("Error opening ipc_handle on device " +
-                          str(self.device) + ": " + str(e))
-
-        self.lock.release()
-
-        while (self.running):
-            time.sleep(0.0001)
-
-        try:
-            logging.warn("Closing: " + str(self.device) +
-                         str(numba.cuda.get_current_device()))
-            self.lock.acquire()
-            [ipc.close() for ipc in self.ipcs]
-            self.lock.release()
-
-        except Exception as e:
-            logging.error("Error closing ipc_handle on device " +
-                          str(self.device) + ": " + str(e))
-
-    def close(self):
-
-        """
-        This should be called before calling join(). Otherwise, IPC handles
-        may not be properly cleaned up.
-        """
-        self.lock.acquire()
-        self.running = False
-        self.lock.release()
-
-    def info(self):
-        """
-        Warning: this method is invoked from the calling thread. Make
-        sure the context in the thread reading the memory is tied to
-        self.device, otherwise an expensive peer access might take
-        place underneath.
-        """
-        while (not self.running):
-            time.sleep(0.0001)
-
-        return self.ptr_info
-
-
-def new_ipc_thread(ipcs, dev):
-    t = IPCThread(ipcs, dev)
-    t.start()
-    return t
diff --git a/python/cuml/dask/common/ucx.py b/python/cuml/dask/common/ucx.py
deleted file mode 100644
index b92d76b98c..0000000000
--- a/python/cuml/dask/common/ucx.py
+++ /dev/null
@@ -1,81 +0,0 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-from .comms_utils import is_ucx_enabled
-from cuml.common.import_utils import has_ucp
-
-if is_ucx_enabled() and has_ucp():
-    import ucp
-
-
-async def _connection_func(ep):
-    return 0
-
-
-class UCX:
-    """
-    Singleton UCX context to encapsulate all interactions with the
-    UCX-py API and guarantee only a single listener & endpoints are
-    created by cuML on a single process.
-    """
-
-    __instance = None
-
-    def __init__(self, listener_callback):
-
-        self.listener_callback = listener_callback
-
-        self._create_listener()
-        self._endpoints = {}
-
-        assert UCX.__instance is None
-
-        UCX.__instance = self
-
-    @staticmethod
-    def get(listener_callback=_connection_func):
-        if UCX.__instance is None:
-            UCX(listener_callback)
-        return UCX.__instance
-
-    def get_worker(self):
-        return ucp.get_ucp_worker()
-
-    def _create_listener(self):
-        self._listener = ucp.create_listener(self.listener_callback)
-
-    def listener_port(self):
-        return self._listener.port
-
-    async def _create_endpoint(self, ip, port):
-        ep = await ucp.create_endpoint(ip, port)
-        self._endpoints[(ip, port)] = ep
-        return ep
-
-    async def get_endpoint(self, ip, port):
-        if (ip, port) not in self._endpoints:
-            ep = await self._create_endpoint(ip, port)
-        else:
-            ep = self._endpoints[(ip, port)]
-
-        return ep
-
-    def __del__(self):
-        for ip_port, ep in self._endpoints.items():
-            if not ep.closed():
-                ep.abort()
-            del ep
-
-        self._listener.close()
diff --git a/python/cuml/dask/datasets/classification.py b/python/cuml/dask/datasets/classification.py
index ea5906af83..0c06bf79eb 100644
--- a/python/cuml/dask/datasets/classification.py
+++ b/python/cuml/dask/datasets/classification.py
@@ -41,16 +41,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
                         class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,
                         shuffle=True, random_state=None, order='F',
                         dtype='float32', n_parts=None, client=None):
-    """Generate a random n-class classification problem.
+    """
+    Generate a random n-class classification problem.
 
     This initially creates clusters of points normally distributed (std=1)
-    about vertices of an ``n_informative``-dimensional hypercube with sides of
-    length ``2*class_sep`` and assigns an equal number of clusters to each
+    about vertices of an `n_informative`-dimensional hypercube with sides of
+    length ``2 * class_sep`` and assigns an equal number of clusters to each
     class. It introduces interdependence between these features and adds
     various types of further noise to the data.
+
     Without shuffling, ``X`` horizontally stacks features in the following
-    order: the primary ``n_informative`` features, followed by ``n_redundant``
-    linear combinations of the informative features, followed by ``n_repeated``
+    order: the primary `n_informative` features, followed by `n_redundant`
+    linear combinations of the informative features, followed by `n_repeated`
     duplicates, drawn randomly with replacement from the informative and
     redundant features. The remaining features are filled with random noise.
     Thus, without shuffling, all useful features are contained in the columns
@@ -99,15 +101,15 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
     n_samples : int, optional (default=100)
         The number of samples.
     n_features : int, optional (default=20)
-        The total number of features. These comprise ``n_informative``
-        informative features, ``n_redundant`` redundant features,
-        ``n_repeated`` duplicated features and
+        The total number of features. These comprise `n_informative`
+        informative features, `n_redundant` redundant features,
+        `n_repeated` duplicated features and
         ``n_features-n_informative-n_redundant-n_repeated`` useless features
         drawn at random.
     n_informative : int, optional (default=2)
         The number of informative features. Each class is composed of a number
         of gaussian clusters each located around the vertices of a hypercube
-        in a subspace of dimension ``n_informative``. For each cluster,
+        in a subspace of dimension `n_informative`. For each cluster,
         informative features are drawn independently from  N(0, 1) and then
         randomly linearly combined within each cluster in order to add
         covariance. The clusters are then placed on the vertices of the
@@ -122,13 +124,13 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
         The number of classes (or labels) of the classification problem.
     n_clusters_per_class : int, optional (default=2)
         The number of clusters per class.
-    weights : array-like of shape (n_classes,) or (n_classes - 1,),\
-              (default=None)
+    weights : array-like of shape ``(n_classes,)`` or ``(n_classes - 1,)``, \
+        (default=None)
         The proportions of samples assigned to each class. If None, then
         classes are balanced. Note that if ``len(weights) == n_classes - 1``,
         then the last class weight is automatically inferred.
-        More than ``n_samples`` samples may be returned if the sum of
-        ``weights`` exceeds 1.
+        More than `n_samples` samples may be returned if the sum of
+        `weights` exceeds 1.
     flip_y : float, optional (default=0.01)
         The fraction of samples whose class is assigned randomly. Larger
         values introduce noise in the labels and make the classification
@@ -171,17 +173,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
     -----
     How we extended the dask MNMG version from the single GPU version:
 
-        1. We generate centroids of shape (n_centroids, n_informative)
-        2. We generate an informative covariance of shape
-           (n_centroids, n_informative, n_informative)
-        3. We generate a redundant covariance of shape
-           (n_informative, n_redundant)
-        4. We generate the indices for the repeated features
-        We pass along the references to the futures of the above arrays
-        with each part to the single GPU
-        `cuml.datasets.classification.make_classification` so that each
-        part (and worker) has access to the correct values to generate
-        data from the same covariances
+    1. We generate centroids of shape ``(n_centroids, n_informative)``
+    2. We generate an informative covariance of shape \
+        ``(n_centroids, n_informative, n_informative)``
+    3. We generate a redundant covariance of shape \
+        ``(n_informative, n_redundant)``
+    4. We generate the indices for the repeated features \
+    We pass along the references to the futures of the above arrays \
+    with each part to the single GPU \
+    `cuml.datasets.classification.make_classification` so that each \
+    part (and worker) has access to the correct values to generate \
+    data from the same covariances
+
     """
 
     client = get_client(client=client)
@@ -250,7 +253,7 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
                for idx, f in enumerate(parts)]
 
     X_dela = _create_delayed(X_parts, dtype, worker_rows, n_features)
-    y_dela = _create_delayed(y_parts, dtype, worker_rows)
+    y_dela = _create_delayed(y_parts, np.int64, worker_rows)
 
     X = da.concatenate(X_dela)
     y = da.concatenate(y_dela)
diff --git a/python/cuml/dask/datasets/regression.py b/python/cuml/dask/datasets/regression.py
index 874405ad44..3259df69fa 100644
--- a/python/cuml/dask/datasets/regression.py
+++ b/python/cuml/dask/datasets/regression.py
@@ -134,9 +134,9 @@ def _shuffle(client, rs, X, y, chunksizes, n_features,
     return da.concatenate(X_dela, axis=0), da.concatenate(y_dela, axis=0)
 
 
-def _convert_C_to_F_order(client, X, chunksizes, n_features, dtype):
+def _convert_to_order(client, X, chunksizes, order, n_features, dtype):
     X_ddh = DistributedDataHandler.create(data=X, client=client)
-    X_converted = [client.submit(cp.array, X_part, copy=False, order='F',
+    X_converted = [client.submit(cp.array, X_part, copy=False, order=order,
                                  workers=[w])
                    for idx, (w, X_part) in enumerate(X_ddh.gpu_futures)]
 
@@ -223,7 +223,7 @@ def make_low_rank_matrix(n_samples=100, n_features=100,
     tail_strength : float between 0.0 and 1.0, optional (default=0.5)
         The relative importance of the fat noisy tail of the singular values
         profile.
-    random_state : int, CuPy RandomState instance, Dask RandomState instance
+    random_state : int, CuPy RandomState instance, Dask RandomState instance \
                    or None (default)
         Determines random number generation for dataset creation. Pass an int
         for reproducible output across multiple function calls.
@@ -236,6 +236,7 @@ def make_low_rank_matrix(n_samples=100, n_features=100,
     -------
     X : Dask-CuPy array of shape [n_samples, n_features]
         The matrix.
+
     """
 
     rs = _create_rs_generator(random_state)
@@ -276,7 +277,9 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
                     random_state=None, n_parts=1, n_samples_per_part=None,
                     order='F', dtype='float32', client=None,
                     use_full_low_rank=True):
-    """Generate a random regression problem.
+    """
+    Generate a random regression problem.
+
     The input set can either be well conditioned (by default) or have a low
     rank-fat tail singular profile.
 
@@ -305,9 +308,11 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
             of the input data by linear combinations. Using this kind of
             singular spectrum in the input allows the generator to reproduce
             the correlations often observed in practice.
+
         if None:
             The input set is well conditioned, centered and gaussian with
             unit variance.
+
     tail_strength : float between 0.0 and 1.0, optional (default=0.5)
         The relative importance of the fat noisy tail of the singular values
         profile if "effective_rank" is not None.
@@ -317,7 +322,7 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
         Shuffle the samples and the features.
     coef : boolean, optional (default=False)
         If True, the coefficients of the underlying linear model are returned.
-    random_state : int, CuPy RandomState instance, Dask RandomState instance
+    random_state : int, CuPy RandomState instance, Dask RandomState instance \
                    or None (default)
         Determines random number generation for dataset creation. Pass an int
         for reproducible output across multiple function calls.
@@ -339,26 +344,26 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
         The input samples.
     y : Dask-CuPy array of shape [n_samples] or [n_samples, n_targets]
         The output values.
-    coef : Dask-CuPy array of shape [n_features]
+    coef : Dask-CuPy array of shape [n_features] \
            or [n_features, n_targets], optional
         The coefficient of the underlying linear model. It is returned only if
         coef is True.
 
     Notes
     -----
-    - Known Performance Limitations:
-        1. When `effective_rank` is set and `use_full_low_rank` is True,
-           we cannot generate order `F` by construction, and an explicit
-           transpose is performed on each part. This may cause memory to spike
-           (other parameters make order `F` by construction)
-        2. When `n_targets > 1` and `order = 'F'` as above, we have to
-           explicity transpose the `y` array. If `coef = True`, then we also
-           explicity transpose the `ground_truth` array
-        3. When `shuffle = True` and `order = F`, there are memory spikes to
-           shuffle the `F` order arrays
-
-    - NOTE: If out-of-memory errors are encountered in any of the above \
-          configurations, try increasing the `n_parts` parameter.
+    Known Performance Limitations:
+     1. When `effective_rank` is set and `use_full_low_rank` is True, \
+        we cannot generate order `F` by construction, and an explicit \
+        transpose is performed on each part. This may cause memory to spike \
+        (other parameters make order `F` by construction)
+     2. When `n_targets > 1` and `order = 'F'` as above, we have to \
+        explicity transpose the `y` array. If `coef = True`, then we also \
+        explicity transpose the `ground_truth` array
+     3. When `shuffle = True` and `order = F`, there are memory spikes to \
+        shuffle the `F` order arrays
+
+    .. note:: If out-of-memory errors are encountered in any of the above
+        configurations, try increasing the `n_parts` parameter.
     """
 
     client = get_client(client=client)
@@ -410,9 +415,8 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
                                                data_chunksizes, n_features,
                                                dtype)
 
-        if order == 'F':
-            X = _convert_C_to_F_order(client, X, data_chunksizes,
-                                      n_features, dtype)
+        X = _convert_to_order(client, X, data_chunksizes, order,
+                              n_features, dtype)
 
     # Generate a ground truth model with only n_informative features being non
     # zeros (the other features are not correlated to y and should be ignored
@@ -447,11 +451,11 @@ def make_regression(n_samples=100, n_features=100, n_informative=10,
     y = da.squeeze(y)
 
     if order == 'F' and n_targets > 1:
-        y = _convert_C_to_F_order(client, y, y.chunks[0], n_targets, dtype)
+        y = _convert_to_order(client, y, y.chunks[0], order, n_targets, dtype)
         if coef:
-            ground_truth = _convert_C_to_F_order(client, ground_truth,
-                                                 ground_truth.chunks[0],
-                                                 n_targets, dtype)
+            ground_truth = _convert_to_order(client, ground_truth,
+                                             ground_truth.chunks[0], order,
+                                             n_targets, dtype)
 
     if coef:
         ground_truth = da.squeeze(ground_truth)
diff --git a/python/cuml/dask/decomposition/base.py b/python/cuml/dask/decomposition/base.py
index 14311ae766..929bf6a5fb 100644
--- a/python/cuml/dask/decomposition/base.py
+++ b/python/cuml/dask/decomposition/base.py
@@ -14,7 +14,8 @@
 #
 
 from cuml.dask.common import raise_exception_from_futures
-from cuml.dask.common.comms import worker_state, CommsContext
+from cuml.raft.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import Comms
 
 from cuml.dask.common.input_utils import to_output
 from cuml.dask.common import parts_to_ranks
@@ -63,9 +64,9 @@ def _fit(self, X, _transform=False):
 
         if "svd_solver" in self.kwargs \
                 and self.kwargs["svd_solver"] == "tsqr":
-            comms = CommsContext(comms_p2p=True)
+            comms = Comms(comms_p2p=True)
         else:
-            comms = CommsContext(comms_p2p=False)
+            comms = Comms(comms_p2p=False)
 
         comms.init(workers=data.workers)
 
diff --git a/python/cuml/dask/decomposition/pca.py b/python/cuml/dask/decomposition/pca.py
index 171a5e3d82..663fa45fc0 100644
--- a/python/cuml/dask/decomposition/pca.py
+++ b/python/cuml/dask/decomposition/pca.py
@@ -37,7 +37,7 @@ class PCA(BaseDecomposition,
     then selects the top K eigenvectors.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -92,21 +92,27 @@ class PCA(BaseDecomposition,
                     1  0.011454
                     2 -0.008182
 
-    Note: Everytime this code is run, the output will be different because
-          "make_blobs" function generates random matrices.
+    .. note:: Everytime this code is run, the output will be different because
+        "make_blobs" function generates random matrices.
 
     Parameters
     ----------
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     n_components : int (default = 1)
         The number of top K singular vectors / values you want.
         Must be <= number(columns).
     svd_solver : 'full', 'jacobi', or 'tsqr'
         'full': run exact full SVD and select the components by postprocessing
         'jacobi': iteratively compute SVD of the covariance matrix
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     whiten : boolean (default = False)
         If True, de-correlates the components. This is done by dividing them by
         the corresponding singular values then multiplying by sqrt(n_samples).
@@ -146,9 +152,6 @@ class PCA(BaseDecomposition,
         datasets of everyday objects and images, and used to distinguish
         between cancerous cells from healthy cells.
 
-
-    For an additional example see `the PCA notebook
-    <https://github.com/rapidsai/notebooks/blob/master/cuml/pca_demo.ipynb>`_.
     For additional docs, see `scikitlearn's PCA
     <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html>`_.
     """
diff --git a/python/cuml/dask/decomposition/tsvd.py b/python/cuml/dask/decomposition/tsvd.py
index 2915bd4b8a..1744a3accb 100644
--- a/python/cuml/dask/decomposition/tsvd.py
+++ b/python/cuml/dask/decomposition/tsvd.py
@@ -27,7 +27,7 @@ class TruncatedSVD(BaseDecomposition,
                    DecompositionSyncFitMixin):
     """
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -64,38 +64,45 @@ class TruncatedSVD(BaseDecomposition,
 
     .. code-block:: python
 
-          Input Matrix:
-                              0         1          2
-                    0 -8.519647 -8.519222  -8.865648
-                    1 -6.107700 -8.350124 -10.351215
-                    2 -8.026635 -9.442240  -7.561770
-                    0 -8.519647 -8.519222  -8.865648
-                    1 -6.107700 -8.350124 -10.351215
-                    2 -8.026635 -9.442240  -7.561770
-
-          Transformed Input Matrix:
-                               0
-                    0  14.928891
-                    1  14.487295
-                    2  14.431235
-                    0  14.928891
-                    1  14.487295
-                    2  14.431235
-    Note: Everytime this code is run, the output will be different because
-          "make_blobs" function generates random matrices.
+        Input Matrix:
+                            0         1          2
+                  0 -8.519647 -8.519222  -8.865648
+                  1 -6.107700 -8.350124 -10.351215
+                  2 -8.026635 -9.442240  -7.561770
+                  0 -8.519647 -8.519222  -8.865648
+                  1 -6.107700 -8.350124 -10.351215
+                  2 -8.026635 -9.442240  -7.561770
+
+        Transformed Input Matrix:
+                             0
+                  0  14.928891
+                  1  14.487295
+                  2  14.431235
+                  0  14.928891
+                  1  14.487295
+                  2  14.431235
+
+    .. note:: Everytime this code is run, the output will be different because
+        "make_blobs" function generates random matrices.
 
     Parameters
     ----------
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     n_components : int (default = 1)
         The number of top K singular vectors / values you want.
         Must be <= number(columns).
     svd_solver : 'full'
         Only Full algorithm is supported since it's significantly faster on GPU
         then the other solvers including randomized SVD.
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Attributes
     ----------
@@ -107,6 +114,7 @@ class TruncatedSVD(BaseDecomposition,
         How much in % the variance is explained given by S**2/sum(S**2)
     singular_values_ : array
         The top K singular values. Remember all singular values >= 0
+
     """
 
     def __init__(self, client=None, **kwargs):
@@ -151,9 +159,9 @@ def fit_transform(self, X):
 
     def transform(self, X, delayed=True):
         """
-        Apply dimensionality reduction to X.
+        Apply dimensionality reduction to `X`.
 
-        X is projected on the first principal components previously extracted
+        `X` is projected on the first principal components previously extracted
         from a training set.
 
         Parameters
diff --git a/python/cuml/dask/ensemble/base.py b/python/cuml/dask/ensemble/base.py
index efc15aece1..d81042bf90 100644
--- a/python/cuml/dask/ensemble/base.py
+++ b/python/cuml/dask/ensemble/base.py
@@ -16,6 +16,7 @@
 import dask
 import math
 import numpy as np
+import warnings
 
 from cuml.dask.common.input_utils import DistributedDataHandler, \
     concatenate
@@ -37,11 +38,18 @@ def _create_model(self, model_func,
                       workers,
                       n_estimators,
                       base_seed,
+                      ignore_empty_partitions,
                       **kwargs):
 
         self.client = get_client(client)
-        self.workers = self.client.scheduler_info()['workers'].keys()
+        if workers is None:
+            # Default to all workers
+            workers = self.client.scheduler_info()['workers'].keys()
+        self.workers = workers
         self._set_internal_model(None)
+        self.active_workers = list()
+        self.ignore_empty_partitions = ignore_empty_partitions
+        self.n_estimators = n_estimators
 
         self.n_estimators_per_worker = \
             self._estimators_per_worker(n_estimators)
@@ -56,7 +64,7 @@ def _create_model(self, model_func,
             worker: self.client.submit(
                 model_func,
                 n_estimators=self.n_estimators_per_worker[n],
-                seed=seeds[n],
+                random_state=seeds[n],
                 **kwargs,
                 pure=False,
                 workers=[worker],
@@ -85,6 +93,7 @@ def _estimators_per_worker(self, n_estimators):
 
     def _fit(self, model, dataset, convert_dtype):
         data = DistributedDataHandler.create(dataset, client=self.client)
+        self.active_workers = data.workers
         self.datatype = data.datatype
         if self.datatype == 'cudf':
             has_float64 = (dataset[0].dtypes.any() == np.float64)
@@ -112,6 +121,24 @@ def _fit(self, model, dataset, convert_dtype):
                     workers=[worker],
                     pure=False)
             )
+        if len(self.workers) > len(self.active_workers):
+            if self.ignore_empty_partitions:
+                curent_estimators = self.n_estimators / \
+                                    len(self.workers) * \
+                                    len(self.active_workers)
+                warn_text = (
+                    f"Data was not split among all workers "
+                    f"using only {self.active_workers} workers to fit."
+                    f"This will only train {curent_estimators}"
+                    f" estimators instead of the requested "
+                    f"{self.n_estimators}"
+                )
+                warnings.warn(warn_text)
+            else:
+                raise ValueError("Data was not split among all workers. "
+                                 "Re-run the code or "
+                                 "use ignore_empty_partitions=True"
+                                 " while creating model")
         wait_and_raise_from_futures(futures)
         return self
 
@@ -123,7 +150,7 @@ def _concat_treelite_models(self):
         bytes format.
         """
         model_serialized_futures = list()
-        for w in self.workers:
+        for w in self.active_workers:
             model_serialized_futures.append(
                 dask.delayed(_get_serialized_model)
                 (self.rfs[w]))
diff --git a/python/cuml/dask/ensemble/randomforestclassifier.py b/python/cuml/dask/ensemble/randomforestclassifier.py
index 3d49f22803..c4a3b879c9 100755
--- a/python/cuml/dask/ensemble/randomforestclassifier.py
+++ b/python/cuml/dask/ensemble/randomforestclassifier.py
@@ -14,6 +14,8 @@
 # limitations under the License.
 #
 
+import warnings
+
 import numpy as np
 
 from cuml.dask.common.base import BaseEstimator
@@ -36,12 +38,12 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
     (possibly on different nodes).
 
     Currently, this API makes the following assumptions:
-    * The set of Dask workers used between instantiation, fit,
-    and predict are all consistent
-    * Training data comes in the form of cuDF dataframes or Dask Arrays
-    distributed so that each worker has at least one partition.
-    * The print_summary and print_detailed functions print the
-    information of the forest on the worker.
+     * The set of Dask workers used between instantiation, fit, \
+        and predict are all consistent
+     * Training data comes in the form of cuDF dataframes or Dask Arrays \
+        distributed so that each worker has at least one partition.
+     * The print_summary and print_detailed functions print the \
+        information of the forest on the worker.
 
     Future versions of the API will support more flexible data
     distribution and additional input types.
@@ -65,13 +67,17 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
     n_estimators : int (default = 10)
                    total number of trees in the forest (not per-worker)
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     split_criterion : The criterion used to split nodes.
         0 for GINI, 1 for ENTROPY, 4 for CRITERION_END.
         2 and 3 not valid for classification
         (default = 0)
-    split_algo : 0 for HIST and 1 for GLOBAL_QUANTILE
-        (default = 1)
+    split_algo : 0 for HIST and 1 for GLOBAL_QUANTILE (default = 1)
         the algorithm to determine how nodes are split in the tree.
     split_criterion : The criterion used to split nodes.
         0 for GINI, 1 for ENTROPY, 4 for CRITERION_END.
@@ -106,14 +112,24 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
     workers : optional, list of strings
         Dask addresses of workers to use for computation.
         If None, all available Dask workers will be used.
+    random_state : int (default = None)
+        Seed for the random number generator. Unseeded by default.
     seed : int (default = None)
+        Deprecated in favor of `random_state`.
         Base seed for the random number generator. Unseeded by default. Does
         not currently fully guarantee the exact same results.
+    ignore_empty_partitions: Boolean (default = False)
+        Specify behavior when a worker does not hold any data
+        while splitting. When True, it returns the results from workers
+        with data (the number of trained estimators will be less than
+        n_estimators) When False, throws a RuntimeError.
+        This is an experiemental parameter, and may be removed
+        in the future.
 
     Examples
-    ---------
+    --------
     For usage examples, please see the RAPIDS notebooks repository:
-    https://github.com/rapidsai/notebooks/blob/branch-0.12/cuml/random_forest_mnmg_demo.ipynb
+    https://github.com/rapidsai/cuml/blob/branch-0.15/notebooks/random_forest_mnmg_demo.ipynb
     """
 
     def __init__(
@@ -122,31 +138,47 @@ def __init__(
         client=None,
         verbose=False,
         n_estimators=10,
+        random_state=None,
         seed=None,
+        ignore_empty_partitions=False,
         **kwargs
     ):
 
         super(RandomForestClassifier, self).__init__(client=client,
                                                      verbose=verbose,
                                                      **kwargs)
+        if seed is not None:
+            if random_state is None:
+                warnings.warn("Parameter 'seed' is deprecated and will be"
+                              " removed in 0.17. Please use 'random_state'"
+                              " instead. Setting 'random_state' as the"
+                              " curent 'seed' value",
+                              DeprecationWarning)
+                random_state = seed
+            else:
+                warnings.warn("Both 'seed' and 'random_state' parameters were"
+                              " set. Using 'random_state' since 'seed' is"
+                              " deprecated and will be removed in 0.17.",
+                              DeprecationWarning)
 
         self._create_model(
             model_func=RandomForestClassifier._construct_rf,
             client=client,
             workers=workers,
             n_estimators=n_estimators,
-            base_seed=seed,
+            base_seed=random_state,
+            ignore_empty_partitions=ignore_empty_partitions,
             **kwargs)
 
     @staticmethod
     def _construct_rf(
         n_estimators,
-        seed,
+        random_state,
         **kwargs
     ):
         return cuRFC(
             n_estimators=n_estimators,
-            seed=seed,
+            random_state=random_state,
             **kwargs
         )
 
@@ -182,7 +214,9 @@ def fit(self, X, y, convert_dtype=False):
         memory consumption, ensure that each worker has exactly one partition.
 
         When persisting data, you can use
-        cuml.dask.common.utils.persist_across_workers to simplify this::
+        `cuml.dask.common.utils.persist_across_workers` to simplify this:
+
+        .. code-block:: python
 
             X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
             y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
@@ -190,7 +224,10 @@ def fit(self, X, y, convert_dtype=False):
                                                               [X_dask_cudf,
                                                                y_dask_cudf])
 
-        (this is equivalent to calling `persist` with the data and workers)::
+        This is equivalent to calling `persist` with the data and workers:
+
+        .. code-block:: python
+
             X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,
                                                             y_dask_cudf],
                                                            workers={
@@ -265,7 +302,7 @@ def predict(self, X, output_class=True, algo='auto', threshold=0.5,
             coalescing-friendly
             'batch_tree_reorg' - similar to tree_reorg but predicting
             multiple rows per thread block
-            `algo` - choose the algorithm automatically. Currently
+            'algo' - choose the algorithm automatically. Currently
             'batch_tree_reorg' is used for dense storage
             and 'naive' for sparse storage
         threshold : float (default = 0.5)
@@ -300,10 +337,9 @@ def predict(self, X, output_class=True, algo='auto', threshold=0.5,
         y : Dask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
 
         """
-        if predict_model == "CPU" or self.num_classes > 2:
-            preds = self.predict_model_on_cpu(X,
+        if predict_model == "CPU":
+            preds = self.predict_model_on_cpu(X=X,
                                               convert_dtype=convert_dtype)
-
         else:
             preds = \
                 self._predict_using_fil(X, output_class=output_class,
@@ -392,7 +428,7 @@ def predict_proba(self, X,
         """
         Predicts the probability of each class for X.
 
-        See documentation of `predict' for notes on performance.
+        See documentation of `predict` for notes on performance.
 
         Parameters
         ----------
@@ -418,15 +454,13 @@ def predict_proba(self, X,
             coalescing-friendly
             'batch_tree_reorg' - similar to tree_reorg but predicting
             multiple rows per thread block
-            `auto` - choose the algorithm automatically. Currently
+            'auto' - choose the algorithm automatically. Currently
             'batch_tree_reorg' is used for dense storage
             and 'naive' for sparse storage
         threshold : float (default = 0.5)
             Threshold used for classification. Optional and required only
             while performing the predict operation on the GPU.
             It is applied if output_class == True, else it is ignored
-        num_classes : int (default = 2)
-            number of different classes present in the dataset
         convert_dtype : bool, optional (default = True)
             When set to True, the predict method will, when necessary, convert
             the input to the data type which was used to train the model. This
@@ -442,9 +476,10 @@ def predict_proba(self, X,
             or algo='auto'
 
         Returns
-        ----------
+        -------
         y : NumPy
            Dask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)
+
         """
         if self._get_internal_model() is None:
             self._set_internal_model(self._concat_treelite_models())
diff --git a/python/cuml/dask/ensemble/randomforestregressor.py b/python/cuml/dask/ensemble/randomforestregressor.py
index fa35fa90bd..9f392fad56 100755
--- a/python/cuml/dask/ensemble/randomforestregressor.py
+++ b/python/cuml/dask/ensemble/randomforestregressor.py
@@ -14,6 +14,8 @@
 # limitations under the License.
 #
 
+import warnings
+
 from cuml.dask.common.base import DelayedPredictionMixin
 from cuml.ensemble import RandomForestRegressor as cuRFR
 from cuml.dask.ensemble.base import \
@@ -30,12 +32,12 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
     (possibly on different nodes).
 
     Currently, this API makes the following assumptions:
-    * The set of Dask workers used between instantiation, fit,
-    and predict are all consistent
-    * Training data comes in the form of cuDF dataframes or Dask Arrays
-    distributed so that each worker has at least one partition.
-    * The print_summary and print_detailed functions print the
-    information of the forest on the worker.
+     * The set of Dask workers used between instantiation, fit,
+       and predict are all consistent
+     * Training data comes in the form of cuDF dataframes or Dask Arrays
+       distributed so that each worker has at least one partition.
+     * The print_summary and print_detailed functions print the
+       information of the forest on the worker.
 
     Future versions of the API will support more flexible data
     distribution and additional input types. User-facing APIs are
@@ -60,7 +62,12 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
     n_estimators : int (default = 10)
         total number of trees in the forest (not per-worker)
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     split_algo : int (default = 1)
         0 for HIST, 1 for GLOBAL_QUANTILE
         The type of algorithm to be used to create the trees.
@@ -98,8 +105,11 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
         The minimum number of samples (rows) needed to split a node.
         If int then number of sample rows
         If float the min_rows_per_sample*n_rows
-    accuracy_metric : string (default = 'mse')
+    accuracy_metric : string (default = 'r2')
         Decides the metric used to evaluate the performance of the model.
+        In the 0.16 release, the default scoring metric was changed
+        from mean squared error to r-squared.
+        for r-squared : 'r2'
         for median of abs error : 'median_ae'
         for mean of abs error : 'mean_ae'
         for mean square error' : 'mse'
@@ -108,9 +118,19 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
     workers : optional, list of strings
         Dask addresses of workers to use for computation.
         If None, all available Dask workers will be used.
+    random_state : int (default = None)
+        Seed for the random number generator. Unseeded by default.
     seed : int (default = None)
+        Deprecated in favor of `random_state`.
         Base seed for the random number generator. Unseeded by default. Does
         not currently fully guarantee the exact same results.
+    ignore_empty_partitions: Boolean (default = False)
+        Specify behavior when a worker does not hold any data
+        while splitting. When True, it returns the results from workers
+        with data (the number of trained estimators will be less than
+        n_estimators) When False, throws a RuntimeError.
+        This is an experiemental parameter, and may be removed
+        in the future.
 
     """
 
@@ -120,30 +140,48 @@ def __init__(
         client=None,
         verbose=False,
         n_estimators=10,
+        random_state=None,
         seed=None,
+        ignore_empty_partitions=False,
         **kwargs
     ):
         super(RandomForestRegressor, self).__init__(client=client,
                                                     verbose=verbose,
                                                     **kwargs)
+
+        if seed is not None:
+            if random_state is None:
+                warnings.warn("Parameter 'seed' is deprecated and will be"
+                              " removed in 0.17. Please use 'random_state'"
+                              " instead. Setting 'random_state' as the"
+                              " curent 'seed' value",
+                              DeprecationWarning)
+                random_state = seed
+            else:
+                warnings.warn("Both 'seed' and 'random_state' parameters were"
+                              " set. Using 'random_state' since 'seed' is"
+                              " deprecated and will be removed in 0.17.",
+                              DeprecationWarning)
+
         self._create_model(
             model_func=RandomForestRegressor._construct_rf,
             client=client,
             workers=workers,
             n_estimators=n_estimators,
-            base_seed=seed,
+            base_seed=random_state,
+            ignore_empty_partitions=ignore_empty_partitions,
             **kwargs
         )
 
     @staticmethod
     def _construct_rf(
         n_estimators,
-        seed,
+        random_state,
         **kwargs
     ):
         return cuRFR(
             n_estimators=n_estimators,
-            seed=seed,
+            random_state=random_state,
             **kwargs)
 
     @staticmethod
@@ -174,7 +212,9 @@ def fit(self, X, y, convert_dtype=False):
         on each Dask worker being used by the forest (self.workers).
 
         When persisting data, you can use
-        cuml.dask.common.utils.persist_across_workers to simplify this::
+        `cuml.dask.common.utils.persist_across_workers` to simplify this:
+
+        .. code-block:: python
 
             X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
             y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
@@ -182,7 +222,10 @@ def fit(self, X, y, convert_dtype=False):
                                                               [X_dask_cudf,
                                                                y_dask_cudf])
 
-        (this is equivalent to calling `persist` with the data and workers)::
+        This is equivalent to calling `persist` with the data and workers):
+
+        .. code-block:: python
+
             X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,
                                                             y_dask_cudf],
                                                            workers={
@@ -202,6 +245,7 @@ def fit(self, X, y, convert_dtype=False):
             When set to True, the fit method will, when necessary, convert
             y to be the same data type as X if they differ. This will increase
             memory used for the method.
+
         """
         self.internal_model = None
         self._fit(model=self.rfs,
@@ -274,8 +318,9 @@ def predict(self, X, predict_model="GPU", algo='auto',
             eagerly executed one.
 
         Returns
-        ----------
-        y : Dask cuDF dataframe  or CuPy backed Dask Array (n_rows, 1)
+        -------
+        y : Dask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
+
         """
         if predict_model == "CPU":
             preds = self.predict_model_on_cpu(X, convert_dtype=convert_dtype)
diff --git a/python/cuml/dask/feature_extraction/__init__.py b/python/cuml/dask/feature_extraction/__init__.py
new file mode 100644
index 0000000000..493c5362c1
--- /dev/null
+++ b/python/cuml/dask/feature_extraction/__init__.py
@@ -0,0 +1,17 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml.dask.feature_extraction import text
diff --git a/python/cuml/dask/feature_extraction/text/__init__.py b/python/cuml/dask/feature_extraction/text/__init__.py
new file mode 100644
index 0000000000..2275e9f125
--- /dev/null
+++ b/python/cuml/dask/feature_extraction/text/__init__.py
@@ -0,0 +1,17 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml.dask.feature_extraction.text.tfidf_transformer import TfidfTransformer
diff --git a/python/cuml/dask/feature_extraction/text/tfidf_transformer.py b/python/cuml/dask/feature_extraction/text/tfidf_transformer.py
new file mode 100644
index 0000000000..353badfc84
--- /dev/null
+++ b/python/cuml/dask/feature_extraction/text/tfidf_transformer.py
@@ -0,0 +1,208 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import dask
+from toolz import first
+import dask.array
+
+from cuml.common import with_cupy_rmm
+from cuml.dask.common.base import BaseEstimator
+from cuml.dask.common.base import DelayedTransformMixin
+from cuml.dask.common.utils import wait_and_raise_from_futures
+from cuml.dask.common.func import reduce
+from cuml.dask.common.input_utils import DistributedDataHandler
+
+from cuml.feature_extraction.text import TfidfTransformer as s_TfidfTransformer
+
+
+class TfidfTransformer(BaseEstimator, DelayedTransformMixin):
+
+    """
+    Distributed TF-IDF  transformer
+
+    Examples
+    --------
+
+        import cupy as cp
+        from sklearn.datasets import fetch_20newsgroups
+        from sklearn.feature_extraction.text import CountVectorizer
+        from dask_cuda import LocalCUDACluster
+        from dask.distributed import Client
+        from cuml.dask.common import to_sparse_dask_array
+        from cuml.dask.naive_bayes import MultinomialNB
+        import dask
+        from cuml.dask.feature_extraction.text import TfidfTransformer
+
+        # Create a local CUDA cluster
+        cluster = LocalCUDACluster()
+        client = Client(cluster)
+
+        # Load corpus
+        twenty_train = fetch_20newsgroups(subset='train',
+                                shuffle=True, random_state=42)
+        cv = CountVectorizer()
+        xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
+        X = to_sparse_dask_array(xformed, client)
+
+        y = dask.array.from_array(twenty_train.target, asarray=False,
+                            fancy=False).astype(cp.int32)
+
+
+        mutli_gpu_transformer = TfidfTransformer()
+        X_transormed = mutli_gpu_transformer.fit_transform(X)
+        X_transormed.compute_chunk_sizes()
+
+        model = MultinomialNB()
+        model.fit(X_transormed, y)
+        model.score(X_transormed, y)
+
+    Output:
+
+    .. code-block:: python
+
+        array(0.93264981)
+
+    """
+
+    def __init__(self, client=None, verbose=False, **kwargs):
+
+        """
+        Create new  distributed  TF-IDF  transformer instance
+
+        Parameters
+        -----------
+
+        client : dask.distributed.Client optional Dask client to use
+        """
+        super(TfidfTransformer, self).__init__(
+            client=client, verbose=verbose, **kwargs
+        )
+
+        self.datatype = "cupy"
+
+        # Make any potential model args available and catch any potential
+        # ValueErrors before distributed training begins.
+        self._set_internal_model(s_TfidfTransformer(**kwargs))
+
+    @staticmethod
+    @with_cupy_rmm
+    def _set_doc_stats(X, kwargs):
+        model = s_TfidfTransformer(**kwargs)
+        # Below is only required if we have to set stats
+        if model.use_idf:
+            model._set_doc_stats(X)
+
+        return model
+
+    @staticmethod
+    def _merge_stats_to_model(models):
+        modela = first(models)
+        if modela.use_idf:
+            for model in models[1:]:
+                modela.__n_samples += model.__n_samples
+                modela.__df += model.__df
+        return modela
+
+    @staticmethod
+    def _set_idf_diag(model):
+        model._set_idf_diag()
+        return model
+
+    @with_cupy_rmm
+    def fit(self, X):
+
+        """
+        Fit distributed TFIDF Transformer
+
+        Parameters
+        ----------
+
+        X : dask.Array with blocks containing dense or sparse cupy arrays
+        Returns
+        -------
+
+        cuml.dask.naive_bayes.TfidfTransformer current model instance
+        """
+
+        # Only Dask.Array supported for now
+        if not isinstance(X, dask.array.core.Array):
+            raise ValueError("Only dask.Array is supported for X")
+
+        if len(X.chunks[1]) != 1:
+            raise ValueError(
+                "X must be chunked by row only. "
+                "Multi-dimensional chunking is not supported"
+            )
+
+        # We don't' do anything if we don't need idf
+        if not self.internal_model.use_idf:
+            return self
+
+        futures = DistributedDataHandler.create(X, self.client)
+
+        models = [
+            self.client.submit(
+                self._set_doc_stats, part, self.kwargs, pure=False
+            )
+            for w, part in futures.gpu_futures
+        ]
+
+        models = reduce(models, self._merge_stats_to_model, client=self.client)
+
+        wait_and_raise_from_futures([models])
+
+        models = self.client.submit(self._set_idf_diag, models, pure=False)
+
+        wait_and_raise_from_futures([models])
+
+        self._set_internal_model(models)
+
+        return self
+
+    @staticmethod
+    def _get_part(parts, idx):
+        return parts[idx]
+
+    @staticmethod
+    def _get_size(arrs):
+        return arrs.shape[0]
+
+    def fit_transform(self, X):
+        return self.fit(X).transform(X)
+
+    def transform(self, X):
+        """
+        Use distributed TFIDFTransformer to transforme the
+        given set of data samples.
+
+        Parameters
+        ----------
+
+        X : dask.Array with blocks containing dense or sparse cupy arrays
+
+
+        Returns
+        -------
+
+        dask.Array with blocks containing transformed sparse cupy arrays
+
+        """
+        if not isinstance(X, dask.array.core.Array):
+            raise ValueError("Only dask.Array is supported for X")
+
+        return self._transform(
+            X, n_dims=2, delayed=True, output_collection_type="cupy"
+        )
diff --git a/python/cuml/dask/linear_model/elastic_net.py b/python/cuml/dask/linear_model/elastic_net.py
index 460c8a9e7c..e44e4c28f0 100644
--- a/python/cuml/dask/linear_model/elastic_net.py
+++ b/python/cuml/dask/linear_model/elastic_net.py
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 from cuml.dask.solvers import CD
 from cuml.dask.common.base import BaseEstimator
 
@@ -68,21 +63,24 @@ class ElasticNet(BaseEstimator):
         This (setting to ‘random’) often leads to significantly faster
         convergence especially when tol is higher than 1e-4.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
-    output_type : (optional) {'input', 'cudf', 'cupy', 'numpy'} default = None
-        Use it to control output type of the results and attributes.
-        If None it'll inherit the output type set at the
-        module level, cuml.output_type. If that has not been changed, by
-        default the estimator will mirror the type of the data used for each
-        fit or predict call.
-        If set, the estimator will override the global option for its behavior.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     -----------
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
 
     For additional docs, see `scikitlearn's ElasticNet
diff --git a/python/cuml/dask/linear_model/lasso.py b/python/cuml/dask/linear_model/lasso.py
index a35373fb44..b66de734c3 100644
--- a/python/cuml/dask/linear_model/lasso.py
+++ b/python/cuml/dask/linear_model/lasso.py
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 from cuml.dask.solvers import CD
 from cuml.dask.common.base import BaseEstimator
 
@@ -67,7 +62,7 @@ class Lasso(BaseEstimator):
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     For additional docs, see `scikitlearn's Lasso
     <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_.
diff --git a/python/cuml/dask/linear_model/linear_regression.py b/python/cuml/dask/linear_model/linear_regression.py
index 1b8be723d8..d56f608246 100644
--- a/python/cuml/dask/linear_model/linear_regression.py
+++ b/python/cuml/dask/linear_model/linear_regression.py
@@ -17,7 +17,7 @@
 from cuml.dask.common.base import DelayedPredictionMixin
 from cuml.dask.common.base import mnmg_import
 from cuml.dask.common.base import SyncFitMixinLinearModel
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import worker_state
 
 
 class LinearRegression(BaseEstimator,
@@ -59,7 +59,7 @@ class LinearRegression(BaseEstimator,
     coef_ : cuDF series, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
     """
 
     def __init__(self, client=None, verbose=False, **kwargs):
diff --git a/python/cuml/dask/linear_model/ridge.py b/python/cuml/dask/linear_model/ridge.py
index 28335d952c..dd886ce577 100644
--- a/python/cuml/dask/linear_model/ridge.py
+++ b/python/cuml/dask/linear_model/ridge.py
@@ -17,7 +17,7 @@
 from cuml.dask.common.base import DelayedPredictionMixin
 from cuml.dask.common.base import mnmg_import
 from cuml.dask.common.base import SyncFitMixinLinearModel
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import worker_state
 
 
 class Ridge(BaseEstimator,
@@ -65,7 +65,8 @@ class Ridge(BaseEstimator,
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
+
     """
 
     def __init__(self, client=None, verbose=False, **kwargs):
diff --git a/python/cuml/dask/manifold/umap.py b/python/cuml/dask/manifold/umap.py
index 77bdf7df1e..68936ac959 100644
--- a/python/cuml/dask/manifold/umap.py
+++ b/python/cuml/dask/manifold/umap.py
@@ -21,13 +21,14 @@ class UMAP(BaseEstimator,
            DelayedTransformMixin):
     """
     Uniform Manifold Approximation and Projection
+
     Finds a low dimensional embedding of the data that approximates
     an underlying manifold.
 
     Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap_.py
 
     Examples
-    ----------
+    --------
 
     .. code-block:: python
 
@@ -59,34 +60,36 @@ class UMAP(BaseEstimator,
         distributed_model = MNMG_UMAP(local_model)
         embedding = distributed_model.transform(X)
 
-    Note: Everytime this code is run, the output will be different because
+    .. note:: Everytime this code is run, the output will be different because
         "make_blobs" function generates random matrices.
 
     Notes
     -----
-    This module is heavily based on Leland McInnes' reference UMAP package.
+    This module is heavily based on Leland McInnes' reference UMAP package
+    [1]_.
+
     However, there are a number of differences and features that are
-    not yet implemented in cuml.umap:
+    not yet implemented in `cuml.umap`:
+
     * Using a non-Euclidean distance metric (support for a fixed set
-        of non-Euclidean metrics is planned for an upcoming release).
+      of non-Euclidean metrics is planned for an upcoming release).
     * Using a pre-computed pairwise distance matrix (under consideration
-        for future releases)
+      for future releases)
     * Manual initialization of initial embedding positions
 
     In addition to these missing features, you should expect to see
-    the final embeddings differing between cuml.umap and the reference
+    the final embeddings differing between `cuml.umap` and the reference
     UMAP. In particular, the reference UMAP uses an approximate kNN
     algorithm for large data sizes while cuml.umap always uses exact
     kNN.
 
-    Known issue: If a UMAP model has not yet been fit, it cannot be pickled
+    **Known issue:** If a UMAP model has not yet been fit, it cannot be pickled
 
     References
     ----------
-    * Leland McInnes, John Healy, James Melville
-    UMAP: Uniform Manifold Approximation and Projection for Dimension
-    Reduction
-    https://arxiv.org/abs/1802.03426
+    .. [1] `Leland McInnes, John Healy, James Melville
+       UMAP: Uniform Manifold Approximation and Projection for Dimension
+       Reduction. <https://arxiv.org/abs/1802.03426>`_
 
     """
     def __init__(self, model, client=None, **kwargs):
diff --git a/python/cuml/dask/metrics/confusion_matrix.py b/python/cuml/dask/metrics/confusion_matrix.py
index b9fba1b0c4..a6d7522017 100644
--- a/python/cuml/dask/metrics/confusion_matrix.py
+++ b/python/cuml/dask/metrics/confusion_matrix.py
@@ -16,6 +16,7 @@
 
 import numpy as np
 import cupy as cp
+import cupyx
 from cuml.dask.common.input_utils import DistributedDataHandler
 
 from cuml.dask.common.utils import get_client
@@ -43,9 +44,9 @@ def _local_cm(inputs, labels, use_sample_weight):
     y_pred = y_pred[ind]
     y_true = y_true[ind]
     sample_weight = sample_weight[ind]
-    cm = cp.sparse.coo_matrix((sample_weight, (y_true, y_pred)),
-                              shape=(n_labels, n_labels), dtype=cp.float64,
-                              ).toarray()
+    cm = cupyx.scipy.sparse.coo_matrix((sample_weight, (y_true, y_pred)),
+                                       shape=(n_labels, n_labels),
+                                       dtype=cp.float64).toarray()
     return cp.nan_to_num(cm)
 
 
diff --git a/python/cuml/dask/metrics/utils.py b/python/cuml/dask/metrics/utils.py
index 60df622845..011b3d8e5c 100644
--- a/python/cuml/dask/metrics/utils.py
+++ b/python/cuml/dask/metrics/utils.py
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 import cupy as cp
 
 
diff --git a/python/cuml/dask/naive_bayes/__init__.py b/python/cuml/dask/naive_bayes/__init__.py
index 86c490240b..de0e85e846 100644
--- a/python/cuml/dask/naive_bayes/__init__.py
+++ b/python/cuml/dask/naive_bayes/__init__.py
@@ -14,4 +14,4 @@
 # limitations under the License.
 #
 
-from cuml.dask.naive_bayes.naive_bayes import MultinomialNB
\ No newline at end of file
+from cuml.dask.naive_bayes.naive_bayes import MultinomialNB
diff --git a/python/cuml/dask/naive_bayes/naive_bayes.py b/python/cuml/dask/naive_bayes/naive_bayes.py
index 5ae889572d..5a749be30a 100644
--- a/python/cuml/dask/naive_bayes/naive_bayes.py
+++ b/python/cuml/dask/naive_bayes/naive_bayes.py
@@ -51,50 +51,50 @@ class MultinomialNB(BaseEstimator,
 
     .. code-block:: python
 
-    import cupy as cp
+        import cupy as cp
 
-    from sklearn.datasets import fetch_20newsgroups
-    from sklearn.feature_extraction.text import CountVectorizer
+        from sklearn.datasets import fetch_20newsgroups
+        from sklearn.feature_extraction.text import CountVectorizer
 
-    from dask_cuda import LocalCUDACluster
-    from dask.distributed import Client
+        from dask_cuda import LocalCUDACluster
+        from dask.distributed import Client
 
-    from cuml.dask.common import to_sparse_dask_array
+        from cuml.dask.common import to_sparse_dask_array
 
-    from cuml.dask.naive_bayes import MultinomialNB
+        from cuml.dask.naive_bayes import MultinomialNB
 
-    # Create a local CUDA cluster
+        # Create a local CUDA cluster
 
-    cluster = LocalCUDACluster()
-    client = Client(cluster)
+        cluster = LocalCUDACluster()
+        client = Client(cluster)
 
-    # Load corpus
+        # Load corpus
 
-    twenty_train = fetch_20newsgroups(subset='train',
-                              shuffle=True, random_state=42)
+        twenty_train = fetch_20newsgroups(subset='train',
+                                  shuffle=True, random_state=42)
 
-    cv = CountVectorizer()
-    xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
+        cv = CountVectorizer()
+        xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
 
-    X = to_sparse_dask_array(xformed, client)
-    y = dask.array.from_array(twenty_train.target, asarray=False,
-                          fancy=False).astype(cp.int32)
+        X = to_sparse_dask_array(xformed, client)
+        y = dask.array.from_array(twenty_train.target, asarray=False,
+                              fancy=False).astype(cp.int32)
 
-    # Train model
+        # Train model
 
-    model = MultinomialNB()
-    model.fit(X, y)
+        model = MultinomialNB()
+        model.fit(X, y)
 
-    # Compute accuracy on training set
+        # Compute accuracy on training set
 
-    model.score(X, y)
+        model.score(X, y)
 
 
     Output:
 
     .. code-block:: python
 
-    0.9244298934936523
+        0.9244298934936523
 
     """
     def __init__(self, client=None, verbose=False, **kwargs):
diff --git a/python/cuml/dask/neighbors/kneighbors_classifier.py b/python/cuml/dask/neighbors/kneighbors_classifier.py
index 815d394554..7fd8d073c3 100644
--- a/python/cuml/dask/neighbors/kneighbors_classifier.py
+++ b/python/cuml/dask/neighbors/kneighbors_classifier.py
@@ -20,7 +20,7 @@
 from cuml.dask.common import flatten_grouped_results
 from cuml.dask.common.utils import raise_mg_import_exception
 from cuml.dask.common.utils import wait_and_raise_from_futures
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import worker_state
 from cuml.dask.neighbors import NearestNeighbors
 import dask.array as da
 from uuid import uuid1
@@ -232,7 +232,7 @@ def predict(self, X, convert_dtype=True):
         out_d = to_output(out_d_futures, self.datatype)
         return out, out_i, out_d
 
-    def score(self, X, y):
+    def score(self, X, y, convert_dtype=True):
         """
         Predict labels for a query from previously stored index
         and index labels.
@@ -252,9 +252,9 @@ def score(self, X, y):
         -------
         score
         """
-        labels, _, _ = self.predict(X, convert_dtype=True)
-        diff = (labels == y)
         if self.data_handler.datatype == 'cupy':
+            preds, _, _ = self.predict(X, convert_dtype=convert_dtype)
+            diff = (preds == y)
             mean = da.mean(diff)
             return mean.compute()
         else:
diff --git a/python/cuml/dask/neighbors/kneighbors_regressor.py b/python/cuml/dask/neighbors/kneighbors_regressor.py
index a2236aaa28..25e97d75d6 100644
--- a/python/cuml/dask/neighbors/kneighbors_regressor.py
+++ b/python/cuml/dask/neighbors/kneighbors_regressor.py
@@ -20,7 +20,7 @@
 from cuml.dask.common import flatten_grouped_results
 from cuml.dask.common.utils import raise_mg_import_exception
 from cuml.dask.common.utils import wait_and_raise_from_futures
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import worker_state
 from cuml.dask.neighbors import NearestNeighbors
 import dask.array as da
 from uuid import uuid1
@@ -221,10 +221,12 @@ def score(self, X, y):
         -------
         score
         """
-        labels, _, _ = self.predict(X, convert_dtype=True)
-        diff = (labels == y)
         if self.data_handler.datatype == 'cupy':
-            mean = da.mean(diff)
-            return mean.compute()
+            preds, _, _ = self.predict(X, convert_dtype=True)
+            y_mean = y.mean(axis=0)
+            residual_sss = ((y - preds) ** 2).sum(axis=0)
+            total_sss = ((y - y_mean) ** 2).sum(axis=0)
+            r2_score = da.mean(1 - (residual_sss / total_sss))
+            return r2_score.compute()
         else:
             raise ValueError("Only Dask arrays are supported")
diff --git a/python/cuml/dask/neighbors/nearest_neighbors.py b/python/cuml/dask/neighbors/nearest_neighbors.py
index dcb3e9d3d9..77e49fd7ec 100644
--- a/python/cuml/dask/neighbors/nearest_neighbors.py
+++ b/python/cuml/dask/neighbors/nearest_neighbors.py
@@ -19,7 +19,8 @@
 from cuml.dask.common import raise_mg_import_exception
 from cuml.dask.common.base import BaseEstimator
 
-from cuml.dask.common.comms import worker_state, CommsContext
+from cuml.raft.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import Comms
 from cuml.dask.common.input_utils import to_output
 from cuml.dask.common.input_utils import DistributedDataHandler
 
@@ -100,8 +101,8 @@ def _build_comms(index_handler, query_handler, streams_per_handle):
         workers = set(index_handler.workers)
         workers.update(query_handler.workers)
 
-        comms = CommsContext(comms_p2p=True,
-                             streams_per_handle=streams_per_handle)
+        comms = Comms(comms_p2p=True,
+                      streams_per_handle=streams_per_handle)
         comms.init(workers=workers)
         return comms
 
diff --git a/python/cuml/dask/preprocessing/LabelEncoder.py b/python/cuml/dask/preprocessing/LabelEncoder.py
new file mode 100644
index 0000000000..dfae271854
--- /dev/null
+++ b/python/cuml/dask/preprocessing/LabelEncoder.py
@@ -0,0 +1,194 @@
+from cuml.dask.common.base import BaseEstimator
+from cuml.dask.common.base import DelayedTransformMixin
+from cuml.dask.common.base import DelayedInverseTransformMixin
+
+from toolz import first
+
+from collections.abc import Sequence
+from dask_cudf.core import DataFrame as dcDataFrame
+from dask_cudf.core import Series as daskSeries
+from cuml.common.exceptions import NotFittedError
+from cuml.preprocessing import LabelEncoder as LE
+
+
+class LabelEncoder(BaseEstimator,
+                   DelayedTransformMixin,
+                   DelayedInverseTransformMixin):
+    """
+    An nvcategory based implementation of ordinal label encoding
+
+    Parameters
+    ----------
+    handle_unknown : {'error', 'ignore'}, default='error'
+        Whether to raise an error or ignore if an unknown categorical feature
+        is present during transform (default is to raise). When this parameter
+        is set to 'ignore' and an unknown category is encountered during
+        transform or inverse transform, the resulting encoding will be null.
+
+    Examples
+    --------
+    Converting a categorical implementation to a numerical one
+
+    .. code-block:: python
+
+        import cudf
+        import dask_cudf
+        df = cudf.DataFrame({'num_col':[10, 20, 30, 30, 30],
+                           'cat_col':['a','b','c','a','a']})
+        ddf = dask_cudf.from_cudf(df, npartitions=2)
+
+        # There are two functionally equivalent ways to do this
+        le = LabelEncoder()
+        le.fit(ddf.cat_col)  # le = le.fit(data.category) also works
+        encoded = le.transform(ddf.cat_col)
+
+        print(encoded.compute())
+
+        # This method is preferred
+        le = LabelEncoder()
+        encoded = le.fit_transform(ddf.cat_col)
+
+        print(encoded.compute())
+
+        # We can assign this to a new column
+        ddf = ddf.assign(encoded=encoded.values)
+        print(ddf.compute())
+
+        # We can also encode more data
+        test_data = cudf.Series(['c', 'a'])
+        encoded = le.transform(dask_cudf.from_cudf(test_data, npartitions=2))
+        print(encoded.compute())
+
+        # After train, ordinal label can be inverse_transform() back to
+        # string labels
+        ord_label = cudf.Series([0, 0, 1, 2, 1])
+        ord_label = dask_cudf.from_cudf(ord_label, npartitions=2)
+
+        print(ord_label.compute())
+
+    Output:
+
+    .. code-block:: python
+
+        [0 1 2 0 0]
+
+        [0 1 2 0 0]
+
+           num_col cat_col  encoded
+        0       10       a        0
+        1       20       b        1
+        2       30       c        2
+        3       30       a        0
+        4       30       a        0
+
+        [2 0]
+
+        0    a
+        1    a
+        2    b
+        0    c
+        1    b
+        dtype: object
+
+    """
+    def __init__(self, client=None, verbose=False, **kwargs):
+        super(LabelEncoder, self).__init__(client=client,
+                                           verbose=verbose,
+                                           **kwargs)
+
+    def fit(self, y):
+        """
+        Fit a LabelEncoder (nvcategory) instance to a set of categories
+
+        Parameters
+        ----------
+        y : dask_cudf.Series
+            Series containing the categories to be encoded. Its elements
+            may or may not be unique
+
+        Returns
+        -------
+        self : LabelEncoder
+            A fitted instance of itself to allow method chaining
+
+        Notes
+        --------
+        Number of unique classes will be collected at the client. It'll
+        consume memory proportional to the number of unique classes.
+        """
+        _classes = y.unique().compute()
+        el = first(y) if isinstance(y, Sequence) else y
+        self.datatype = ('cudf' if isinstance(el, (dcDataFrame, daskSeries))
+                         else 'cupy')
+        self._set_internal_model(LE(**self.kwargs).fit(y, _classes=_classes))
+        return self
+
+    def fit_transform(self, y, delayed=True):
+        """
+        Simultaneously fit and transform an input
+
+        This is functionally equivalent to (but faster than)
+        LabelEncoder().fit(y).transform(y)
+        """
+        return self.fit(y).transform(y, delayed=delayed)
+
+    def transform(self, y, delayed=True):
+        """
+        Transform an input into its categorical keys.
+
+        This is intended for use with small inputs relative to the size of the
+        dataset. For fitting and transforming an entire dataset, prefer
+        `fit_transform`.
+
+        Parameters
+        ----------
+        y : dask_cudf.Series
+            Input keys to be transformed. Its values should match the
+            categories given to `fit`
+
+        Returns
+        -------
+        encoded : dask_cudf.Series
+            The ordinally encoded input series
+
+        Raises
+        ------
+        KeyError
+            if a category appears that was not seen in `fit`
+        """
+        if self._get_internal_model() is not None:
+            return self._transform(y,
+                                   delayed=delayed,
+                                   output_dtype='int32',
+                                   output_collection_type='cudf')
+        else:
+            msg = ("This LabelEncoder instance is not fitted yet. Call 'fit' "
+                   "with appropriate arguments before using this estimator.")
+            raise NotFittedError(msg)
+
+    def inverse_transform(self, y, delayed=True):
+        """
+        Convert the data back to the original representation.
+        In case unknown categories are encountered (all zeros in the
+        one-hot encoding), ``None`` is used to represent this category.
+
+        Parameters
+        ----------
+        X : dask_cudf Series
+            The string representation of the categories.
+        delayed : bool (default = True)
+            Whether to execute as a delayed task or eager.
+
+        Returns
+        -------
+        X_tr : dask_cudf.Series
+            Distributed object containing the inverse transformed array.
+        """
+        if self._get_internal_model() is not None:
+            return self._inverse_transform(y,
+                                           delayed=delayed,
+                                           output_collection_type='cudf')
+        else:
+            msg = ("This LabelEncoder instance is not fitted yet. Call 'fit' "
+                   "with appropriate arguments before using this estimator.")
+            raise NotFittedError(msg)
diff --git a/python/cuml/dask/preprocessing/__init__.py b/python/cuml/dask/preprocessing/__init__.py
index 4ad5b85176..e39820bad0 100644
--- a/python/cuml/dask/preprocessing/__init__.py
+++ b/python/cuml/dask/preprocessing/__init__.py
@@ -16,3 +16,4 @@
 
 from cuml.dask.preprocessing.label import LabelBinarizer
 from cuml.dask.preprocessing.encoders import OneHotEncoder
+from cuml.dask.preprocessing.LabelEncoder import LabelEncoder
diff --git a/python/cuml/dask/preprocessing/label.py b/python/cuml/dask/preprocessing/label.py
index e3ebce3eba..9e7e9e23bd 100644
--- a/python/cuml/dask/preprocessing/label.py
+++ b/python/cuml/dask/preprocessing/label.py
@@ -21,6 +21,7 @@
 
 import dask
 import cupy as cp
+import cupyx
 
 
 class LabelBinarizer(BaseEstimator):
@@ -39,6 +40,7 @@ class LabelBinarizer(BaseEstimator):
     .. code-block:: python
 
         import cupy as cp
+        import cupyx
         from cuml.dask.preprocessing import LabelBinarizer
 
         from dask_cuda import LocalCUDACluster
@@ -190,7 +192,7 @@ def transform(self, y):
         xform_func = dask.delayed(LabelBinarizer._func_xform)
         meta = rmm_cupy_ary(cp.zeros, 1)
         if internal_model.sparse_output:
-            meta = cp.sparse.csr_matrix(meta)
+            meta = cupyx.scipy.sparse.csr_matrix(meta)
         f = [dask.array.from_delayed(xform_func(internal_model, part),
              meta=meta, dtype=cp.float32,
              shape=(len(y), len(self.classes_))) for w, part in parts]
diff --git a/python/cuml/dask/solvers/cd.py b/python/cuml/dask/solvers/cd.py
index 45fe6ee788..c4d14205c3 100644
--- a/python/cuml/dask/solvers/cd.py
+++ b/python/cuml/dask/solvers/cd.py
@@ -17,15 +17,14 @@
 from cuml.dask.common.base import DelayedPredictionMixin
 from cuml.dask.common.base import mnmg_import
 from cuml.dask.common.base import SyncFitMixinLinearModel
-from cuml.dask.common.comms import worker_state
+from cuml.raft.dask.common.comms import worker_state
 
 
 class CD(BaseEstimator,
          SyncFitMixinLinearModel,
          DelayedPredictionMixin):
     """
-    Model-Parallel Multi-GPU Linear Regression Model. Single Process Multi GPU
-    supported currently
+    Model-Parallel Multi-GPU Linear Regression Model.
     """
 
     def __init__(self, client=None, **kwargs):
diff --git a/python/cuml/datasets/arima.pyx b/python/cuml/datasets/arima.pyx
index 8d6f10de2b..d1e19730fa 100644
--- a/python/cuml/datasets/arima.pyx
+++ b/python/cuml/datasets/arima.pyx
@@ -14,16 +14,14 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cuml
 import numpy as np
 
 from cuml.common.array import CumlArray as cumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
 from cuml.tsa.arima cimport ARIMAOrder
 
 from libc.stdint cimport uint64_t, uintptr_t
@@ -33,7 +31,7 @@ from random import randint
 
 cdef extern from "cuml/datasets/make_arima.hpp" namespace "ML":
     void cpp_make_arima "ML::Datasets::make_arima" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         float* out,
         int batch_size,
         int n_obs,
@@ -45,7 +43,7 @@ cdef extern from "cuml/datasets/make_arima.hpp" namespace "ML":
     )
 
     void cpp_make_arima "ML::Datasets::make_arima" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         double* out,
         int batch_size,
         int n_obs,
@@ -70,11 +68,11 @@ def make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1),
                seasonal_order=(0, 0, 0, 0), intercept=False,
                random_state=None, dtype='double', output_type='cupy',
                handle=None):
-    r"""Generates a dataset of time series by simulating an ARIMA process
+    """Generates a dataset of time series by simulating an ARIMA process
     of a given order.
 
-    Example
-    -------
+    Examples
+    --------
     .. code-block:: python
 
         from cuml.datasets import make_arima
@@ -102,7 +100,7 @@ def make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1),
     handle: cuml.Handle
         If it is None, a new one is created just for this function call
 
-    Returns:
+    Returns
     --------
     out: array-like, shape (n_obs, batch_size)
         Array of the requested type containing the generated dataset
@@ -123,8 +121,8 @@ def make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1),
     else:
         dtype = inp_to_dtype[dtype]
 
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     out = cumlArray.empty((n_obs, batch_size), dtype=dtype, order='F')
     cdef uintptr_t out_ptr = <uintptr_t> out.ptr
diff --git a/python/cuml/datasets/classification.py b/python/cuml/datasets/classification.py
index 668de6ff7e..eacfbe096e 100644
--- a/python/cuml/datasets/classification.py
+++ b/python/cuml/datasets/classification.py
@@ -247,7 +247,7 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
     # Initialize X and y
     X = generator.randn(n_samples * n_features, dtype=dtype)
     X = X.reshape((n_samples, n_features), order=order)
-    y = cp.zeros(n_samples, dtype=np.int)
+    y = cp.zeros(n_samples, dtype=np.int64)
 
     # Build the polytope whose vertices become cluster centroids
     if _centroids is None:
diff --git a/python/cuml/datasets/regression.pyx b/python/cuml/datasets/regression.pyx
index b2adde56a6..2cfdc9581e 100644
--- a/python/cuml/datasets/regression.pyx
+++ b/python/cuml/datasets/regression.pyx
@@ -14,15 +14,14 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cuml
 import numpy as np
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
+
 from cuml.common import get_dev_array_ptr, zeros
 
 from libcpp cimport bool
@@ -32,7 +31,7 @@ from random import randint
 
 cdef extern from "cuml/datasets/make_regression.hpp" namespace "ML":
     void cpp_make_regression "ML::Datasets::make_regression" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         float* out,
         float* values,
         long n_rows,
@@ -48,7 +47,7 @@ cdef extern from "cuml/datasets/make_regression.hpp" namespace "ML":
         uint64_t seed)
 
     void cpp_make_regression "ML::Datasets::make_regression" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         double* out,
         double* values,
         long n_rows,
@@ -80,8 +79,8 @@ def make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1,
 
     See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html
 
-    Example
-    -------
+    Examples
+    --------
 
     .. code-block:: python
 
@@ -159,8 +158,8 @@ def make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1,
     if effective_rank is None:
         effective_rank = -1
 
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     out = zeros((n_samples, n_features), dtype=dtype, order='C')
     cdef uintptr_t out_ptr = get_dev_array_ptr(out)
diff --git a/python/cuml/decomposition/base_mg.pyx b/python/cuml/decomposition/base_mg.pyx
index 2f8bf7bc15..bdedec6eec 100644
--- a/python/cuml/decomposition/base_mg.pyx
+++ b/python/cuml/decomposition/base_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 
 import ctypes
@@ -35,7 +32,7 @@ from cuml.common.array import CumlArray
 import cuml.common.opg_data_utils_mg as opg
 
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.decomposition.utils cimport *
 from cuml.common import input_to_dev_array, zeros
 from cuml.common import input_to_cuml_array
diff --git a/python/cuml/decomposition/pca.pyx b/python/cuml/decomposition/pca.pyx
index 51e4ddf1cf..0a7e64b718 100644
--- a/python/cuml/decomposition/pca.pyx
+++ b/python/cuml/decomposition/pca.pyx
@@ -14,15 +14,13 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
 import numpy as np
 import cupy as cp
+import cupyx
 import scipy
 
 from enum import IntEnum
@@ -39,7 +37,10 @@ from cython.operator cimport dereference as deref
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
 from cuml.common.base import _input_to_type
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
+import cuml.common.logger as logger
 from cuml.decomposition.utils cimport *
 from cuml.common import input_to_cuml_array
 from cuml.common import with_cupy_rmm
@@ -49,7 +50,7 @@ from cuml.common.input_utils import sparse_scipy_to_cp
 
 cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
 
-    cdef void pcaFit(cumlHandle& handle,
+    cdef void pcaFit(handle_t& handle,
                      float *input,
                      float *components,
                      float *explained_var,
@@ -59,7 +60,7 @@ cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
                      float *noise_vars,
                      const paramsPCA &prms) except +
 
-    cdef void pcaFit(cumlHandle& handle,
+    cdef void pcaFit(handle_t& handle,
                      double *input,
                      double *components,
                      double *explained_var,
@@ -69,7 +70,7 @@ cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
                      double *noise_vars,
                      const paramsPCA &prms) except +
 
-    cdef void pcaInverseTransform(cumlHandle& handle,
+    cdef void pcaInverseTransform(handle_t& handle,
                                   float *trans_input,
                                   float *components,
                                   float *singular_vals,
@@ -77,7 +78,7 @@ cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
                                   float *input,
                                   const paramsPCA &prms) except +
 
-    cdef void pcaInverseTransform(cumlHandle& handle,
+    cdef void pcaInverseTransform(handle_t& handle,
                                   double *trans_input,
                                   double *components,
                                   double *singular_vals,
@@ -85,7 +86,7 @@ cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
                                   double *input,
                                   const paramsPCA &prms) except +
 
-    cdef void pcaTransform(cumlHandle& handle,
+    cdef void pcaTransform(handle_t& handle,
                            float *input,
                            float *components,
                            float *trans_input,
@@ -93,7 +94,7 @@ cdef extern from "cuml/decomposition/pca.hpp" namespace "ML":
                            float *mu,
                            const paramsPCA &prms) except +
 
-    cdef void pcaTransform(cumlHandle& handle,
+    cdef void pcaTransform(handle_t& handle,
                            double *input,
                            double *components,
                            double *trans_input,
@@ -122,7 +123,7 @@ class PCA(Base):
     less accurate.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -208,13 +209,22 @@ class PCA(Base):
         If True, then copies data then removes mean from data. False might
         cause data to be overwritten with its mean centered version.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     iterated_power : int (default = 15)
         Used in Jacobi solver. The more iterations, the more accurate, but
         slower.
-    n_components : int (default = 1)
+    n_components : int (default = None)
         The number of top K singular vectors / values you want.
-        Must be <= number(columns).
+        Must be <= number(columns). If n_components is not set, then all
+        components are kept:
+
+            n_components = min(n_samples, n_features)
+
     random_state : int / None (default = None)
         If you want results to be the same when you restart Python, select a
         state.
@@ -225,15 +235,20 @@ class PCA(Base):
     tol : float (default = 1e-7)
         Used if algorithm = "jacobi". Smaller tolerance can increase accuracy,
         but but will slow down the algorithm's convergence.
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     whiten : boolean (default = False)
         If True, de-correlates the components. This is done by dividing them by
         the corresponding singular values then multiplying by sqrt(n_samples).
         Whitening allows each component to have unit variance and removes
         multi-collinearity. It might be beneficial for downstream
         tasks like LinearRegression where correlated features cause problems.
-
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
@@ -272,7 +287,7 @@ class PCA(Base):
     """
 
     def __init__(self, copy=True, handle=None, iterated_power=15,
-                 n_components=1, random_state=None, svd_solver='auto',
+                 n_components=None, random_state=None, svd_solver='auto',
                  tol=1e-7, verbose=False, whiten=False,
                  output_type=None):
         # parameters
@@ -323,7 +338,15 @@ class PCA(Base):
 
     def _build_params(self, n_rows, n_cols):
         cpdef paramsPCA *params = new paramsPCA()
-        params.n_components = self.n_components
+        if self.n_components is None:
+            logger.warn(
+                'Warning(`_build_params`): As of v0.16, PCA invoked without an'
+                ' n_components argument defauts to using'
+                ' min(n_samples, n_features) rather than 1'
+            )
+            params.n_components = min(n_rows, n_cols)
+        else:
+            params.n_components = self.n_components
         params.n_rows = n_rows
         params.n_cols = n_cols
         params.whiten = self.whiten
@@ -357,68 +380,70 @@ class PCA(Base):
         self.n_cols = X.shape[1]
         self.dtype = X.dtype
 
-        covariance, self._mean_, _ = cov(X, X, return_mean=True)
+        # NOTE: All intermediate calculations are done using cupy.ndarray and
+        # then converted to CumlArray at the end to minimize conversions
+        # between types
+        covariance, temp_mean_, _ = cov(X, X, return_mean=True)
 
-        self._explained_variance_, self._components_ = \
+        temp_explained_variance_, temp_components_ = \
             cp.linalg.eigh(covariance, UPLO='U')
 
         # NOTE: We reverse the eigen vector and eigen values here
-        # because cupy provides them in ascending order
-        self._explained_variance_ = self._explained_variance_[::-1]
+        # because cupy provides them in ascending order. Make a copy otherwise
+        # it is not C_CONTIGUOUS anymore and would error when converting to
+        # CumlArray
+        temp_explained_variance_ = temp_explained_variance_[::-1].copy()
 
-        self._components_ = cp.flip(self._components_, axis=1)
+        temp_components_ = cp.flip(temp_components_, axis=1)
 
-        self._components_ = self._components_.T[:self.n_components, :]
+        temp_components_ = temp_components_.T[:self.n_components, :]
 
-        self._explained_variance_ratio_ = self._explained_variance_ / cp.sum(
-            self._explained_variance_)
+        temp_explained_variance_ratio_ = temp_explained_variance_ / cp.sum(
+            temp_explained_variance_)
 
         if self.n_components < min(self.n_rows, self.n_cols):
-            self._noise_variance_ = \
-                self._explained_variance_[self.n_components:].mean()
+            temp_noise_variance_ = \
+                temp_explained_variance_[self.n_components:].mean()
         else:
-            self._noise_variance_ = cp.array([0.0])
+            temp_noise_variance_ = cp.array([0.0])
 
-        self._explained_variance_ = \
-            self._explained_variance_[:self.n_components]
+        temp_explained_variance_ = \
+            temp_explained_variance_[:self.n_components]
 
-        self._explained_variance_ratio_ = \
-            self._explained_variance_ratio_[:self.n_components]
+        temp_explained_variance_ratio_ = \
+            temp_explained_variance_ratio_[:self.n_components]
 
         # Truncating negative explained variance values to 0
-        self._singular_values_ = \
-            cp.where(self._explained_variance_ < 0, 0,
-                     self._explained_variance_)
-        self._singular_values_ = \
-            cp.sqrt(self._singular_values_ * (self.n_rows - 1))
+        temp_singular_values_ = \
+            cp.where(temp_explained_variance_ < 0, 0,
+                     temp_explained_variance_)
+        temp_singular_values_ = \
+            cp.sqrt(temp_singular_values_ * (self.n_rows - 1))
+
+        # Since temp_components_ can have a negative stride, copy it to get a
+        # new contiguous array
+        temp_components_ = temp_components_.copy()
+
+        # Finally, store everything as CumlArray to support `to_output`
+        self._mean_ = CumlArray(temp_mean_)
+        self._explained_variance_ = CumlArray(temp_explained_variance_)
+        self._components_ = CumlArray(temp_components_)
+        self._noise_variance_ = CumlArray(temp_noise_variance_)
+        self._explained_variance_ratio_ = \
+            CumlArray(temp_explained_variance_ratio_)
+        self._singular_values_ = CumlArray(temp_singular_values_)
 
         return self
 
+    @generate_docstring(X='dense_sparse')
     def fit(self, X, y=None):
         """
-        Fit the model with X.
-
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-            sparse array-like (device) shape = (n_samples, n_features)
-            Acceptable formats: cupy.sparse
-
-        y : ignored
-
-        Returns
-        -------
-        cluster labels
+        Fit the model with X. y is currently ignored.
 
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
-        if cp.sparse.issparse(X):
+        if cupyx.scipy.sparse.issparse(X):
             return self._sparse_fit(X)
         elif scipy.sparse.issparse(X):
             X = sparse_scipy_to_cp(X)
@@ -431,7 +456,7 @@ class PCA(Base):
         cdef paramsPCA *params = <paramsPCA*><size_t> \
             self._build_params(self.n_rows, self.n_cols)
 
-        if self.n_components > self.n_cols:
+        if params.n_components > self.n_cols:
             raise ValueError('Number of components should not be greater than'
                              'the number of columns in the data')
 
@@ -454,7 +479,7 @@ class PCA(Base):
         cdef uintptr_t noise_vars_ptr = \
             self._noise_variance_.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         if self.dtype == np.float32:
             pcaFit(handle_[0],
                    <float*> input_ptr,
@@ -482,54 +507,61 @@ class PCA(Base):
 
         return self
 
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'trans',
+                                       'type': 'dense_sparse',
+                                       'description': 'Transformed values',
+                                       'shape': '(n_samples, n_components)'})
     def fit_transform(self, X, y=None):
         """
         Fit the model with X and apply the dimensionality reduction on X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-          training data (floats or doubles), where n_samples is the number of
-          samples, and n_features is the number of features.
-          Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-          ndarray, cuda array interface compliant array like CuPy
-
-        y : ignored
-
-        Returns
-        -------
-        X_new : cuDF DataFrame, shape (n_samples, n_components)
         """
 
         return self.fit(X).transform(X)
 
     @with_cupy_rmm
     def _sparse_inverse_transform(self, X, return_sparse=False,
-                                  sparse_tol=1e-10):
+                                  sparse_tol=1e-10, out_type=None):
+
+        # NOTE: All intermediate calculations are done using cupy.ndarray and
+        # then converted to CumlArray at the end to minimize conversions
+        # between types
+        temp_components_ = cp.asarray(self._components_)
+        temp_mean_ = self._mean_.to_output("cupy")
 
         if self.whiten:
-            self._components_ *= (1 / cp.sqrt(self.n_rows - 1))
-            self._components_ *= self._singular_values_
+            temp_components_ *= (1 / cp.sqrt(self.n_rows - 1))
+            temp_components_ *= self._singular_values_
 
-        X_inv = X.dot(self._components_)
-        X_inv += self._mean_
+        X_inv = X.dot(temp_components_)
+        X_inv += temp_mean_
 
         if self.whiten:
-            self._components_ /= self._singular_values_
-            self._components_ *= cp.sqrt(self.n_rows - 1)
+            temp_components_ /= self._singular_values_
+            temp_components_ *= cp.sqrt(self.n_rows - 1)
+
+        self._components_ = CumlArray(temp_components_)
 
         if return_sparse:
             X_inv = cp.where(X_inv < sparse_tol, 0, X_inv)
 
-            X_inv = cp.sparse.csr_matrix(X_inv)
+            X_inv = cupyx.scipy.sparse.csr_matrix(X_inv)
 
             return X_inv
 
-        if self._get_output_type(X) == 'cupy':
+        if out_type == 'cupy':
             return X_inv
-        elif self._get_output_type(X) == 'numpy':
-            return X_inv.get()
-
+        else:
+            X_inv, _, _, _ = input_to_cuml_array(X_inv, order='K')
+            return X_inv.to_output(out_type)
+
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'X_inv',
+                                       'type': 'dense_sparse',
+                                       'description': 'Transformed values',
+                                       'shape': '(n_samples, n_features)'})
+    @with_cupy_rmm
     def inverse_transform(self, X, convert_dtype=False,
                           return_sparse=False, sparse_tol=1e-10):
         """
@@ -537,57 +569,30 @@ class PCA(Base):
 
         In other words, return an input X_original whose transform would be X.
 
-        Parameters
-        ----------
-        X : dense array-like (device or host) shape = (n_samples, n_features)
-            New data (floats or doubles), where n_samples is the number of
-            samples and n_components is the number of components.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-            sparse array-like (device) shape = (n_samples, n_features)
-            Acceptable formats: cupy.sparse
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the inverse_transform method will automatically
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
-        return_sparse : bool, optional (default = False)
-            Ignored when the model is not fit on a sparse matrix
-            If True, the method will convert the inverse transform to a
-            cupy.sparse.csr_matrix object
-
-            NOTE: Currently, there is a loss of information when converting
-            to csr matrix (cusolver bug). Default can be switched to True
-            once this is solved
-
-        sparse_tol : float, optional (default = 1e-10)
-            Ignored when return_sparse=False
-            If True, values in the inverse transform below this parameter
-            are clipped to 0
-
-        Returns
-        -------
-        X_original : cuDF DataFrame, shape (n_samples, n_features)
-
         """
 
-        if cp.sparse.issparse(X):
+        out_type = self._get_output_type(X)
+
+        if cupyx.scipy.sparse.issparse(X):
             return self._sparse_inverse_transform(X,
                                                   return_sparse=return_sparse,
-                                                  sparse_tol=sparse_tol)
+                                                  sparse_tol=sparse_tol,
+                                                  out_type=out_type)
         elif scipy.sparse.issparse(X):
             X = sparse_scipy_to_cp(X)
             return self._sparse_inverse_transform(X,
                                                   return_sparse=return_sparse,
-                                                  sparse_tol=sparse_tol)
+                                                  sparse_tol=sparse_tol,
+                                                  out_type=out_type)
         elif self._sparse_model:
+            X, _, _, _ = \
+                input_to_cuml_array(X, order='K',
+                                    check_dtype=[cp.float32, cp.float64])
+            X = X.to_output(output_type='cupy')
             return self._sparse_inverse_transform(X,
                                                   return_sparse=return_sparse,
-                                                  sparse_tol=sparse_tol)
-
-        out_type = self._get_output_type(X)
+                                                  sparse_tol=sparse_tol,
+                                                  out_type=out_type)
 
         X_m, n_rows, _, dtype = \
             input_to_cuml_array(X, check_dtype=self.dtype,
@@ -612,7 +617,7 @@ class PCA(Base):
         cdef uintptr_t singular_vals_ptr = self._singular_values_.ptr
         cdef uintptr_t _mean_ptr = self._mean_.ptr
 
-        cdef cumlHandle* h_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* h_ = <handle_t*><size_t>self.handle.getHandle()
         if dtype.type == np.float32:
             pcaInverseTransform(h_[0],
                                 <float*> _trans_input_ptr,
@@ -637,25 +642,40 @@ class PCA(Base):
         return input_data.to_output(out_type)
 
     @with_cupy_rmm
-    def _sparse_transform(self, X):
+    def _sparse_transform(self, X, out_type=None):
+
+        # NOTE: All intermediate calculations are done using cupy.ndarray and
+        # then converted to CumlArray at the end to minimize conversions
+        # between types
+        temp_components_ = self._components_.to_output("cupy")
+        temp_mean_ = self._mean_.to_output("cupy")
 
         if self.whiten:
-            self._components_ *= cp.sqrt(self.n_rows - 1)
-            self._components_ /= self._singular_values_
+            temp_components_ *= cp.sqrt(self.n_rows - 1)
+            temp_components_ /= self._singular_values_
 
-        X = X - self._mean_
-        X_transformed = X.dot(self._components_.T)
-        X = X + self._mean_
+        X = X - temp_mean_
+        X_transformed = X.dot(temp_components_.T)
 
         if self.whiten:
-            self._components_ *= self._singular_values_
-            self._components_ *= (1 / cp.sqrt(self.n_rows - 1))
+            temp_components_ *= self._singular_values_
+            temp_components_ *= (1 / cp.sqrt(self.n_rows - 1))
+
+        self._components_ = CumlArray(temp_components_)
 
         if self._get_output_type(X) == 'cupy':
             return X_transformed
-        elif self._get_output_type(X) == 'numpy':
-            return X_transformed.get()
-
+        else:
+            X_transformed, _, _, _ = \
+                input_to_cuml_array(X_transformed, order='K')
+            return X_transformed.to_output(out_type)
+
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'trans',
+                                       'type': 'dense_sparse',
+                                       'description': 'Transformed values',
+                                       'shape': '(n_samples, n_components)'})
+    @with_cupy_rmm
     def transform(self, X, convert_dtype=False):
         """
         Apply dimensionality reduction to X.
@@ -663,38 +683,21 @@ class PCA(Base):
         X is projected on the first principal components previously extracted
         from a training set.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            New data (floats or doubles), where n_samples is the number of
-            samples and n_components is the number of components.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-            sparse array-like (device) shape = (n_samples, n_features)
-            Acceptable formats: cupy.sparse
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the transform method will automatically
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
-
-        Returns
-        -------
-        X_new : cuDF DataFrame, shape (n_samples, n_components)
-
         """
 
-        if cp.sparse.issparse(X):
-            return self._sparse_transform(X)
+        out_type = self._get_output_type(X)
+
+        if cupyx.scipy.sparse.issparse(X):
+            return self._sparse_transform(X, out_type=out_type)
         elif scipy.sparse.issparse(X):
             X = sparse_scipy_to_cp(X)
-            return self._sparse_transform(X)
+            return self._sparse_transform(X, out_type=out_type)
         elif self._sparse_model:
-            return self._sparse_transform(X)
-
-        out_type = self._get_output_type(X)
+            X, _, _, _ = \
+                input_to_cuml_array(X, order='K',
+                                    check_dtype=[cp.float32, cp.float64])
+            X = X.to_output(output_type='cupy')
+            return self._sparse_transform(X, out_type=out_type)
 
         X_m, n_rows, n_cols, dtype = \
             input_to_cuml_array(X, check_dtype=self.dtype,
@@ -721,7 +724,7 @@ class PCA(Base):
             self._singular_values_.ptr
         cdef uintptr_t _mean_ptr = self._mean_.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         if dtype.type == np.float32:
             pcaTransform(handle_[0],
                          <float*> input_ptr,
@@ -746,8 +749,9 @@ class PCA(Base):
         return t_input_data.to_output(out_type)
 
     def get_param_names(self):
-        return ["copy", "iterated_power", "n_components", "svd_solver", "tol",
-                "whiten"]
+        return super().get_param_names() + \
+            ["copy", "iterated_power", "n_components", "svd_solver", "tol",
+                "whiten", "random_state"]
 
     def __getstate__(self):
         state = self.__dict__.copy()
@@ -759,4 +763,4 @@ class PCA(Base):
 
     def __setstate__(self, state):
         self.__dict__.update(state)
-        self.handle = cuml.common.handle.Handle()
+        self.handle = Handle()
diff --git a/python/cuml/decomposition/pca_mg.pyx b/python/cuml/decomposition/pca_mg.pyx
index f25c503bd8..94cf4ef2c4 100644
--- a/python/cuml/decomposition/pca_mg.pyx
+++ b/python/cuml/decomposition/pca_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 
 import ctypes
@@ -36,7 +33,7 @@ from cuml.common.array import CumlArray
 import cuml.common.opg_data_utils_mg as opg
 
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.decomposition.utils cimport paramsSolver
 from cuml.common import input_to_dev_array, zeros
 from cuml.common.opg_data_utils_mg cimport *
@@ -68,7 +65,7 @@ cdef extern from "cuml/decomposition/pca_mg.hpp" namespace "ML":
 
 cdef extern from "cuml/decomposition/pca_mg.hpp" namespace "ML::PCA::opg":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[floatData_t *] input_data,
                   PartDescriptor &input_desc,
                   float *components,
@@ -80,7 +77,7 @@ cdef extern from "cuml/decomposition/pca_mg.hpp" namespace "ML::PCA::opg":
                   paramsPCAMG &prms,
                   bool verbose) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[doubleData_t *] input_data,
                   PartDescriptor &input_desc,
                   double *components,
@@ -139,7 +136,7 @@ class PCAMG(BaseDecompositionMG, PCA):
         cdef uintptr_t singular_vals_ptr = self._singular_values_.ptr
         cdef uintptr_t mean_ptr = self._mean_.ptr
         cdef uintptr_t noise_vars_ptr = self._noise_variance_.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef paramsPCAMG *params = <paramsPCAMG*><size_t>arg_params
 
diff --git a/python/cuml/decomposition/tsvd.pyx b/python/cuml/decomposition/tsvd.pyx
index 3534ca07e5..cf5bcd0170 100644
--- a/python/cuml/decomposition/tsvd.pyx
+++ b/python/cuml/decomposition/tsvd.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -33,7 +30,8 @@ import cuml
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.decomposition.utils cimport *
 from cuml.common import input_to_cuml_array
 
@@ -42,19 +40,19 @@ from cython.operator cimport dereference as deref
 
 cdef extern from "cuml/decomposition/tsvd.hpp" namespace "ML":
 
-    cdef void tsvdFit(cumlHandle& handle,
+    cdef void tsvdFit(handle_t& handle,
                       float *input,
                       float *components,
                       float *singular_vals,
                       const paramsTSVD &prms) except +
 
-    cdef void tsvdFit(cumlHandle& handle,
+    cdef void tsvdFit(handle_t& handle,
                       double *input,
                       double *components,
                       double *singular_vals,
                       const paramsTSVD &prms) except +
 
-    cdef void tsvdFitTransform(cumlHandle& handle,
+    cdef void tsvdFitTransform(handle_t& handle,
                                float *input,
                                float *trans_input,
                                float *components,
@@ -63,7 +61,7 @@ cdef extern from "cuml/decomposition/tsvd.hpp" namespace "ML":
                                float *singular_vals,
                                const paramsTSVD &prms) except +
 
-    cdef void tsvdFitTransform(cumlHandle& handle,
+    cdef void tsvdFitTransform(handle_t& handle,
                                double *input,
                                double *trans_input,
                                double *components,
@@ -72,25 +70,25 @@ cdef extern from "cuml/decomposition/tsvd.hpp" namespace "ML":
                                double *singular_vals,
                                const paramsTSVD &prms) except +
 
-    cdef void tsvdInverseTransform(cumlHandle& handle,
+    cdef void tsvdInverseTransform(handle_t& handle,
                                    float *trans_input,
                                    float *components,
                                    float *input,
                                    const paramsTSVD &prms) except +
 
-    cdef void tsvdInverseTransform(cumlHandle& handle,
+    cdef void tsvdInverseTransform(handle_t& handle,
                                    double *trans_input,
                                    double *components,
                                    double *input,
                                    const paramsTSVD &prms) except +
 
-    cdef void tsvdTransform(cumlHandle& handle,
+    cdef void tsvdTransform(handle_t& handle,
                             float *input,
                             float *components,
                             float *trans_input,
                             const paramsTSVD &prms) except +
 
-    cdef void tsvdTransform(cumlHandle& handle,
+    cdef void tsvdTransform(handle_t& handle,
                             double *input,
                             double *components,
                             double *trans_input,
@@ -115,7 +113,7 @@ class TruncatedSVD(Base):
     might be less accurate.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -184,7 +182,12 @@ class TruncatedSVD(Base):
         components.
         Jacobi is much faster as it iteratively corrects, but is less accurate.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     n_components : int (default = 1)
         The number of top K singular vectors / values you want.
         Must be <= number(columns).
@@ -197,8 +200,14 @@ class TruncatedSVD(Base):
     tol : float (default = 1e-7)
         Used if algorithm = "jacobi". Smaller tolerance can increase accuracy,
         but but will slow down the algorithm's convergence.
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     -----------
@@ -221,14 +230,15 @@ class TruncatedSVD(Base):
 
     **Applications of TruncatedSVD**
 
-        TruncatedSVD is also known as Latent Semantic Indexing (LSI) which
-        tries to find topics of a word count matrix. If X previously was
-        centered with mean removal, TruncatedSVD is the same as TruncatedPCA.
-        TruncatedSVD is also used in information retrieval tasks,
-        recommendation systems and data compression.
+    TruncatedSVD is also known as Latent Semantic Indexing (LSI) which
+    tries to find topics of a word count matrix. If X previously was
+    centered with mean removal, TruncatedSVD is the same as TruncatedPCA.
+    TruncatedSVD is also used in information retrieval tasks,
+    recommendation systems and data compression.
 
     For additional documentation, see `scikitlearn's TruncatedSVD docs
     <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_.
+
     """
 
     def __init__(self, algorithm='full', handle=None, n_components=1,
@@ -291,18 +301,10 @@ class TruncatedSVD(Base):
                                                  dtype=self.dtype)
         self._noise_variance_ = CumlArray.zeros(1, dtype=self.dtype)
 
+    @generate_docstring()
     def fit(self, X, y=None):
         """
-        Fit LSI model on training cudf DataFrame X.
-
-        Parameters
-        ----------
-       X : array-like (device or host) shape = (n_samples, n_features)
-           Dense matrix (floats or doubles) of shape (n_samples, n_features).
-           Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-           ndarray, cuda array interface compliant array like CuPy
-
-        y : ignored
+        Fit LSI model on training cudf DataFrame X. y is currently ignored.
 
         """
 
@@ -310,26 +312,17 @@ class TruncatedSVD(Base):
 
         return self
 
+    @generate_docstring(return_values={'name': 'trans',
+                                       'type': 'dense',
+                                       'description': 'Reduced version of X',
+                                       'shape': '(n_samples, n_components)'})
     def fit_transform(self, X, y=None):
         """
         Fit LSI model to X and perform dimensionality reduction on X.
+        y is currently ignored.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : ignored
-
-        Returns
-        -------
-        X_new : cuDF DataFrame, shape (n_samples, n_components)
-            Reduced version of X as a dense cuDF DataFrame
         """
-        self._set_output_type(X)
-        self._set_n_features_in(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         X_m, self.n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, check_dtype=[np.float32, np.float64])
@@ -358,7 +351,7 @@ class TruncatedSVD(Base):
         if self.n_components> self.n_cols:
             raise ValueError(' n_components must be < n_features')
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         if self.dtype == np.float32:
             tsvdFitTransform(handle_[0],
                              <float*> input_ptr,
@@ -385,26 +378,15 @@ class TruncatedSVD(Base):
         out_type = self._get_output_type(X)
         return _trans_input_.to_output(out_type)
 
+    @generate_docstring(return_values={'name': 'X_original',
+                                       'type': 'dense',
+                                       'description': 'X in original space',
+                                       'shape': '(n_samples, n_features)'})
     def inverse_transform(self, X, convert_dtype=False):
         """
         Transform X back to its original space.
-        Returns a cuDF DataFrame X_original whose transform would be X.
-
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-           Dense matrix (floats or doubles) of shape (n_samples, n_features).
-           Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-           ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = False)
-            When set to True, the inverse_transform method will automatically
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
-        Returns
-        -------
-        X_original : cuDF DataFrame, shape (n_samples, n_features)
-            Note that this is always a dense cuDF DataFrame.
+        Returns X_original whose transform would be X.
+
         """
 
         trans_input, n_rows, _, dtype = \
@@ -424,7 +406,7 @@ class TruncatedSVD(Base):
         cdef uintptr_t input_ptr = input_data.ptr
         cdef uintptr_t components_ptr = self._components_.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype.type == np.float32:
             tsvdInverseTransform(handle_[0],
@@ -446,24 +428,14 @@ class TruncatedSVD(Base):
         out_type = self._get_output_type(X)
         return input_data.to_output(out_type)
 
+    @generate_docstring(return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Reduced version of X',
+                                       'shape': '(n_samples, n_components)'})
     def transform(self, X, convert_dtype=False):
         """
         Perform dimensionality reduction on X.
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = False)
-            When set to True, the transform method will automatically
-            convert the input to the data type which was used to train the
-            model.
-
-        Returns
-        -------
-        X_new : cuDF DataFrame, shape (n_samples, n_components)
-            Reduced version of X. This will always be a dense DataFrame.
+
         """
         input, n_rows, _, dtype = \
             input_to_cuml_array(X, check_dtype=self.dtype,
@@ -484,7 +456,7 @@ class TruncatedSVD(Base):
         cdef uintptr_t trans_input_ptr = t_input_data.ptr
         cdef uintptr_t components_ptr = self._components_.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype.type == np.float32:
             tsvdTransform(handle_[0],
@@ -507,4 +479,5 @@ class TruncatedSVD(Base):
         return t_input_data.to_output(out_type)
 
     def get_param_names(self):
-        return ["algorithm", "n_components", "n_iter", "random_state", "tol"]
+        return super().get_param_names() + \
+            ["algorithm", "n_components", "n_iter", "random_state", "tol"]
diff --git a/python/cuml/decomposition/tsvd_mg.pyx b/python/cuml/decomposition/tsvd_mg.pyx
index c80957e2ef..e50de8af66 100644
--- a/python/cuml/decomposition/tsvd_mg.pyx
+++ b/python/cuml/decomposition/tsvd_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -31,7 +28,7 @@ from libc.stdint cimport uintptr_t, uint32_t, uint64_t
 from cython.operator cimport dereference as deref
 
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.decomposition.utils cimport *
 import cuml.common.opg_data_utils_mg as opg
 from cuml.common.opg_data_utils_mg cimport *
@@ -41,7 +38,7 @@ from cuml.decomposition.base_mg import BaseDecompositionMG
 
 cdef extern from "cuml/decomposition/tsvd_mg.hpp" namespace "ML::TSVD::opg":
 
-    cdef void fit_transform(cumlHandle& handle,
+    cdef void fit_transform(handle_t& handle,
                             vector[floatData_t *] input_data,
                             PartDescriptor &input_desc,
                             vector[floatData_t *] trans_data,
@@ -53,7 +50,7 @@ cdef extern from "cuml/decomposition/tsvd_mg.hpp" namespace "ML::TSVD::opg":
                             paramsTSVD &prms,
                             bool verbose) except +
 
-    cdef void fit_transform(cumlHandle& handle,
+    cdef void fit_transform(handle_t& handle,
                             vector[doubleData_t *] input_data,
                             PartDescriptor &input_desc,
                             vector[doubleData_t *] trans_data,
@@ -79,7 +76,7 @@ class TSVDMG(BaseDecompositionMG, TruncatedSVD):
         cdef uintptr_t explained_var_ratio_ptr = \
             self._explained_variance_ratio_.ptr
         cdef uintptr_t singular_vals_ptr = self._singular_values_.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef paramsTSVD *params = <paramsTSVD*><size_t>arg_params
 
diff --git a/python/cuml/decomposition/utils.pxd b/python/cuml/decomposition/utils.pxd
index e5719921be..b33c654b85 100644
--- a/python/cuml/decomposition/utils.pxd
+++ b/python/cuml/decomposition/utils.pxd
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 from libcpp cimport bool
 
 ctypedef int underlying_type_t_solver
@@ -28,7 +23,6 @@ cdef extern from "cuml/decomposition/params.hpp" namespace "ML" nogil:
     ctypedef enum solver "ML::solver":
         COV_EIG_DQ "ML::solver::COV_EIG_DQ"
         COV_EIG_JACOBI "ML::solver::COV_EIG_JACOBI"
-        RANDOMIZED "ML::solver::RANDOMIZED"
 
     cdef cppclass params:
         int n_rows
diff --git a/python/cuml/ensemble/randomforest_common.pyx b/python/cuml/ensemble/randomforest_common.pyx
index 7e57dd2bdd..192711fa69 100644
--- a/python/cuml/ensemble/randomforest_common.pyx
+++ b/python/cuml/ensemble/randomforest_common.pyx
@@ -22,7 +22,7 @@ import warnings
 import numpy as np
 from cuml import ForestInference
 from cuml.fil.fil import TreeliteModel
-from cuml.common.handle import Handle
+from cuml.raft.common.handle import Handle
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
 
@@ -31,21 +31,24 @@ from cython.operator cimport dereference as deref
 from cuml.ensemble.randomforest_shared import treelite_serialize, \
     treelite_deserialize
 from cuml.ensemble.randomforest_shared cimport *
-from cuml.common import input_to_cuml_array, rmm_cupy_ary
+from cuml.common import input_to_cuml_array, with_cupy_rmm
 
 
 class BaseRandomForestModel(Base):
-    variables = ['n_estimators', 'max_depth', 'handle',
-                 'max_features', 'n_bins',
-                 'split_algo', 'split_criterion', 'min_rows_per_node',
-                 'min_impurity_decrease',
-                 'bootstrap', 'bootstrap_features',
-                 'verbose', 'rows_sample',
-                 'max_leaves', 'quantile_per_tree']
+    _param_names = ['n_estimators', 'max_depth', 'handle',
+                    'max_features', 'n_bins',
+                    'split_algo', 'split_criterion', 'min_rows_per_node',
+                    'min_impurity_decrease',
+                    'bootstrap', 'bootstrap_features',
+                    'verbose', 'rows_sample',
+                    'max_leaves', 'quantile_per_tree',
+                    'accuracy_metric', 'use_experimental_backend',
+                    'max_batch_size']
+
     criterion_dict = {'0': GINI, '1': ENTROPY, '2': MSE,
                       '3': MAE, '4': CRITERION_END}
 
-    def __init__(self, split_criterion, seed=None,
+    def __init__(self, *, split_criterion, seed=None,
                  n_streams=8, n_estimators=100,
                  max_depth=16, handle=None, max_features='auto',
                  n_bins=8, split_algo=1, bootstrap=True,
@@ -58,10 +61,8 @@ class BaseRandomForestModel(Base):
                  max_leaf_nodes=None, min_impurity_decrease=0.0,
                  min_impurity_split=None, oob_score=None,
                  random_state=None, warm_start=None, class_weight=None,
-                 quantile_per_tree=False, criterion=None):
-
-        if accuracy_metric:
-            BaseRandomForestModel.variables.append('accuracy_metric')
+                 quantile_per_tree=False, criterion=None,
+                 use_experimental_backend=False, max_batch_size=128):
 
         sklearn_params = {"criterion": criterion,
                           "min_samples_leaf": min_samples_leaf,
@@ -69,7 +70,6 @@ class BaseRandomForestModel(Base):
                           "max_leaf_nodes": max_leaf_nodes,
                           "min_impurity_split": min_impurity_split,
                           "oob_score": oob_score, "n_jobs": n_jobs,
-                          "random_state": random_state,
                           "warm_start": warm_start,
                           "class_weight": class_weight}
 
@@ -82,13 +82,27 @@ class BaseRandomForestModel(Base):
                     "(https://docs.rapids.ai/api/cuml/nightly/"
                     "api.html#random-forest) for more information")
 
-        if ((seed is not None) and (n_streams != 1)):
+        if seed is not None:
+            if random_state is None:
+                warnings.warn("Parameter 'seed' is deprecated and will be"
+                              " removed in 0.17. Please use 'random_state'"
+                              " instead. Setting 'random_state' as the"
+                              " curent 'seed' value",
+                              DeprecationWarning)
+                random_state = seed
+            else:
+                warnings.warn("Both 'seed' and 'random_state' parameters were"
+                              " set. Using 'random_state' since 'seed' is"
+                              " deprecated and will be removed in 0.17.",
+                              DeprecationWarning)
+
+        if ((random_state is not None) and (n_streams != 1)):
             warnings.warn("For reproducible results in Random Forest"
                           " Classifier or for almost reproducible results"
                           " in Random Forest Regressor, n_streams==1 is "
                           "recommended. If n_streams is > 1, results may vary "
                           "due to stream/thread timing differences, even when "
-                          "random_seed is set")
+                          "random_state is set")
         if handle is None:
             handle = Handle(n_streams)
 
@@ -125,8 +139,10 @@ class BaseRandomForestModel(Base):
         self.dtype = dtype
         self.accuracy_metric = accuracy_metric
         self.quantile_per_tree = quantile_per_tree
+        self.use_experimental_backend = use_experimental_backend
+        self.max_batch_size = max_batch_size
         self.n_streams = handle.getNumInternalStreams()
-        self.seed = seed
+        self.random_state = random_state
         self.rf_forest = 0
         self.rf_forest64 = 0
         self.model_pbuf_bytes = bytearray()
@@ -195,12 +211,6 @@ class BaseRandomForestModel(Base):
 
         else:
             if self.RF_type == CLASSIFICATION:
-                if self.num_classes > 2:
-                    raise NotImplementedError(
-                        "Pickling for multi-class classification models"
-                        " is currently not implemented. Please check"
-                        " cuml GitHub issue #1679 for more information.")
-
                 build_treelite_forest(
                     &tl_handle,
                     <RandomForestMetaData[float, int]*>
@@ -218,6 +228,7 @@ class BaseRandomForestModel(Base):
         self.treelite_handle = <uintptr_t> tl_handle
         return self.treelite_handle
 
+    @with_cupy_rmm
     def _dataset_setup_for_fit(self, X, y, convert_dtype):
         self._set_output_type(X)
         self._set_n_features_in(X)
@@ -241,13 +252,16 @@ class BaseRandomForestModel(Base):
             if y_dtype != np.int32:
                 raise TypeError("The labels `y` need to be of dtype"
                                 " `int32`")
-            self.classes_ = rmm_cupy_ary(cp.unique, y_m)
-            self.num_classes = len(self.classes_)
+            temp_classes = cp.unique(y_m)
+            self.num_classes = len(temp_classes)
             for i in range(self.num_classes):
-                if i not in self.classes_:
+                if i not in temp_classes:
                     raise ValueError("The labels need "
                                      "to be consecutive values from "
                                      "0 to the number of unique label values")
+
+            # Save internally as CumlArray
+            self._classes_ = CumlArray(temp_classes)
         else:
             y_m, _, _, y_dtype = \
                 input_to_cuml_array(
@@ -332,26 +346,13 @@ class BaseRandomForestModel(Base):
                                         predict_proba=predict_proba)
         return preds
 
-    def _get_params(self, deep):
-        params = dict()
-        for key in BaseRandomForestModel.variables:
-            if key in ['handle']:
-                continue
-            var_value = getattr(self, key, None)
-            params[key] = var_value
-        return params
+    def get_param_names(self):
+        return super().get_param_names() + BaseRandomForestModel._param_names
 
-    def _set_params(self, **params):
+    def set_params(self, **params):
         self.treelite_serialized_model = None
 
-        if not params:
-            return self
-        for key, value in params.items():
-            if key not in BaseRandomForestModel.variables:
-                raise ValueError('Invalid parameter for estimator')
-            else:
-                setattr(self, key, value)
-        return self
+        super().set_params(**params)
 
 
 def _check_fil_parameter_validity(depth, algo, fil_sparse_format):
diff --git a/python/cuml/ensemble/randomforest_shared.pxd b/python/cuml/ensemble/randomforest_shared.pxd
index 97a51ad6bd..9f4a474773 100644
--- a/python/cuml/ensemble/randomforest_shared.pxd
+++ b/python/cuml/ensemble/randomforest_shared.pxd
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 import ctypes
 import math
 import numpy as np
@@ -28,14 +23,14 @@ from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from libc.stdlib cimport calloc, malloc, free
 from libcpp.vector cimport vector
+from libcpp.string cimport string
 
-from cuml.common.handle import Handle
+from cuml.raft.common.handle import Handle
 from cuml import ForestInference
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import get_cudf_column_ptr, get_dev_array_ptr, \
     input_to_dev_array, zeros
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 cdef extern from "treelite/c_api.h":
@@ -100,8 +95,13 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                                           int) except +
 
     cdef void delete_rf_metadata[T, L](RandomForestMetaData[T, L]*) except +
+
+    #
+    # Text representation of random forest
+    #
     cdef void print_rf_summary[T, L](RandomForestMetaData[T, L]*) except +
     cdef void print_rf_detailed[T, L](RandomForestMetaData[T, L]*) except +
+    cdef string dump_rf_as_json[T, L](RandomForestMetaData[T, L]*) except +
 
     cdef RF_params set_rf_class_obj(int,
                                     int,
@@ -117,6 +117,8 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                                     int,
                                     CRITERION,
                                     bool,
+                                    int,
+                                    bool,
                                     int) except +
 
     cdef vector[unsigned char] save_model(ModelHandle)
diff --git a/python/cuml/ensemble/randomforest_shared.pyx b/python/cuml/ensemble/randomforest_shared.pyx
index c8fe539f59..b58b2a442f 100644
--- a/python/cuml/ensemble/randomforest_shared.pyx
+++ b/python/cuml/ensemble/randomforest_shared.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from libcpp.vector cimport vector
 from cython.operator cimport dereference as deref, preincrement as inc
diff --git a/python/cuml/ensemble/randomforestclassifier.pyx b/python/cuml/ensemble/randomforestclassifier.pyx
index 9b31015bd8..c3cdc8aa6f 100644
--- a/python/cuml/ensemble/randomforestclassifier.pyx
+++ b/python/cuml/ensemble/randomforestclassifier.pyx
@@ -15,10 +15,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 import rmm
@@ -29,7 +26,9 @@ import cuml.common.logger as logger
 from cuml import ForestInference
 from cuml.common.array import CumlArray
 from cuml.common.base import ClassifierMixin
-from cuml.common.handle import Handle
+from cuml.common.doc_utils import generate_docstring
+from cuml.common.doc_utils import insert_into_docstring
+from cuml.raft.common.handle import Handle
 from cuml.common import input_to_cuml_array, rmm_cupy_ary
 
 from cuml.ensemble.randomforest_common import BaseRandomForestModel
@@ -47,8 +46,7 @@ from libc.stdlib cimport calloc, malloc, free
 
 from numba import cuda
 
-from cuml.common.handle cimport cumlHandle
-cimport cuml.common.handle
+from cuml.raft.common.handle cimport handle_t
 cimport cuml.common.cuda
 
 cimport cython
@@ -56,7 +54,7 @@ cimport cython
 
 cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   RandomForestMetaData[float, int]*,
                   float*,
                   int,
@@ -66,7 +64,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                   RF_params,
                   int) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   RandomForestMetaData[double, int]*,
                   double*,
                   int,
@@ -76,7 +74,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                   RF_params,
                   int) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       RandomForestMetaData[float, int] *,
                       float*,
                       int,
@@ -84,7 +82,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                       int*,
                       bool) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       RandomForestMetaData[double, int]*,
                       double*,
                       int,
@@ -92,7 +90,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                       int*,
                       bool) except +
 
-    cdef void predictGetAll(cumlHandle& handle,
+    cdef void predictGetAll(handle_t& handle,
                             RandomForestMetaData[float, int] *,
                             float*,
                             int,
@@ -100,7 +98,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                             int*,
                             bool) except +
 
-    cdef void predictGetAll(cumlHandle& handle,
+    cdef void predictGetAll(handle_t& handle,
                             RandomForestMetaData[double, int]*,
                             double*,
                             int,
@@ -108,14 +106,14 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                             int*,
                             bool) except +
 
-    cdef RF_metrics score(cumlHandle& handle,
+    cdef RF_metrics score(handle_t& handle,
                           RandomForestMetaData[float, int]*,
                           int*,
                           int,
                           int*,
                           bool) except +
 
-    cdef RF_metrics score(cumlHandle& handle,
+    cdef RF_metrics score(handle_t& handle,
                           RandomForestMetaData[double, int]*,
                           int*,
                           int,
@@ -133,8 +131,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
     histogram-based algorithms to determine splits, rather than an exact
     count. You can tune the size of the histograms with the n_bins parameter.
 
-    **Known Limitations**: This is an early release of the cuML
-    Random Forest code. It contains a few known limitations:
+    .. note:: This is an early release of the cuML
+        Random Forest code. It contains a few known limitations:
 
        * GPU-based inference is only supported if the model was trained
          with 32-bit (float32) datatypes. CPU-based inference may be used
@@ -142,9 +140,11 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
        * Very deep / very wide models may exhaust available GPU memory.
          Future versions of cuML will provide an alternative algorithm to
          reduce memory consumption.
+       * While training the model for multi class classification problems,
+         using deep trees or `max_features=1.0` provides better performance.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
             import numpy as np
@@ -171,8 +171,6 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
     -----------
     n_estimators : int (default = 100)
         Number of trees in the forest. (Default changed to 100 in cuML 0.11)
-    handle : cuml.Handle
-        If it is None, a new one is created just for this class.
     split_criterion : The criterion used to split nodes.
         0 for GINI, 1 for ENTROPY
         2 and 3 not valid for classification
@@ -219,17 +217,48 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
     quantile_per_tree : boolean (default = False)
         Whether quantile is computed for individal trees in RF.
         Only relevant for GLOBAL_QUANTILE split_algo.
+    use_experimental_backend : boolean (default = False)
+        If set to true and  following conditions are also met, experimental
+         decision tree training implementation would be used:
+            split_algo = 1 (GLOBAL_QUANTILE)
+            max_features = 1.0 (Feature sub-sampling disabled)
+            quantile_per_tree = false (No per tree quantile computation)
+    max_batch_size: int (default = 128)
+        Maximum number of nodes that can be processed in a given batch. This is
+        used only when 'use_experimental_backend' is true.
+    random_state : int (default = None)
+        Seed for the random number generator. Unseeded by default.
     seed : int (default = None)
+        Deprecated in favor of `random_state`.
         Seed for the random number generator. Unseeded by default.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
     """
 
-    def __init__(self, split_criterion=0,
-                 **kwargs):
+    def __init__(self, split_criterion=0, handle=None, verbose=False,
+                 output_type=None, **kwargs):
 
         self.RF_type = CLASSIFICATION
         self.num_classes = 2
         super(RandomForestClassifier, self).__init__(
             split_criterion=split_criterion,
+            handle=handle,
+            verbose=verbose,
+            output_type=output_type,
             **kwargs)
 
     """
@@ -332,6 +361,7 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         Parameters
         ----------
+
         output_class : boolean (default = True)
             This is optional and required only while performing the
             predict operation on the GPU.
@@ -364,10 +394,12 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             or algo='auto'
 
         Returns
-        ----------
-        fil_model :
+        -------
+
+        fil_model
             A Forest Inference model which can be used to perform
             inferencing on the random forest model.
+
         """
         treelite_handle = self._obtain_treelite_handle()
         return _obtain_fil_model(treelite_handle=treelite_handle,
@@ -377,32 +409,21 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
                                  algo=algo,
                                  fil_sparse_format=fil_sparse_format)
 
-    """
-    TODO : Move functions duplicated in the RF classifier and regressor
-           to a shared file. Cuml issue #1854 has been created to track this.
-    """
-
+    @generate_docstring(skip_parameters_heading=True,
+                        y='dense_intdtype',
+                        convert_dtype_cast='np.float32')
     def fit(self, X, y, convert_dtype=True):
         """
         Perform Random Forest Classification on the input data
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (int32) of shape (n_samples, 1).
-            Acceptable formats: NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-            These labels should be contiguous integers from 0 to n_classes.
         convert_dtype : bool, optional (default = True)
             When set to True, the fit method will, when necessary, convert
             y to be of dtype int32. This will increase memory used for
             the method.
         """
-        self._set_target_dtype(y)
+        self._set_base_attributes(target_dtype=y)
 
         X_m, y_m, max_feature_val = self._dataset_setup_for_fit(X, y,
                                                                 convert_dtype)
@@ -411,8 +432,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         X_ptr = X_m.ptr
         y_ptr = y_m.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, int] *rf_forest = \
             new RandomForestMetaData[float, int]()
@@ -421,10 +442,10 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             new RandomForestMetaData[double, int]()
         self.rf_forest64 = <uintptr_t> rf_forest64
 
-        if self.seed is None:
+        if self.random_state is None:
             seed_val = <uintptr_t>NULL
         else:
-            seed_val = <uintptr_t>self.seed
+            seed_val = <uintptr_t>self.random_state
 
         rf_params = set_rf_class_obj(<int> self.max_depth,
                                      <int> self.max_leaves,
@@ -440,7 +461,9 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
                                      <int> seed_val,
                                      <CRITERION> self.split_criterion,
                                      <bool> self.quantile_per_tree,
-                                     <int> self.n_streams)
+                                     <int> self.n_streams,
+                                     <bool> self.use_experimental_backend,
+                                     <int> self.max_batch_size)
 
         if self.dtype == np.float32:
             fit(handle_[0],
@@ -490,8 +513,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         preds = CumlArray.zeros(n_rows, dtype=np.int32)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, int] *rf_forest = \
             <RandomForestMetaData[float, int]*><uintptr_t> self.rf_forest
@@ -525,6 +548,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         del(X_m)
         return preds.to_output(output_type=out_type, output_dtype=out_dtype)
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')],
+                           return_values=[('dense', '(n_samples, 1)')])
     def predict(self, X, predict_model="GPU",
                 output_class=True, threshold=0.5,
                 algo='auto', num_classes=None,
@@ -535,10 +560,7 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        X : {}
         predict_model : String (default = 'GPU')
             'GPU' to predict using the GPU, 'CPU' otherwise. The 'GPU' can only
             be used if the model was trained on float32 data and `X` is float32
@@ -567,7 +589,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             It is applied if output_class == True, else it is ignored
         num_classes : int (default = None)
             number of different classes present in the dataset. This variable
-            will be depricated in 0.16
+            will be deprecated in 0.16. The number of classes passed
+            must match the number of classes the model was trained on
         convert_dtype : bool, optional (default = True)
             When set to True, the predict method will, when necessary, convert
             the input to the data type which was used to train the model. This
@@ -584,19 +607,15 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         Returns
         ----------
-        y : (same as the input datatype)
-            Dense vector (ints, floats, or doubles) of shape (n_samples, 1)
+        y : {}
         """
-        if (num_classes and self.num_classes != num_classes):
-            raise ValueError("The number of classes in the test dataset"
-                             " should be equal to the number of classes"
-                             " present in the training dataset.")
-
-        elif predict_model == "CPU" or self.num_classes > 2:
-            if self.num_classes > 2 and predict_model == "GPU":
-                warnings.warn("Switching over to use the CPU predict since "
-                              "the GPU predict currently cannot perform "
-                              "multi-class classification.")
+        if num_classes:
+            warnings.warn("num_classes is deprecated and will be removed"
+                          " in an upcoming version")
+            if num_classes != self.num_classes:
+                raise NotImplementedError("limiting num_classes for predict"
+                                          " is not implemented")
+        if predict_model == "CPU":
             preds = self._predict_model_on_cpu(X,
                                                convert_dtype=convert_dtype)
 
@@ -647,8 +666,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         preds = CumlArray.zeros(n_rows * self.n_estimators, dtype=np.int32)
         preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
         cdef RandomForestMetaData[float, int] *rf_forest = \
             <RandomForestMetaData[float, int]*><uintptr_t> self.rf_forest
 
@@ -679,25 +698,20 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         del(X_m)
         return preds.to_output(out_type)
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')],
+                           return_values=[('dense', '(n_samples, 1)')])
     def predict_proba(self, X, output_class=True,
                       threshold=0.5, algo='auto',
-                      convert_dtype=True,
-                      fil_sparse_format='auto',
-                      num_classes=None):
+                      num_classes=None, convert_dtype=True,
+                      fil_sparse_format='auto'):
         """
         Predicts class probabilites for X. This function uses the GPU
         implementation of predict. Therefore, data with 'dtype = np.float32'
-        and 'num_classes = 2' should be used while using this function.
-        The option to use predict_proba for multi_class classification is not
-        currently implemented. Please check cuml issue #1679 for more
-        information.
+        should be used with this function.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        X : {}
         output_class: boolean (default = True)
             This is optional and required only while performing the
             predict operation on the GPU.
@@ -721,7 +735,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             It is applied if output_class == True, else it is ignored
         num_classes : int (default = None)
             number of different classes present in the dataset. This variable
-            will be depricated in 0.16
+            will be deprecated in 0.16. The number of classes passed
+            must match the number of classes the model was trained on
         convert_dtype : bool, optional (default = True)
             When set to True, the predict method will, when necessary, convert
             the input to the data type which was used to train the model. This
@@ -738,10 +753,7 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         Returns
         -------
-        y : (same as the input datatype)
-            Dense vector (float) of shape (n_samples, 1). The datatype of y
-            depend on the value of 'output_type' varaible specified by the
-            user while intializing the model.
+        y : {}
         """
         if self.dtype == np.float64:
             raise TypeError("GPU based predict only accepts np.float32 data. \
@@ -751,16 +763,15 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
                             then please use the CPU based predict by \
                             setting predict_model = 'CPU'")
 
-        elif self.num_classes > 2:
-            raise NotImplementedError("Predict_proba for multi-class "
-                                      "classification models is currently not "
-                                      "implemented. Please check cuml issue "
-                                      "#1679 for more information.")
+        if num_classes:
+            warnings.warn("num_classes is deprecated and will be removed"
+                          " in an upcoming version")
+            if num_classes != self.num_classes:
+                raise NotImplementedError("The number of classes in the test "
+                                          "dataset should be equal to the "
+                                          "number of classes present in the "
+                                          "training dataset.")
 
-        elif (num_classes and self.num_classes != num_classes):
-            raise ValueError("The number of classes in the test dataset"
-                             " should be equal to the number of classes"
-                             " present in the training dataset.")
         preds_proba = \
             self._predict_model_on_gpu(X, output_class=output_class,
                                        threshold=threshold,
@@ -771,6 +782,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         return preds_proba
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)'),
+                                       ('dense_intdtype', '(n_samples, 1)')])
     def score(self, X, y, threshold=0.5,
               algo='auto', num_classes=None, predict_model="GPU",
               convert_dtype=True, fil_sparse_format='auto'):
@@ -779,12 +792,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : NumPy
-            Dense vector (int) of shape (n_samples, 1)
+        X : {}
+        y : {}
         algo : string (default = 'auto')
             This is optional and required only while performing the
             predict operation on the GPU.
@@ -802,7 +811,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             predict operation on the GPU.
         num_classes : int (default = None)
             number of different classes present in the dataset. This variable
-            will be depricated in 0.16
+            will be deprecated in 0.16. The number of classes passed
+            must match the number of classes the model was trained on
         convert_dtype : boolean, default=True
             whether to convert input data to correct dtype automatically
         predict_model : String (default = 'GPU')
@@ -848,8 +858,8 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             input_to_cuml_array(preds, convert_to_dtype=np.int32)
         preds_ptr = preds_m.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, int] *rf_forest = \
             <RandomForestMetaData[float, int]*><uintptr_t> self.rf_forest
@@ -881,27 +891,6 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
         del(preds_m)
         return self.stats['accuracy']
 
-    def get_params(self, deep=True):
-        """
-        Returns the value of all parameters
-        required to configure this estimator as a dictionary.
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-        return self._get_params(deep=deep)
-
-    def set_params(self, **params):
-        """
-        Sets the value of parameters required to
-        configure this estimator, it functions similar to
-        the sklearn set_params.
-        Parameters
-        -----------
-        params : dict of new params
-        """
-        return self._set_params(**params)
-
     def print_summary(self):
         """
         Prints the summary of the forest used to train and test the model
@@ -932,3 +921,17 @@ class RandomForestClassifier(BaseRandomForestModel, ClassifierMixin):
             print_rf_detailed(rf_forest64)
         else:
             print_rf_detailed(rf_forest)
+
+    def dump_as_json(self):
+        """
+        Dump (export) the Random Forest model as a JSON string
+        """
+        cdef RandomForestMetaData[float, int] *rf_forest = \
+            <RandomForestMetaData[float, int]*><uintptr_t> self.rf_forest
+
+        cdef RandomForestMetaData[double, int] *rf_forest64 = \
+            <RandomForestMetaData[double, int]*><uintptr_t> self.rf_forest64
+
+        if self.dtype == np.float64:
+            return dump_rf_as_json(rf_forest64)
+        return dump_rf_as_json(rf_forest)
diff --git a/python/cuml/ensemble/randomforestregressor.pyx b/python/cuml/ensemble/randomforestregressor.pyx
index 18351f8150..7acb991c45 100644
--- a/python/cuml/ensemble/randomforestregressor.pyx
+++ b/python/cuml/ensemble/randomforestregressor.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 import rmm
@@ -29,7 +26,9 @@ from cuml import ForestInference
 from cuml.common.array import CumlArray
 
 from cuml.common.base import RegressorMixin
-from cuml.common.handle import Handle
+from cuml.common.doc_utils import generate_docstring
+from cuml.common.doc_utils import insert_into_docstring
+from cuml.raft.common.handle import Handle
 from cuml.common import input_to_cuml_array, rmm_cupy_ary
 
 from cuml.ensemble.randomforest_common import BaseRandomForestModel
@@ -47,8 +46,7 @@ from libc.stdlib cimport calloc, malloc, free
 
 from numba import cuda
 
-from cuml.common.handle cimport cumlHandle
-cimport cuml.common.handle
+from cuml.raft.common.handle cimport handle_t
 cimport cuml.common.cuda
 
 cimport cython
@@ -56,7 +54,7 @@ cimport cython
 
 cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   RandomForestMetaData[float, float]*,
                   float*,
                   int,
@@ -65,7 +63,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                   RF_params,
                   int) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   RandomForestMetaData[double, double]*,
                   double*,
                   int,
@@ -74,7 +72,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                   RF_params,
                   int) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       RandomForestMetaData[float, float] *,
                       float*,
                       int,
@@ -82,7 +80,7 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                       float*,
                       int) except +
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       RandomForestMetaData[double, double]*,
                       double*,
                       int,
@@ -90,14 +88,14 @@ cdef extern from "cuml/ensemble/randomforest.hpp" namespace "ML":
                       double*,
                       int) except +
 
-    cdef RF_metrics score(cumlHandle& handle,
+    cdef RF_metrics score(handle_t& handle,
                           RandomForestMetaData[float, float]*,
                           float*,
                           int,
                           float*,
                           int) except +
 
-    cdef RF_metrics score(cumlHandle& handle,
+    cdef RF_metrics score(handle_t& handle,
                           RandomForestMetaData[double, double]*,
                           double*,
                           int,
@@ -110,50 +108,51 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
     """
     Implements a Random Forest regressor model which fits multiple decision
     trees in an ensemble.
-    Note that the underlying algorithm for tree node splits differs from that
-    used in scikit-learn. By default, the cuML Random Forest uses a
-    histogram-based algorithm to determine splits, rather than an exact
-    count. You can tune the size of the histograms with the n_bins parameter.
+
+    .. note:: that the underlying algorithm for tree node splits differs from
+        that used in scikit-learn. By default, the cuML Random Forest uses a
+        histogram-based algorithm to determine splits, rather than an exact
+        count. You can tune the size of the histograms with the n_bins
+        parameter.
 
     **Known Limitations**: This is an early release of the cuML
     Random Forest code. It contains a few known limitations:
 
-       * GPU-based inference is only supported if the model was trained
-         with 32-bit (float32) datatypes. CPU-based inference may be used
-         in this case as a slower fallback.
-       * Very deep / very wide models may exhaust available GPU memory.
-         Future versions of cuML will provide an alternative algorithm to
-         reduce memory consumption.
+     * GPU-based inference is only supported if the model was trained
+       with 32-bit (float32) datatypes. CPU-based inference may be used
+       in this case as a slower fallback.
+     * Very deep / very wide models may exhaust available GPU memory.
+       Future versions of cuML will provide an alternative algorithm to
+       reduce memory consumption.
 
     Examples
-    ---------
+    --------
+
     .. code-block:: python
 
-            import numpy as np
-            from cuml.test.utils import get_handle
-            from cuml.ensemble import RandomForestRegressor as curfc
-            from cuml.test.utils import get_handle
-            X = np.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=np.float32)
-            y = np.asarray([0.0,1.0,2.0,3.0], dtype=np.float32)
-            cuml_model = curfc(max_features=1.0, n_bins=8,
-                               split_algo=0, min_rows_per_node=2,
-                               n_estimators=40, accuracy_metric='mse')
-            cuml_model.fit(X,y)
-            cuml_score = cuml_model.score(X,y)
-            print("MSE score of cuml : ", cuml_score)
+        import numpy as np
+        from cuml.test.utils import get_handle
+        from cuml.ensemble import RandomForestRegressor as curfc
+        from cuml.test.utils import get_handle
+        X = np.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=np.float32)
+        y = np.asarray([0.0,1.0,2.0,3.0], dtype=np.float32)
+        cuml_model = curfc(max_features=1.0, n_bins=8,
+                            split_algo=0, min_rows_per_node=2,
+                            n_estimators=40, accuracy_metric='r2')
+        cuml_model.fit(X,y)
+        cuml_score = cuml_model.score(X,y)
+        print("MSE score of cuml : ", cuml_score)
 
     Output:
 
     .. code-block:: python
 
-            MSE score of cuml :  0.1123437201231765
+        MSE score of cuml :  0.1123437201231765
 
     Parameters
     -----------
     n_estimators : int (default = 100)
         Number of trees in the forest. (Default changed to 100 in cuML 0.11)
-    handle : cuml.Handle
-        If it is None, a new one is created just for this class.
     split_algo : int (default = 1)
         The algorithm to determine how nodes are split in the tree.
         0 for HIST and 1 for GLOBAL_QUANTILE. HIST curently uses a slower
@@ -165,7 +164,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         2 for MSE, or 3 for MAE
         0 and 1 not valid for regression
     bootstrap : boolean (default = True)
-       Control bootstrapping.
+        Control bootstrapping.
         If True, each tree in the forest is built
         on a bootstrapped sample with replacement.
         If False, sampling without replacement is done.
@@ -182,7 +181,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
     max_leaves : int (default = -1)
         Maximum leaf nodes per tree. Soft constraint. Unlimited,
         if -1.
-     max_features : int, float, or string (default = 'auto')
+    max_features : int, float, or string (default = 'auto')
         Ratio of number of features (columns) to consider
         per node split.
         If int then max_features/n_features.
@@ -198,27 +197,64 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         If float the min_rows_per_sample*n_rows
     min_impurity_decrease : float (default = 0.0)
         The minimum decrease in impurity required for node to be split
-    accuracy_metric : string (default = 'mse')
+    accuracy_metric : string (default = 'r2')
         Decides the metric used to evaluate the performance of the model.
+        In the 0.16 release, the default scoring metric was changed
+        from mean squared error to r-squared.
+        for r-squared : 'r2'
         for median of abs error : 'median_ae'
         for mean of abs error : 'mean_ae'
         for mean square error' : 'mse'
     quantile_per_tree : boolean (default = False)
         Whether quantile is computed for individal trees in RF.
         Only relevant for GLOBAL_QUANTILE split_algo.
+    use_experimental_backend : boolean (default = False)
+        If set to true and  following conditions are also met, experimental
+         decision tree training implementation would be used:
+            split_algo = 1 (GLOBAL_QUANTILE)
+            max_features = 1.0 (Feature sub-sampling disabled)
+            quantile_per_tree = false (No per tree quantile computation)
+    max_batch_size: int (default = 128)
+        Maximum number of nodes that can be processed in a given batch. This is
+        used only when 'use_experimental_backend' is true.
+    random_state : int (default = None)
+        Seed for the random number generator. Unseeded by default. Does not
+        currently fully guarantee the exact same results.
     seed : int (default = None)
+        Deprecated in favor of `random_state`.
         Seed for the random number generator. Unseeded by default. Does not
         currently fully guarantee the exact same results.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     """
 
     def __init__(self, split_criterion=2,
-                 accuracy_metric='mse',
+                 accuracy_metric='r2',
+                 handle=None,
+                 verbose=False,
+                 output_type=None,
                  **kwargs):
         self.RF_type = REGRESSION
         super(RandomForestRegressor, self).__init__(
             split_criterion=split_criterion,
             accuracy_metric=accuracy_metric,
+            handle=handle,
+            verbose=verbose,
+            output_type=output_type,
             **kwargs)
     """
     TODO:
@@ -315,6 +351,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         """
         Create a Forest Inference (FIL) model from the trained cuML
         Random Forest model.
+
         Parameters
         ----------
         output_class : boolean (default = False)
@@ -323,6 +360,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
             If true, return a 1 or 0 depending on whether the raw
             prediction exceeds the threshold. If False, just return
             the raw prediction.
+
         algo : string (default = 'auto')
             This is optional and required only while performing the
             predict operation on the GPU.
@@ -334,6 +372,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
             `auto` - choose the algorithm automatically. Currently
             'batch_tree_reorg' is used for dense storage
             and 'naive' for sparse storage
+
         fil_sparse_format : boolean or string (default = 'auto')
             This variable is used to choose the type of forest that will be
             created in the Forest Inference Library. It is not required
@@ -343,11 +382,14 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
             False - create a dense forest
             True - create a sparse forest, requires algo='naive'
             or algo='auto'
+
         Returns
-        ----------
-        fil_model :
+        -------
+
+        fil_model
             A Forest Inference model which can be used to perform
             inferencing on the random forest model.
+
         """
         treelite_handle = self._obtain_treelite_handle()
         return _obtain_fil_model(treelite_handle=treelite_handle,
@@ -356,30 +398,11 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
                                  algo=algo,
                                  fil_sparse_format=fil_sparse_format)
 
-    """
-    TODO : Move functions duplicated in the RF classifier and regressor
-           to a shared file. Cuml issue #1854 has been created to track this.
-    """
-
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Perform Random Forest Regression on the input data
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (int32) of shape (n_samples, 1).
-            Acceptable formats: NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-            These labels should be contiguous integers from 0 to n_classes.
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This will increase
-            memory used for the method.
         """
         X_m, y_m, max_feature_val = self._dataset_setup_for_fit(X, y,
                                                                 convert_dtype)
@@ -389,8 +412,8 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         X_ptr = X_m.ptr
         y_ptr = y_m.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, float] *rf_forest = \
             new RandomForestMetaData[float, float]()
@@ -398,10 +421,10 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         cdef RandomForestMetaData[double, double] *rf_forest64 = \
             new RandomForestMetaData[double, double]()
         self.rf_forest64 = <uintptr_t> rf_forest64
-        if self.seed is None:
+        if self.random_state is None:
             seed_val = <uintptr_t>NULL
         else:
-            seed_val = <uintptr_t>self.seed
+            seed_val = <uintptr_t>self.random_state
 
         rf_params = set_rf_class_obj(<int> self.max_depth,
                                      <int> self.max_leaves,
@@ -417,7 +440,9 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
                                      <int> seed_val,
                                      <CRITERION> self.split_criterion,
                                      <bool> self.quantile_per_tree,
-                                     <int> self.n_streams)
+                                     <int> self.n_streams,
+                                     <bool> self.use_experimental_backend,
+                                     <int> self.max_batch_size)
 
         if self.dtype == np.float32:
             fit(handle_[0],
@@ -459,8 +484,8 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         preds = CumlArray.zeros(n_rows, dtype=dtype)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, float] *rf_forest = \
             <RandomForestMetaData[float, float]*><uintptr_t> self.rf_forest
@@ -494,6 +519,8 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         del(X_m)
         return preds.to_output(out_type)
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')],
+                           return_values=[('dense', '(n_samples, 1)')])
     def predict(self, X, predict_model="GPU",
                 algo='auto', convert_dtype=True,
                 fil_sparse_format='auto'):
@@ -502,10 +529,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        X : {}
         predict_model : String (default = 'GPU')
             'GPU' to predict using the GPU, 'CPU' otherwise. The GPU can only
             be used if the model was trained on float32 data and `X` is float32
@@ -537,8 +561,7 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
 
         Returns
         ----------
-        y : NumPy
-            Dense vector (int) of shape (n_samples, 1)
+        y : {}
 
         """
         if predict_model == "CPU":
@@ -561,19 +584,19 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
 
         return preds
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)'),
+                                       ('dense', '(n_samples, 1)')])
     def score(self, X, y, algo='auto', convert_dtype=True,
               fil_sparse_format='auto', predict_model="GPU"):
         """
         Calculates the accuracy metric score of the model for X.
+        In the 0.16 release, the default scoring metric was changed
+        from mean squared error to r-squared.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : NumPy
-            Dense vector (int) of shape (n_samples, 1)
+        X : {}
+        y : {}
         algo : string (default = 'auto')
             This is optional and required only while performing the
             predict operation on the GPU.
@@ -607,6 +630,8 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         median_abs_error : float or
         mean_abs_error : float
         """
+        from cuml.metrics.regression import r2_score
+
         cdef uintptr_t y_ptr
         _, n_rows, _, dtype = \
             input_to_cuml_array(X,
@@ -627,8 +652,16 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
             input_to_cuml_array(preds, convert_to_dtype=dtype)
         preds_ptr = preds_m.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><uintptr_t>self.handle.getHandle()
+        # shortcut for default accuracy metric of r^2
+        if self.accuracy_metric == "r2":
+            stats = r2_score(y_m, preds, handle=self.handle)
+            self.handle.sync()
+            del(y_m)
+            del(preds_m)
+            return stats
+
+        cdef handle_t* handle_ =\
+            <handle_t*><uintptr_t>self.handle.getHandle()
 
         cdef RandomForestMetaData[float, float] *rf_forest = \
             <RandomForestMetaData[float, float]*><uintptr_t> self.rf_forest
@@ -664,27 +697,6 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
         del(preds_m)
         return stats
 
-    def get_params(self, deep=True):
-        """
-        Returns the value of all parameters
-        required to configure this estimator as a dictionary.
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-        return self._get_params(deep=deep)
-
-    def set_params(self, **params):
-        """
-        Sets the value of parameters required to
-        configure this estimator, it functions similar to
-        the sklearn set_params.
-        Parameters
-        -----------
-        params : dict of new params
-        """
-        return self._set_params(**params)
-
     def print_summary(self):
         """
         Prints the summary of the forest used to train and test the model
@@ -715,3 +727,17 @@ class RandomForestRegressor(BaseRandomForestModel, RegressorMixin):
             print_rf_detailed(rf_forest64)
         else:
             print_rf_detailed(rf_forest)
+
+    def dump_as_json(self):
+        """
+        Dump (export) the Random Forest model as a JSON string
+        """
+        cdef RandomForestMetaData[float, float] *rf_forest = \
+            <RandomForestMetaData[float, float]*><uintptr_t> self.rf_forest
+
+        cdef RandomForestMetaData[double, double] *rf_forest64 = \
+            <RandomForestMetaData[double, double]*><uintptr_t> self.rf_forest64
+
+        if self.dtype == np.float64:
+            return dump_rf_as_json(rf_forest64)
+        return dump_rf_as_json(rf_forest)
diff --git a/python/cuml/experimental/decomposition/__init__.py b/python/cuml/experimental/decomposition/__init__.py
new file mode 100644
index 0000000000..d6ede246d1
--- /dev/null
+++ b/python/cuml/experimental/decomposition/__init__.py
@@ -0,0 +1,17 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml.experimental.decomposition.incremental_pca import IncrementalPCA
diff --git a/python/cuml/experimental/decomposition/incremental_pca.py b/python/cuml/experimental/decomposition/incremental_pca.py
new file mode 100644
index 0000000000..b302a8e499
--- /dev/null
+++ b/python/cuml/experimental/decomposition/incremental_pca.py
@@ -0,0 +1,674 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numbers
+
+import cupy as cp
+import cupyx
+import scipy
+from cuml import Base
+from cuml.common import input_to_cuml_array
+from cuml.common import with_cupy_rmm
+from cuml.common.array import CumlArray
+from cuml.decomposition import PCA
+
+
+class IncrementalPCA(PCA):
+    """
+    Based on sklearn.decomposition.IncrementalPCA from scikit-learn 0.23.1
+
+    Incremental principal components analysis (IPCA).
+    Linear dimensionality reduction using Singular Value Decomposition of
+    the data, keeping only the most significant singular vectors to
+    project the data to a lower dimensional space. The input data is
+    centered but not scaled for each feature before applying the SVD.
+    Depending on the size of the input data, this algorithm can be much
+    more memory efficient than a PCA, and allows sparse input.
+    This algorithm has constant memory complexity, on the order of
+    ``batch_size * n_features``, enabling use of np.memmap files without
+    loading the entire file into memory. For sparse matrices, the input
+    is converted to dense in batches (in order to be able to subtract the
+    mean) which avoids storing the entire dense matrix at any one time.
+    The computational overhead of each SVD is
+    ``O(batch_size * n_features ** 2)``, but only 2 * batch_size samples
+    remain in memory at a time. There will be ``n_samples / batch_size``
+    SVD computations to get the principal components, versus 1 large SVD
+    of complexity ``O(n_samples * n_features ** 2)`` for PCA.
+
+    Parameters
+    ----------
+
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    n_components : int or None, (default=None)
+        Number of components to keep. If ``n_components`` is ``None``,
+        then ``n_components`` is set to ``min(n_samples, n_features)``.
+    whiten : bool, optional
+        If True, de-correlates the components. This is done by dividing them by
+        the corresponding singular values then multiplying by sqrt(n_samples).
+        Whitening allows each component to have unit variance and removes
+        multi-collinearity. It might be beneficial for downstream
+        tasks like LinearRegression where correlated features cause problems.
+    copy : bool, (default=True)
+        If False, X will be overwritten. ``copy=False`` can be used to
+        save memory but is unsafe for general use.
+    batch_size : int or None, (default=None)
+        The number of samples to use for each batch. Only used when calling
+        ``fit``. If ``batch_size`` is ``None``, then ``batch_size``
+        is inferred from the data and set to ``5 * n_features``, to provide a
+        balance between approximation accuracy and memory consumption.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
+    Attributes
+    ----------
+
+    components_ : array, shape (n_components, n_features)
+        Components with maximum variance.
+    explained_variance_ : array, shape (n_components,)
+        Variance explained by each of the selected components.
+    explained_variance_ratio_ : array, shape (n_components,)
+        Percentage of variance explained by each of the selected components.
+        If all components are stored, the sum of explained variances is equal
+        to 1.0.
+    singular_values_ : array, shape (n_components,)
+        The singular values corresponding to each of the selected components.
+        The singular values are equal to the 2-norms of the ``n_components``
+        variables in the lower-dimensional space.
+    mean_ : array, shape (n_features,)
+        Per-feature empirical mean, aggregate over calls to ``partial_fit``.
+    var_ : array, shape (n_features,)
+        Per-feature empirical variance, aggregate over calls to
+        ``partial_fit``.
+    noise_variance_ : float
+        The estimated noise covariance following the Probabilistic PCA model
+        from [4]_.
+    n_components_ : int
+        The estimated number of components. Relevant when
+        ``n_components=None``.
+    n_samples_seen_ : int
+        The number of samples processed by the estimator. Will be reset on
+        new calls to fit, but increments across ``partial_fit`` calls.
+    batch_size_ : int
+        Inferred batch size from ``batch_size``.
+
+    Notes
+    -----
+
+    Implements the incremental PCA model from [1]_. This model is an extension
+    of the Sequential Karhunen-Loeve Transform from [2]_. We have specifically
+    abstained from an optimization used by authors of both papers, a QR
+    decomposition used in specific situations to reduce the algorithmic
+    complexity of the SVD. The source for this technique is [3]_. This
+    technique has been omitted because it is advantageous only when decomposing
+    a matrix with ``n_samples >= 5/3 * n_features`` where ``n_samples`` and
+    ``n_features`` are the matrix rows and columns, respectively. In addition,
+    it hurts the readability of the implemented algorithm. This would be a good
+    opportunity for future optimization, if it is deemed necessary.
+
+    References
+    ----------
+    .. [1] `D. Ross, J. Lim, R. Lin, M. Yang. Incremental Learning for Robust
+        Visual Tracking, International Journal of Computer Vision, Volume 77,
+        Issue 1-3, pp. 125-141, May 2008.
+        <https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf>`_
+
+    .. [2] `A. Levy and M. Lindenbaum, Sequential Karhunen-Loeve Basis
+        Extraction and its Application to Images, IEEE Transactions on Image
+        Processing, Volume 9, Number 8, pp. 1371-1374, August 2000.
+        <https://www.cs.technion.ac.il/~mic/doc/skl-ip.pdf>`_
+
+    .. [3] G. Golub and C. Van Loan. Matrix Computations, Third Edition,
+        Chapter 5, Section 5.4.4, pp. 252-253.
+
+    .. [4] `C. Bishop, 1999. "Pattern Recognition and Machine Learning",
+        Section 12.2.1, pp. 574
+        <http://www.miketipping.com/papers/met-mppca.pdf>`_
+
+    Examples
+    ---------
+
+    .. code-block:: python
+
+        >>> from cuml.experimental.decomposition import IncrementalPCA
+        >>> import cupy as cp
+        >>> import cupyx
+        >>>
+        >>> X = cupyx.scipy.sparse.random(1000, 4, format='csr', density=0.07)
+        >>> ipca = IncrementalPCA(n_components=2, batch_size=200)
+        >>> ipca.fit(X)
+        >>>
+        >>> # Components:
+        >>> ipca.components_
+        array([[-0.02362926,  0.87328851, -0.15971988,  0.45967206],
+            [-0.14643883,  0.11414225,  0.97589354,  0.11471273]])
+        >>>
+        >>> # Singular Values:
+        >>> ipca.singular_values_
+        array([4.90298662, 4.54498226])
+        >>>
+        >>> # Explained Variance:
+        >>> ipca.explained_variance_
+        array([0.02406334, 0.02067754])
+        >>>
+        >>> # Explained Variance Ratio:
+        >>> ipca.explained_variance_ratio_
+        array([0.28018011, 0.24075775])
+        >>>
+        >>> # Mean:
+        >>> ipca.mean_
+        array([0.03249896, 0.03629852, 0.03268694, 0.03216601])
+        >>>
+        >>> # Noise Variance:
+        >>> ipca.noise_variance_.item()
+        0.003474966583315544
+
+    """
+    def __init__(self, handle=None, n_components=None, *, whiten=False,
+                 copy=True, batch_size=None, verbose=False,
+                 output_type=None):
+
+        super(IncrementalPCA, self).__init__(handle=handle,
+                                             n_components=n_components,
+                                             whiten=whiten, copy=copy,
+                                             verbose=verbose,
+                                             output_type=output_type)
+        self.batch_size = batch_size
+        self._hyperparams = ["n_components", "whiten", "copy", "batch_size"]
+        self._cupy_attributes = True
+        self._sparse_model = True
+
+    @with_cupy_rmm
+    def fit(self, X, y=None):
+        """
+        Fit the model with X, using minibatches of size batch_size.
+
+        Parameters
+        ----------
+        X : array-like or sparse matrix, shape (n_samples, n_features)
+            Training data, where n_samples is the number of samples and
+            n_features is the number of features.
+        y : Ignored
+
+        Returns
+        -------
+
+        self : object
+            Returns the instance itself.
+
+        """
+
+        self._set_base_attributes(output_type=X)
+
+        self.n_samples_seen_ = 0
+        self._mean_ = .0
+        self.var_ = .0
+
+        if scipy.sparse.issparse(X) or cupyx.scipy.sparse.issparse(X):
+            X = _validate_sparse_input(X)
+        else:
+            X, n_samples, n_features, self.dtype = \
+                input_to_cuml_array(X, order='K',
+                                    check_dtype=[cp.float32, cp.float64])
+
+            # NOTE: While we cast the input to a cupy array here, we still
+            # respect the `output_type` parameter in the constructor. This
+            # is done by PCA, which IncrementalPCA inherits from. PCA's
+            # transform and inverse transform convert the output to the
+            # required type.
+            X = X.to_output(output_type='cupy')
+
+        n_samples, n_features = X.shape
+
+        if self.batch_size is None:
+            self.batch_size_ = 5 * n_features
+        else:
+            self.batch_size_ = self.batch_size
+
+        for batch in _gen_batches(n_samples, self.batch_size_,
+                                  min_batch_size=self.n_components or 0):
+            X_batch = X[batch]
+            if cupyx.scipy.sparse.issparse(X_batch):
+                X_batch = X_batch.toarray()
+
+            self.partial_fit(X_batch, check_input=False)
+
+        return self
+
+    @with_cupy_rmm
+    def partial_fit(self, X, y=None, check_input=True):
+        """
+        Incremental fit with X. All of X is processed as a single batch.
+
+        Parameters
+        ----------
+
+        X : array-like or sparse matrix, shape (n_samples, n_features)
+            Training data, where n_samples is the number of samples and
+            n_features is the number of features.
+        check_input : bool
+            Run check_array on X.
+        y : Ignored
+
+        Returns
+        -------
+
+        self : object
+            Returns the instance itself.
+
+        """
+        if check_input:
+            if scipy.sparse.issparse(X) or cupyx.scipy.sparse.issparse(X):
+                raise TypeError(
+                    "IncrementalPCA.partial_fit does not support "
+                    "sparse input. Either convert data to dense "
+                    "or use IncrementalPCA.fit to do so in batches.")
+
+            self._set_output_type(X)
+
+            X, n_samples, n_features, self.dtype = \
+                input_to_cuml_array(X, order='K',
+                                    check_dtype=[cp.float32, cp.float64])
+            X = X.to_output(output_type='cupy')
+        else:
+            n_samples, n_features = X.shape
+
+        if not hasattr(self, '_components_'):
+            self._components_ = None
+
+        if self.n_components is None:
+            if self._components_ is None:
+                self.n_components_ = min(n_samples, n_features)
+            else:
+                self.n_components_ = self._components_.shape[0]
+        elif not 1 <= self.n_components <= n_features:
+            raise ValueError("n_components=%r invalid for n_features=%d, need "
+                             "more rows than columns for IncrementalPCA "
+                             "processing" % (self.n_components, n_features))
+        elif not self.n_components <= n_samples:
+            raise ValueError("n_components=%r must be less or equal to "
+                             "the batch number of samples "
+                             "%d." % (self.n_components, n_samples))
+        else:
+            self.n_components_ = self.n_components
+
+        if (self._components_ is not None) and (self._components_.shape[0] !=
+                                                self.n_components_):
+            raise ValueError("Number of input features has changed from %i "
+                             "to %i between calls to partial_fit! Try "
+                             "setting n_components to a fixed value." %
+                             (self._components_.shape[0], self.n_components_))
+
+        if not self._cupy_attributes:
+            self._cumlarray_to_cupy_attrs()
+            self._cupy_attributes = True
+
+        # This is the first partial_fit
+        if not hasattr(self, 'n_samples_seen_'):
+            self.n_samples_seen_ = 0
+            self._mean_ = .0
+            self.var_ = .0
+
+        # Update stats - they are 0 if this is the first step
+        col_mean, col_var, n_total_samples = \
+            _incremental_mean_and_var(
+                X, last_mean=self._mean_, last_variance=self.var_,
+                last_sample_count=cp.repeat(cp.asarray([self.n_samples_seen_]),
+                                            X.shape[1]))
+        n_total_samples = n_total_samples[0]
+
+        # Whitening
+        if self.n_samples_seen_ == 0:
+            # If it is the first step, simply whiten X
+            X = X - col_mean
+        else:
+            col_batch_mean = cp.mean(X, axis=0)
+            X = X - col_batch_mean
+            # Build matrix of combined previous basis and new data
+            mean_correction = \
+                cp.sqrt((self.n_samples_seen_ * n_samples) /
+                        n_total_samples) * (self._mean_ - col_batch_mean)
+            X = cp.vstack((self._singular_values_.reshape((-1, 1)) *
+                           self._components_, X, mean_correction))
+
+        U, S, V = cp.linalg.svd(X, full_matrices=False)
+        U, V = _svd_flip(U, V, u_based_decision=False)
+        explained_variance = S ** 2 / (n_total_samples - 1)
+        explained_variance_ratio = S ** 2 / cp.sum(col_var * n_total_samples)
+
+        self.n_samples_seen_ = n_total_samples
+        self._components_ = V[:self.n_components_]
+        self._singular_values_ = S[:self.n_components_]
+        self._mean_ = col_mean
+        self.var_ = col_var
+        self._explained_variance_ = explained_variance[:self.n_components_]
+        self._explained_variance_ratio_ = \
+            explained_variance_ratio[:self.n_components_]
+        if self.n_components_ < n_features:
+            self._noise_variance_ = \
+                explained_variance[self.n_components_:].mean()
+        else:
+            self._noise_variance_ = 0.
+
+        if self._cupy_attributes:
+            self._cupy_to_cumlarray_attrs()
+            self._cupy_attributes = False
+
+        return self
+
+    @with_cupy_rmm
+    def transform(self, X, convert_dtype=False):
+        """
+        Apply dimensionality reduction to X.
+
+        X is projected on the first principal components previously extracted
+        from a training set, using minibatches of size batch_size if X is
+        sparse.
+
+        Parameters
+        ----------
+
+        X : array-like or sparse matrix, shape (n_samples, n_features)
+            New data, where n_samples is the number of samples
+            and n_features is the number of features.
+
+        convert_dtype : bool, optional (default = False)
+            When set to True, the transform method will automatically
+            convert the input to the data type which was used to train the
+            model. This will increase memory used for the method.
+
+        Returns
+        -------
+
+        X_new : array-like, shape (n_samples, n_components)
+
+        """
+
+        if scipy.sparse.issparse(X) or cupyx.scipy.sparse.issparse(X):
+            out_type = self._get_output_type(X)
+
+            X = _validate_sparse_input(X)
+
+            n_samples = X.shape[0]
+            output = []
+            for batch in _gen_batches(n_samples, self.batch_size_,
+                                      min_batch_size=self.n_components or 0):
+                output.append(super().transform(X[batch]))
+            output, _, _, _ = \
+                input_to_cuml_array(cp.vstack(output), order='K')
+
+            return output.to_output(out_type)
+        else:
+            return super().transform(X)
+
+    def get_param_names(self):
+        # Skip super() since we dont pass any extra parameters in __init__
+        return Base.get_param_names(self) + self._hyperparams
+
+    def _cupy_to_cumlarray_attrs(self):
+        self._components_ = CumlArray(self._components_.copy())
+        self._mean_ = CumlArray(self._mean_)
+        self._noise_variance_ = CumlArray(self._noise_variance_)
+        self._singular_values_ = CumlArray(self._singular_values_)
+        self._explained_variance_ = CumlArray(self._explained_variance_.copy())
+        self._explained_variance_ratio_ = \
+            CumlArray(self._explained_variance_ratio_)
+
+    def _cumlarray_to_cupy_attrs(self):
+        self._components_ = self._components_.to_output("cupy")
+        self._mean_ = self._mean_.to_output("cupy")
+        self._noise_variance_ = self._noise_variance_.to_output("cupy")
+        self._singular_values_ = self._singular_values_.to_output("cupy")
+        self._explained_variance_ = self._explained_variance_.to_output("cupy")
+        self._explained_variance_ratio_ = \
+            self._explained_variance_ratio_.to_output("cupy")
+
+
+def _validate_sparse_input(X):
+    """
+    Validate the format and dtype of sparse inputs.
+    This function throws an error for any cupyx.scipy.sparse object that is not
+    of type cupyx.scipy.sparse.csr_matrix or cupyx.scipy.sparse.csc_matrix.
+    It also validates the dtype of the input to be 'float32' or 'float64'
+
+    Parameters
+    ----------
+
+    X : scipy.sparse or cupyx.scipy.sparse object
+        A sparse input
+
+    Returns
+    -------
+
+    X : The input converted to a cupyx.scipy.sparse.csr_matrix object
+
+    """
+
+    acceptable_dtypes = ('float32', 'float64')
+
+    # NOTE: We can include cupyx.scipy.sparse.csc.csc_matrix
+    # once it supports indexing in cupy 8.0.0b5
+    acceptable_cupy_sparse_formats = \
+        (cupyx.scipy.sparse.csr_matrix)
+
+    if X.dtype not in acceptable_dtypes:
+        raise TypeError("Expected input to be of type float32 or float64."
+                        " Received %s" % X.dtype)
+    if scipy.sparse.issparse(X):
+        return cupyx.scipy.sparse.csr_matrix(X)
+    elif cupyx.scipy.sparse.issparse(X):
+        if not isinstance(X, acceptable_cupy_sparse_formats):
+            raise TypeError("Expected input to be of type"
+                            " cupyx.scipy.sparse.csr_matrix or"
+                            " cupyx.scipy.sparse.csc_matrix. Received %s"
+                            % type(X))
+        else:
+            return X
+
+
+def _gen_batches(n, batch_size, min_batch_size=0):
+    """
+    Generator to create slices containing batch_size elements, from 0 to n.
+    The last slice may contain less than batch_size elements, when batch_size
+    does not divide n.
+
+    Parameters
+    ----------
+
+    n : int
+    batch_size : int
+        Number of element in each batch
+    min_batch_size : int, default=0
+        Minimum batch size to produce.
+
+    Yields
+    ------
+
+    slice of batch_size elements
+
+    """
+
+    if not isinstance(batch_size, numbers.Integral):
+        raise TypeError("gen_batches got batch_size=%s, must be an"
+                        " integer" % batch_size)
+    if batch_size <= 0:
+        raise ValueError("gen_batches got batch_size=%s, must be"
+                         " positive" % batch_size)
+    start = 0
+    for _ in range(int(n // batch_size)):
+        end = start + batch_size
+        if end + min_batch_size > n:
+            continue
+        yield slice(start, end)
+        start = end
+    if start < n:
+        yield slice(start, n)
+
+
+def _safe_accumulator_op(op, x, *args, **kwargs):
+    """
+    This function provides numpy accumulator functions with a float64 dtype
+    when used on a floating point input. This prevents accumulator overflow on
+    smaller floating point dtypes.
+
+    Parameters
+    ----------
+
+    op : function
+        A cupy accumulator function such as cp.mean or cp.sum
+    x : cupy array
+        A numpy array to apply the accumulator function
+    *args : positional arguments
+        Positional arguments passed to the accumulator function after the
+        input x
+    **kwargs : keyword arguments
+        Keyword arguments passed to the accumulator function
+
+    Returns
+    -------
+
+    result : The output of the accumulator function passed to this function
+
+    """
+
+    if cp.issubdtype(x.dtype, cp.floating) and x.dtype.itemsize < 8:
+        result = op(x, *args, **kwargs, dtype=cp.float64).astype(cp.float32)
+    else:
+        result = op(x, *args, **kwargs)
+    return result
+
+
+def _incremental_mean_and_var(X, last_mean, last_variance, last_sample_count):
+    """
+    Calculate mean update and a Youngs and Cramer variance update.
+    last_mean and last_variance are statistics computed at the last step by the
+    function. Both must be initialized to 0.0. In case no scaling is required
+    last_variance can be None. The mean is always required and returned because
+    necessary for the calculation of the variance. last_n_samples_seen is the
+    number of samples encountered until now.
+    From the paper "Algorithms for computing the sample variance: analysis and
+    recommendations", by Chan, Golub, and LeVeque.
+
+    Parameters
+    ----------
+
+    X : array-like, shape (n_samples, n_features)
+        Data to use for variance update
+    last_mean : array-like, shape: (n_features,)
+    last_variance : array-like, shape: (n_features,)
+    last_sample_count : array-like, shape (n_features,)
+
+    Returns
+    -------
+
+    updated_mean : array, shape (n_features,)
+    updated_variance : array, shape (n_features,)
+        If None, only mean is computed
+    updated_sample_count : array, shape (n_features,)
+
+    Notes
+    -----
+    NaNs are ignored during the algorithm.
+
+    References
+    ----------
+    T. Chan, G. Golub, R. LeVeque. Algorithms for computing the sample
+        variance: recommendations, The American Statistician, Vol. 37, No. 3,
+        pp. 242-247
+
+    """
+
+    # old = stats until now
+    # new = the current increment
+    # updated = the aggregated stats
+    last_sum = last_mean * last_sample_count
+    new_sum = _safe_accumulator_op(cp.nansum, X, axis=0)
+
+    new_sample_count = cp.sum(~cp.isnan(X), axis=0)
+    updated_sample_count = last_sample_count + new_sample_count
+
+    updated_mean = (last_sum + new_sum) / updated_sample_count
+
+    if last_variance is None:
+        updated_variance = None
+    else:
+        new_unnormalized_variance = (
+            _safe_accumulator_op(cp.nanvar, X, axis=0) * new_sample_count)
+        last_unnormalized_variance = last_variance * last_sample_count
+
+        # NOTE: The scikit-learn implementation has a np.errstate check
+        # here for ignoring invalid divides. This is not implemented in
+        # cupy as of 7.6.0
+        last_over_new_count = last_sample_count / new_sample_count
+        updated_unnormalized_variance = (
+            last_unnormalized_variance + new_unnormalized_variance +
+            last_over_new_count / updated_sample_count *
+            (last_sum / last_over_new_count - new_sum) ** 2)
+
+        zeros = last_sample_count == 0
+        updated_unnormalized_variance[zeros] = new_unnormalized_variance[zeros]
+        updated_variance = updated_unnormalized_variance / updated_sample_count
+
+    return updated_mean, updated_variance, updated_sample_count
+
+
+def _svd_flip(u, v, u_based_decision=True):
+    """
+    Sign correction to ensure deterministic output from SVD.
+    Adjusts the columns of u and the rows of v such that the loadings in the
+    columns in u that are largest in absolute value are always positive.
+
+    Parameters
+    ----------
+
+    u : cupy.ndarray
+        u and v are the output of `cupy.linalg.svd`
+    v : cupy.ndarray
+        u and v are the output of `cupy.linalg.svd`
+    u_based_decision : boolean, (default=True)
+        If True, use the columns of u as the basis for sign flipping.
+        Otherwise, use the rows of v. The choice of which variable to base the
+        decision on is generally algorithm dependent.
+
+    Returns
+    -------
+    u_adjusted, v_adjusted : arrays with the same dimensions as the input.
+
+    """
+    if u_based_decision:
+        # columns of u, rows of v
+        max_abs_cols = cp.argmax(cp.abs(u), axis=0)
+        signs = cp.sign(u[max_abs_cols, range(u.shape[1])])
+        u *= signs
+        v *= signs[:, cp.newaxis]
+    else:
+        # rows of v, columns of u
+        max_abs_rows = cp.argmax(cp.abs(v), axis=1)
+        signs = cp.sign(v[list(range(v.shape[0])), max_abs_rows])
+        u *= signs
+        v *= signs[:, cp.newaxis]
+    return u, v
diff --git a/python/cuml/experimental/hyperopt_utils/plotting_utils.py b/python/cuml/experimental/hyperopt_utils/plotting_utils.py
index 035d42701a..887717ceb3 100644
--- a/python/cuml/experimental/hyperopt_utils/plotting_utils.py
+++ b/python/cuml/experimental/hyperopt_utils/plotting_utils.py
@@ -22,8 +22,8 @@
 
 def plot_heatmap(df, col1, col2):
     """
-    Generates a heatmap to highlight interactions
-    of two parameters specified in col1 and col2.
+    Generates a heatmap to highlight interactions of two parameters specified
+    in col1 and col2.
 
     Parameters
     ----------
@@ -32,9 +32,6 @@ def plot_heatmap(df, col1, col2):
     col1 : string; Name of the first parameter
     col2: string; Name of the second parameter
 
-    Output
-    ----------
-    A heatmap using seaborn
     """
     max_scores = df.groupby([col1, col2]).max()
     max_scores = max_scores.unstack()[['mean_test_score']]
@@ -43,8 +40,8 @@ def plot_heatmap(df, col1, col2):
 
 def plot_search_results(res):
     """
-    Plots by fixing all paramters except one parameter to
-    its best value using matplotlib.
+    Plots by fixing all paramters except one parameter to its best value using
+    matplotlib.
 
     Accepts results from grid or random search from dask-ml.
 
@@ -52,9 +49,6 @@ def plot_search_results(res):
     ----------
     res : results from Grid or Random Search
 
-    Output
-    ----------
-    As many plots as the parameters that were tuned
     """
     # Results from grid search
     results = res.cv_results_
diff --git a/python/cuml/experimental/preprocessing/__init__.py b/python/cuml/experimental/preprocessing/__init__.py
new file mode 100644
index 0000000000..ab3cdf8124
--- /dev/null
+++ b/python/cuml/experimental/preprocessing/__init__.py
@@ -0,0 +1,41 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml._thirdparty.sklearn.preprocessing import StandardScaler, \
+    MinMaxScaler, MaxAbsScaler, Normalizer, Binarizer, PolynomialFeatures, \
+    SimpleImputer, RobustScaler, KBinsDiscretizer
+from cuml._thirdparty.sklearn.preprocessing import scale, minmax_scale, \
+    normalize, add_dummy_feature, binarize, robust_scale
+
+__all__ = [
+    # Classes
+    'Binarizer',
+    'KBinsDiscretizer',
+    'MaxAbsScaler',
+    'MinMaxScaler',
+    'Normalizer',
+    'PolynomialFeatures',
+    'RobustScaler',
+    'SimpleImputer',
+    'StandardScaler',
+    # Functions
+    'add_dummy_feature',
+    'binarize',
+    'minmax_scale',
+    'normalize',
+    'robust_scale',
+    'scale',
+]
diff --git a/python/cuml/feature_extraction/_tfidf.py b/python/cuml/feature_extraction/_tfidf.py
index 2ac18f8e62..d01f8e0fa6 100644
--- a/python/cuml/feature_extraction/_tfidf.py
+++ b/python/cuml/feature_extraction/_tfidf.py
@@ -13,24 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from sklearn.utils.validation import FLOAT_DTYPES
 from cuml.common.exceptions import NotFittedError
 import cupy as cp
+import cupyx
 from cuml.common import with_cupy_rmm
 from cuml.common.sparsefuncs import csr_row_normalize_l1, csr_row_normalize_l2
 from cuml.common.sparsefuncs import csr_diag_mul
+from cuml.common.array import CumlArray
+from cuml import Base
 
 
 def _sparse_document_frequency(X):
     """Count the number of non-zero values for each feature in sparse X."""
-    if cp.sparse.isspmatrix_csr(X):
+    if cupyx.scipy.sparse.isspmatrix_csr(X):
         return cp.bincount(X.indices, minlength=X.shape[1])
     else:
         return cp.diff(X.indptr)
 
 
-class TfidfTransformer:
-    """Transform a count matrix to a normalized tf or tf-idf representation
+def _get_dtype(X):
+    """
+        Returns the valid dtype for tf-idf transformer
+    """
+    import numpy as np
+    FLOAT_DTYPES = (np.float64, np.float32, np.float16)
+
+    dtype = X.dtype if X.dtype in FLOAT_DTYPES else cp.float32
+    return dtype
+
+
+class TfidfTransformer(Base):
+    """
+    Transform a count matrix to a normalized tf or tf-idf representation
     Tf means term-frequency while tf-idf means term-frequency times inverse
     document-frequency. This is a common term weighting scheme in information
     retrieval, that has also found good use in document classification.
@@ -66,12 +80,13 @@ class TfidfTransformer:
 
     Parameters
     ----------
+
     norm : {'l1', 'l2'}, default='l2'
         Each output row will have unit norm, either:
-        * 'l2': Sum of squares of vector elements is 1. The cosine
-        similarity between two vectors is their dot product when l2 norm has
-        been applied.
-        * 'l1': Sum of absolute values of vector elements is 1.
+         * 'l2': Sum of squares of vector elements is 1. The cosine similarity
+           between two vectors is their dot product when l2 norm has been
+           applied.
+         * 'l1': Sum of absolute values of vector elements is 1.
     use_idf : bool, default=True
         Enable inverse-document-frequency reweighting.
     smooth_idf : bool, default=True
@@ -80,21 +95,83 @@ class TfidfTransformer:
         exactly once. Prevents zero divisions.
     sublinear_tf : bool, default=False
         Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
     idf_ : array of shape (n_features)
         The inverse document frequency (IDF) vector; only defined
-        if  ``use_idf`` is True.
+        if ``use_idf`` is True.
+
     """
 
     def __init__(self, *, norm='l2', use_idf=True, smooth_idf=True,
-                 sublinear_tf=False):
+                 sublinear_tf=False, handle=None, verbose=False,
+                 output_type=None):
+
+        super(TfidfTransformer, self).__init__(
+            handle=handle,
+            verbose=verbose,
+            output_type=output_type)
         self.norm = norm
         self.use_idf = use_idf
         self.smooth_idf = smooth_idf
         self.sublinear_tf = sublinear_tf
 
+    @with_cupy_rmm
+    def _set_doc_stats(self, X):
+        """
+        We set the following document level statistics here:
+        n_samples
+        n_features
+        df(document frequency)
+        """
+        # Should not have a cost if already sparse
+        output_dtype = _get_dtype(X)
+        X = self._convert_to_csr(X, output_dtype)
+        n_samples, n_features = X.shape
+        df = _sparse_document_frequency(X)
+        df = df.astype(output_dtype, copy=False)
+        self.__df = CumlArray(df)
+        self.__n_samples = n_samples
+        self.__n_features = n_features
+
+        return
+
+    @with_cupy_rmm
+    def _set_idf_diag(self):
+        """
+            Sets idf_diagonal sparse array
+        """
+        # perform idf smoothing if required
+        df = self.__df.to_output('cupy') + int(self.smooth_idf)
+        n_samples = self.__n_samples + int(self.smooth_idf)
+
+        # log+1 instead of log makes sure terms with zero idf don't get
+        # suppressed entirely.
+        idf = cp.log(n_samples / df) + 1
+        self._idf_diag = cp.sparse.dia_matrix(
+            (idf, 0),
+            shape=(self.__n_features, self.__n_features),
+            dtype=df.dtype
+        )
+        # Free up memory occupied by below
+        del self.__df
+
     @with_cupy_rmm
     def fit(self, X):
         """Learn the idf vector (global term weights).
@@ -104,26 +181,11 @@ def fit(self, X):
         X : array-like of shape n_samples, n_features
             A matrix of term/token counts.
         """
-        dtype = X.dtype if X.dtype in FLOAT_DTYPES else cp.float32
-        X = self._convert_to_csr(X, dtype)
-
+        output_dtype = _get_dtype(X)
+        X = self._convert_to_csr(X, output_dtype)
         if self.use_idf:
-            n_samples, n_features = X.shape
-            df = _sparse_document_frequency(X)
-            df = df.astype(dtype, copy=False)
-
-            # perform idf smoothing if required
-            df += int(self.smooth_idf)
-            n_samples += int(self.smooth_idf)
-
-            # log+1 instead of log makes sure terms with zero idf don't get
-            # suppressed entirely.
-            idf = cp.log(n_samples / df) + 1
-            self._idf_diag = cp.sparse.dia_matrix(
-                (idf, 0),
-                shape=(n_features, n_features),
-                dtype=dtype
-            )
+            self._set_doc_stats(X)
+            self._set_idf_diag()
 
         return self
 
@@ -146,7 +208,7 @@ def transform(self, X, copy=True):
         if copy:
             X = X.copy()
 
-        dtype = X.dtype if X.dtype in FLOAT_DTYPES else cp.float32
+        dtype = _get_dtype(X)
 
         X = self._convert_to_csr(X, dtype)
         if X.dtype != dtype:
@@ -204,9 +266,9 @@ def _check_is_idf_fitted(self):
 
     def _convert_to_csr(self, X, dtype):
         """Convert array to CSR format if it not sparse nor CSR."""
-        if not cp.sparse.isspmatrix_csr(X):
-            if not cp.sparse.issparse(X):
-                X = cp.sparse.csr_matrix(X.astype(dtype))
+        if not cupyx.scipy.sparse.isspmatrix_csr(X):
+            if not cupyx.scipy.sparse.issparse(X):
+                X = cupyx.scipy.sparse.csr_matrix(X.astype(dtype))
             else:
                 X = X.tocsr()
         return X
@@ -221,8 +283,12 @@ def idf_(self):
     def idf_(self, value):
         value = cp.asarray(value, dtype=cp.float32)
         n_features = value.shape[0]
-        self._idf_diag = cp.sparse.dia_matrix(
+        self._idf_diag = cupyx.scipy.sparse.dia_matrix(
             (value, 0),
             shape=(n_features, n_features),
             dtype=cp.float32
         )
+
+    def get_param_names(self):
+        return super().get_param_names() + \
+            ["norm", "use_idf", "smooth_idf", "sublinear_tf"]
diff --git a/python/cuml/feature_extraction/_tfidf_vectorizer.py b/python/cuml/feature_extraction/_tfidf_vectorizer.py
index 54fb28a525..e2cc4158b6 100644
--- a/python/cuml/feature_extraction/_tfidf_vectorizer.py
+++ b/python/cuml/feature_extraction/_tfidf_vectorizer.py
@@ -94,10 +94,11 @@ class TfidfVectorizer(CountVectorizer):
         Typically the delimiting character between words is a good choice.
     norm : {'l1', 'l2'}, default='l2'
         Each output row will have unit norm, either:
-        * 'l2': Sum of squares of vector elements is 1. The cosine
-        similarity between two vectors is their dot product when l2 norm has
-        been applied.
-        * 'l1': Sum of absolute values of vector elements is 1.
+         * 'l2': Sum of squares of vector elements is 1. The cosine similarity
+           between two vectors is their dot product when l2 norm has been
+           applied.
+         * 'l1': Sum of absolute values of vector elements is 1.
+
     use_idf : bool, default=True
         Enable inverse-document-frequency reweighting.
     smooth_idf : bool, default=True
@@ -119,6 +120,7 @@ class TfidfVectorizer(CountVectorizer):
           - occurred in too many documents (`max_df`)
           - occurred in too few documents (`min_df`)
           - were cut off by feature selection (`max_features`).
+
         This is only available if no vocabulary was given.
 
     Notes
diff --git a/python/cuml/feature_extraction/_vectorizers.py b/python/cuml/feature_extraction/_vectorizers.py
index ef7d03bfe8..13a69d37ef 100644
--- a/python/cuml/feature_extraction/_vectorizers.py
+++ b/python/cuml/feature_extraction/_vectorizers.py
@@ -28,7 +28,7 @@
 
 
 def _preprocess(doc, lower=False, remove_non_alphanumeric=False, delimiter=" ",
-                keep_underscore_char=True):
+                keep_underscore_char=True, remove_single_token_len=True):
     """
     Chain together an optional series of text preprocessing steps to
     apply to a document.
@@ -63,6 +63,12 @@ def _preprocess(doc, lower=False, remove_non_alphanumeric=False, delimiter=" ",
             doc = doc.str.replace(temp_string, '_', regex=False)
         else:
             doc = doc.str.filter_alphanum(' ', keep=True)
+
+        # sklearn by default removes tokens of
+        # length 1, if its remove alphanumerics
+        if remove_single_token_len:
+            doc = doc.str.filter_tokens(2)
+
     return doc
 
 
@@ -134,7 +140,9 @@ def get_char_ngrams(self, ngram_size, str_series, doc_id_sr):
             del str_series
 
             padding = Series(self.delimiter).repeat(len(tokens))
-            tokens = padding.str.cat(tokens.str.cat(padding))
+            tokens = tokens.str.cat(padding)
+            padding = padding.reset_index(drop=True)
+            tokens = padding.str.cat(tokens)
             tokens = tokens.reset_index(drop=True)
 
             ngram_sr = tokens.str.character_ngrams(n=ngram_size)
@@ -373,7 +381,9 @@ class CountVectorizer(_VectorizerMixin):
           - occurred in too many documents (`max_df`)
           - occurred in too few documents (`min_df`)
           - were cut off by feature selection (`max_features`).
+
         This is only available if no vocabulary was given.
+
     """
 
     def __init__(self, input=None, encoding=None, decode_error=None,
@@ -456,7 +466,7 @@ def _limit_features(self, count_df, vocab, high, low, limit):
         documents than low, modifying the vocabulary, and restricting it to
         at most the limit most frequent.
 
-        Sets self.vocabulary_ and self.stop_words_ with the new values.
+        Sets `self.vocabulary_` and `self.stop_words_` with the new values.
         """
         if high is None and low is None and limit is None:
             self.stop_words_ = None
@@ -499,15 +509,17 @@ def fit(self, raw_documents):
         """
         Build a vocabulary of all tokens in the raw documents.
 
-       Parameters
-       ----------
-       raw_documents : cudf.Series
-           A Series of string documents
+        Parameters
+        ----------
+
+        raw_documents : cudf.Series
+            A Series of string documents
 
-       Returns
-       -------
-       self
-       """
+        Returns
+        -------
+        self
+
+        """
         self.fit_transform(raw_documents)
         return self
 
@@ -515,7 +527,8 @@ def fit_transform(self, raw_documents):
         """
         Build the vocabulary and return document-term matrix.
 
-        Equivalent to .fit(X).transform(X) but preprocess X only once.
+        Equivalent to ``self.fit(X).transform(X)`` but preprocess `X` only
+        once.
 
         Parameters
         ----------
@@ -624,10 +637,13 @@ def inverse_transform(self, X):
     def get_feature_names(self):
         """
         Array mapping from feature integer indices to feature name.
+
         Returns
         -------
+
         feature_names : Series
             A list of feature names.
+
         """
         return self.vocabulary_
 
@@ -636,34 +652,34 @@ class HashingVectorizer(_VectorizerMixin):
     """
     Convert a collection of text documents to a matrix of token occurrences
 
-    It turns a collection of text documents into a cupy.sparse matrix holding
-    token occurrence counts (or binary occurrence information), possibly
-    normalized as token frequencies if norm='l1' or projected on the euclidean
-    unit sphere if norm='l2'.
+    It turns a collection of text documents into a cupyx.scipy.sparse matrix
+    holding token occurrence counts (or binary occurrence information),
+    possibly normalized as token frequencies if norm='l1' or projected on the
+    euclidean unit sphere if norm='l2'.
 
     This text vectorizer implementation uses the hashing trick to find the
     token string name to feature integer index mapping.
 
     This strategy has several advantages:
 
-    - it is very low memory scalable to large datasets as there is no need to
-      store a vocabulary dictionary in memory which is even more important
-      as GPU's that are often memory constrained
-    - it is fast to pickle and un-pickle as it holds no state besides the
-      constructor parameters
-    - it can be used in a streaming (partial fit) or parallel pipeline as there
-      is no state computed during fit.
+     - it is very low memory scalable to large datasets as there is no need to
+       store a vocabulary dictionary in memory which is even more important
+       as GPU's that are often memory constrained
+     - it is fast to pickle and un-pickle as it holds no state besides the
+       constructor parameters
+     - it can be used in a streaming (partial fit) or parallel pipeline as
+       there is no state computed during fit.
 
     There are also a couple of cons (vs using a CountVectorizer with an
     in-memory vocabulary):
 
-    - there is no way to compute the inverse transform (from feature indices to
-      string feature names) which can be a problem when trying to introspect
-      which features are most important to a model.
-    - there can be collisions: distinct tokens can be mapped to the same
-      feature index. However in practice this is rarely an issue if n_features
-      is large enough (e.g. 2 ** 18 for text classification problems).
-    - no IDF weighting as this would render the transformer stateful.
+     - there is no way to compute the inverse transform (from feature indices
+       to string feature names) which can be a problem when trying to
+       introspect which features are most important to a model.
+     - there can be collisions: distinct tokens can be mapped to the same
+       feature index. However in practice this is rarely an issue if n_features
+       is large enough (e.g. 2 ** 18 for text classification problems).
+     - no IDF weighting as this would render the transformer stateful.
 
     The hash function employed is the signed 32-bit version of Murmurhash3.
 
@@ -677,7 +693,7 @@ class HashingVectorizer(_VectorizerMixin):
     stop_words : string {'english'}, list, default=None
         If 'english', a built-in stop word list for English is used.
         There are several known issues with 'english' and you should
-        consider an alternative (see :ref:`stop_words`).
+        consider an alternative.
         If a list, that list is assumed to contain stop words, all of which
         will be removed from the resulting tokens.
         Only applies if ``analyzer == 'word'``.
@@ -709,8 +725,9 @@ class HashingVectorizer(_VectorizerMixin):
     dtype : type, optional
         Type of the matrix returned by fit_transform() or transform().
     delimiter : str, whitespace by default
-        String used as a replacement for stop words if stop_words is not None.
-        Typically the delimiting character between words is a good choice.
+        String used as a replacement for stop words if `stop_words` is not
+        None. Typically the delimiting character between words is a good
+        choice.
 
     Examples
     --------
diff --git a/python/cuml/fil/README.md b/python/cuml/fil/README.md
new file mode 100644
index 0000000000..d3c26544e7
--- /dev/null
+++ b/python/cuml/fil/README.md
@@ -0,0 +1,78 @@
+# FIL - RAPIDS Forest Inference Library
+
+The Forest Inference Library provides a lightweight, flexible API to
+infer (predict) results from a tree-based model ensemble on GPU. The
+tree ensemble can be either gradient-boosted decision tree (GBDT) or
+random forest (RF) models trained in XGBoost, cuML, scikit-learn, or
+LightGBM.
+
+# Code sample
+
+Starting with an XGBoost classification model saved in the file
+"xgb.mod," we want to use that model to infer on a large dataset of
+test samples.
+
+```python
+from cuml import ForestInference
+
+fm = ForestInference.load(filename=model_path,
+                          output_class=True,
+                          threshold=0.50,
+                          model_type='xgboost')
+
+X = ... load test samples as a numpy or cupy array ...
+
+y_out = fm.predict(X)
+
+```
+
+See [the sample notebook](https://github.com/rapidsai/cuml/blob/main/notebooks/forest_inference_demo.ipynb) for much more detail and runnable samples.
+
+Additionally, FIL can be called directly from C or C++ code. See [the API docs here](https://docs.rapids.ai/api/libcuml/nightly/namespaceML_1_1fil.html)
+
+# Features
+
+* Input model source: XGBoost (binary format), cuML RandomForest, scikit-learn RandomForest, LightGBM
+* Model types: Regression, Binary Classification, Multi-class Classification (for cuML Random Forests, but not GBDTs or scikit-learn Random Forests)
+* Tree storage types: Dense or sparse tree storage (see Sparse Forests with FIL blog below)
+* Input formats: Dense, row-major, FP32 arrays on GPU or CPU (e.g. NumPy, cuPy, or other data formats supported by cuML). Trees are expected to be trained for float32 inputs. There may be rounding differences if trees were trained for float64 inputs.
+* High performance batch inference
+* Input parsing based on (Treelite)[https://github.com/dmlc/treelite]
+
+Upcoming features:
+
+* Support for multi-class GBDTs is planned for RAPIDS 0.16
+* Support for smaller node storage (8-byte) to reduce memory usage for
+  small trees is experimental
+
+# Benchmarks and performance notes
+
+(1) The core data format supported by FIL is an FP32, (row-major)[https://en.wikipedia.org/wiki/Row-_and_column-major_order] array on
+GPU. All other input types will be automatically converted to this
+format internally, but you will get the lowest latency if you use that
+format to start with.
+
+(2) FIL is optimized for high-throughput, batch inference, so its
+performance benefits become more pronounced as the size of the test
+data X grows. Larger, more complex models (e.g. those with more trees)
+will also see a greater boost as they can fully occupy a large GPU.
+
+The chart below shows how performance (measured in microseconds per
+row) varies as the number of input rows increases, comparing both
+CPU-based inference (XGBoost CPU inference, and the optimized treelite
+library) and GPU-based inference (XGBoost and FIL).
+
+![FIL Performance Chart](./fil_performance_nrows.png)
+
+(_Benchmarks were run on a DGX1-Volta system with 2x 20-core
+Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPUs and a single V100-32gb
+GPU, using FIL 0.9.)
+
+
+# Blogs and further references
+
+* [RAPIDS Forest Inference Library: Prediction at 100 million rows per second](https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35)
+* [Sparse Forests with FIL](https://medium.com/rapids-ai/sparse-forests-with-fil-ffbb42b0c7e3
+)
+* [GBM Inferencing on GPU (earlier research work)](https://on-demand.gputechconf.com/gtc/2018/presentation/s8873-gbm-inferencing-on-gpu-v2.pdf)
+* [Sample Notebook](https://github.com/rapidsai/cuml/blob/branch-0.16/notebooks/forest_inference_demo.ipynb)
diff --git a/python/cuml/fil/fil.pyx b/python/cuml/fil/fil.pyx
index 7301114931..c8c9b8e994 100644
--- a/python/cuml/fil/fil.pyx
+++ b/python/cuml/fil/fil.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import copy
 import cudf
@@ -35,13 +32,12 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
-from cuml.common import input_to_cuml_array
+from cuml.raft.common.handle cimport handle_t
+from cuml.common import input_to_cuml_array, logger
 
 import treelite
 import treelite.sklearn as tl_skl
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 cdef extern from "treelite/c_api.h":
@@ -55,6 +51,8 @@ cdef extern from "treelite/c_api.h":
     cdef int TreeliteFreeModel(ModelHandle handle) except +
     cdef int TreeliteQueryNumTree(ModelHandle handle, size_t* out) except +
     cdef int TreeliteQueryNumFeature(ModelHandle handle, size_t* out) except +
+    cdef int TreeliteQueryNumOutputGroups(ModelHandle handle,
+                                          size_t* out) except +
     cdef int TreeliteLoadLightGBMModel(const char* filename,
                                        ModelHandle* out) except +
     cdef int TreeliteLoadProtobufModel(const char* filename,
@@ -66,7 +64,7 @@ cdef class TreeliteModel():
     """
     Wrapper for Treelite-loaded forest
 
-    Note: This is only used for loading saved models into ForestInference,
+    .. note:: This is only used for loading saved models into ForestInference,
     it does not actually perform inference. Users typically do
     not need to access TreeliteModel instances directly.
 
@@ -135,6 +133,9 @@ cdef class TreeliteModel():
                 err = TreeliteGetLastError()
                 raise RuntimeError("Failed to load %s (%s)" % (filename, err))
         elif model_type == "lightgbm":
+            logger.warn("Treelite currently does not support float64 model"
+                        " parameters. Accuracy may degrade slightly relative"
+                        " to native LightGBM invocation.")
             res = TreeliteLoadLightGBMModel(filename_bytes, &handle)
             if res < 0:
                 err = TreeliteGetLastError()
@@ -163,7 +164,8 @@ cdef extern from "cuml/fil/fil.h" namespace "ML::fil":
     cdef enum storage_type_t:
         AUTO,
         DENSE,
-        SPARSE
+        SPARSE,
+        SPARSE8
 
     cdef struct forest:
         pass
@@ -176,17 +178,17 @@ cdef extern from "cuml/fil/fil.h" namespace "ML::fil":
         float threshold
         storage_type_t storage_type
 
-    cdef void free(cumlHandle& handle,
+    cdef void free(handle_t& handle,
                    forest_t)
 
-    cdef void predict(cumlHandle& handle,
+    cdef void predict(handle_t& handle,
                       forest_t,
                       float*,
                       float*,
                       size_t,
-                      bool)
+                      bool) except +
 
-    cdef forest_t from_treelite(cumlHandle& handle,
+    cdef forest_t from_treelite(handle_t& handle,
                                 forest_t*,
                                 ModelHandle,
                                 treelite_params_t*)
@@ -195,6 +197,8 @@ cdef class ForestInference_impl():
 
     cpdef object handle
     cdef forest_t forest_data
+    cdef size_t num_output_groups
+    cdef bool output_class
 
     def __cinit__(self,
                   handle=None):
@@ -218,7 +222,10 @@ cdef class ForestInference_impl():
     def get_storage_type(self, storage_type_str):
         storage_type_dict={'auto': storage_type_t.AUTO,
                            'False': storage_type_t.DENSE,
-                           'True': storage_type_t.SPARSE}
+                           'dense': storage_type_t.DENSE,
+                           'True': storage_type_t.SPARSE,
+                           'sparse': storage_type_t.SPARSE,
+                           'sparse8': storage_type_t.SPARSE8}
 
         if storage_type_str not in storage_type_dict.keys():
             raise ValueError(
@@ -226,6 +233,8 @@ cdef class ForestInference_impl():
                 "supported. Please refer to the documentation at"
                 "(https://docs.rapids.ai/api/cuml/nightly/api.html#"
                 "forest-inferencing) to see the accepted values.")
+        if storage_type_str == 'sparse8':
+            logger.info('storage_type=="sparse8" is an experimental feature')
         return storage_type_dict[storage_type_str]
 
     def predict(self, X, output_type='numpy',
@@ -247,9 +256,17 @@ cdef class ForestInference_impl():
             matches sklearn
 
         Returns
-        ----------
+        -------
+
         Predicted results of type as defined by the output_type variable
+
         """
+        if (not self.output_class) and predict_proba:
+            raise NotImplementedError("Predict_proba function is not available"
+                                      " for Regression models. If you are "
+                                      " using a Classification model, please "
+                                      " set `output_class=True` while creating"
+                                      " the FIL model.")
         cdef uintptr_t X_ptr
         X_m, n_rows, n_cols, dtype = \
             input_to_cuml_array(X, order='C',
@@ -257,13 +274,16 @@ cdef class ForestInference_impl():
                                 check_dtype=np.float32)
         X_ptr = X_m.ptr
 
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><size_t>self.handle.getHandle()
 
         if preds is None:
             shape = (n_rows, )
             if predict_proba:
-                shape += (2,)
+                if self.num_output_groups <= 2:
+                    shape += (2,)
+                else:
+                    shape += (self.num_output_groups,)
             preds = CumlArray.empty(shape=shape, dtype=np.float32, order='C')
         elif (not isinstance(preds, cudf.Series) and
               not rmm.is_cuda_array(preds)):
@@ -297,20 +317,24 @@ cdef class ForestInference_impl():
                                         float threshold,
                                         str storage_type):
         cdef treelite_params_t treelite_params
-        treelite_params.output_class = output_class
+
+        self.output_class = output_class
+        treelite_params.output_class = self.output_class
         treelite_params.threshold = threshold
         treelite_params.algo = self.get_algo(algo)
         treelite_params.storage_type = self.get_storage_type(storage_type)
 
         self.forest_data = NULL
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><size_t>self.handle.getHandle()
         cdef uintptr_t model_ptr = <uintptr_t>model_handle
 
         from_treelite(handle_[0],
                       &self.forest_data,
                       <ModelHandle> model_ptr,
                       &treelite_params)
+        TreeliteQueryNumOutputGroups(<ModelHandle> model_ptr,
+                                     & self.num_output_groups)
         return self
 
     def load_from_treelite_model(self,
@@ -319,6 +343,8 @@ cdef class ForestInference_impl():
                                  str algo,
                                  float threshold,
                                  str storage_type):
+        TreeliteQueryNumOutputGroups(<ModelHandle> model.handle,
+                                     & self.num_output_groups)
         return self.load_from_treelite_model_handle(<uintptr_t>model.handle,
                                                     output_class, algo,
                                                     threshold, storage_type)
@@ -332,30 +358,34 @@ cdef class ForestInference_impl():
 
         cdef treelite_params_t treelite_params
 
-        treelite_params.output_class = output_class
+        self.output_class = output_class
+        treelite_params.output_class = self.output_class
         treelite_params.threshold = threshold
         treelite_params.algo = self.get_algo(algo)
         treelite_params.storage_type = self.get_storage_type(storage_type)
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><size_t>self.handle.getHandle()
         cdef uintptr_t model_ptr = <uintptr_t>model_handle
 
         from_treelite(handle_[0],
                       &self.forest_data,
                       <ModelHandle> model_ptr,
                       &treelite_params)
+        TreeliteQueryNumOutputGroups(<ModelHandle> model_ptr,
+                                     &self.num_output_groups)
         return self
 
     def __dealloc__(self):
-        cdef cumlHandle* handle_ =\
-            <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ =\
+            <handle_t*><size_t>self.handle.getHandle()
         if self.forest_data !=NULL:
             free(handle_[0],
                  self.forest_data)
 
 
 class ForestInference(Base):
-    """ForestInference provides GPU-accelerated inference (prediction)
+    """
+    ForestInference provides GPU-accelerated inference (prediction)
     for random forest and boosted decision tree models.
 
     This module does not support training models. Rather, users should
@@ -380,11 +410,29 @@ class ForestInference(Base):
      * Inference uses a dense matrix format, which is efficient for many
        problems but can be suboptimal for sparse datasets.
      * Only binary classification and regression are supported.
+     * Many other random forest implementations including LightGBM, and SKLearn
+       GBDTs make use of 64-bit floating point parameters, but the underlying
+       library for ForestInference uses only 32-bit parameters. Because of the
+       truncation that will occur when loading such models into
+       ForestInference, you may observe a slight degradation in accuracy.
 
     Parameters
     ----------
     handle : cuml.Handle
-       If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
     --------
@@ -413,13 +461,17 @@ class ForestInference(Base):
     Notes
     ------
     For additional usage examples, see the sample notebook at
-    https://github.com/rapidsai/cuml/blob/branch-0.14/notebooks/forest_inference_demo.ipynb
+    https://github.com/rapidsai/cuml/blob/branch-0.15/notebooks/forest_inference_demo.ipynb
 
     """
+
     def __init__(self,
-                 handle=None, output_type=None):
-        super(ForestInference, self).__init__(handle,
-                                              output_type=output_type)
+                 handle=None,
+                 output_type=None,
+                 verbose=False):
+        super(ForestInference, self).__init__(handle=handle,
+                                              output_type=output_type,
+                                              verbose=verbose)
         self._impl = ForestInference_impl(self.handle)
 
     def predict(self, X, preds=None):
@@ -485,37 +537,42 @@ class ForestInference(Base):
 
         Parameters
         ----------
-        model : the trained model information in the treelite format
-           loaded from a saved model using the treelite API
-           https://treelite.readthedocs.io/en/latest/treelite-api.html
+        model
+            the trained model information in the treelite format
+            loaded from a saved model using the treelite API
+            https://treelite.readthedocs.io/en/latest/treelite-api.html
         output_class: boolean (default=False)
-           If True, return a 1 or 0 depending on whether the raw prediction
-           exceeds the threshold. If False, just return the raw prediction.
+            For a Classification model output_class must be True.
+            For a Regression model output_class must be False.
         algo : string (default='auto')
-            name of the algo from (from algo_t enum)
-             'AUTO' or 'auto' - choose the algorithm automatically;
-                   currently 'BATCH_TREE_REORG' is used for dense storage,
-                   and 'NAIVE' for sparse storage
-             'NAIVE' or 'naive' - simple inference using shared memory
-             'TREE_REORG' or 'tree_reorg' - similar to naive but trees
-                              rearranged to be more coalescing-friendly
-             'BATCH_TREE_REORG' or 'batch_tree_reorg' - similar to TREE_REORG
-                                    but predicting multiple rows
-                                    per thread block
+            name of the algo from (from algo_t enum) :
+             - 'AUTO' or 'auto' - choose the algorithm automatically;
+               currently 'BATCH_TREE_REORG' is used for dense storage,
+               and 'NAIVE' for sparse storage
+             - 'NAIVE' or 'naive' - simple inference using shared memory
+             - 'TREE_REORG' or 'tree_reorg' - similar to naive but trees
+               rearranged to be more coalescing-friendly
+             - 'BATCH_TREE_REORG' or 'batch_tree_reorg' - similar to TREE_REORG
+               but predicting multiple rows per thread block
         threshold : float (default=0.5)
             Threshold is used to for classification. It is applied
             only if output_class == True, else it is ignored.
         storage_type : string or boolean (default='auto')
-            In-memory storage format to be used for the FIL model.
-             'auto' - choose the storage type automatically
-                      (currently DENSE is always used)
-             False - create a dense forest
-             True - create a sparse forest;
-                      requires algo='NAIVE' or algo='AUTO'
+            In-memory storage format to be used for the FIL model:
+             - 'auto' - choose the storage type automatically
+               (currently DENSE is always used)
+             - False - create a dense forest
+             - True - create a sparse forest;
+               requires algo='NAIVE' or algo='AUTO'
+             - 'sparse8' - (experimental) create a sparse forest
+                      with 8-byte nodes; requires algo='NAIVE' or algo='AUTO';
+                      can fail if 8-byte nodes are not enough
+                      to store the forest, e.g. if there are
+                      too many nodes in a tree or too many features
 
         Returns
         ----------
-        fil_model :
+        fil_model
             A Forest Inference model which can be used to perform
             inferencing on the random forest/ XGBoost model.
         """
@@ -542,40 +599,43 @@ class ForestInference(Base):
 
         Parameters
         ----------
-        skl_model : The scikit-learn model from which to build the FIL version.
+        skl_model
+            The scikit-learn model from which to build the FIL version.
         output_class: boolean (default=False)
-           If True, return a 1 or 0 depending on whether the raw prediction
-           exceeds the threshold. If False, just return the raw prediction.
+            For a Classification model output_class must be True.
+            For a Regression model output_class must be False.
         algo : string (default='auto')
-            name of the algo from (from algo_t enum)
-             'AUTO' or 'auto' - choose the algorithm automatically;
-                   currently 'BATCH_TREE_REORG' is used for dense storage,
-                   and 'NAIVE' for sparse storage
-             'NAIVE' or 'naive' - simple inference using shared memory
-             'TREE_REORG' or 'tree_reorg' - similar to naive but trees
-                              rearranged to be more coalescing-friendly
-             'BATCH_TREE_REORG' or 'batch_tree_reorg' - similar to TREE_REORG
-                                    but predicting multiple rows
-                                    per thread block
+            name of the algo from (from algo_t enum):
+             - 'AUTO' or 'auto' - choose the algorithm automatically;
+               currently 'BATCH_TREE_REORG' is used for dense storage,
+               and 'NAIVE' for sparse storage
+             - 'NAIVE' or 'naive' - simple inference using shared memory
+             - 'TREE_REORG' or 'tree_reorg' - similar to naive but trees
+               rearranged to be more coalescing-friendly
+             - 'BATCH_TREE_REORG' or 'batch_tree_reorg' - similar to TREE_REORG
+               but predicting multiple rows per thread block
         threshold : float (default=0.5)
             Threshold is used to for classification. It is applied
-            only if output_class == True, else it is ignored.
+            only if ``output_class == True``, else it is ignored.
         storage_type : string or boolean (default='auto')
-            In-memory storage format to be used for the FIL model.
-             'auto' - choose the storage type automatically
-                      (currently DENSE is always used)
-             False - create a dense forest
-             True - create a sparse forest;
-                      requires algo='NAIVE' or algo='AUTO'
+            In-memory storage format to be used for the FIL model:
+             - 'auto' - choose the storage type automatically
+               (currently DENSE is always used)
+             - False - create a dense forest
+             - True - create a sparse forest;
+               requires algo='NAIVE' or algo='AUTO'
 
         Returns
         ----------
-        fil_model :
+        fil_model
             A Forest Inference model created from the scikit-learn
             model passed.
 
         """
         cuml_fm = ForestInference(handle=handle)
+        logger.warn("Treelite currently does not support float64 model"
+                    " parameters. Accuracy may degrade slightly relative to"
+                    " native sklearn invocation.")
         tl_model = tl_skl.import_model(skl_model)
         cuml_fm.load_from_treelite_model(
             tl_model, algo=algo, output_class=output_class,
@@ -591,36 +651,37 @@ class ForestInference(Base):
              model_type="xgboost",
              handle=None):
         """
-        Returns a FIL instance containing the forest saved in 'filename'
+        Returns a FIL instance containing the forest saved in `filename`
         This uses Treelite to load the saved model.
 
         Parameters
         ----------
         filename : string
-           Path to saved model file in a treelite-compatible format
-           (See https://treelite.readthedocs.io/en/latest/treelite-api.html
+            Path to saved model file in a treelite-compatible format
+            (See https://treelite.readthedocs.io/en/latest/treelite-api.html
             for more information)
-        output_class : bool (default=False)
-           If True, return a 1 or 0 depending on whether the raw prediction
-           exceeds the threshold. If False, just return the raw prediction.
+        output_class: boolean (default=False)
+            For a Classification model `output_class` must be True.
+            For a Regression model `output_class` must be False.
         threshold : float (default=0.5)
-           Cutoff value above which a prediction is set to 1.0
-           Only used if the model is classification and output_class is True
+            Cutoff value above which a prediction is set to 1.0
+            Only used if the model is classification and `output_class` is True
         algo : string (default='auto')
-           Which inference algorithm to use.
-           See documentation in FIL.load_from_treelite_model
+            Which inference algorithm to use.
+            See documentation in `FIL.load_from_treelite_model`
         storage_type : string (default='auto')
             In-memory storage format to be used for the FIL model.
-            See documentation in FIL.load_from_treelite_model
+            See documentation in `FIL.load_from_treelite_model`
         model_type : string (default="xgboost")
             Format of the saved treelite model to be load.
             It can be 'xgboost', 'lightgbm'.
 
         Returns
         ----------
-        fil_model :
+        fil_model
             A Forest Inference model which can be used to perform
             inferencing on the model read from the file.
+
         """
         cuml_fm = ForestInference(handle=handle)
         tl_model = TreeliteModel.from_filename(filename, model_type=model_type)
@@ -646,22 +707,22 @@ class ForestInference(Base):
         model_handle : Modelhandle to the treelite forest model
             (See https://treelite.readthedocs.io/en/latest/treelite-api.html
             for more information)
-        output_class : bool (default=False)
-           If True, return a 1 or 0 depending on whether the raw prediction
-           exceeds the threshold. If False, just return the raw prediction.
+        output_class: boolean (default=False)
+            For a Classification model `output_class` must be True.
+            For a Regression model `output_class` must be False.
         threshold : float (default=0.5)
-           Cutoff value above which a prediction is set to 1.0
-           Only used if the model is classification and output_class is True
+            Cutoff value above which a prediction is set to 1.0
+            Only used if the model is classification and output_class is True
         algo : string (default='auto')
-           Which inference algorithm to use.
-           See documentation in FIL.load_from_treelite_model
+            Which inference algorithm to use.
+            See documentation in `FIL.load_from_treelite_model`
         storage_type : string (default='auto')
             In-memory storage format to be used for the FIL model.
-            See documentation in FIL.load_from_treelite_model
+            See documentation in `FIL.load_from_treelite_model`
 
         Returns
         ----------
-        fil_model :
+        fil_model
             A Forest Inference model which can be used to perform
             inferencing on the random forest model.
         """
diff --git a/python/cuml/fil/fil_performance_nrows.png b/python/cuml/fil/fil_performance_nrows.png
new file mode 100644
index 0000000000..28196590b4
Binary files /dev/null and b/python/cuml/fil/fil_performance_nrows.png differ
diff --git a/python/cuml/internals/__init__.py b/python/cuml/internals/__init__.py
index d398e63df6..64158e83ba 100644
--- a/python/cuml/internals/__init__.py
+++ b/python/cuml/internals/__init__.py
@@ -14,4 +14,4 @@
 # limitations under the License.
 #
 
-from cuml.internals.internals import GraphBasedDimRedCallback
\ No newline at end of file
+from cuml.internals.internals import GraphBasedDimRedCallback
diff --git a/python/cuml/internals/internals.pyx b/python/cuml/internals/internals.pyx
index 1f07625cb9..784ecf6844 100644
--- a/python/cuml/internals/internals.pyx
+++ b/python/cuml/internals/internals.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 
 from libc.stdint cimport uintptr_t
diff --git a/python/cuml/linear_model/base_mg.pyx b/python/cuml/linear_model/base_mg.pyx
index eb8f751556..8642a40409 100644
--- a/python/cuml/linear_model/base_mg.pyx
+++ b/python/cuml/linear_model/base_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 
 import ctypes
@@ -30,7 +27,7 @@ from cython.operator cimport dereference as deref
 
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common.input_utils import input_to_cuml_array
 from cuml.decomposition.utils cimport *
diff --git a/python/cuml/linear_model/elastic_net.pyx b/python/cuml/linear_model/elastic_net.pyx
index 738a6dad0b..49eddd4c42 100644
--- a/python/cuml/linear_model/elastic_net.pyx
+++ b/python/cuml/linear_model/elastic_net.pyx
@@ -14,13 +14,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from cuml.solvers import CD
 from cuml.common.base import Base, RegressorMixin
+from cuml.common.doc_utils import generate_docstring
 
 
 class ElasticNet(Base, RegressorMixin):
@@ -36,7 +34,7 @@ class ElasticNet(Base, RegressorMixin):
     descent to fit a linear model.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -115,21 +113,27 @@ class ElasticNet(Base, RegressorMixin):
         This (setting to ‘random’) often leads to significantly faster
         convergence especially when tol is higher than 1e-4.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
-    output_type : (optional) {'input', 'cudf', 'cupy', 'numpy'} default = None
-        Use it to control output type of the results and attributes.
-        If None it'll inherit the output type set at the
-        module level, cuml.output_type. If that has not been changed, by
-        default the estimator will mirror the type of the data used for each
-        fit or predict call.
-        If set, the estimator will override the global option for its behavior.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Attributes
     -----------
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     -----
@@ -139,7 +143,7 @@ class ElasticNet(Base, RegressorMixin):
 
     def __init__(self, alpha=1.0, l1_ratio=0.5, fit_intercept=True,
                  normalize=False, max_iter=1000, tol=1e-3, selection='cyclic',
-                 handle=None, output_type=None):
+                 handle=None, output_type=None, verbose=False):
         """
         Initializes the elastic-net regression class.
 
@@ -159,7 +163,7 @@ class ElasticNet(Base, RegressorMixin):
 
         # Hard-code verbosity as CoordinateDescent does not have verbosity
         super(ElasticNet, self).__init__(handle=handle,
-                                         verbose=False,
+                                         verbose=verbose,
                                          output_type=output_type)
 
         self._check_alpha(alpha)
@@ -167,13 +171,13 @@ class ElasticNet(Base, RegressorMixin):
 
         self.alpha = alpha
         self.l1_ratio = l1_ratio
-        self.coef_ = None
+        self._coef_ = None
         self.intercept_ = None
         self.fit_intercept = fit_intercept
         self.normalize = normalize
         self.max_iter = max_iter
         self.tol = tol
-        self.cuElasticNet = None
+        self.solver_model = None
         if selection in ['cyclic', 'random']:
             self.selection = selection
         else:
@@ -186,7 +190,7 @@ class ElasticNet(Base, RegressorMixin):
         if self.selection == 'random':
             shuffle = True
 
-        self.cuElasticNet = CD(fit_intercept=self.fit_intercept,
+        self.solver_model = CD(fit_intercept=self.fit_intercept,
                                normalize=self.normalize, alpha=self.alpha,
                                l1_ratio=self.l1_ratio, shuffle=shuffle,
                                max_iter=self.max_iter, handle=self.handle)
@@ -201,93 +205,36 @@ class ElasticNet(Base, RegressorMixin):
             msg = "l1_ratio value has to be between 0.0 and 1.0"
             raise ValueError(msg.format(l1_ratio))
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the transform method will, when necessary,
-            convert y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-
-        self.cuElasticNet.fit(X, y, convert_dtype=convert_dtype)
-
-        self.coef_ = self.cuElasticNet.coef_
-        self.intercept_ = self.cuElasticNet.intercept_
+        self._set_base_attributes(output_type=X, n_features=X)
+        self.solver_model.fit(X, y, convert_dtype=convert_dtype)
 
         return self
 
-    def predict(self, X, convert_dtype=False):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
-        Predicts the y for X.
-
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
+        Predicts `y` values for `X`.
 
         """
 
-        return self.cuElasticNet.predict(X)
-
-    def get_params(self, deep=True):
-        """
-        Scikit-learn style function that returns the estimator parameters.
-
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-        params = dict()
-        variables = ['alpha', 'fit_intercept', 'normalize', 'max_iter', 'tol',
-                     'selection']
-        for key in variables:
-            var_value = getattr(self, key, None)
-            params[key] = var_value
-        return params
-
-    def set_params(self, **params):
-        """
-        Sklearn style set parameter state to dictionary of params.
-
-        Parameters
-        -----------
-        params : dict of new params
-        """
-        if not params:
-            return self
-        variables = ['alpha', 'fit_intercept', 'normalize', 'max_iter', 'tol',
-                     'selection']
-        for key, value in params.items():
-            if key not in variables:
-                raise ValueError('Invalid parameter for estimator')
-            else:
-                setattr(self, key, value)
-
-        return self
+        return self.solver_model.predict(X, convert_dtype=convert_dtype)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "alpha",
+            "l1_ratio",
+            "fit_intercept",
+            "normalize",
+            "max_iter",
+            "tol",
+            "selection",
+        ]
diff --git a/python/cuml/linear_model/lasso.pyx b/python/cuml/linear_model/lasso.pyx
index a1cdbdbe84..24eccb3d27 100644
--- a/python/cuml/linear_model/lasso.pyx
+++ b/python/cuml/linear_model/lasso.pyx
@@ -14,13 +14,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from cuml.solvers import CD
 from cuml.common.base import Base, RegressorMixin
+from cuml.common.doc_utils import generate_docstring
 
 
 class Lasso(Base, RegressorMixin):
@@ -37,7 +35,7 @@ class Lasso(Base, RegressorMixin):
     a linear model.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -111,14 +109,27 @@ class Lasso(Base, RegressorMixin):
         This (setting to ‘random’) often leads to significantly faster
         convergence especially when tol is higher than 1e-4.
     handle : cuml.Handle
-        If it is None, a new one is created just for this class.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Attributes
     -----------
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     -----
@@ -128,21 +139,19 @@ class Lasso(Base, RegressorMixin):
 
     def __init__(self, alpha=1.0, fit_intercept=True, normalize=False,
                  max_iter=1000, tol=1e-3, selection='cyclic', handle=None,
-                 output_type=None):
+                 output_type=None, verbose=False):
 
         # Hard-code verbosity as CoordinateDescent does not have verbosity
-        super(Lasso, self).__init__(handle=handle, verbose=False,
+        super(Lasso, self).__init__(handle=handle, verbose=verbose,
                                     output_type=output_type)
 
         self._check_alpha(alpha)
         self.alpha = alpha
-        self.coef_ = None
-        self.intercept_ = None
         self.fit_intercept = fit_intercept
         self.normalize = normalize
         self.max_iter = max_iter
         self.tol = tol
-        self.culasso = None
+        self.solver_model = None
         if selection in ['cyclic', 'random']:
             self.selection = selection
         else:
@@ -155,98 +164,45 @@ class Lasso(Base, RegressorMixin):
         if self.selection == 'random':
             shuffle = True
 
-        self.culasso = CD(fit_intercept=self.fit_intercept,
-                          normalize=self.normalize, alpha=self.alpha,
-                          l1_ratio=1.0, shuffle=shuffle,
-                          max_iter=self.max_iter, handle=self.handle)
+        self.solver_model = CD(fit_intercept=self.fit_intercept,
+                               normalize=self.normalize, alpha=self.alpha,
+                               l1_ratio=1.0, shuffle=shuffle,
+                               max_iter=self.max_iter, handle=self.handle)
 
     def _check_alpha(self, alpha):
         if alpha <= 0.0:
             msg = "alpha value has to be positive"
             raise ValueError(msg.format(alpha))
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the transform method will, when necessary,
-            convert y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-
-        self.culasso.fit(X, y, convert_dtype=convert_dtype)
-
-        self.coef_ = self.culasso.coef_
-        self.intercept_ = self.culasso.intercept_
+        self._set_base_attributes(output_type=X, n_features=X)
+        self.solver_model.fit(X, y, convert_dtype=convert_dtype)
 
         return self
 
-    def predict(self, X, convert_dtype=False):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        ----------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
-
-        """
-
-        return self.culasso.predict(X, convert_dtype=convert_dtype)
-
-    def get_params(self, deep=True):
         """
-        Scikit-learn style function that returns the estimator parameters.
 
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-        params = dict()
-        variables = ['alpha', 'fit_intercept', 'normalize', 'max_iter', 'tol',
-                     'selection']
-        for key in variables:
-            var_value = getattr(self, key, None)
-            params[key] = var_value
-        return params
-
-    def set_params(self, **params):
-        """
-        Sklearn style set parameter state to dictionary of params.
-
-        Parameters
-        -----------
-        params : dict of new params
-        """
-        if not params:
-            return self
-        variables = ['alpha', 'fit_intercept', 'normalize', 'max_iter', 'tol',
-                     'selection']
-        for key, value in params.items():
-            if key not in variables:
-                raise ValueError('Invalid parameter for estimator')
-            else:
-                setattr(self, key, value)
-
-        return self
+        return self.solver_model.predict(X, convert_dtype=convert_dtype)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "alpha",
+            "fit_intercept",
+            "normalize",
+            "max_iter",
+            "tol",
+            "selection",
+        ]
diff --git a/python/cuml/linear_model/linear_regression.pyx b/python/cuml/linear_model/linear_regression.pyx
index 93b647711e..2c22249e45 100644
--- a/python/cuml/linear_model/linear_regression.pyx
+++ b/python/cuml/linear_model/linear_regression.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -33,12 +30,13 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base, RegressorMixin
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 
 cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
 
-    cdef void olsFit(cumlHandle& handle,
+    cdef void olsFit(handle_t& handle,
                      float *input,
                      int n_rows,
                      int n_cols,
@@ -48,7 +46,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                      bool fit_intercept,
                      bool normalize, int algo) except +
 
-    cdef void olsFit(cumlHandle& handle,
+    cdef void olsFit(handle_t& handle,
                      double *input,
                      int n_rows,
                      int n_cols,
@@ -58,7 +56,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                      bool fit_intercept,
                      bool normalize, int algo) except +
 
-    cdef void olsPredict(cumlHandle& handle,
+    cdef void olsPredict(handle_t& handle,
                          const float *input,
                          int n_rows,
                          int n_cols,
@@ -66,7 +64,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                          float intercept,
                          float *preds) except +
 
-    cdef void olsPredict(cumlHandle& handle,
+    cdef void olsPredict(handle_t& handle,
                          const double *input,
                          int n_rows,
                          int n_cols,
@@ -86,7 +84,7 @@ class LinearRegression(Base, RegressorMixin):
     stable, but Eig (default) is much faster.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -151,13 +149,28 @@ class LinearRegression(Base, RegressorMixin):
         If True, the predictors in X will be normalized by dividing by it's
         L2 norm.
         If False, no scaling will be done.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     -----------
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If `fit_intercept_` is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     ------
@@ -179,7 +192,7 @@ class LinearRegression(Base, RegressorMixin):
     <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>`_.
 
     For an additional example see `the OLS notebook
-    <https://github.com/rapidsai/notebooks/blob/branch-0.12/cuml/linear_regression_demo.ipynb>`_.
+    <https://github.com/rapidsai/cuml/blob/branch-0.15/notebooks/linear_regression_demo.ipynb>`_.
 
 
     """
@@ -211,29 +224,13 @@ class LinearRegression(Base, RegressorMixin):
             'eig': 1
         }[algorithm]
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         cdef uintptr_t X_ptr, y_ptr
         X_m, n_rows, self.n_cols, self.dtype = \
@@ -266,7 +263,7 @@ class LinearRegression(Base, RegressorMixin):
 
         cdef float c_intercept1
         cdef double c_intercept2
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
 
@@ -303,27 +300,14 @@ class LinearRegression(Base, RegressorMixin):
 
         return self
 
-    def predict(self, X, convert_dtype=False):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts `y` values for `X`.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        -------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
-
         """
 
         out_type = self._get_output_type(X)
@@ -341,7 +325,7 @@ class LinearRegression(Base, RegressorMixin):
         preds = CumlArray.zeros(n_rows, dtype=dtype)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype.type == np.float32:
             olsPredict(handle_[0],
@@ -367,4 +351,5 @@ class LinearRegression(Base, RegressorMixin):
         return preds.to_output(out_type)
 
     def get_param_names(self):
-        return ['algorithm', 'fit_intercept', 'normalize']
+        return super().get_param_names() + \
+            ['algorithm', 'fit_intercept', 'normalize']
diff --git a/python/cuml/linear_model/linear_regression_mg.pyx b/python/cuml/linear_model/linear_regression_mg.pyx
index 9c56a9f395..24bc3ab332 100644
--- a/python/cuml/linear_model/linear_regression_mg.pyx
+++ b/python/cuml/linear_model/linear_regression_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -32,7 +29,7 @@ from cython.operator cimport dereference as deref
 
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common import input_to_cuml_array
 from cuml.decomposition.utils cimport *
@@ -43,7 +40,7 @@ from cuml.linear_model.base_mg import MGFitMixin
 
 cdef extern from "cuml/linear_model/ols_mg.hpp" namespace "ML::OLS::opg":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[floatData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[floatData_t *] labels,
@@ -54,7 +51,7 @@ cdef extern from "cuml/linear_model/ols_mg.hpp" namespace "ML::OLS::opg":
                   int algo,
                   bool verbose) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[doubleData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[doubleData_t *] labels,
@@ -75,7 +72,7 @@ class LinearRegressionMG(MGFitMixin, LinearRegression):
 
         cdef float float_intercept
         cdef double double_intercept
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
 
diff --git a/python/cuml/linear_model/logistic_regression.pyx b/python/cuml/linear_model/logistic_regression.pyx
index 7721203721..d367eadd65 100644
--- a/python/cuml/linear_model/logistic_regression.pyx
+++ b/python/cuml/linear_model/logistic_regression.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cupy as cp
 import pprint
@@ -25,6 +22,7 @@ import pprint
 from cuml.solvers import QN
 from cuml.common.base import Base, ClassifierMixin
 from cuml.common.array import CumlArray
+from cuml.common.doc_utils import generate_docstring
 import cuml.common.logger as logger
 from cuml.common import input_to_cuml_array, with_cupy_rmm
 
@@ -59,7 +57,7 @@ class LogisticRegression(Base, ClassifierMixin):
     Note that, just like in Scikit-learn, the bias will not be regularized.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
         import cudf
@@ -78,9 +76,9 @@ class LogisticRegression(Base, ClassifierMixin):
         reg.fit(X,y)
 
         print("Coefficients:")
-        print(reg.coef_.to_output('cupy'))
+        print(reg.coef_)
         print("Intercept:")
-        print(reg.intercept_.to_output('cupy'))
+        print(reg.intercept_)
 
         X_new = cudf.DataFrame()
         X_new['col1'] = np.array([1,5], dtype = np.float32)
@@ -126,8 +124,9 @@ class LogisticRegression(Base, ClassifierMixin):
     linesearch_max_iter: int (default = 50)
         Max number of linesearch iterations per outer iteration used in the
         lbfgs and owl QN solvers.
-    verbose : int or boolean (default = False)
-        Controls verbose level of logging.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     l1_ratio: float or None, optional (default=None)
         The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`
     solver: 'qn', 'lbfgs', 'owl' (default='qn').
@@ -136,6 +135,18 @@ class LogisticRegression(Base, ClassifierMixin):
         depending on the conditions of the l1 regularization described
         above. Options 'lbfgs' and 'owl' are just convenience values that
         end up using the same solver following the same rules.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     -----------
@@ -144,7 +155,7 @@ class LogisticRegression(Base, ClassifierMixin):
         Note: this includes the intercept as the last column if fit_intercept
         is True
     intercept_: device array (n_classes, 1)
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     ------
@@ -226,7 +237,7 @@ class LogisticRegression(Base, ClassifierMixin):
 
         loss = "sigmoid"
 
-        self.qn = QN(
+        self.solver_model = QN(
             loss=loss,
             fit_intercept=self.fit_intercept,
             l1_strength=l1_strength,
@@ -245,39 +256,23 @@ class LogisticRegression(Base, ClassifierMixin):
         else:
             self.verb_prefix = ""
 
+    @generate_docstring()
     @with_cupy_rmm
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self.qn._set_target_dtype(y)
-        self._set_output_type(X)
-        self._set_n_features_in(X)
+        self.solver_model._set_base_attributes(target_dtype=y)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         # Converting y to device array here to use `unique` function
         # since calling input_to_dev_array again in QN has no cost
         # Not needed to check dtype since qn class checks it already
         y_m, _, _, _ = input_to_cuml_array(y)
 
-        self.classes_ = cp.unique(y_m)
-        self._num_classes = len(self.classes_)
+        self._classes_ = CumlArray(cp.unique(y_m))
+        self._num_classes = len(self._classes_)
 
         if self._num_classes > 2:
             loss = "softmax"
@@ -287,12 +282,12 @@ class LogisticRegression(Base, ClassifierMixin):
         if logger.should_log_for(logger.level_debug):
             logger.debug(self.verb_prefix + "Setting loss to " + str(loss))
 
-        self.qn.loss = loss
+        self.solver_model.loss = loss
 
         if logger.should_log_for(logger.level_debug):
             logger.debug(self.verb_prefix + "Calling QN fit " + str(loss))
 
-        self.qn.fit(X, y_m, convert_dtype=convert_dtype)
+        self.solver_model.fit(X, y_m, convert_dtype=convert_dtype)
 
         # coefficients and intercept are contained in the same array
         if logger.should_log_for(logger.level_debug):
@@ -300,91 +295,53 @@ class LogisticRegression(Base, ClassifierMixin):
                 self.verb_prefix + "Setting coefficients " + str(loss)
             )
 
-        if self.fit_intercept:
-            self.coef_ = self.qn.coef_[0:-1]
-            self.intercept_ = self.qn.coef_[-1]
-        else:
-            self.coef_ = self.qn.coef_
-
         if logger.should_log_for(logger.level_trace):
             logger.trace(self.verb_prefix + "Coefficients: " +
-                         str(self.coef_.to_output("cupy")))
+                         str(self._coef_.to_output("cupy")))
             if self.fit_intercept:
                 logger.trace(
                     self.verb_prefix
                     + "Intercept: "
-                    + str(self.intercept_.to_output("cupy"))
+                    + str(self._intercept_.to_output("cupy"))
                 )
 
         return self
 
+    @generate_docstring(return_values={'name': 'score',
+                                       'type': 'dense',
+                                       'description': 'Confidence score',
+                                       'shape': '(n_samples, n_classes)'})
     def decision_function(self, X, convert_dtype=False):
         """
         Gives confidence score for X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: array-like (device)
-           Dense matrix (floats or doubles) of shape (n_samples, n_classes)
         """
-        return self.qn._decision_function(X, convert_dtype=convert_dtype)
-
-    def predict(self, X, convert_dtype=False):
+        return self.solver_model._decision_function(
+            X,
+            convert_dtype=convert_dtype
+        ).to_output(output_type=self._get_output_type(X))
+
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y : (same as the input datatype)
-            Dense vector (ints, floats, or doubles) of shape (n_samples, 1).
         """
-        return self.qn.predict(X, convert_dtype=convert_dtype)
+        return self.solver_model.predict(X, convert_dtype=convert_dtype)
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted class \
+                                                       probabilities',
+                                       'shape': '(n_samples, n_classes)'})
     @with_cupy_rmm
-    def predict_proba(self, X, convert_dtype=False):
+    def predict_proba(self, X, convert_dtype=True):
         """
         Predicts the class probabilities for each class in X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: array-like (device)
-           Dense matrix (floats or doubles) of shape (n_samples, n_classes)
         """
         return self._predict_proba_impl(
             X,
@@ -392,26 +349,15 @@ class LogisticRegression(Base, ClassifierMixin):
             log_proba=False
         )
 
-    def predict_log_proba(self, X, convert_dtype=False):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Logaright of predicted \
+                                                       class probabilities',
+                                       'shape': '(n_samples, n_classes)'})
+    def predict_log_proba(self, X, convert_dtype=True):
         """
         Predicts the log class probabilities for each class in X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: array-like (device)
-           Dense matrix (floats or doubles) of shape (n_samples, n_classes)
         """
         return self._predict_proba_impl(
             X,
@@ -427,9 +373,11 @@ class LogisticRegression(Base, ClassifierMixin):
         # qn solver due to https://github.com/rapidsai/cuml/issues/2404
         X_m, _, _, self.dtype = input_to_cuml_array(
             X,
-            check_dtype=self.qn.dtype,
-            convert_to_dtype=(self.qn.dtype if convert_dtype else None),
-            check_cols=self.qn.n_cols,
+            check_dtype=self.solver_model.dtype,
+            convert_to_dtype=(
+                self.solver_model.dtype if convert_dtype else None
+            ),
+            check_cols=self.solver_model.n_cols,
         )
 
         scores = cp.asarray(
@@ -453,11 +401,12 @@ class LogisticRegression(Base, ClassifierMixin):
         return proba.to_output(out_type)
 
     def get_param_names(self):
-        return [
-            "C",
+        return super().get_param_names() + [
             "penalty",
             "tol",
+            "C",
             "fit_intercept",
+            "class_weight",
             "max_iter",
             "linesearch_max_iter",
             "l1_ratio",
@@ -466,27 +415,9 @@ class LogisticRegression(Base, ClassifierMixin):
 
     def __getstate__(self):
         state = self.__dict__.copy()
-        if "coef_" in state:
-            del state["coef_"]
-        if "intercept_" in state:
-            del state["intercept_"]
         return state
 
     def __setstate__(self, state):
         super(LogisticRegression, self).__init__(handle=None,
                                                  verbose=state["verbose"])
-
-        if "qn" in state:
-            qn = state["qn"]
-            if qn.coef_ is not None:
-                if qn.fit_intercept:
-                    state["coef_"] = qn.coef_[0:-1]
-                    state["intercept_"] = qn.coef_[-1]
-                else:
-                    state["coef_"] = qn.coef_
-                    n_classes = qn.coef_.shape[1]
-                    state["intercept_"] = CumlArray.zeros(
-                        n_classes, dtype=qn.coef_.dtype
-                    )
-
         self.__dict__.update(state)
diff --git a/python/cuml/linear_model/mbsgd_classifier.pyx b/python/cuml/linear_model/mbsgd_classifier.pyx
index 990ea1826e..f70b13b798 100644
--- a/python/cuml/linear_model/mbsgd_classifier.pyx
+++ b/python/cuml/linear_model/mbsgd_classifier.pyx
@@ -14,11 +14,9 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 from cuml.common.base import Base, ClassifierMixin
+from cuml.common.doc_utils import generate_docstring
 from cuml.solvers import SGD
 
 
@@ -26,9 +24,19 @@ class MBSGDClassifier(Base, ClassifierMixin):
     """
     Linear models (linear SVM, logistic regression, or linear regression)
     fitted by minimizing a regularized empirical loss with mini-batch SGD.
+    The MBSGD Classifier implementation is experimental and and it uses a
+    different algorithm than sklearn's SGDClassifier. In order to improve
+    the results obtained from cuML's MBSGDClassifier:
+    * Reduce the batch size
+    * Increase the eta0
+    * Increase the number of iterations
+    Since cuML is analyzing the data in batches using a small eta0 might
+    not let the model learn as much as scikit learn does. Furthermore,
+    decreasing the batch size might seen an increase in the time required
+    to fit the model.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
         import numpy as np
@@ -88,6 +96,12 @@ class MBSGDClassifier(Base, ClassifierMixin):
 
     alpha: float (default = 0.0001)
         The constant value which decides the degree of regularization
+    l1_ratio: float (default=0.15)
+        The l1_ratio is used only when `penalty = elasticnet`. The value for
+        l1_ratio should be `0 <= l1_ratio <= 1`. When `l1_ratio = 0` then the
+        `penalty = 'l2'` and if `l1_ratio = 1` then `penalty = 'l1'`
+    batch_size: int (default = 32)
+        It sets the number of samples that will be included in each batch.
     fit_intercept : boolean (default = True)
        If True, the model tries to correct for the global mean of y.
        If False, the model expects that you have centered the data.
@@ -115,11 +129,21 @@ class MBSGDClassifier(Base, ClassifierMixin):
         The old learning rate is generally divided by 5
     n_iter_no_change : int (default = 5)
         the number of epochs to train without any imporvement in the model
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
         Variable to control output type of the results and attributes of
-        the estimators. If None, it'll inherit the output type set at the
-        module level, cuml.output_type. If set, the estimator will override
-        the global option for its behavior.
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Notes
     ------
@@ -148,102 +172,47 @@ class MBSGDClassifier(Base, ClassifierMixin):
         self.power_t = power_t
         self.batch_size = batch_size
         self.n_iter_no_change = n_iter_no_change
-        self.cu_mbsgd_classifier = SGD(**self.get_params())
+        self.solver_model = SGD(**self.get_params())
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-        self.cu_mbsgd_classifier._estimator_type = self._estimator_type
-
-        self.cu_mbsgd_classifier.fit(X, y, convert_dtype=convert_dtype)
-        self.coef_ = self.cu_mbsgd_classifier.coef_
-        self.classes_ = self.cu_mbsgd_classifier.classes_
-        self.intercept_ = self.cu_mbsgd_classifier.intercept_
-
+        self._set_base_attributes(n_features=X)
+        self.solver_model._estimator_type = self._estimator_type
+        self.solver_model.fit(X, y, convert_dtype=convert_dtype)
         return self
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y : (same as the input datatype)
-            Dense vector (ints, floats, or doubles) of shape (n_samples, 1).
         """
         preds = \
-            self.cu_mbsgd_classifier.predictClass(X,
-                                                  convert_dtype=convert_dtype)
+            self.solver_model.predictClass(X,
+                                           convert_dtype=convert_dtype)
 
         return preds
 
-    def get_params(self, deep=True):
-        """
-        Scikit-learn style function that returns the estimator parameters.
-
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-
-        params = dict()
-        variables = ['loss', 'penalty', 'alpha', 'l1_ratio', 'fit_intercept',
-                     'epochs', 'tol', 'shuffle', 'learning_rate', 'eta0',
-                     'power_t', 'batch_size', 'n_iter_no_change', 'handle']
-        for key in variables:
-            var_value = getattr(self, key, None)
-            params[key] = var_value
-        return params
-
-    def set_params(self, **params):
-        """
-        Sklearn style set parameter state to dictionary of params.
-
-        Parameters
-        -----------
-        params : dict of new params
-        """
-
-        if not params:
-            return self
-        variables = ['loss', 'penalty', 'alpha', 'l1_ratio', 'fit_intercept',
-                     'epochs', 'tol', 'shuffle', 'learning_rate', 'eta0',
-                     'power_t', 'batch_size', 'n_iter_no_change', 'handle']
-        for key, value in params.items():
-            if key not in variables:
-                raise ValueError('Invalid parameter for estimator')
-            else:
-                setattr(self, key, value)
-
-        return self
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "loss",
+            "penalty",
+            "alpha",
+            "l1_ratio",
+            "fit_intercept",
+            "epochs",
+            "tol",
+            "shuffle",
+            "learning_rate",
+            "eta0",
+            "power_t",
+            "batch_size",
+            "n_iter_no_change",
+        ]
diff --git a/python/cuml/linear_model/mbsgd_regressor.pyx b/python/cuml/linear_model/mbsgd_regressor.pyx
index d133496a37..bee609cbf7 100644
--- a/python/cuml/linear_model/mbsgd_regressor.pyx
+++ b/python/cuml/linear_model/mbsgd_regressor.pyx
@@ -14,11 +14,9 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 from cuml.common.base import Base, RegressorMixin
+from cuml.common.doc_utils import generate_docstring
 from cuml.solvers import SGD
 
 
@@ -26,9 +24,19 @@ class MBSGDRegressor(Base, RegressorMixin):
     """
     Linear regression model fitted by minimizing a
     regularized empirical loss with mini-batch SGD.
+    The MBSGD Regressor implementation is experimental and and it uses a
+    different algorithm than sklearn's SGDClassifier. In order to improve
+    the results obtained from cuML's MBSGD Regressor:
+    * Reduce the batch size
+    * Increase the eta0
+    * Increase the number of iterations
+    Since cuML is analyzing the data in batches using a small eta0 might
+    not let the model learn as much as scikit learn does. Furthermore,
+    decreasing the batch size might seen an increase in the time required
+    to fit the model.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
         import numpy as np
@@ -82,6 +90,12 @@ class MBSGDRegressor(Base, RegressorMixin):
     fit_intercept : boolean (default = True)
        If True, the model tries to correct for the global mean of y.
        If False, the model expects that you have centered the data.
+    l1_ratio: float (default=0.15)
+        The l1_ratio is used only when `penalty = elasticnet`. The value for
+        l1_ratio should be `0 <= l1_ratio <= 1`. When `l1_ratio = 0` then the
+        `penalty = 'l2'` and if `l1_ratio = 1` then `penalty = 'l1'`
+    batch_size: int (default = 32)
+        It sets the number of samples that will be included in each batch.
     epochs : int (default = 1000)
         The number of times the model should iterate through the entire dataset
         during training (default = 1000)
@@ -106,11 +120,21 @@ class MBSGDRegressor(Base, RegressorMixin):
         The old learning rate is generally divided by 5
     n_iter_no_change : int (default = 5)
         the number of epochs to train without any imporvement in the model
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
         Variable to control output type of the results and attributes of
-        the estimators. If None, it'll inherit the output type set at the
-        module level, cuml.output_type. If set, the estimator will override
-        the global option for its behavior.
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Notes
     ------
@@ -144,98 +168,45 @@ class MBSGDRegressor(Base, RegressorMixin):
         self.power_t = power_t
         self.batch_size = batch_size
         self.n_iter_no_change = n_iter_no_change
-        self.cu_mbsgd_classifier = SGD(**self.get_params())
+        self.solver_model = SGD(**self.get_params())
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-        self.cu_mbsgd_classifier.fit(X, y, convert_dtype=convert_dtype)
-        self.coef_ = self.cu_mbsgd_classifier.coef_
-        self.intercept_ = self.cu_mbsgd_classifier.intercept_
-
+        self._set_base_attributes(n_features=X)
+        self.solver_model.fit(X, y, convert_dtype=convert_dtype)
         return self
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: Type specified by `output_type`
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
 
-        preds = self.cu_mbsgd_classifier.predict(X,
-                                                 convert_dtype=convert_dtype)
+        preds = self.solver_model.predict(X,
+                                          convert_dtype=convert_dtype)
         return preds
 
-    def get_params(self, deep=True):
-        """
-        Scikit-learn style function that returns the estimator parameters.
-
-        Parameters
-        -----------
-        deep : boolean (default = True)
-        """
-
-        params = dict()
-        variables = ['loss', 'penalty', 'alpha', 'l1_ratio', 'fit_intercept',
-                     'epochs', 'tol', 'shuffle', 'learning_rate', 'eta0',
-                     'power_t', 'batch_size', 'n_iter_no_change', 'handle']
-        for key in variables:
-            var_value = getattr(self, key, None)
-            params[key] = var_value
-        return params
-
-    def set_params(self, **params):
-        """
-        Sklearn style set parameter state to dictionary of params.
-
-        Parameters
-        -----------
-        params : dict of new params
-        """
-
-        if not params:
-            return self
-        variables = ['loss', 'penalty', 'alpha', 'l1_ratio', 'fit_intercept',
-                     'epochs', 'tol', 'shuffle', 'learning_rate', 'eta0',
-                     'power_t', 'batch_size', 'n_iter_no_change', 'handle']
-        for key, value in params.items():
-            if key not in variables:
-                raise ValueError('Invalid parameter for estimator')
-            else:
-                setattr(self, key, value)
-
-        return self
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "loss",
+            "penalty",
+            "alpha",
+            "l1_ratio",
+            "fit_intercept",
+            "epochs",
+            "tol",
+            "shuffle",
+            "learning_rate",
+            "eta0",
+            "power_t",
+            "batch_size",
+            "n_iter_no_change",
+        ]
diff --git a/python/cuml/linear_model/ridge.pyx b/python/cuml/linear_model/ridge.pyx
index 4037f8bc93..f6cc24c043 100644
--- a/python/cuml/linear_model/ridge.pyx
+++ b/python/cuml/linear_model/ridge.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -32,12 +29,13 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.base import Base, RegressorMixin
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 
 cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
 
-    cdef void ridgeFit(cumlHandle& handle,
+    cdef void ridgeFit(handle_t& handle,
                        float *input,
                        int n_rows,
                        int n_cols,
@@ -50,7 +48,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                        bool normalize,
                        int algo) except +
 
-    cdef void ridgeFit(cumlHandle& handle,
+    cdef void ridgeFit(handle_t& handle,
                        double *input,
                        int n_rows,
                        int n_cols,
@@ -63,7 +61,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                        bool normalize,
                        int algo) except +
 
-    cdef void ridgePredict(cumlHandle& handle,
+    cdef void ridgePredict(handle_t& handle,
                            const float *input,
                            int n_rows,
                            int n_cols,
@@ -71,7 +69,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                            float intercept,
                            float *preds) except +
 
-    cdef void ridgePredict(cumlHandle& handle,
+    cdef void ridgePredict(handle_t& handle,
                            const double *input,
                            int n_rows,
                            int n_cols,
@@ -98,7 +96,7 @@ class Ridge(Base, RegressorMixin):
     Coordinate Descent and can be faster when data is large.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -168,13 +166,28 @@ class Ridge(Base, RegressorMixin):
         If True, the predictors in X will be normalized by dividing by it's L2
         norm.
         If False, no scaling will be done.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Attributes
     -----------
     coef_ : array, shape (n_features)
         The estimated coefficients for the linear regression model.
     intercept_ : array
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     ------
@@ -191,11 +204,12 @@ class Ridge(Base, RegressorMixin):
 
 
     For additional docs, see `Scikit-learn's Ridge Regression
-    <https://github.com/rapidsai/notebooks/blob/master/cuml/ridge_regression_demo.ipynb>`_.
+    <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html>`_.
     """
 
     def __init__(self, alpha=1.0, solver='eig', fit_intercept=True,
-                 normalize=False, handle=None, output_type=None):
+                 normalize=False, handle=None, output_type=None,
+                 verbose=False):
 
         """
         Initializes the linear ridge regression class.
@@ -211,7 +225,7 @@ class Ridge(Base, RegressorMixin):
 
         """
         self._check_alpha(alpha)
-        super(Ridge, self).__init__(handle=handle, verbose=False,
+        super(Ridge, self).__init__(handle=handle, verbose=verbose,
                                     output_type=output_type)
 
         # internal array attributes
@@ -242,30 +256,13 @@ class Ridge(Base, RegressorMixin):
             'cd': 2
         }[algorithm]
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_output_type(X)
-        self._set_n_features_in(X)
-
+        self._set_base_attributes(output_type=X, n_features=X)
         cdef uintptr_t X_ptr, y_ptr
         X_m, n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, check_dtype=[np.float32, np.float64])
@@ -301,7 +298,7 @@ class Ridge(Base, RegressorMixin):
         cdef double c_intercept2
         cdef float c_alpha1
         cdef double c_alpha2
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             c_alpha1 = self.alpha
@@ -344,27 +341,14 @@ class Ridge(Base, RegressorMixin):
 
         return self
 
-    def predict(self, X, convert_dtype=False):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
-
         """
         out_type = self._get_output_type(X)
 
@@ -381,7 +365,7 @@ class Ridge(Base, RegressorMixin):
         preds = CumlArray.zeros(n_rows, dtype=dtype)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype.type == np.float32:
             ridgePredict(handle_[0],
@@ -407,4 +391,5 @@ class Ridge(Base, RegressorMixin):
         return preds.to_output(out_type)
 
     def get_param_names(self):
-        return ['solver', 'fit_intercept', 'normalize', 'alpha']
+        return super().get_param_names() + \
+            ['solver', 'fit_intercept', 'normalize', 'alpha']
diff --git a/python/cuml/linear_model/ridge_mg.pyx b/python/cuml/linear_model/ridge_mg.pyx
index 7551cb7e3e..a143607baf 100644
--- a/python/cuml/linear_model/ridge_mg.pyx
+++ b/python/cuml/linear_model/ridge_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -32,7 +29,7 @@ from cython.operator cimport dereference as deref
 
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common import input_to_cuml_array
 from cuml.decomposition.utils cimport *
@@ -42,7 +39,7 @@ from cuml.linear_model.base_mg import MGFitMixin
 
 cdef extern from "cuml/linear_model/ridge_mg.hpp" namespace "ML::Ridge::opg":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[floatData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[floatData_t *] labels,
@@ -55,7 +52,7 @@ cdef extern from "cuml/linear_model/ridge_mg.hpp" namespace "ML::Ridge::opg":
                   int algo,
                   bool verbose) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[doubleData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[doubleData_t *] labels,
@@ -78,7 +75,7 @@ class RidgeMG(MGFitMixin, Ridge):
 
         cdef float float_intercept
         cdef double double_intercept
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         cdef float float_alpha
         cdef double double_alpha
         # Only one alpha is supported.
diff --git a/python/cuml/manifold/t_sne.pyx b/python/cuml/manifold/t_sne.pyx
index 05aa0e5883..15e91118c5 100644
--- a/python/cuml/manifold/t_sne.pyx
+++ b/python/cuml/manifold/t_sne.pyx
@@ -14,11 +14,10 @@
 # limitations under the License.
 #
 
-# cython: profile = False
 # distutils: language = c++
 # distutils: extra_compile_args = -Ofast
-# cython: embedsignature = True, language_level = 3
-# cython: boundscheck = False, wraparound = False
+# cython: boundscheck = False
+# cython: wraparound = False
 
 import cudf
 import cuml
@@ -29,10 +28,11 @@ import pandas as pd
 import warnings
 
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 import cuml.common.logger as logger
 
 from cuml.common.array import CumlArray
+from cuml.common.doc_utils import generate_docstring
 from cuml.common import input_to_cuml_array
 import rmm
 
@@ -40,12 +40,11 @@ from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from libcpp.memory cimport shared_ptr
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 cdef extern from "cuml/manifold/tsne.h" namespace "ML" nogil:
     cdef void TSNE_fit(
-        const cumlHandle &handle,
+        const handle_t &handle,
         const float *X,
         float *Y,
         const int n,
@@ -68,7 +67,7 @@ cdef extern from "cuml/manifold/tsne.h" namespace "ML" nogil:
         const float post_momentum,
         const long long random_state,
         int verbosity,
-        const bool intialize_embeddings,
+        const bool initialize_embeddings,
         bool barnes_hut) except +
 
 
@@ -89,8 +88,9 @@ class TSNE(Base):
     Parameters
     -----------
     n_components : int (default 2)
-        The output dimensionality size. Currently only size=2 is tested, but
-        the 'exact' algorithm will support greater dimensionality in future.
+        The output dimensionality size. Currently only size=2 is tested and
+        supported, but the 'exact' algorithm will support greater
+        dimensionality in future.
     perplexity : float (default 30.0)
         Larger datasets require a larger value. Consider choosing different
         perplexity values from 5 to 50 and see the output differences.
@@ -111,9 +111,9 @@ class TSNE(Base):
         a future release.
     init : str 'random' (default 'random')
         Currently supports random intialization.
-    verbose : int or boolean (default = False) (default logger.level_info)
-        Level of verbosity.
-        Most messages will be printed inside the Python Console.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
     random_state : int (default None)
         Setting this can allow future runs of TSNE to look mostly the same.
         It is known that TSNE tends to have vastly different outputs on
@@ -144,68 +144,80 @@ class TSNE(Base):
         During the exaggeration iteration, more forcefully apply gradients.
     post_momentum : float (default 0.8)
         During the late phases, less forcefully apply gradients.
-    handle : (cuML Handle, default None)
-        You can pass in a past handle that was initialized, or we will create
-        one for you anew!
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     References
     -----------
-    *   van der Maaten, L.J.P.
-        t-Distributed Stochastic Neighbor Embedding
-        https://lvdmaaten.github.io/tsne/
+    .. [1] `van der Maaten, L.J.P.
+       t-Distributed Stochastic Neighbor Embedding
+       <https://lvdmaaten.github.io/tsne/>`_
 
-    *   van der Maaten, L.J.P.; Hinton, G.E.
-        Visualizing High-Dimensional Data
-        Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
+    .. [2] van der Maaten, L.J.P.; Hinton, G.E.
+       Visualizing High-Dimensional Data
+       Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
 
-    *   George C. Linderman, Manas Rachh, Jeremy G. Hoskins,
+    .. [3] George C. Linderman, Manas Rachh, Jeremy G. Hoskins,
         Stefan Steinerberger, Yuval Kluger Efficient Algorithms for
         t-distributed Stochastic Neighborhood Embedding
 
-    Tips
-    -----
-    Maaten and Linderman showcased how TSNE can be very sensitive to both the
-    starting conditions (ie random initialization), and how parallel versions
-    of TSNE can generate vastly different results. It has been suggested that
-    you run TSNE a few times to settle on the best configuration. Notice
-    specifying random_state and fixing it across runs can help, but TSNE does
-    not guarantee similar results each time.
-
-    As suggested, PCA (upcoming with change #1098) can also help to alleviate
-    this issue.
-
-    Reference Implementation
-    -------------------------
-    The CUDA implementation is derived from the excellent CannyLabs open source
-    implementation here: https://github.com/CannyLab/tsne-cuda/. The CannyLabs
-    code is licensed according to the conditions in cuml/cpp/src/tsne/
-    cannylabs_tsne_license.txt. A full description of their approach is
-    available in their article t-SNE-CUDA: GPU-Accelerated t-SNE and its
-    Applications to Modern Data (https://arxiv.org/abs/1807.11824).
+    .. tip::
+        Maaten and Linderman showcased how TSNE can be very sensitive to both
+        the starting conditions (ie random initialization), and how parallel
+        versions of TSNE can generate vastly different results. It has been
+        suggested that you run TSNE a few times to settle on the best
+        configuration. Notice specifying random_state and fixing it across runs
+        can help, but TSNE does not guarantee similar results each time.
+
+        As suggested, PCA (upcoming with change #1098) can also help to
+        alleviate this issue.
+
+    .. note::
+        The CUDA implementation is derived from the excellent CannyLabs open
+        source implementation here: https://github.com/CannyLab/tsne-cuda/. The
+        CannyLabs code is licensed according to the conditions in
+        cuml/cpp/src/tsne/ cannylabs_tsne_license.txt. A full description of
+        their approach is available in their article t-SNE-CUDA:
+        GPU-Accelerated t-SNE and its Applications to Modern Data
+        (https://arxiv.org/abs/1807.11824).
+
     """
     def __init__(self,
-                 int n_components=2,
-                 float perplexity=30.0,
-                 float early_exaggeration=12.0,
-                 float learning_rate=200.0,
-                 int n_iter=1000,
-                 int n_iter_without_progress=300,
-                 float min_grad_norm=1e-07,
-                 str metric='euclidean',
-                 str init='random',
-                 int verbose=False,
+                 n_components=2,
+                 perplexity=30.0,
+                 early_exaggeration=12.0,
+                 learning_rate=200.0,
+                 n_iter=1000,
+                 n_iter_without_progress=300,
+                 min_grad_norm=1e-07,
+                 metric='euclidean',
+                 init='random',
+                 verbose=False,
                  random_state=None,
-                 str method='barnes_hut',
-                 float angle=0.5,
+                 method='barnes_hut',
+                 angle=0.5,
                  learning_rate_method='adaptive',
-                 int n_neighbors=90,
-                 int perplexity_max_iter=100,
-                 int exaggeration_iter=250,
-                 float pre_momentum=0.5,
-                 float post_momentum=0.8,
-                 handle=None):
-
-        super(TSNE, self).__init__(handle=handle, verbose=verbose)
+                 n_neighbors=90,
+                 perplexity_max_iter=100,
+                 exaggeration_iter=250,
+                 pre_momentum=0.5,
+                 post_momentum=0.8,
+                 handle=None,
+                 output_type=None):
+
+        super(TSNE, self).__init__(handle=handle,
+                                   verbose=verbose,
+                                   output_type=output_type)
 
         if n_components < 0:
             raise ValueError("n_components = {} should be more "
@@ -214,7 +226,10 @@ class TSNE(Base):
             warnings.warn("Barnes Hut only works when n_components == 2. "
                           "Switching to exact.")
             method = 'exact'
-        if n_components != 2:
+        if n_components > 2:
+            raise ValueError("Currently TSNE supports n_components = 2; "
+                             "but got n_components = {}".format(n_components))
+        if n_components < 2:
             warnings.warn("Currently TSNE supports n_components = 2.")
             n_components = 2
         if perplexity < 0:
@@ -291,29 +306,30 @@ class TSNE(Base):
         if learning_rate_method is None:
             self.learning_rate_method = 'none'
         else:
-            self.learning_rate_method = learning_rate_method.lower()
+            # To support `sklearn.base.clone()`, we must minimize altering
+            # argument references unless absolutely necessary. Check to see if
+            # lowering the string results in the same value, and if so, keep
+            # the same reference that was passed in. This may seem redundant,
+            # but it allows `clone()` to function without raising an error
+            if (learning_rate_method.lower() != learning_rate_method):
+                learning_rate_method = learning_rate_method.lower()
+
+            self.learning_rate_method = learning_rate_method
         self.epssq = 0.0025
         self.perplexity_tol = 1e-5
         self.min_gain = 0.01
         self.pre_learning_rate = learning_rate
         self.post_learning_rate = learning_rate * 2
 
+    @generate_docstring(convert_dtype_cast='np.float32')
     def fit(self, X, convert_dtype=True):
-        """Fit X into an embedded space.
-
-        Parameters
-        -----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            X contains a sample per row.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
-        self._set_n_features_in(X)
+        Fit X into an embedded space.
+
+        """
+        self._set_base_attributes(n_features=X)
         cdef int n, p
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         if handle_ == NULL:
             raise ValueError("cuML Handle is Null! Terminating TSNE.")
 
@@ -407,23 +423,18 @@ class TSNE(Base):
             del self._embedding_
             self._embedding_ = None
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Embedding of the \
+                                                       training data in \
+                                                       low-dimensional space.',
+                                       'shape': '(n_samples, n_components)'})
     def fit_transform(self, X, convert_dtype=True):
-        """Fit X into an embedded space and return that transformed output.
-
-        Parameters
-        -----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            X contains a sample per row.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit_transform method will automatically
-            convert the inputs to np.float32.
-
-        Returns
-        --------
-        X_new : array, shape (n_samples, n_components)
-                Embedding of the training data in low-dimensional space.
+        """
+        Fit X into an embedded space and return that transformed output.
+
+
         """
         self.fit(X, convert_dtype=convert_dtype)
         out_type = self._get_output_type(X)
@@ -444,3 +455,25 @@ class TSNE(Base):
                                    verbose=state['verbose'])
         self.__dict__.update(state)
         return state
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "n_components",
+            "perplexity",
+            "early_exaggeration",
+            "learning_rate",
+            "n_iter",
+            "n_iter_without_progress",
+            "min_grad_norm",
+            "metric",
+            "init",
+            "random_state",
+            "method",
+            "angle",
+            "learning_rate_method",
+            "n_neighbors",
+            "perplexity_max_iter",
+            "exaggeration_iter",
+            "pre_momentum",
+            "post_momentum",
+        ]
diff --git a/python/cuml/manifold/umap.pyx b/python/cuml/manifold/umap.pyx
index 7b27752ea8..41e83534b5 100644
--- a/python/cuml/manifold/umap.pyx
+++ b/python/cuml/manifold/umap.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cudf
 import cuml
@@ -32,13 +29,15 @@ import cupy
 
 import numba.cuda as cuda
 
-from cupy.sparse import csr_matrix as cp_csr_matrix,\
+from cupyx.scipy.sparse import csr_matrix as cp_csr_matrix,\
     coo_matrix as cp_coo_matrix, csc_matrix as cp_csc_matrix
 
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
-from cuml.common import get_cudf_column_ptr, get_dev_array_ptr, \
-    input_to_cuml_array, zeros, with_cupy_rmm, has_scipy
+from cuml.raft.common.handle cimport handle_t
+from cuml.common.doc_utils import generate_docstring
+from cuml.common.input_utils import input_to_cuml_array
+from cuml.common.memory_utils import with_cupy_rmm
+from cuml.common.import_utils import has_scipy
 from cuml.common.array import CumlArray
 
 import rmm
@@ -51,7 +50,6 @@ from libc.stdlib cimport calloc, malloc, free
 
 from libcpp.memory cimport shared_ptr
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 
@@ -94,7 +92,7 @@ cdef extern from "cuml/manifold/umapparams.h" namespace "ML":
 
 
 cdef extern from "cuml/manifold/umap.hpp" namespace "ML":
-    void fit(cumlHandle & handle,
+    void fit(handle_t & handle,
              float * X,
              int n,
              int d,
@@ -103,7 +101,7 @@ cdef extern from "cuml/manifold/umap.hpp" namespace "ML":
              UMAPParams * params,
              float * embeddings) except +
 
-    void fit(cumlHandle & handle,
+    void fit(handle_t & handle,
              float * X,
              float * y,
              int n,
@@ -113,7 +111,7 @@ cdef extern from "cuml/manifold/umap.hpp" namespace "ML":
              UMAPParams * params,
              float * embeddings) except +
 
-    void transform(cumlHandle & handle,
+    void transform(handle_t & handle,
                    float * X,
                    int n,
                    int d,
@@ -128,12 +126,17 @@ cdef extern from "cuml/manifold/umap.hpp" namespace "ML":
 
 
 class UMAP(Base):
-    """Uniform Manifold Approximation and Projection
+    """
+    Uniform Manifold Approximation and Projection
+
     Finds a low dimensional embedding of the data that approximates
     an underlying manifold.
 
     Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap_.py
 
+    The UMAP algorithm is outlined in [1]. This implementation follows the
+    GPU-accelerated version as described in [2].
+
     Parameters
     ----------
     n_neighbors: float (optional, default 15)
@@ -154,8 +157,10 @@ class UMAP(Base):
         The initial learning rate for the embedding optimization.
     init: string (optional, default 'spectral')
         How to initialize the low dimensional embedding. Options are:
-            * 'spectral': use a spectral embedding of the fuzzy 1-skeleton
-            * 'random': assign initial embedding positions at random.
+
+        * 'spectral': use a spectral embedding of the fuzzy 1-skeleton
+        * 'random': assign initial embedding positions at random.
+
     min_dist: float (optional, default 0.1)
         The effective minimum distance between embedded points. Smaller values
         will result in a more clustered/clumped embedding where nearby points
@@ -190,7 +195,7 @@ class UMAP(Base):
         in greater repulsive force being applied, greater optimization
         cost, but slightly more accuracy.
     transform_queue_size: float (optional, default 4.0)
-        For transform operations (embedding new points using a trained model_
+        For transform operations (embedding new points using a trained model
         this will control how aggressively to search for nearest neighbors.
         Larger values will result in slower performance but more accurate
         nearest neighbor evaluation.
@@ -202,15 +207,23 @@ class UMAP(Base):
         More specific parameters controlling the embedding. If None these
         values are set automatically as determined by ``min_dist`` and
         ``spread``.
-    hash_input: UMAP can hash the training input so that exact embeddings
-                are returned when transform is called on the same data upon
-                which the model was trained. This enables consistent
-                behavior between calling model.fit_transform(X) and
-                calling model.fit(X).transform(X). Not that the CPU-based
-                UMAP reference implementation does this by default. This
-                feature is made optional in the GPU version due to the
-                significant overhead in copying memory to the host for
-                computing the hash. (default = False)
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    hash_input: bool, optional (default = False)
+        UMAP can hash the training input so that exact embeddings
+        are returned when transform is called on the same data upon
+        which the model was trained. This enables consistent
+        behavior between calling ``model.fit_transform(X)`` and
+        calling ``model.fit(X).transform(X)``. Not that the CPU-based
+        UMAP reference implementation does this by default. This
+        feature is made optional in the GPU version due to the
+        significant overhead in copying memory to the host for
+        computing the hash.
     random_state : int, RandomState instance or None, optional (default=None)
         random_state is the seed used by the random number generator during
         embedding initialization and during sampling used by the optimizer.
@@ -228,32 +241,42 @@ class UMAP(Base):
         The optimization step will be processed with at most optim_batch_size
         edges at once preventing inconsistencies. A lower batch size will yield
         more consistently repeatable embeddings at the cost of speed.
-    callback: An instance of GraphBasedDimRedCallback class to intercept
-              the internal state of embeddings while they are being trained.
-              Example of callback usage:
-                  from cuml.internals import GraphBasedDimRedCallback
-                  class CustomCallback(GraphBasedDimRedCallback):
-                    def on_preprocess_end(self, embeddings):
-                        print(embeddings.copy_to_host())
-
-                    def on_epoch_end(self, embeddings):
-                        print(embeddings.copy_to_host())
-
-                    def on_train_end(self, embeddings):
-                        print(embeddings.copy_to_host())
-    verbose : int or boolean (default = False)
-        Controls verbosity of logging.
+    callback: An instance of GraphBasedDimRedCallback class
+        Used to intercept the internal state of embeddings while they are being
+        trained. Example of callback usage:
+
+        .. code-block:: python
+
+            from cuml.internals import GraphBasedDimRedCallback
+
+            class CustomCallback(GraphBasedDimRedCallback):
+                def on_preprocess_end(self, embeddings):
+                    print(embeddings.copy_to_host())
+
+                def on_epoch_end(self, embeddings):
+                    print(embeddings.copy_to_host())
+
+                def on_train_end(self, embeddings):
+                    print(embeddings.copy_to_host())
+
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Notes
     -----
     This module is heavily based on Leland McInnes' reference UMAP package.
     However, there are a number of differences and features that are not yet
-    implemented in cuml.umap:
-      * Using a non-Euclidean distance metric (support for a fixed set
-        of non-Euclidean metrics is planned for an upcoming release).
-      * Using a pre-computed pairwise distance matrix (under consideration
-        for future releases)
-      * Manual initialization of initial embedding positions
+    implemented in `cuml.umap`:
+
+    * Using a pre-computed pairwise distance matrix (under consideration
+      for future releases)
+    * Manual initialization of initial embedding positions
 
     In addition to these missing features, you should expect to see
     the final embeddings differing between cuml.umap and the reference
@@ -261,16 +284,16 @@ class UMAP(Base):
     algorithm for large data sizes while cuml.umap always uses exact
     kNN.
 
-    Known issue: If a UMAP model has not yet been fit, it cannot be pickled.
-    However, after fitting, a UMAP mode.
-
     References
     ----------
-    * Leland McInnes, John Healy, James Melville
-      UMAP: Uniform Manifold Approximation and Projection for Dimension
-      Reduction
-      https://arxiv.org/abs/1802.03426
-
+    .. [1] `Leland McInnes, John Healy, James Melville
+       UMAP: Uniform Manifold Approximation and Projection for Dimension
+       Reduction <https://arxiv.org/abs/1802.03426>`_
+
+    .. [2] `Corey Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale,
+       Tim Oates, John Zedlewski, Joshua Patterson
+       Bringing UMAP Closer to the Speed of Light with GPU Acceleration
+       <https://arxiv.org/abs/2008.00325>`_
     """
 
     def __init__(self,
@@ -331,14 +354,23 @@ class UMAP(Base):
         self.target_weights = target_weights
 
         self.multicore_implem = random_state is None
-        if isinstance(random_state, np.random.RandomState):
-            rs = random_state
+
+        # Check to see if we are already a random_state (type==np.uint64).
+        # Reuse this if already passed (can happen from get_params() of another
+        # instance)
+        if isinstance(random_state, np.uint64):
+            self.random_state = random_state
         else:
-            rs = np.random.RandomState(random_state)
-        self.random_state = <uint64_t> rs.randint(low=0,
-                                                  high=np.iinfo(
-                                                      np.uint64).max,
-                                                  dtype=np.uint64)
+            # Otherwise create a RandomState instance to generate a new
+            # np.uint64
+            if isinstance(random_state, np.random.RandomState):
+                rs = random_state
+            else:
+                rs = np.random.RandomState(random_state)
+
+            self.random_state = rs.randint(low=0,
+                                           high=np.iinfo(np.uint64).max,
+                                           dtype=np.uint64)
 
         if target_metric == "euclidean" or target_metric == "categorical":
             self.target_metric = target_metric
@@ -348,8 +380,8 @@ class UMAP(Base):
         self.optim_batch_size = <int> optim_batch_size
 
         self.callback = callback  # prevent callback destruction
-        self.X_m = None
-        self.embedding_ = None
+        self._X_m = None  # accessed via X_m
+        self._embedding_ = None  # accessed via embedding_
 
         self.validate_hyperparams()
 
@@ -436,7 +468,7 @@ class UMAP(Base):
             csc_matrix = DummyClass
 
         if isinstance(knn_graph, (csc_matrix, cp_csc_matrix)):
-            knn_graph = cupy.sparse.csr_matrix(knn_graph)
+            knn_graph = cp_csr_matrix(knn_graph)
             n_samples = knn_graph.shape[0]
             reordering = knn_graph.data.reshape((n_samples, -1))
             reordering = reordering.argsort()
@@ -475,6 +507,8 @@ class UMAP(Base):
                    (knn_dists_m, knn_dists_m.ptr)
         return (None, None), (None, None)
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        skip_parameters_heading=True)
     @with_cupy_rmm
     def fit(self, X, y=None, convert_dtype=True,
             knn_graph=None):
@@ -483,14 +517,6 @@ class UMAP(Base):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            X contains a sample per row.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        y : array-like (device or host) shape = (n_samples, 1)
-            y contains a label per row.
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
         knn_graph : sparse array-like (device or host)
             shape=(n_samples, n_samples)
             A sparse array containing the k-nearest neighbors of X,
@@ -510,8 +536,6 @@ class UMAP(Base):
             Acceptable formats: sparse SciPy ndarray, CuPy device ndarray,
             CSR/COO preferred other formats will go through conversion to CSR
         """
-        self._set_n_features_in(X)
-
         if len(X.shape) != 2:
             raise ValueError("data should be two dimensional")
 
@@ -520,7 +544,7 @@ class UMAP(Base):
             raise ValueError("Cannot provide a KNN graph when in \
             semi-supervised mode with categorical target_metric for now.")
 
-        self.X_m, self.n_rows, self.n_dims, dtype = \
+        self._X_m, self.n_rows, self.n_dims, dtype = \
             input_to_cuml_array(X, order='C', check_dtype=np.float32,
                                 convert_to_dtype=(np.float32
                                                   if convert_dtype
@@ -530,7 +554,7 @@ class UMAP(Base):
             raise ValueError("There needs to be more than 1 sample to "
                              "build nearest the neighbors graph")
 
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         (knn_indices_m, knn_indices_ctype), (knn_dists_m, knn_dists_ctype) =\
             self._extract_knn_graph(knn_graph, convert_dtype)
@@ -540,18 +564,18 @@ class UMAP(Base):
 
         self.n_neighbors = min(self.n_rows, self.n_neighbors)
 
-        self.embedding_ = CumlArray.zeros((self.n_rows,
+        self._embedding_ = CumlArray.zeros((self.n_rows,
                                            self.n_components),
-                                          order="C", dtype=np.float32)
+                                           order="C", dtype=np.float32)
 
         if self.hash_input:
-            self.input_hash = joblib.hash(self.X_m.to_output('numpy'))
+            self.input_hash = joblib.hash(self._X_m.to_output('numpy'))
 
-        cdef cumlHandle * handle_ = \
-            <cumlHandle*> <size_t> self.handle.getHandle()
+        cdef handle_t * handle_ = \
+            <handle_t*> <size_t> self.handle.getHandle()
 
-        cdef uintptr_t x_raw = self.X_m.ptr
-        cdef uintptr_t embed_raw = self.embedding_.ptr
+        cdef uintptr_t x_raw = self._X_m.ptr
+        cdef uintptr_t embed_raw = self._embedding_.ptr
 
         cdef UMAPParams* umap_params = \
             <UMAPParams*> <size_t> UMAP._build_umap_params(self)
@@ -590,6 +614,14 @@ class UMAP(Base):
 
         return self
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        skip_parameters_heading=True,
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Embedding of the \
+                                                       data in \
+                                                       low-dimensional space.',
+                                       'shape': '(n_samples, n_components)'})
     def fit_transform(self, X, y=None, convert_dtype=True,
                       knn_graph=None):
         """
@@ -600,15 +632,10 @@ class UMAP(Base):
         and calling fit().transform(). Calling fit_transform(X) will
         train the embeddings on X and return the embeddings. Calling
         fit(X).transform(X) will train the embeddings on X and then
-        run a second optimization
-        return the embedding after it is trained while calling
+        run a second optimization.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            X contains a sample per row.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
         knn_graph : sparse array-like (device or host)
             shape=(n_samples, n_samples)
             A sparse array containing the k-nearest neighbors of X,
@@ -628,16 +655,20 @@ class UMAP(Base):
             Acceptable formats: sparse SciPy ndarray, CuPy device ndarray,
             CSR/COO preferred other formats will go through conversion to CSR
 
-        Returns
-        -------
-        X_new : array, shape (n_samples, n_components)
-            Embedding of the training data in low-dimensional space.
         """
         self.fit(X, y, convert_dtype=convert_dtype,
                  knn_graph=knn_graph)
         out_type = self._get_output_type(X)
-        return self.embedding_.to_output(out_type)
-
+        return self._embedding_.to_output(out_type)
+
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        skip_parameters_heading=True,
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Embedding of the \
+                                                       data in \
+                                                       low-dimensional space.',
+                                       'shape': '(n_samples, n_components)'})
     @with_cupy_rmm
     def transform(self, X, convert_dtype=True,
                   knn_graph=None):
@@ -654,10 +685,6 @@ class UMAP(Base):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            New data to be transformed.
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
         knn_graph : sparse array-like (device or host)
             shape=(n_samples, n_samples)
             A sparse array containing the k-nearest neighbors of X,
@@ -677,10 +704,6 @@ class UMAP(Base):
             Acceptable formats: sparse SciPy ndarray, CuPy device ndarray,
             CSR/COO preferred other formats will go through conversion to CSR
 
-        Returns
-        -------
-        X_new : array, shape (n_samples, n_components)
-            Embedding of the new data in low-dimensional space.
         """
         if len(X.shape) != 2:
             raise ValueError("data should be two dimensional")
@@ -704,7 +727,7 @@ class UMAP(Base):
 
         if self.hash_input and joblib.hash(X_m.to_output('numpy')) == \
                 self.input_hash:
-            ret = self.embedding_.to_output(out_type)
+            ret = self._embedding_.to_output(out_type)
             del X_m
             return ret
 
@@ -719,11 +742,11 @@ class UMAP(Base):
         cdef uintptr_t knn_indices_raw = knn_indices_ctype or 0
         cdef uintptr_t knn_dists_raw = knn_dists_ctype or 0
 
-        cdef cumlHandle * handle_ = \
-            <cumlHandle*> <size_t> self.handle.getHandle()
+        cdef handle_t * handle_ = \
+            <handle_t*> <size_t> self.handle.getHandle()
 
-        cdef uintptr_t orig_x_raw = self.X_m.ptr
-        cdef uintptr_t embed_ptr = self.embedding_.ptr
+        cdef uintptr_t orig_x_raw = self._X_m.ptr
+        cdef uintptr_t embed_ptr = self._embedding_.ptr
 
         cdef UMAPParams* umap_params = \
             <UMAPParams*> <size_t> UMAP._build_umap_params(self)
@@ -747,3 +770,28 @@ class UMAP(Base):
         ret = embedding.to_output(out_type)
         del X_m
         return ret
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "n_neighbors",
+            "n_components",
+            "n_epochs",
+            "learning_rate",
+            "min_dist",
+            "spread",
+            "set_op_mix_ratio",
+            "local_connectivity",
+            "repulsion_strength",
+            "negative_sample_rate",
+            "transform_queue_size",
+            "init",
+            "a",
+            "b",
+            "target_n_neighbors",
+            "target_weights",
+            "target_metric",
+            "hash_input",
+            "random_state",
+            "optim_batch_size",
+            "callback",
+        ]
diff --git a/python/cuml/metrics/__init__.py b/python/cuml/metrics/__init__.py
index 0417acf369..184c6752d6 100644
--- a/python/cuml/metrics/__init__.py
+++ b/python/cuml/metrics/__init__.py
@@ -29,3 +29,5 @@
 from cuml.metrics.cluster.mutual_info_score import mutual_info_score
 from cuml.metrics.confusion_matrix import confusion_matrix
 from cuml.metrics.cluster.entropy import cython_entropy as entropy
+from cuml.metrics.pairwise_distances import pairwise_distances, \
+    PAIRWISE_DISTANCE_METRICS
diff --git a/python/cuml/metrics/_classification.py b/python/cuml/metrics/_classification.py
index b95b070550..ad24926590 100644
--- a/python/cuml/metrics/_classification.py
+++ b/python/cuml/metrics/_classification.py
@@ -46,6 +46,7 @@ def log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None):
     Returns
     -------
     loss : float
+
     Examples
     --------
     >>> from cuml.metrics import log_loss
@@ -53,13 +54,16 @@ def log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None):
     >>> log_loss(np.array([1, 0, 0, 1]),
     ...          np.array([[.1, .9], [.9, .1], [.8, .2], [.35, .65]]))
     0.21616...
+
     References
     ----------
     C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer,
     p. 209.
+
     Notes
     -----
     The logarithm used is the natural logarithm (base-e).
+
     """
     y_true, n_rows, n_cols, ytype = \
         input_to_cuml_array(y_true, check_dtype=[np.int32, np.int64,
diff --git a/python/cuml/metrics/_ranking.py b/python/cuml/metrics/_ranking.py
index 4ef0f38bf9..37263ad627 100644
--- a/python/cuml/metrics/_ranking.py
+++ b/python/cuml/metrics/_ranking.py
@@ -26,20 +26,22 @@ def precision_recall_curve(y_true, probs_pred):
     """
     Compute precision-recall pairs for different probability thresholds
 
-    Note: this implementation is restricted to the binary classification task.
-    The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
-    true positives and ``fp`` the number of false positives. The precision is
-    intuitively the ability of the classifier not to label as positive a sample
-    that is negative.
+    .. note:: this implementation is restricted to the binary classification
+        task. The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the
+        number of true positives and ``fp`` the number of false positives. The
+        precision is intuitively the ability of the classifier not to label as
+        positive a sample that is negative.
 
-    The recall is the ratio ``tp / (tp + fn)`` where ``tp`` is the number of
-    true positives and ``fn`` the number of false negatives. The recall is
-    intuitively the ability of the classifier to find all the positive samples.
-    The last precision and recall values are 1. and 0. respectively and do not
-    have a corresponding threshold.  This ensures that the graph starts on the
-    y axis.
+        The recall is the ratio ``tp / (tp + fn)`` where ``tp`` is the number
+        of true positives and ``fn`` the number of false negatives. The recall
+        is intuitively the ability of the classifier to find all the positive
+        samples. The last precision and recall values are 1. and 0.
+        respectively and do not have a corresponding threshold. This ensures
+        that the graph starts on the y axis.
+
+        Read more in the scikit-learn's `User Guide
+        <https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics>`_.
 
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
 
     Parameters
     ----------
@@ -120,7 +122,7 @@ def roc_auc_score(y_true, y_score):
     Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
     from prediction scores.
 
-    Note: this implementation can only be used with binary classification.
+    .. note:: this implementation can only be used with binary classification.
 
     Parameters
     ----------
@@ -140,18 +142,13 @@ def roc_auc_score(y_true, y_score):
 
     Examples
     --------
-    .. code-block:: python
-
-            import numpy as np
-            from cuml.metrics import roc_auc_score
-            y_true = np.array([0, 0, 1, 1])
-            y_scores = np.array([0.1, 0.4, 0.35, 0.8])
-            print(roc_auc_score(y_true, y_scores))
-
-    Output:
-    .. code-block:: python
+    >>> import numpy as np
+    >>> from cuml.metrics import roc_auc_score
+    >>> y_true = np.array([0, 0, 1, 1])
+    >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
+    >>> print(roc_auc_score(y_true, y_scores))
+    0.75
 
-            0.75
     """
     y_true, n_rows, n_cols, ytype = \
         input_to_cuml_array(y_true, check_dtype=[np.int32, np.int64,
diff --git a/python/cuml/metrics/accuracy.pyx b/python/cuml/metrics/accuracy.pyx
index 393a53c16d..85c43b9fa8 100644
--- a/python/cuml/metrics/accuracy.pyx
+++ b/python/cuml/metrics/accuracy.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 
@@ -25,14 +22,14 @@ from libc.stdint cimport uintptr_t
 
 import cudf
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_dev_array
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 cimport cuml.common.cuda
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
 
-    float accuracy_score_py(cumlHandle &handle,
+    float accuracy_score_py(handle_t &handle,
                             int *predictions,
                             int *ref_predictions,
                             int n) except +
@@ -55,10 +52,10 @@ def accuracy_score(ground_truth, predictions, handle=None, convert_dtype=True):
         float
           The accuracy of the model used for prediction
     """
-    handle = cuml.common.handle.Handle() \
+    handle = Handle() \
         if handle is None else handle
-    cdef cumlHandle* handle_ =\
-        <cumlHandle*><size_t>handle.getHandle()
+    cdef handle_t* handle_ =\
+        <handle_t*><size_t>handle.getHandle()
 
     cdef uintptr_t preds_ptr, ground_truth_ptr
     preds_m, preds_ptr, n_rows, _, _ = \
diff --git a/python/cuml/metrics/cluster/adjustedrandindex.pyx b/python/cuml/metrics/cluster/adjustedrandindex.pyx
index cf2b4307c5..efaf4abb29 100644
--- a/python/cuml/metrics/cluster/adjustedrandindex.pyx
+++ b/python/cuml/metrics/cluster/adjustedrandindex.pyx
@@ -14,24 +14,21 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cupy as cp
 import warnings
 
 from libc.stdint cimport uintptr_t
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 cimport cuml.common.cuda
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
 
-    double adjustedRandIndex(cumlHandle &handle,
+    double adjustedRandIndex(handle_t &handle,
                              int *y,
                              int *y_hat,
                              int n)
@@ -56,10 +53,10 @@ def adjusted_rand_score(labels_true, labels_pred, handle=None,
         float
             The adjusted rand index value between -1.0 and 1.0
     """
-    handle = cuml.common.handle.Handle() \
+    handle = Handle() \
         if handle is None else handle
-    cdef cumlHandle* handle_ =\
-        <cumlHandle*><size_t>handle.getHandle()
+    cdef handle_t* handle_ =\
+        <handle_t*><size_t>handle.getHandle()
 
     labels_true, n_rows, _, _ = \
         input_to_cuml_array(labels_true, order='C', check_dtype=cp.int32,
diff --git a/python/cuml/metrics/cluster/completeness_score.pyx b/python/cuml/metrics/cluster/completeness_score.pyx
index 2b317e69fd..59b166531e 100644
--- a/python/cuml/metrics/cluster/completeness_score.pyx
+++ b/python/cuml/metrics/cluster/completeness_score.pyx
@@ -14,19 +14,16 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from libc.stdint cimport uintptr_t
 from cuml.metrics.cluster.utils import prepare_cluster_metric_inputs
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
-    double completenessScore(const cumlHandle & handle, const int *y,
+    double completenessScore(const handle_t & handle, const int *y,
                              const int *y_hat, const int n,
                              const int lower_class_range,
                              const int upper_class_range) except +
@@ -74,8 +71,8 @@ def completeness_score(labels_true, labels_pred, handle=None):
       The completeness of the predicted labeling given the ground truth.
       Score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling.
     """
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle *handle_ = <cumlHandle*> <size_t> handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t *handle_ = <handle_t*> <size_t> handle.getHandle()
 
     (y_true, y_pred, n_rows,
      lower_class_range, upper_class_range) = prepare_cluster_metric_inputs(
diff --git a/python/cuml/metrics/cluster/entropy.pyx b/python/cuml/metrics/cluster/entropy.pyx
index 085b800b81..c7c3e20452 100644
--- a/python/cuml/metrics/cluster/entropy.pyx
+++ b/python/cuml/metrics/cluster/entropy.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 import math
 
 import numpy as np
@@ -25,13 +22,13 @@ import cupy as cp
 
 from libc.stdint cimport uintptr_t
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import with_cupy_rmm, input_to_cuml_array
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 cimport cuml.common.cuda
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
-    double entropy(const cumlHandle &handle,
+    double entropy(const handle_t &handle,
                    const int *y,
                    const int n,
                    const int lower_class_range,
@@ -80,8 +77,8 @@ def cython_entropy(clustering, base=None, handle=None):
     S : float
         The calculated entropy.
     """
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle *handle_ = <cumlHandle*> <size_t> handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t *handle_ = <handle_t*> <size_t> handle.getHandle()
 
     (clustering, n_rows,
      lower_class_range, upper_class_range) = _prepare_cluster_input(clustering)
diff --git a/python/cuml/metrics/cluster/homogeneity_score.pyx b/python/cuml/metrics/cluster/homogeneity_score.pyx
index a27bea0f33..3fcbf6971a 100644
--- a/python/cuml/metrics/cluster/homogeneity_score.pyx
+++ b/python/cuml/metrics/cluster/homogeneity_score.pyx
@@ -14,19 +14,16 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from libc.stdint cimport uintptr_t
 from cuml.metrics.cluster.utils import prepare_cluster_metric_inputs
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
-    double homogeneityScore(const cumlHandle & handle, const int *y,
+    double homogeneityScore(const handle_t & handle, const int *y,
                             const int *y_hat, const int n,
                             const int lower_class_range,
                             const int upper_class_range) except +
@@ -74,8 +71,8 @@ def homogeneity_score(labels_true, labels_pred, handle=None):
       The homogeneity of the predicted labeling given the ground truth.
       Score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling.
     """
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle *handle_ = <cumlHandle*> <size_t> handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t *handle_ = <handle_t*> <size_t> handle.getHandle()
 
     (y_true, y_pred,
      n_rows,
diff --git a/python/cuml/metrics/cluster/mutual_info_score.pyx b/python/cuml/metrics/cluster/mutual_info_score.pyx
index a8bc9e3020..5524dc4936 100644
--- a/python/cuml/metrics/cluster/mutual_info_score.pyx
+++ b/python/cuml/metrics/cluster/mutual_info_score.pyx
@@ -14,20 +14,17 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from libc.stdint cimport uintptr_t
 
 from cuml.metrics.cluster.utils import prepare_cluster_metric_inputs
-import cuml.common.handle
+from cuml.raft.common.handle import Handle
 
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
-    double mutualInfoScore(const cumlHandle &handle,
+    double mutualInfoScore(const handle_t &handle,
                            const int *y,
                            const int *y_hat,
                            const int n,
@@ -72,8 +69,8 @@ def mutual_info_score(labels_true, labels_pred, handle=None):
     float
       Mutual information, a non-negative value
     """
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle *handle_ = <cumlHandle*> <size_t> handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t *handle_ = <handle_t*> <size_t> handle.getHandle()
 
     (y_true, y_pred, n_rows,
      lower_class_range, upper_class_range) = prepare_cluster_metric_inputs(
diff --git a/python/cuml/metrics/cluster/utils.pyx b/python/cuml/metrics/cluster/utils.pyx
index 0b134e39be..da57c3058b 100644
--- a/python/cuml/metrics/cluster/utils.pyx
+++ b/python/cuml/metrics/cluster/utils.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 import cupy as cp
 from cuml.metrics.utils import sorted_unique_labels
 from cuml.prims.label import make_monotonic
diff --git a/python/cuml/metrics/confusion_matrix.py b/python/cuml/metrics/confusion_matrix.py
index 6e7508aa1b..81901feaa9 100644
--- a/python/cuml/metrics/confusion_matrix.py
+++ b/python/cuml/metrics/confusion_matrix.py
@@ -14,13 +14,9 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 import numpy as np
 import cupy as cp
+import cupyx
 
 from cuml.common import input_to_cuml_array
 from cuml.common.memory_utils import with_cupy_rmm
@@ -97,9 +93,9 @@ def confusion_matrix(y_true, y_pred,
     y_true = y_true[ind]
     sample_weight = sample_weight[ind]
 
-    cm = cp.sparse.coo_matrix((sample_weight, (y_true, y_pred)),
-                              shape=(n_labels, n_labels),
-                              dtype=np.float64).toarray()
+    cm = cupyx.scipy.sparse.coo_matrix((sample_weight, (y_true, y_pred)),
+                                       shape=(n_labels, n_labels),
+                                       dtype=np.float64).toarray()
 
     with np.errstate(all='ignore'):
         if normalize == 'true':
diff --git a/python/cuml/metrics/pairwise_distances.pyx b/python/cuml/metrics/pairwise_distances.pyx
new file mode 100644
index 0000000000..8b469e9d8c
--- /dev/null
+++ b/python/cuml/metrics/pairwise_distances.pyx
@@ -0,0 +1,276 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# distutils: language = c++
+
+from libcpp cimport bool
+from libc.stdint cimport uintptr_t
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
+import cupy as cp
+import numpy as np
+from cuml.common.base import _determine_stateless_output_type
+from cuml.common import (get_cudf_column_ptr, get_dev_array_ptr,
+                         input_to_cuml_array, CumlArray, logger, with_cupy_rmm)
+from cuml.metrics.cluster.utils import prepare_cluster_metric_inputs
+
+cdef extern from "cuml/distance/distance_type.h" namespace "ML::Distance":
+
+    cdef enum DistanceType:
+        EucExpandedL2 "ML::Distance::DistanceType::EucExpandedL2"
+        EucExpandedL2Sqrt "ML::Distance::DistanceType::EucExpandedL2Sqrt"
+        EucExpandedCosine "ML::Distance::DistanceType::EucExpandedCosine"
+        EucUnexpandedL1 "ML::Distance::DistanceType::EucUnexpandedL1"
+        EucUnexpandedL2 "ML::Distance::DistanceType::EucUnexpandedL2"
+        EucUnexpandedL2Sqrt "ML::Distance::DistanceType::EucUnexpandedL2Sqrt"
+
+cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
+    void pairwiseDistance(const handle_t &handle, const double *x,
+                          const double *y, double *dist, int m, int n, int k,
+                          DistanceType metric, bool isRowMajor) except +
+    void pairwiseDistance(const handle_t &handle, const float *x,
+                          const float *y, float *dist, int m, int n, int k,
+                          DistanceType metric, bool isRowMajor) except +
+
+
+"""
+List of available distance metrics in `pairwise_distances`
+"""
+PAIRWISE_DISTANCE_METRICS = [
+    "cityblock",
+    "cosine",
+    "euclidean",
+    "l1",
+    "l2",
+    "manhattan",
+    "sqeuclidean"
+]
+
+
+def _determine_metric(metric_str):
+
+    # Available options in scikit-learn and their pairs. See
+    # sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS:
+    # 'cityblock': EucUnexpandedL1
+    # 'cosine': EucExpandedCosine
+    # 'euclidean': EucUnexpandedL2Sqrt
+    # 'haversine': N/A
+    # 'l2': EucUnexpandedL2Sqrt
+    # 'l1': EucUnexpandedL1
+    # 'manhattan': EucUnexpandedL1
+    # 'nan_euclidean': N/A
+    # 'sqeuclidean': EucUnexpandedL2
+    # Note: many are duplicates following this:
+    # https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L1321
+
+    if metric_str == 'cityblock':
+        return DistanceType.EucUnexpandedL1
+    elif metric_str == 'cosine':
+        return DistanceType.EucExpandedCosine
+    elif metric_str == 'euclidean':
+        return DistanceType.EucUnexpandedL2Sqrt
+    elif metric_str == 'haversine':
+        raise ValueError(" The metric: '{}', is not supported at this time."
+                         .format(metric_str))
+    elif metric_str == 'l2':
+        return DistanceType.EucUnexpandedL2Sqrt
+    elif metric_str == 'l1':
+        return DistanceType.EucUnexpandedL1
+    elif metric_str == 'manhattan':
+        return DistanceType.EucUnexpandedL1
+    elif metric_str == 'nan_euclidean':
+        raise ValueError(" The metric: '{}', is not supported at this time."
+                         .format(metric_str))
+    elif metric_str == 'sqeuclidean':
+        return DistanceType.EucUnexpandedL2
+    else:
+        raise ValueError("Unknown metric: {}".format(metric_str))
+
+
+@with_cupy_rmm
+def pairwise_distances(X, Y=None, metric="euclidean", handle=None,
+                       convert_dtype=True, output_type=None, **kwds):
+    """
+    Compute the distance matrix from a vector array `X` and optional `Y`.
+
+    This method takes either one or two vector arrays, and returns a distance
+    matrix.
+
+    If `Y` is given (default is `None`), then the returned matrix is the
+    pairwise distance between the arrays from both `X` and `Y`.
+
+    Valid values for metric are:
+
+    - From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', \
+        'manhattan'].
+        Sparse matrices are not supported.
+    - From scipy.spatial.distance: ['sqeuclidean']
+        See the documentation for scipy.spatial.distance for details on this
+        metric. Sparse matrices are not supported.
+
+    Parameters
+    ----------
+    X : array-like (device or host) of shape (n_samples_x, n_features)
+        Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
+        ndarray, cuda array interface compliant array like CuPy
+
+    Y : array-like (device or host) of shape (n_samples_y, n_features),\
+        optional
+        Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
+        ndarray, cuda array interface compliant array like CuPy
+
+    metric : {"cityblock", "cosine", "euclidean", "l1", "l2", "manhattan", \
+        "sqeuclidean"}
+        The metric to use when calculating distance between instances in a
+        feature array.
+
+    convert_dtype : bool, optional (default = True)
+        When set to True, the method will, when necessary, convert
+        Y to be the same data type as X if they differ. This
+        will increase memory used for the method.
+
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
+    Returns
+    -------
+    D : array [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
+        A distance matrix D such that D_{i, j} is the distance between the
+        ith and jth vectors of the given matrix `X`, if `Y` is None.
+        If `Y` is not `None`, then D_{i, j} is the distance between the ith
+        array from `X` and the jth array from `Y`.
+
+    Examples
+    --------
+        >>> import cupy as cp
+        >>> from cuml.metrics import pairwise_distances
+        >>>
+        >>> X = cp.array([[2.0, 3.0], [3.0, 5.0], [5.0, 8.0]])
+        >>> Y = cp.array([[1.0, 0.0], [2.0, 1.0]])
+        >>>
+        >>> # Euclidean Pairwise Distance, Single Input:
+        >>> pairwise_distances(X, metric='euclidean')
+        array([[0.        , 2.23606798, 5.83095189],
+            [2.23606798, 0.        , 3.60555128],
+            [5.83095189, 3.60555128, 0.        ]])
+        >>>
+        >>> # Cosine Pairwise Distance, Multi-Input:
+        >>> pairwise_distances(X, Y, metric='cosine')
+        array([[0.4452998 , 0.13175686],
+            [0.48550424, 0.15633851],
+            [0.47000106, 0.14671817]])
+        >>>
+        >>> # Manhattan Pairwise Distance, Multi-Input:
+        >>> pairwise_distances(X, Y, metric='manhattan')
+        array([[ 4.,  2.],
+            [ 7.,  5.],
+            [12., 10.]])
+    """
+
+    handle = Handle() if handle is None else handle
+    cdef handle_t *handle_ = <handle_t*> <size_t> handle.getHandle()
+
+    # Determine the input type to convert to when returning
+    output_type = _determine_stateless_output_type(output_type, X)
+
+    # Get the input arrays, preserve order and type where possible
+    X_m, n_samples_x, n_features_x, dtype_x = \
+        input_to_cuml_array(X, order="K", check_dtype=[np.float32, np.float64])
+
+    # Get the order from the CumlArray
+    input_order = X_m.order
+
+    cdef uintptr_t d_X_ptr
+    cdef uintptr_t d_Y_ptr
+    cdef uintptr_t d_dest_ptr
+
+    if (Y is not None):
+
+        # Check for the odd case where one dimension of X is 1. In this case,
+        # CumlArray always returns order=="C" so instead get the order from Y
+        if (n_samples_x == 1 or n_features_x == 1):
+            input_order = "K"
+
+        Y_m, n_samples_y, n_features_y, dtype_y = \
+            input_to_cuml_array(Y, order=input_order,
+                                convert_to_dtype=(dtype_x if convert_dtype
+                                                  else None),
+                                check_dtype=[dtype_x])
+
+        # Get the order from Y if necessary (It's possible to set order="F" in
+        # input_to_cuml_array and have Y_m.order=="C")
+        if (input_order == "K"):
+            input_order = Y_m.order
+    else:
+        # Shallow copy X variables
+        Y_m = X_m
+        n_samples_y = n_samples_x
+        n_features_y = n_features_x
+        dtype_y = dtype_x
+
+    is_row_major = input_order == "C"
+
+    # Check feature sizes are equal
+    if (n_features_x != n_features_y):
+        raise ValueError("Incompatible dimension for X and Y matrices: \
+                         X.shape[1] == {} while Y.shape[1] == {}"
+                         .format(n_features_x, n_features_y))
+
+    # Get the metric string to int
+    metric_val = _determine_metric(metric)
+
+    # Create the output array
+    dest_m = CumlArray.zeros((n_samples_x, n_samples_y), dtype=dtype_x,
+                             order=input_order)
+
+    d_X_ptr = X_m.ptr
+    d_Y_ptr = Y_m.ptr
+    d_dest_ptr = dest_m.ptr
+
+    # Now execute the functions
+    if (dtype_x == np.float32):
+        pairwiseDistance(handle_[0],
+                         <float*> d_X_ptr,
+                         <float*> d_Y_ptr,
+                         <float*> d_dest_ptr,
+                         <int> n_samples_x,
+                         <int> n_samples_y,
+                         <int> n_features_x,
+                         <DistanceType> metric_val,
+                         <bool> is_row_major)
+    elif (dtype_x == np.float64):
+        pairwiseDistance(handle_[0],
+                         <double*> d_X_ptr,
+                         <double*> d_Y_ptr,
+                         <double*> d_dest_ptr,
+                         <int> n_samples_x,
+                         <int> n_samples_y,
+                         <int> n_features_x,
+                         <DistanceType> metric_val,
+                         <bool> is_row_major)
+    else:
+        raise NotImplementedError("Unsupported dtype: {}".format(dtype_x))
+
+    # Sync on the stream before exiting. pairwiseDistance does not sync.
+    handle.sync()
+
+    del X_m
+    del Y_m
+
+    return dest_m.to_output(output_type)
diff --git a/python/cuml/metrics/regression.pxd b/python/cuml/metrics/regression.pxd
index 5abcf33d73..315911e104 100644
--- a/python/cuml/metrics/regression.pxd
+++ b/python/cuml/metrics/regression.pxd
@@ -14,21 +14,16 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 
 cdef extern from "cuml/metrics/metrics.hpp" namespace "ML::Metrics":
 
-    float r2_score_py(const cumlHandle& handle,
+    float r2_score_py(const handle_t& handle,
                       float *y,
                       float *y_hat,
                       int n) except +
 
-    double r2_score_py(const cumlHandle& handle,
+    double r2_score_py(const handle_t& handle,
                        double *y,
                        double *y_hat,
                        int n) except +
diff --git a/python/cuml/metrics/regression.pyx b/python/cuml/metrics/regression.pyx
index f20e87da0f..60324699d1 100644
--- a/python/cuml/metrics/regression.pyx
+++ b/python/cuml/metrics/regression.pyx
@@ -14,24 +14,21 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 import cupy as cp
 
 from libc.stdint cimport uintptr_t
 
-import cuml.common.handle
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle import Handle
+from cuml.raft.common.handle cimport handle_t
 from cuml.metrics cimport regression
 from cuml.common.input_utils import input_to_cuml_array
 from cuml.common.memory_utils import with_cupy_rmm
 
 
-def r2_score(y, y_hat, convert_dtype=False, handle=None):
+def r2_score(y, y_hat, convert_dtype=True, handle=None):
     """
     Calculates r2 score between y and y_hat
 
@@ -57,8 +54,8 @@ def r2_score(y, y_hat, convert_dtype=False, handle=None):
         trustworthiness score : double
             Trustworthiness of the low-dimensional embedding
     """
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     y_m, n_rows, _, ytype = \
         input_to_cuml_array(y, check_dtype=[np.float32, np.float64],
diff --git a/python/cuml/metrics/trustworthiness.pyx b/python/cuml/metrics/trustworthiness.pyx
index 519eb1ce8c..b8fff10683 100644
--- a/python/cuml/metrics/trustworthiness.pyx
+++ b/python/cuml/metrics/trustworthiness.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cudf
 import numpy as np
@@ -26,19 +23,19 @@ import warnings
 from numba import cuda
 
 from libc.stdint cimport uintptr_t
-import cuml.common.handle
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle import Handle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import get_cudf_column_ptr, get_dev_array_ptr, \
     input_to_dev_array
 
-cdef extern from "metrics/trustworthiness_c.h" namespace "MLCommon::Distance":
+cdef extern from "cuml/distance/distance_type.h" namespace "ML::Distance":
 
     ctypedef int DistanceType
-    ctypedef DistanceType euclidean "(MLCommon::Distance::DistanceType)5"
+    ctypedef DistanceType euclidean "(ML::Distance::DistanceType)5"
 
 cdef extern from "metrics/trustworthiness_c.h" namespace "ML::Metrics":
 
-    cdef double trustworthiness_score[T, DistanceType](const cumlHandle& h,
+    cdef double trustworthiness_score[T, DistanceType](const handle_t& h,
                                                        T* X,
                                                        T* X_embedded,
                                                        int n, int m,
@@ -103,8 +100,8 @@ def trustworthiness(X, X_embedded, handle=None, n_neighbors=5,
                            convert_to_dtype=(np.float32 if convert_dtype
                                              else None))
 
-    handle = cuml.common.handle.Handle() if handle is None else handle
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+    handle = Handle() if handle is None else handle
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     if metric == 'euclidean':
         res = trustworthiness_score[float, euclidean](handle_[0],
diff --git a/python/cuml/metrics/utils.py b/python/cuml/metrics/utils.py
index bd865b1a53..efef1eb832 100644
--- a/python/cuml/metrics/utils.py
+++ b/python/cuml/metrics/utils.py
@@ -14,11 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
 import cupy as cp
 
 
diff --git a/python/cuml/naive_bayes/__init__.py b/python/cuml/naive_bayes/__init__.py
index b8701ca7fb..3721c75ef8 100644
--- a/python/cuml/naive_bayes/__init__.py
+++ b/python/cuml/naive_bayes/__init__.py
@@ -14,4 +14,4 @@
 # limitations under the License.
 #
 
-from cuml.naive_bayes.naive_bayes import MultinomialNB
\ No newline at end of file
+from cuml.naive_bayes.naive_bayes import MultinomialNB
diff --git a/python/cuml/naive_bayes/naive_bayes.py b/python/cuml/naive_bayes/naive_bayes.py
index 144c2dcce5..39dcc7ab91 100644
--- a/python/cuml/naive_bayes/naive_bayes.py
+++ b/python/cuml/naive_bayes/naive_bayes.py
@@ -16,6 +16,7 @@
 
 
 import cupy as cp
+import cupyx
 import cupy.prof
 import math
 import warnings
@@ -23,6 +24,7 @@
 from cuml.common import with_cupy_rmm
 from cuml.common import CumlArray
 from cuml.common.base import Base
+from cuml.common.doc_utils import generate_docstring
 from cuml.common.input_utils import input_to_cuml_array
 from cuml.common.kernel_utils import cuda_kernel_factory
 from cuml.common.import_utils import has_scipy
@@ -115,9 +117,6 @@ def count_features_dense_kernel(float_dtype, int_dtype):
 
 class MultinomialNB(Base):
 
-    # TODO: Make this extend cuml.Base:
-    # https://github.com/rapidsai/cuml/issues/1834
-
     """
     Naive Bayes classifier for multinomial models
 
@@ -127,10 +126,38 @@ class MultinomialNB(Base):
     The multinomial distribution normally requires integer feature counts.
     However, in practice, fractional counts such as tf-idf may also work.
 
-    NOTE: While cuML only provides the multinomial version currently, the
-    other variants are planned to be included soon. Refer to the
-    corresponding Github issue for updates:
-    https://github.com/rapidsai/cuml/issues/1666
+    Notes
+    -----
+    While cuML only provides the multinomial version currently, the other
+    variants are planned to be included soon. Refer to the corresponding Github
+    `issue <https://github.com/rapidsai/cuml/issues/1666>`_ for updates.
+
+    Parameters
+    ----------
+
+    alpha : float
+        Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
+    fit_prior : boolean
+        Whether to learn class prior probabilities or no. If false, a uniform
+        prior will be used.
+    class_prior : array-like, size (n_classes)
+        Prior probabilities of the classes. If specified, the priors are not
+        adjusted according to the data.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Examples
     --------
@@ -140,42 +167,43 @@ class MultinomialNB(Base):
 
     .. code-block:: python
 
-    import cupy as cp
+        import cupy as cp
+        import cupyx
 
-    from sklearn.datasets import fetch_20newsgroups
-    from sklearn.feature_extraction.text import CountVectorizer
+        from sklearn.datasets import fetch_20newsgroups
+        from sklearn.feature_extraction.text import CountVectorizer
 
-    from cuml.naive_bayes import MultinomialNB
+        from cuml.naive_bayes import MultinomialNB
 
-    # Load corpus
+        # Load corpus
 
-    twenty_train = fetch_20newsgroups(subset='train',
-                              shuffle=True, random_state=42)
+        twenty_train = fetch_20newsgroups(subset='train',
+                                shuffle=True, random_state=42)
 
-    # Turn documents into term frequency vectors
+        # Turn documents into term frequency vectors
 
-    count_vect = CountVectorizer()
-    features = count_vect.fit_transform(twenty_train.data)
+        count_vect = CountVectorizer()
+        features = count_vect.fit_transform(twenty_train.data)
 
-    # Put feature vectors and labels on the GPU
+        # Put feature vectors and labels on the GPU
 
-    X = cp.sparse.csr_matrix(features.tocsr(), dtype=cp.float32)
-    y = cp.asarray(twenty_train.target, dtype=cp.int32)
+        X = cupyx.scipy.sparse.csr_matrix(features.tocsr(), dtype=cp.float32)
+        y = cp.asarray(twenty_train.target, dtype=cp.int32)
 
-    # Train model
+        # Train model
 
-    model = MultinomialNB()
-    model.fit(X, y)
+        model = MultinomialNB()
+        model.fit(X, y)
 
-    # Compute accuracy on training set
+        # Compute accuracy on training set
 
-    model.score(X, y)
+        model.score(X, y)
 
     Output:
 
     .. code-block:: python
 
-    0.9244298934936523
+        0.9244298934936523
 
     """
     @with_cupy_rmm
@@ -184,23 +212,11 @@ def __init__(self,
                  fit_prior=True,
                  class_prior=None,
                  output_type=None,
-                 handle=None):
-        """
-        Create new multinomial Naive Bayes instance
-
-        Parameters
-        ----------
-
-        alpha : float Additive (Laplace/Lidstone) smoothing parameter (0 for
-                no smoothing).
-        fit_prior : boolean Whether to learn class prior probabilities or no.
-                    If false, a uniform prior will be used.
-        class_prior : array-like, size (n_classes) Prior probabilities of the
-                      classes. If specified, the priors are not adjusted
-                      according to the data.
-        """
+                 handle=None,
+                 verbose=False):
         super(MultinomialNB, self).__init__(handle=handle,
-                                            output_type=output_type)
+                                            output_type=output_type,
+                                            verbose=verbose)
         self.alpha = alpha
         self.fit_prior = fit_prior
 
@@ -216,23 +232,15 @@ def __init__(self,
         # Needed until Base no longer assumed cumlHandle
         self.handle = None
 
+    @generate_docstring(X='dense_sparse')
     @cp.prof.TimeRangeDecorator(message="fit()", color_id=0)
     @with_cupy_rmm
     def fit(self, X, y, sample_weight=None):
         """
         Fit Naive Bayes classifier according to X, y
 
-        Parameters
-        ----------
-
-        X : {array-like, cupy sparse matrix} of shape (n_samples, n_features)
-            Training vectors, where n_samples is the number of samples and
-            n_features is the number of features.
-        y : array-like shape (n_samples) Target values.
-        sample_weight : array-like of shape (n_samples)
-            Weights applied to individial samples (1. for unweighted).
         """
-        self._set_n_features_in(X)
+        self._set_base_attributes(output_type=X)
         return self.partial_fit(X, y, sample_weight)
 
     @cp.prof.TimeRangeDecorator(message="fit()", color_id=0)
@@ -248,12 +256,13 @@ def _partial_fit(self, X, y, sample_weight=None, _classes=None):
 
         # todo: use a sparse CumlArray style approach when ready
         # https://github.com/rapidsai/cuml/issues/2216
-        if scipy_sparse_isspmatrix(X) or cp.sparse.isspmatrix(X):
+        if scipy_sparse_isspmatrix(X) or cupyx.scipy.sparse.isspmatrix(X):
             X = X.tocoo()
             rows = cp.asarray(X.row, dtype=X.row.dtype)
             cols = cp.asarray(X.col, dtype=X.col.dtype)
             data = cp.asarray(X.data, dtype=X.data.dtype)
-            X = cp.sparse.coo_matrix((data, (rows, cols)), shape=X.shape)
+            X = cupyx.scipy.sparse.coo_matrix((data, (rows, cols)),
+                                              shape=X.shape)
         else:
             X = input_to_cuml_array(X, order='K').array.to_output('cupy')
 
@@ -266,16 +275,16 @@ def _partial_fit(self, X, y, sample_weight=None, _classes=None):
             if _classes is not None:
                 _classes, *_ = input_to_cuml_array(_classes, order='K')
                 check_labels(Y, _classes.to_output('cupy'))
-                self.classes_ = _classes
+                self._classes_ = _classes
             else:
-                self.classes_ = CumlArray(data=label_classes)
+                self._classes_ = CumlArray(data=label_classes)
 
             self._n_classes_ = self.classes_.shape[0]
             self._n_features_ = X.shape[1]
             self._init_counters(self._n_classes_, self._n_features_,
                                 X.dtype)
         else:
-            check_labels(Y, self.classes_)
+            check_labels(Y, self._classes_)
 
         self._count(X, Y)
 
@@ -336,22 +345,17 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
         return self._partial_fit(X, y, sample_weight=sample_weight,
                                  _classes=classes)
 
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'y_hat',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_rows, 1)'})
     @cp.prof.TimeRangeDecorator(message="predict()", color_id=1)
     @with_cupy_rmm
     def predict(self, X):
         """
         Perform classification on an array of test vectors X.
 
-        Parameters
-        ----------
-
-        X : array-like of shape (n_samples, n_features)
-
-        Returns
-        -------
-
-        C : cupy.ndarray of shape (n_samples)
-
         """
         out_type = self._get_output_type(X)
 
@@ -363,12 +367,13 @@ def predict(self, X):
 
         # todo: use a sparse CumlArray style approach when ready
         # https://github.com/rapidsai/cuml/issues/2216
-        if scipy_sparse_isspmatrix(X) or cp.sparse.isspmatrix(X):
+        if scipy_sparse_isspmatrix(X) or cupyx.scipy.sparse.isspmatrix(X):
             X = X.tocoo()
             rows = cp.asarray(X.row, dtype=X.row.dtype)
             cols = cp.asarray(X.col, dtype=X.col.dtype)
             data = cp.asarray(X.data, dtype=X.data.dtype)
-            X = cp.sparse.coo_matrix((data, (rows, cols)), shape=X.shape)
+            X = cupyx.scipy.sparse.coo_matrix((data, (rows, cols)),
+                                              shape=X.shape)
         else:
             X = input_to_cuml_array(X, order='K').array.to_output('cupy')
 
@@ -378,24 +383,18 @@ def predict(self, X):
         y_hat = invert_labels(indices, classes=self.classes_)
         return CumlArray(data=y_hat).to_output(out_type)
 
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'C',
+                                       'type': 'dense',
+            'description': 'Returns the log-probability of the samples for each class in the \
+            model. The columns correspond to the classes in sorted order, as \
+            they appear in the attribute `classes_`.',  # noqa
+                                       'shape': '(n_rows, 1)'})
     @with_cupy_rmm
     def predict_log_proba(self, X):
         """
         Return log-probability estimates for the test vector X.
 
-        Parameters
-        ----------
-
-        X : array-like of shape (n_samples, n_features)
-
-
-        Returns
-        -------
-
-        C : array-like of shape (n_samples, n_classes)
-            Returns the log-probability of the samples for each class in the
-            model. The columns correspond to the classes in sorted order, as
-            they appear in the attribute classes_.
         """
         out_type = self._get_output_type(X)
 
@@ -407,12 +406,13 @@ def predict_log_proba(self, X):
 
         # todo: use a sparse CumlArray style approach when ready
         # https://github.com/rapidsai/cuml/issues/2216
-        if scipy_sparse_isspmatrix(X) or cp.sparse.isspmatrix(X):
+        if scipy_sparse_isspmatrix(X) or cupyx.scipy.sparse.isspmatrix(X):
             X = X.tocoo()
             rows = cp.asarray(X.row, dtype=X.row.dtype)
             cols = cp.asarray(X.col, dtype=X.col.dtype)
             data = cp.asarray(X.data, dtype=X.data.dtype)
-            X = cp.sparse.coo_matrix((data, (rows, cols)), shape=X.shape)
+            X = cupyx.scipy.sparse.coo_matrix((data, (rows, cols)),
+                                              shape=X.shape)
         else:
             X = input_to_cuml_array(X, order='K').array.to_output('cupy')
 
@@ -437,28 +437,28 @@ def predict_log_proba(self, X):
         result = jll - log_prob_x.T
         return CumlArray(result).to_output(out_type)
 
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'C',
+                                       'type': 'dense',
+            'description': 'Returns the probability of the samples for each class in the \
+            model. The columns correspond to the classes in sorted order, as \
+            they appear in the attribute `classes_`.',  # noqa
+                                       'shape': '(n_rows, 1)'})
     @with_cupy_rmm
     def predict_proba(self, X):
         """
         Return probability estimates for the test vector X.
 
-        Parameters
-        ----------
-
-        X : array-like of shape (n_samples, n_features)
-
-        Returns
-        -------
-
-        C : array-like of shape (n_samples, n_classes)
-            Returns the probability of the samples for each class in the model.
-            The columns correspond to the classes in sorted order, as they
-            appear in the attribute classes_.
         """
         out_type = self._get_output_type(X)
         result = cp.exp(self.predict_log_proba(X))
         return CumlArray(result).to_output(out_type)
 
+    @generate_docstring(X='dense_sparse',
+                        return_values={'name': 'score',
+                                       'type': 'float',
+                                       'description': 'Mean accuracy of \
+                                       self.predict(X) with respect to y.'})
     @with_cupy_rmm
     def score(self, X, y, sample_weight=None):
         """
@@ -468,21 +468,8 @@ def score(self, X, y, sample_weight=None):
         harsh metric since you require for each sample that each label set be
         correctly predicted.
 
-        Parameters
-        ----------
-        X : array-like of shape (n_samples, n_features)
-        Test samples.
-
-        y : array-like of shape (n_samples,) or (n_samples, n_outputs)
-        True labels for X.
+        Currently, sample weight is ignored
 
-        sample_weight : array-like of shape (n_samples,), default=None
-        Sample weights. Currently, sample weight is ignored
-
-        Returns
-        -------
-
-        score : float Mean accuracy of self.predict(X) with respect to y.
         """
         y_hat = self.predict(X)
         return accuracy_score(y_hat, cp.asarray(y, dtype=y.dtype))
@@ -500,7 +487,7 @@ def _count(self, X, Y):
 
         Parameters
         ----------
-        X : cupy.ndarray or cupy.sparse matrix of size
+        X : cupy.ndarray or cupyx.scipy.sparse matrix of size
                   (n_rows, n_features)
         Y : cupy.array of monotonic class labels
         """
@@ -522,7 +509,7 @@ def _count(self, X, Y):
 
         labels_dtype = self.classes_.dtype
 
-        if cp.sparse.isspmatrix(X):
+        if cupyx.scipy.sparse.isspmatrix(X):
             X = X.tocoo()
 
             count_features_coo = count_features_coo_kernel(X.dtype,
@@ -558,8 +545,8 @@ def _count(self, X, Y):
         count_classes((math.ceil(n_rows / 32),), (32,),
                       (class_c, n_rows, Y))
 
-        self._feature_count_ += counts
-        self._class_count_ += class_c
+        self._feature_count_ = CumlArray(self._feature_count_ + counts)
+        self._class_count_ = CumlArray(self._class_count_ + class_c)
 
     def _update_class_log_prior(self, class_prior=None):
 
@@ -573,11 +560,12 @@ def _update_class_log_prior(self, class_prior=None):
 
         elif self.fit_prior:
             log_class_count = cp.log(self._class_count_)
-            self._class_log_prior_ = log_class_count - \
-                cp.log(self._class_count_.sum())
+            self._class_log_prior_ = \
+                CumlArray(log_class_count - cp.log(
+                    cp.asarray(self._class_count_).sum()))
         else:
-            self._class_log_prior_ = cp.full(self._n_classes_,
-                                             -1*math.log(self._n_classes_))
+            self._class_log_prior_ = CumlArray(cp.full(self._n_classes_,
+                                               -1*math.log(self._n_classes_)))
 
     def _update_feature_log_prob(self, alpha):
         """
@@ -589,10 +577,10 @@ def _update_feature_log_prob(self, alpha):
 
         alpha : float amount of smoothing to apply (0. means no smoothing)
         """
-        smoothed_fc = self._feature_count_ + alpha
+        smoothed_fc = cp.asarray(self._feature_count_) + alpha
         smoothed_cc = smoothed_fc.sum(axis=1).reshape(-1, 1)
-        self._feature_log_prob_ = (cp.log(smoothed_fc) -
-                                   cp.log(smoothed_cc.reshape(-1, 1)))
+        self._feature_log_prob_ = CumlArray(cp.log(smoothed_fc) -
+                                            cp.log(smoothed_cc.reshape(-1, 1)))
 
     def _joint_log_likelihood(self, X):
         """
@@ -604,6 +592,14 @@ def _joint_log_likelihood(self, X):
         X : array-like of size (n_samples, n_features)
         """
 
-        ret = X.dot(self._feature_log_prob_.T)
-        ret += self._class_log_prior_
+        ret = X.dot(cp.asarray(self._feature_log_prob_).T)
+        ret += cp.asarray(self._class_log_prior_)
         return ret
+
+    def get_param_names(self):
+        return super().get_param_names() + \
+            [
+                "alpha",
+                "fit_prior",
+                "class_prior",
+            ]
diff --git a/python/cuml/nccl/__init__.py b/python/cuml/nccl/__init__.py
deleted file mode 100644
index 0dd1a0f887..0000000000
--- a/python/cuml/nccl/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from cuml.nccl.nccl import unique_id
-from cuml.nccl.nccl import nccl
-
diff --git a/python/cuml/nccl/nccl.pyx b/python/cuml/nccl/nccl.pyx
deleted file mode 100644
index 3cfe67cb4d..0000000000
--- a/python/cuml/nccl/nccl.pyx
+++ /dev/null
@@ -1,236 +0,0 @@
-#
-# Copyright (c) 2019, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-import cuml.common.logger as logger
-
-from libc.stdint cimport uintptr_t
-from cython.operator cimport dereference as deref
-
-from libcpp cimport bool
-from libc.stdlib cimport malloc, free
-
-
-cdef extern from "cuML_comms_py.hpp" namespace "ML":
-    void get_unique_id(char *uid, int size) except +
-    void ncclUniqueIdFromChar(ncclUniqueId *id,
-                              char *uniqueId,
-                              int size) except +
-
-cdef extern from "nccl.h":
-
-    cdef struct ncclComm
-
-    ctypedef struct ncclUniqueId:
-        char *internal[128]
-
-    ctypedef ncclComm *ncclComm_t
-
-    ctypedef enum ncclResult_t:
-        ncclSuccess
-        ncclUnhandledCudaError
-        ncclSystemError
-        ncclInternalError
-        ncclInvalidArgument
-        ncclInvalidUsage
-        ncclNumResults
-
-    ncclResult_t ncclCommInitRank(ncclComm_t *comm,
-                                  int nranks,
-                                  ncclUniqueId commId,
-                                  int rank) nogil
-
-    ncclResult_t ncclGetUniqueId(ncclUniqueId *uniqueId) nogil
-
-    ncclResult_t ncclCommUserRank(const ncclComm_t comm, int *rank) nogil
-
-    ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int *count) nogil
-
-    const char *ncclGetErrorString(ncclResult_t result) nogil
-
-    ncclResult_t ncclCommAbort(ncclComm_t comm) nogil
-
-    ncclResult_t ncclCommDestroy(ncclComm_t comm) nogil
-
-NCCL_UNIQUE_ID_BYTES = 128
-
-
-def unique_id():
-    """
-    Returns a new ncclUniqueId converted to a
-    character array that can be safely serialized
-    and shared to a remote worker.
-    :return: string a 128-byte unique id string
-    """
-    cdef char *uid = <char *> malloc(NCCL_UNIQUE_ID_BYTES * sizeof(char))
-    get_unique_id(uid, NCCL_UNIQUE_ID_BYTES)
-    c_str = uid[:NCCL_UNIQUE_ID_BYTES-1]
-    free(uid)
-    return c_str
-
-
-cdef class nccl:
-    """
-    A NCCL wrapper for initializing and closing NCCL comms
-    in Python.
-    """
-    cdef ncclComm_t *comm
-
-    cdef int size
-    cdef int rank
-
-    def __cinit__(self):
-        self.comm = <ncclComm_t*>malloc(sizeof(ncclComm_t))
-
-    def __dealloc__(self):
-
-        comm_ = <ncclComm_t*>self.comm
-
-        if comm_ != NULL:
-            free(self.comm)
-            self.comm = NULL
-
-    @staticmethod
-    def get_unique_id():
-        """
-        Returns a new nccl unique id
-        :return: string nccl unique id
-        """
-        return unique_id()
-
-    def init(self, nranks, commId, rank):
-        """
-        Construct a nccl-py object
-        :param nranks: int size of clique
-        :param commId: string unique id from client
-        :param rank: int rank of current worker
-        """
-        self.size = nranks
-        self.rank = rank
-
-        cdef ncclUniqueId *ident = <ncclUniqueId*>malloc(sizeof(ncclUniqueId))
-        ncclUniqueIdFromChar(ident, commId, NCCL_UNIQUE_ID_BYTES)
-
-        comm_ = <ncclComm_t*>self.comm
-
-        cdef int nr = nranks
-        cdef int r = rank
-        cdef ncclResult_t result
-
-        import time
-
-        start = time.time()
-        with nogil:
-            result = ncclCommInitRank(comm_, nr,
-                                      deref(ident), r)
-
-        end = time.time()
-        if result != ncclSuccess:
-            with nogil:
-                err_str = ncclGetErrorString(result)
-            logger.error("NCCL_ERROR: %s" % err_str)
-
-    def destroy(self):
-        """
-        Call destroy on the underlying NCCL comm
-        """
-        comm_ = <ncclComm_t*>self.comm
-
-        cdef ncclResult_t result
-        if comm_ != NULL:
-            with nogil:
-                result = ncclCommDestroy(deref(comm_))
-
-            if result != ncclSuccess:
-                with nogil:
-                    err_str = ncclGetErrorString(result)
-                logger.error("NCCL_ERROR: %s" % err_str)
-
-            free(self.comm)
-            self.comm = NULL
-
-    def abort(self):
-        """
-        Call abort on the underlying nccl comm
-        """
-        comm_ = <ncclComm_t*>self.comm
-        cdef ncclResult_t result
-        if comm_ != NULL:
-            with nogil:
-                result = ncclCommAbort(deref(comm_))
-
-            if result != ncclSuccess:
-                with nogil:
-                    err_str = ncclGetErrorString(result)
-                logger.error("NCCL_ERROR: %s" % err_str)
-            free(comm_)
-            self.comm = NULL
-
-    def cu_device(self):
-        """
-        Get the device backing the underlying comm
-        :returns int device id
-        """
-        cdef int *dev = <int*>malloc(sizeof(int))
-
-        comm_ = <ncclComm_t*>self.comm
-        cdef ncclResult_t result
-        with nogil:
-            result = ncclCommCuDevice(deref(comm_), dev)
-
-        if result != ncclSuccess:
-            with nogil:
-                err_str = ncclGetErrorString(result)
-            logger.error("NCCL_ERROR: %s" % err_str)
-
-        ret = dev[0]
-        free(dev)
-        return ret
-
-    def user_rank(self):
-        """
-        Get the rank id of the current comm
-        :return: int rank
-        """
-
-        cdef int *rank = <int*>malloc(sizeof(int))
-
-        comm_ = <ncclComm_t*>self.comm
-
-        cdef ncclResult_t result
-        with nogil:
-            result = ncclCommUserRank(deref(comm_), rank)
-
-        if result != ncclSuccess:
-            with nogil:
-                err_str = ncclGetErrorString(result)
-            logger.error("NCCL_ERROR: %s" % err_str)
-
-        ret = rank[0]
-        free(rank)
-        return ret
-
-    def get_comm(self):
-        """
-        Returns the underlying nccl comm in a size_t (similar to void*).
-        This can be safely typecasted from size_t into ncclComm_t*
-        :return: size_t ncclComm_t instance
-        """
-        return <size_t>self.comm
diff --git a/python/cuml/neighbors/__init__.py b/python/cuml/neighbors/__init__.py
index 0046a3ea09..ba7ace5aa1 100644
--- a/python/cuml/neighbors/__init__.py
+++ b/python/cuml/neighbors/__init__.py
@@ -21,9 +21,6 @@
 from cuml.neighbors.kneighbors_classifier import KNeighborsClassifier
 from cuml.neighbors.kneighbors_regressor import KNeighborsRegressor
 
-if has_dask():
-    from cuml.neighbors.kneighbors_mg import KNeighborsMG
-
 VALID_METRICS = {"brute": set([
         "l2", "euclidean",
         "l1", "cityblock", "manhattan", "taxicab",
diff --git a/python/cuml/neighbors/kneighbors_classifier.pyx b/python/cuml/neighbors/kneighbors_classifier.pyx
index fb26342dcb..a92108995a 100644
--- a/python/cuml/neighbors/kneighbors_classifier.pyx
+++ b/python/cuml/neighbors/kneighbors_classifier.pyx
@@ -14,16 +14,14 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from cuml.neighbors.nearest_neighbors import NearestNeighbors
 
 from cuml.common.array import CumlArray
 from cuml.common import input_to_cuml_array
 from cuml.common.base import ClassifierMixin
+from cuml.common.doc_utils import generate_docstring
 
 import numpy as np
 import cupy as cp
@@ -32,7 +30,7 @@ import cudf
 
 from cython.operator cimport dereference as deref
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from libcpp.vector cimport vector
 
 from cuml.common import with_cupy_rmm
@@ -49,14 +47,13 @@ from libc.stdlib cimport calloc, malloc, free
 from numba import cuda
 import rmm
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 
 cdef extern from "cuml/neighbors/knn.hpp" namespace "ML":
 
     void knn_classify(
-        cumlHandle &handle,
+        handle_t &handle,
         int* out,
         int64_t *knn_indices,
         vector[int*] &y,
@@ -66,7 +63,7 @@ cdef extern from "cuml/neighbors/knn.hpp" namespace "ML":
     ) except +
 
     void knn_class_proba(
-        cumlHandle &handle,
+        handle_t &handle,
         vector[float*] &out,
         int64_t *knn_indices,
         vector[int*] &y,
@@ -86,10 +83,6 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
     ----------
     n_neighbors : int (default=5)
         Default number of neighbors to query
-    verbose : int or boolean (default = False)
-        Logging level
-    handle : cumlHandle
-        The cumlHandle resources to use
     algorithm : string (default='brute')
         The query algorithm to use. Currently, only 'brute' is supported.
     metric : string (default='euclidean').
@@ -97,9 +90,24 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
     weights : string (default='uniform')
         Sample weights to use. Currently, only the uniform strategy is
         supported.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
       from cuml.neighbors import KNeighborsClassifier
@@ -121,7 +129,6 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
 
 
     Output:
-    -------
 
     .. code-block:: python
 
@@ -135,67 +142,50 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
     <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html>`_.
     """
 
-    def __init__(self, weights="uniform", **kwargs):
-        super(KNeighborsClassifier, self).__init__(**kwargs)
+    def __init__(self, weights="uniform", *, handle=None, verbose=False,
+                 output_type=None, **kwargs):
+        super(KNeighborsClassifier, self).__init__(
+            handle=handle,
+            verbose=verbose,
+            output_type=output_type,
+            **kwargs)
 
-        self.y = None
+        self._y = None
+        self._classes_ = None
         self.weights = weights
 
         if weights != "uniform":
             raise ValueError("Only uniform weighting strategy is "
                              "supported currently.")
 
+    @generate_docstring(convert_dtype_cast='np.float32')
     @with_cupy_rmm
     def fit(self, X, y, convert_dtype=True):
         """
         Fit a GPU index for k-nearest neighbors classifier model.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, n_outputs)
-            Dense matrix (floats or doubles) of shape (n_samples, n_outputs).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
-        self._set_target_dtype(y)
+        self._set_base_attributes(output_type=X, target_dtype=y)
 
         super(KNeighborsClassifier, self).fit(X, convert_dtype)
-        self.y, _, _, _ = \
+        self._y, _, _, _ = \
             input_to_cuml_array(y, order='F', check_dtype=np.int32,
                                 convert_to_dtype=(np.int32
                                                   if convert_dtype
                                                   else None))
-        self.classes_ = cp.unique(self.y)
+        self._classes_ = CumlArray(cp.unique(self._y))
         return self
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Labels predicted',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=True):
         """
         Use the trained k-nearest neighbors classifier to
         predict the labels for X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
-
-        Returns
-        ----------
-        y : (same as the input datatype)
-            Dense vector (ints, floats, or doubles) of shape (n_samples, 1).
         """
 
         out_type = self._get_output_type(X)
@@ -211,7 +201,7 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
                                                   else None))
         cdef uintptr_t inds_ctype = inds.ptr
 
-        out_cols = self.y.shape[1] if len(self.y.shape) == 2 else 1
+        out_cols = self._y.shape[1] if len(self._y.shape) == 2 else 1
 
         out_shape = (n_rows, out_cols) if out_cols > 1 else n_rows
 
@@ -223,13 +213,13 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
         # classification
         cdef uintptr_t y_ptr
         for i in range(out_cols):
-            col = self.y[:, i] if out_cols > 1 else self.y
+            col = self._y[:, i] if out_cols > 1 else self._y
             y_ptr = col.ptr
             y_vec.push_back(<int*>y_ptr)
 
         cdef uintptr_t classes_ptr = classes.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_classify(
             handle_[0],
@@ -245,21 +235,17 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
 
         return classes.to_output(output_type=out_type, output_dtype=out_dtype)
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Labels probabilities',
+                                       'shape': '(n_samples, 1)'})
     @with_cupy_rmm
     def predict_proba(self, X, convert_dtype=True):
         """
         Use the trained k-nearest neighbors classifier to
         predict the label probabilities for X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
 
         out_type = self._get_output_type(X)
@@ -275,7 +261,7 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
                                                   else None))
         cdef uintptr_t inds_ctype = inds.ptr
 
-        out_cols = self.y.shape[1] if len(self.y.shape) == 2 else 1
+        out_cols = self._y.shape[1] if len(self._y.shape) == 2 else 1
 
         cdef vector[int*] *y_vec = new vector[int*]()
         cdef vector[float*] *out_vec = new vector[float*]()
@@ -284,7 +270,7 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
         cdef uintptr_t classes_ptr
         cdef uintptr_t y_ptr
         for out_col in range(out_cols):
-            col = self.y[:, out_col] if out_cols > 1 else self.y
+            col = self._y[:, out_col] if out_cols > 1 else self._y
             classes = CumlArray.zeros((n_rows,
                                        len(cp.unique(cp.asarray(col)))),
                                       dtype=np.float32,
@@ -296,7 +282,7 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
             y_ptr = col.ptr
             y_vec.push_back(<int*>y_ptr)
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_class_proba(
             handle_[0],
@@ -318,5 +304,4 @@ class KNeighborsClassifier(NearestNeighbors, ClassifierMixin):
             if len(final_classes) == 1 else tuple(final_classes)
 
     def get_param_names(self):
-        return super(KNeighborsClassifier, self).get_param_names()\
-            + ["weights"]
+        return super().get_param_names() + ["weights"]
diff --git a/python/cuml/neighbors/kneighbors_classifier_mg.pyx b/python/cuml/neighbors/kneighbors_classifier_mg.pyx
index a2d1f5b08c..2e4d8f5733 100644
--- a/python/cuml/neighbors/kneighbors_classifier_mg.pyx
+++ b/python/cuml/neighbors/kneighbors_classifier_mg.pyx
@@ -14,15 +14,12 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common.opg_data_utils_mg import _build_part_inputs
@@ -34,14 +31,14 @@ from libc.stdint cimport uintptr_t
 from libcpp cimport bool
 from libcpp.memory cimport shared_ptr
 
-from cuml.neighbors import KNeighborsMG
+from cuml.neighbors.kneighbors_mg import KNeighborsMG
 from cudf.core import DataFrame as cudfDataFrame
 
 cdef extern from "cuml/neighbors/knn_mg.hpp" namespace \
         "ML::KNN::opg":
 
     cdef void knn_classify(
-        cumlHandle &handle,
+        handle_t &handle,
         vector[intData_t*] *out,
         vector[int64Data_t*] *out_I,
         vector[floatData_t*] *out_D,
@@ -144,7 +141,7 @@ class KNeighborsClassifierMG(KNeighborsMG):
             out_result_local_parts.push_back(new intData_t(
                 <int*><uintptr_t>o_cai.ptr, n_rows * n_outputs))
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_classify(
             handle_[0],
@@ -257,7 +254,7 @@ class KNeighborsClassifierMG(KNeighborsMG):
                 probas_local_parts.at(query_idx).push_back(<float*><uintptr_t>
                                                            p_cai.ptr)
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_classify(
             handle_[0],
diff --git a/python/cuml/neighbors/kneighbors_mg.pyx b/python/cuml/neighbors/kneighbors_mg.pyx
index 681227d217..29493731c4 100644
--- a/python/cuml/neighbors/kneighbors_mg.pyx
+++ b/python/cuml/neighbors/kneighbors_mg.pyx
@@ -14,15 +14,12 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common.opg_data_utils_mg import _build_part_inputs
@@ -34,7 +31,7 @@ from libc.stdint cimport uintptr_t
 from libcpp cimport bool
 from libcpp.memory cimport shared_ptr
 
-from cuml.neighbors import NearestNeighbors
+from cuml.neighbors.nearest_neighbors_mg import NearestNeighbors
 from cudf.core import DataFrame as cudfDataFrame
 
 
@@ -45,7 +42,7 @@ class KNeighborsMG(NearestNeighbors):
 
     def get_out_type(self, data, query):
         if len(data) > 0:
-            self._set_output_type(data[0])
+            self._set_base_attributes(output_type=data[0])
         out_type = self.output_type
         if len(query) > 0:
             out_type = self._get_output_type(query[0])
diff --git a/python/cuml/neighbors/kneighbors_regressor.pyx b/python/cuml/neighbors/kneighbors_regressor.pyx
index e9c87c56fb..467fe5178e 100644
--- a/python/cuml/neighbors/kneighbors_regressor.pyx
+++ b/python/cuml/neighbors/kneighbors_regressor.pyx
@@ -14,16 +14,14 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from cuml.neighbors.nearest_neighbors import NearestNeighbors
 
 from cuml.common.array import CumlArray
 from cuml.common import input_to_cuml_array
 from cuml.common.base import RegressorMixin
+from cuml.common.doc_utils import generate_docstring
 
 import numpy as np
 
@@ -33,7 +31,7 @@ from cython.operator cimport dereference as deref
 
 from libcpp.vector cimport vector
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 
 from libcpp cimport bool
 from libcpp.memory cimport shared_ptr
@@ -47,24 +45,13 @@ from libc.stdlib cimport calloc, malloc, free
 from numba import cuda
 import rmm
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 
-cdef extern from "cuml/cuml.hpp" namespace "ML" nogil:
-    cdef cppclass deviceAllocator:
-        pass
-
-    cdef cppclass cumlHandle:
-        cumlHandle() except +
-        void setStream(cuml.common.cuda._Stream s) except +
-        void setDeviceAllocator(shared_ptr[deviceAllocator] a) except +
-        cuml.common.cuda._Stream getStream() except +
-
 cdef extern from "cuml/neighbors/knn.hpp" namespace "ML":
 
     void knn_regress(
-        cumlHandle &handle,
+        handle_t &handle,
         float *out,
         int64_t *knn_indices,
         vector[float *] &y,
@@ -88,10 +75,6 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
     ----------
     n_neighbors : int (default=5)
         Default number of neighbors to query
-    verbose : int or boolean (default = False)
-        Logging level
-    handle : cumlHandle
-        The cumlHandle resources to use
     algorithm : string (default='brute')
         The query algorithm to use. Currently, only 'brute' is supported.
     metric : string (default='euclidean').
@@ -99,9 +82,24 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
     weights : string (default='uniform')
         Sample weights to use. Currently, only the uniform strategy is
         supported.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
       from cuml.neighbors import KNeighborsRegressor
@@ -144,60 +142,48 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
     <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html>`_.
     """
 
-    def __init__(self, weights="uniform", **kwargs):
-        super(KNeighborsRegressor, self).__init__(**kwargs)
-        self.y = None
+    def __init__(self, weights="uniform", *, handle=None, verbose=False,
+                 output_type=None, **kwargs):
+        super(KNeighborsRegressor, self).__init__(
+            handle=handle,
+            verbose=verbose,
+            output_type=output_type,
+            **kwargs)
+        self._y = None
         self.weights = weights
         if weights != "uniform":
             raise ValueError("Only uniform weighting strategy "
                              "is supported currently.")
 
+    @generate_docstring(convert_dtype_cast='np.float32')
     def fit(self, X, y, convert_dtype=True):
         """
         Fit a GPU index for k-nearest neighbors regression model.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, n_outputs)
-            Dense matrix (floats or doubles) of shape (n_samples, n_outputs).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
+        self._set_target_dtype(y)
         super(KNeighborsRegressor, self).fit(X, convert_dtype=convert_dtype)
-        self.y, _, _, _ = \
+        self._y, _, _, _ = \
             input_to_cuml_array(y, order='F', check_dtype=np.float32,
                                 convert_to_dtype=(np.float32
                                                   if convert_dtype
                                                   else None))
         return self
 
+    @generate_docstring(convert_dtype_cast='np.float32',
+                        return_values={'name': 'X_new',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, n_features)'})
     def predict(self, X, convert_dtype=True):
         """
         Use the trained k-nearest neighbors regression model to
         predict the labels for X
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
 
         out_type = self._get_output_type(X)
+        out_dtype = self._get_target_dtype() if convert_dtype else None
 
         knn_indices = self.kneighbors(X, return_distance=False,
                                       convert_dtype=convert_dtype)
@@ -209,7 +195,7 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
                                                   else None))
         cdef uintptr_t inds_ctype = inds.ptr
 
-        res_cols = 1 if len(self.y.shape) == 1 else self.y.shape[1]
+        res_cols = 1 if len(self._y.shape) == 1 else self._y.shape[1]
         res_shape = n_rows if res_cols == 1 else (n_rows, res_cols)
         results = CumlArray.zeros(res_shape, dtype=np.float32,
                                   order="C")
@@ -219,11 +205,11 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
         cdef vector[float*] *y_vec = new vector[float*]()
 
         for col_num in range(res_cols):
-            col = self.y if res_cols == 1 else self.y[:, col_num]
+            col = self._y if res_cols == 1 else self._y[:, col_num]
             y_ptr = col.ptr
             y_vec.push_back(<float*>y_ptr)
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_regress(
             handle_[0],
@@ -237,8 +223,7 @@ class KNeighborsRegressor(NearestNeighbors, RegressorMixin):
 
         self.handle.sync()
 
-        return results.to_output(out_type)
+        return results.to_output(out_type, output_dtype=out_dtype)
 
     def get_param_names(self):
-        return super(KNeighborsRegressor, self).get_param_names() \
-            + ["weights"]
+        return super().get_param_names() + ["weights"]
diff --git a/python/cuml/neighbors/kneighbors_regressor_mg.pyx b/python/cuml/neighbors/kneighbors_regressor_mg.pyx
index ed230d3b03..d615ec9226 100644
--- a/python/cuml/neighbors/kneighbors_regressor_mg.pyx
+++ b/python/cuml/neighbors/kneighbors_regressor_mg.pyx
@@ -14,15 +14,12 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common.opg_data_utils_mg import _build_part_inputs
@@ -34,14 +31,14 @@ from libc.stdint cimport uintptr_t
 from libcpp cimport bool
 from libcpp.memory cimport shared_ptr
 
-from cuml.neighbors import KNeighborsMG
+from cuml.neighbors.kneighbors_mg import KNeighborsMG
 from cudf.core import DataFrame as cudfDataFrame
 
 cdef extern from "cuml/neighbors/knn_mg.hpp" namespace \
         "ML::KNN::opg":
 
     cdef void knn_regress(
-        cumlHandle &handle,
+        handle_t &handle,
         vector[floatData_t*] *out,
         vector[int64Data_t*] *out_I,
         vector[floatData_t*] *out_D,
@@ -104,7 +101,7 @@ class KNeighborsRegressorMG(KNeighborsMG):
                                      query, query_parts_to_ranks, query_nrows,
                                      ncols, rank, convert_dtype)
 
-        output = self.gen_local_output(data, convert_dtype, dtype='int32')
+        output = self.gen_local_output(data, convert_dtype, dtype='float32')
 
         query_cais = input['cais']['query']
         local_query_rows = list(map(lambda x: x.shape[0], query_cais))
@@ -120,7 +117,7 @@ class KNeighborsRegressorMG(KNeighborsMG):
             out_result_local_parts.push_back(new floatData_t(
                 <float*><uintptr_t>o_cai.ptr, n_rows * n_outputs))
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         knn_regress(
             handle_[0],
diff --git a/python/cuml/neighbors/nearest_neighbors.pyx b/python/cuml/neighbors/nearest_neighbors.pyx
index e8f9a7353e..b7dbc0376c 100644
--- a/python/cuml/neighbors/nearest_neighbors.pyx
+++ b/python/cuml/neighbors/nearest_neighbors.pyx
@@ -14,14 +14,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
-import pandas as pd
 import cupy as cp
+import cupyx
 import cudf
 import ctypes
 import cuml
@@ -29,11 +26,13 @@ import warnings
 
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
+from cuml.common.doc_utils import generate_docstring
+from cuml.common.doc_utils import insert_into_docstring
 from cuml.common import input_to_cuml_array
 
 from cython.operator cimport dereference as deref
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 
 from libcpp cimport bool
 from libcpp.memory cimport shared_ptr
@@ -47,20 +46,8 @@ from libcpp.vector cimport vector
 from numba import cuda
 import rmm
 
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
-
-cdef extern from "cuml/cuml.hpp" namespace "ML" nogil:
-    cdef cppclass deviceAllocator:
-        pass
-
-    cdef cppclass cumlHandle:
-        cumlHandle() except +
-        void setStream(cuml.common.cuda._Stream s) except +
-        void setDeviceAllocator(shared_ptr[deviceAllocator] a) except +
-        cuml.common.cuda._Stream getStream() except +
-
 cdef extern from "cuml/neighbors/knn.hpp" namespace "ML":
 
     enum MetricType:
@@ -78,7 +65,7 @@ cdef extern from "cuml/neighbors/knn.hpp" namespace "ML":
         METRIC_Correlation
 
     void brute_force_knn(
-        cumlHandle &handle,
+        handle_t &handle,
         vector[float*] &inputs,
         vector[int] &sizes,
         int D,
@@ -105,10 +92,16 @@ class NearestNeighbors(Base):
     ----------
     n_neighbors : int (default=5)
         Default number of neighbors to query
-    verbose : int or boolean (default = False)
-        Logging level
-    handle : cumlHandle
-        The cumlHandle resources to use
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     algorithm : string (default='brute')
         The query algorithm to use. Currently, only 'brute' is supported.
     metric : string (default='euclidean').
@@ -122,9 +115,14 @@ class NearestNeighbors(Base):
         Can increase performance in Minkowski-based (Lp) metrics (for p > 1)
         by using the expanded form and not computing the n-th roots.
     metric_params : dict, optional (default = None) This is currently ignored.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
       import cudf
@@ -184,7 +182,7 @@ class NearestNeighbors(Base):
     -----
 
     For an additional example see `the NearestNeighbors notebook
-    <https://github.com/rapidsai/cuml/blob/branch-0.14/notebooks/nearest_neighbors_demo.ipynb>`_.
+    <https://github.com/rapidsai/cuml/blob/branch-0.15/notebooks/nearest_neighbors_demo.ipynb>`_.
 
     For additional docs, see `scikit-learn's NearestNeighbors
     <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors>`_.
@@ -219,30 +217,20 @@ class NearestNeighbors(Base):
         self.p = p
         self.algorithm = algorithm
 
+    @generate_docstring()
     def fit(self, X, convert_dtype=True):
         """
         Fit GPU index for performing nearest neighbor queries.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will automatically
-            convert the inputs to np.float32.
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         if len(X.shape) != 2:
             raise ValueError("data should be two dimensional")
 
         self.n_dims = X.shape[1]
 
-        self.X_m, n_rows, n_cols, dtype = \
+        self._X_m, n_rows, n_cols, dtype = \
             input_to_cuml_array(X, order='F', check_dtype=np.float32,
                                 convert_to_dtype=(np.float32
                                                   if convert_dtype
@@ -254,7 +242,8 @@ class NearestNeighbors(Base):
         return self
 
     def get_param_names(self):
-        return ["n_neighbors", "algorithm", "metric",
+        return super().get_param_names() + \
+            ["n_neighbors", "algorithm", "metric",
                 "p", "metric_params"]
 
     @staticmethod
@@ -289,6 +278,10 @@ class NearestNeighbors(Base):
 
         return m, expanded
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')],
+                           return_values=[('dense', '(n_samples, n_features)'),
+                                          ('dense',
+                                           '(n_samples, n_features)')])
     def kneighbors(self, X=None, n_neighbors=None, return_distance=True,
                    convert_dtype=True):
         """
@@ -296,10 +289,7 @@ class NearestNeighbors(Base):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        X : {}
 
         n_neighbors : Integer
             Number of neighbors to search. If not provided, the n_neighbors
@@ -314,11 +304,11 @@ class NearestNeighbors(Base):
 
         Returns
         -------
-        distances: cuDF DataFrame, pandas DataFrame, numpy or cupy ndarray
+        distances : {}
             The distances of the k-nearest neighbors for each column vector
             in X
 
-        indices: cuDF DataFrame, pandas DataFrame, numpy or cupy ndarray
+        indices : {}
             The indices of the k-nearest neighbors for each column vector in X
         """
 
@@ -364,14 +354,14 @@ class NearestNeighbors(Base):
 
         use_training_data = X is None
         if X is None:
-            X = self.X_m
+            X = self._X_m
             n_neighbors += 1
 
         if (n_neighbors is None and self.n_neighbors is None) \
                 or n_neighbors <= 0:
             raise ValueError("k or n_neighbors must be a positive integers")
 
-        if n_neighbors > self.X_m.shape[0]:
+        if n_neighbors > self._X_m.shape[0]:
             raise ValueError("n_neighbors must be <= number of "
                              "samples in index")
 
@@ -400,11 +390,11 @@ class NearestNeighbors(Base):
         cdef vector[float*] *inputs = new vector[float*]()
         cdef vector[int] *sizes = new vector[int]()
 
-        cdef uintptr_t idx_ptr = self.X_m.ptr
+        cdef uintptr_t idx_ptr = self._X_m.ptr
         inputs.push_back(<float*>idx_ptr)
-        sizes.push_back(<int>self.X_m.shape[0])
+        sizes.push_back(<int>self._X_m.shape[0])
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef uintptr_t x_ctype_st = X_m.ptr
 
@@ -452,6 +442,7 @@ class NearestNeighbors(Base):
 
         return (D_output, I_output) if return_distance else I_output
 
+    @insert_into_docstring(parameters=[('dense', '(n_samples, n_features)')])
     def kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity'):
         """
         Find the k nearest neighbors of column vectors in X and return as
@@ -459,10 +450,7 @@ class NearestNeighbors(Base):
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        X : {}
 
         n_neighbors : Integer
             Number of neighbors to search. If not provided, the n_neighbors
@@ -483,7 +471,7 @@ class NearestNeighbors(Base):
             numpy's CSR sparse graph (host)
 
         """
-        if not self.X_m:
+        if not self._X_m:
             raise ValueError('This NearestNeighbors instance has not been '
                              'fitted yet, call "fit" before using this '
                              'estimator')
@@ -512,13 +500,14 @@ class NearestNeighbors(Base):
         indices = ind_mlarr.to_output('cupy')[:, 1:] if X is None \
             else ind_mlarr.to_output('cupy')
         n_samples = indices.shape[0]
-        n_samples_fit = self.X_m.shape[0]
+        n_samples_fit = self._X_m.shape[0]
         n_nonzero = n_samples * n_neighbors
         rowptr = cp.arange(0, n_nonzero + 1, n_neighbors)
 
-        sparse_csr = cp.sparse.csr_matrix((distances, cp.ravel(indices),
-                                          rowptr), shape=(n_samples,
-                                          n_samples_fit))
+        sparse_csr = cupyx.scipy.sparse.csr_matrix((distances,
+                                                   cp.ravel(indices),
+                                                   rowptr), shape=(n_samples,
+                                                   n_samples_fit))
 
         if self._get_output_type(X) is 'numpy':
             return sparse_csr.get()
@@ -548,11 +537,17 @@ def kneighbors_graph(X=None, n_neighbors=5, mode='connectivity', verbose=False,
         connectivity matrix with ones and zeros, 'distance' returns the
         edges as the distances between points with the requested metric.
 
-    verbose : int or boolean (default = False)
-        Logging level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
-    handle : cumlHandle
-        The cumlHandle resources to use
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
 
     algorithm : string (default='brute')
         The query algorithm to use. Currently, only 'brute' is supported.
@@ -573,11 +568,11 @@ def kneighbors_graph(X=None, n_neighbors=5, mode='connectivity', verbose=False,
 
     metric_params : dict, optional (default = None) This is currently ignored.
 
-    output_type : {'input', 'cupy', 'numpy'}, optional (default=None)
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
         Variable to control output type of the results and attributes of
-        the estimators. If None, it'll inherit the output type set at the
-        module level, cuml.output_type. If set, the estimator will override
-        the global option for its behavior.
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Returns
     -------
diff --git a/python/cuml/neighbors/nearest_neighbors_mg.pyx b/python/cuml/neighbors/nearest_neighbors_mg.pyx
index c4ed40ee4c..d5de5800e1 100644
--- a/python/cuml/neighbors/nearest_neighbors_mg.pyx
+++ b/python/cuml/neighbors/nearest_neighbors_mg.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 from cuml.neighbors import NearestNeighbors
 
@@ -34,7 +31,7 @@ from cuml.common import input_to_cuml_array
 
 from cython.operator cimport dereference as deref
 
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.opg_data_utils_mg import _build_part_inputs
 import cuml.common.logger as logger
 
@@ -50,8 +47,6 @@ from libc.stdlib cimport calloc, malloc, free
 
 import rmm
 
-
-cimport cuml.common.handle
 cimport cuml.common.cuda
 
 
@@ -86,7 +81,7 @@ cdef extern from "cuml/neighbors/knn_mg.hpp" namespace \
         "ML::KNN::opg":
 
     cdef void brute_force_knn(
-        cumlHandle &handle,
+        handle_t &handle,
         vector[int64Data_t*] &out_I,
         vector[floatData_t*] &out_D,
         vector[floatData_t*] &idx_data,
@@ -182,14 +177,14 @@ class NearestNeighborsMG(NearestNeighbors):
         -------
         output indices, output distances
         """
-        self._set_output_type(indices[0])
+        self._set_base_attributes(output_type=indices[0])
         out_type = self._get_output_type(queries[0])
 
         n_neighbors = self.n_neighbors if n_neighbors is None else n_neighbors
 
         self.n_dims = n
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         idx_cai, idx_local_parts, idx_desc = \
             _build_part_inputs(indices, index_parts_to_ranks,
diff --git a/python/cuml/preprocessing/LabelEncoder.py b/python/cuml/preprocessing/LabelEncoder.py
index 694a4353ac..f8c9cfae32 100644
--- a/python/cuml/preprocessing/LabelEncoder.py
+++ b/python/cuml/preprocessing/LabelEncoder.py
@@ -16,12 +16,14 @@
 
 import cudf
 import cupy as cp
+from cuml import Base
+
 
 from cuml.common.memory_utils import with_cupy_rmm
 from cuml.common.exceptions import NotFittedError
 
 
-class LabelEncoder(object):
+class LabelEncoder(Base):
     """
     An nvcategory based implementation of ordinal label encoding
 
@@ -32,6 +34,21 @@ class LabelEncoder(object):
         is present during transform (default is to raise). When this parameter
         is set to 'ignore' and an unknown category is encountered during
         transform or inverse transform, the resulting encoding will be null.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Examples
     --------
@@ -107,7 +124,17 @@ class LabelEncoder(object):
 
     """
 
-    def __init__(self, handle_unknown='error'):
+    def __init__(self,
+                 handle_unknown='error',
+                 *,
+                 handle=None,
+                 verbose=False,
+                 output_type=None):
+
+        super().__init__(handle=handle,
+                         verbose=verbose,
+                         output_type=output_type)
+
         self.classes_ = None
         self.dtype = None
         self._fitted: bool = False
@@ -126,7 +153,7 @@ def _validate_keywords(self):
             raise ValueError(msg)
 
     @with_cupy_rmm
-    def fit(self, y):
+    def fit(self, y, _classes=None):
         """
         Fit a LabelEncoder (nvcategory) instance to a set of categories
 
@@ -136,15 +163,22 @@ def fit(self, y):
             Series containing the categories to be encoded. It's elements
             may or may not be unique
 
+        _classes: int or None.
+            Passed by the dask client when dask LabelEncoder is used.
+
         Returns
         -------
         self : LabelEncoder
             A fitted instance of itself to allow method chaining
+
         """
         self._validate_keywords()
-        self.dtype = y.dtype if y.dtype != cp.dtype('O') else str
 
-        self.classes_ = y.unique()  # dedupe and sort
+        self.dtype = y.dtype if y.dtype != cp.dtype('O') else str
+        if _classes is not None:
+            self.classes_ = _classes
+        else:
+            self.classes_ = y.unique()  # dedupe and sort
 
         self._fitted = True
         return self
@@ -179,7 +213,7 @@ def transform(self, y: cudf.Series) -> cudf.Series:
 
         encoded = y.cat.set_categories(self.classes_)._column.codes
 
-        encoded = cudf.Series(encoded)
+        encoded = cudf.Series(encoded, index=y.index)
 
         if encoded.has_nulls and self.handle_unknown == 'error':
             raise KeyError("Attempted to encode unseen key")
@@ -199,7 +233,7 @@ def fit_transform(self, y: cudf.Series) -> cudf.Series:
         self.classes_ = y._column.categories
 
         self._fitted = True
-        return cudf.Series(y._column.codes)
+        return cudf.Series(y._column.codes, index=y.index)
 
     @with_cupy_rmm
     def inverse_transform(self, y: cudf.Series) -> cudf.Series:
@@ -239,3 +273,8 @@ def inverse_transform(self, y: cudf.Series) -> cudf.Series:
         reverted = y._column.find_and_replace(ran_idx, self.classes_, False)
 
         return cudf.Series(reverted)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "handle_unknown",
+        ]
diff --git a/python/cuml/preprocessing/TargetEncoder.py b/python/cuml/preprocessing/TargetEncoder.py
new file mode 100644
index 0000000000..c28839792f
--- /dev/null
+++ b/python/cuml/preprocessing/TargetEncoder.py
@@ -0,0 +1,363 @@
+#
+# Copyright (c) 2019, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+import pandas
+import cupy as cp
+import numpy as np
+from cuml.common.exceptions import NotFittedError
+
+
+class TargetEncoder:
+    """
+    A cudf based implementation of target encoding [1]_, which converts
+    one or mulitple categorical variables, 'Xs', with the average of
+    corresponding values of the target variable, 'Y'. The input data is
+    grouped by the columns `Xs` and the aggregated mean value of `Y` of
+    each group is calculated to replace each value of `Xs`. Several
+    optimizations are applied to prevent label leakage and parallelize
+    the execution.
+
+    Parameters
+    ----------
+    n_folds : int (default=4)
+        Default number of folds for fitting training data. To prevent
+        label leakage in `fit`, we split data into `n_folds` and
+        encode one fold using the target variables of the remaining folds.
+    smooth : float (default=0)
+        0 <= smooth <= 1
+        Percentage of samples to smooth the encoding
+    seed : int (default=42)
+        Random seed
+    split_method : {'random', 'continuous', 'interleaved'},
+        default='interleaved'
+        Method to split train data into `n_folds`.
+        'random': random split.
+        'continuous': consecutive samples are grouped into one folds.
+        'interleaved': samples are assign to each fold in a round robin way.
+    output_type: {'cupy', 'numpy', 'auto'}, default = 'auto'
+        The data type of output. If 'auto', it matches input data.
+
+    References
+    ----------
+    .. [1] https://maxhalford.github.io/blog/target-encoding/
+
+    Examples
+    --------
+    Converting a categorical implementation to a numerical one
+
+    .. code-block:: python
+
+        from cudf import DataFrame, Series
+
+        train = DataFrame({'category': ['a', 'b', 'b', 'a'],
+                           'label': [1, 0, 1, 1]})
+        test = DataFrame({'category': ['a', 'c', 'b', 'a']})
+
+        encoder = TargetEncoder()
+        train_encoded = encoder.fit_transform(train.category, train.label)
+        test_encoded = encoder.transform(test.category)
+        print(train_encoded)
+        print(test_encoded)
+
+    Output:
+
+    .. code-block:: python
+
+        [1. 1. 0. 1.]
+        [1.   0.75 0.5  1.  ]
+
+    """
+    def __init__(self, n_folds=4, smooth=0, seed=42,
+                 split_method='interleaved', output_type='auto'):
+        if smooth < 0 or smooth > 1:
+            raise ValueError('smooth {} is not in range [0,1]'.format(smooth))
+        if n_folds < 0 or not isinstance(n_folds, int):
+            raise ValueError(
+                'n_folds {} is not a postive integer'.format(n_folds))
+        if output_type not in {'cupy', 'numpy', 'auto'}:
+            msg = ("output_type should be either 'cupy'"
+                   " or 'numpy' or 'auto', "
+                   "got {0}.".format(output_type))
+            raise ValueError(msg)
+
+        if not isinstance(seed, int):
+            raise ValueError('seed {} is not an integer'.format(seed))
+
+        if split_method not in {'random', 'continuous', 'interleaved'}:
+            msg = ("split_method should be either 'random'"
+                   " or 'continuous' or 'interleaved', "
+                   "got {0}.".format(self.split))
+            raise ValueError(msg)
+
+        self.n_folds = n_folds
+        self.seed = seed
+        self.smooth = smooth
+        self.split = split_method
+        self.y_col = '__TARGET__'
+        self.x_col = '__FEA__'
+        self.out_col = '__TARGET_ENCODE__'
+        self.fold_col = '__FOLD__'
+        self.id_col = '__INDEX__'
+        self.train = None
+        self.output_type = output_type
+
+    def fit(self, x, y):
+        """
+        Fit a TargetEncoder instance to a set of categories
+
+        Parameters
+        ----------
+        x: cudf.Series or cudf.DataFrame or cupy.ndarray
+           categories to be encoded. It's elements may or may
+           not be unique
+        y : cudf.Series or cupy.ndarray
+            Series containing the target variable.
+
+        Returns
+        -------
+        self : TargetEncoder
+            A fitted instance of itself to allow method chaining
+        """
+        res, train = self._fit_transform(x, y)
+        self.train_encode = res
+        self.train = train
+        self._fitted = True
+        return self
+
+    def fit_transform(self, x, y):
+        """
+        Simultaneously fit and transform an input
+
+        This is functionally equivalent to (but faster than)
+        `TargetEncoder().fit(y).transform(y)`
+        """
+        self.fit(x, y)
+        return self.train_encode
+
+    def transform(self, x):
+        """
+        Transform an input into its categorical keys.
+
+        This is intended for test data. For fitting and transforming
+        the training data, prefer `fit_transform`.
+
+        Parameters
+        ----------
+        x : cudf.Series
+            Input keys to be transformed. Its values doesn't have to
+            match the categories given to `fit`
+
+        Returns
+        -------
+        encoded : cupy.ndarray
+            The ordinally encoded input series
+
+        """
+        self._check_is_fitted()
+        test = self._data_with_strings_to_cudf_dataframe(x)
+        if self._is_train_df(test):
+            return self.train_encode
+        x_cols = [i for i in test.columns.tolist() if i != self.id_col]
+        test = test.merge(self.encode_all, on=x_cols, how='left')
+        return self._impute_and_sort(test)
+
+    def _fit_transform(self, x, y):
+        """
+        Core function of target encoding
+        """
+        self.output_type = self._get_output_type(x)
+        cp.random.seed(self.seed)
+        train = self._data_with_strings_to_cudf_dataframe(x)
+        x_cols = [i for i in train.columns.tolist() if i != self.id_col]
+        train[self.y_col] = self._make_y_column(y)
+
+        self.n_folds = min(self.n_folds, len(train))
+        train[self.fold_col] = self._make_fold_column(len(train))
+
+        self.mean = train[self.y_col].mean()
+
+        y_count_each_fold, y_count_all = self._groupby_agg(train,
+                                                           x_cols,
+                                                           op='count')
+
+        y_sum_each_fold, y_sum_all = self._groupby_agg(train,
+                                                       x_cols,
+                                                       op='sum')
+        """
+        Note:
+            encode_each_fold is used to encode train data.
+            encode_all is used to encode test data.
+        """
+        cols = [self.fold_col]+x_cols
+        encode_each_fold = self._compute_output(y_sum_each_fold,
+                                                y_count_each_fold,
+                                                cols,
+                                                f'{self.y_col}_x')
+        encode_all = self._compute_output(y_sum_all,
+                                          y_count_all,
+                                          x_cols,
+                                          self.y_col)
+        self.encode_all = encode_all
+
+        train = train.merge(encode_each_fold, on=cols, how='left')
+        del encode_each_fold
+        return self._impute_and_sort(train), train
+
+    def _make_y_column(self, y):
+        """
+        Create a target column given y
+        """
+        if isinstance(y, cudf.Series) or isinstance(y, pandas.Series):
+            return y.values
+        elif isinstance(y, cp.ndarray) or isinstance(y, np.ndarray):
+            if len(y.shape) == 1:
+                return y
+            elif y.shape[1] == 1:
+                return y[:, 0]
+            else:
+                raise ValueError(f"Input of shape {y.shape} "
+                                 "is not a 1-D array.")
+        else:
+            raise TypeError(
+                f"Input of type {type(y)} is not cudf.Series, "
+                "or pandas.Series"
+                "or numpy.ndarray"
+                "or cupy.ndarray")
+
+    def _make_fold_column(self, len_train):
+        """
+        Create a fold id column for each split_method
+        """
+        if self.split == 'random':
+            return cp.random.randint(0, self.n_folds, len_train)
+        elif self.split == 'continuous':
+            return (cp.arange(len_train) /
+                    (len_train/self.n_folds)) % self.n_folds
+        elif self.split == 'interleaved':
+            return cp.arange(len_train) % self.n_folds
+        else:
+            msg = ("split should be either 'random'"
+                   " or 'continuous' or 'interleaved', "
+                   "got {0}.".format(self.split))
+            raise ValueError(msg)
+
+    def _compute_output(self, df_sum, df_count, cols, y_col):
+        """
+        Compute the output encoding based on aggregated sum and count
+        """
+        df_sum = df_sum.merge(df_count, on=cols, how='left')
+        smooth = self.smooth * len(df_sum)
+        df_sum[self.out_col] = (df_sum[f'{y_col}_x'] +
+                                smooth*self.mean) / \
+                               (df_sum[f'{y_col}_y'] +
+                                smooth)
+        return df_sum
+
+    def _groupby_agg(self, train, x_cols, op):
+        """
+        Compute aggregated value of each fold and overall dataframe
+        grouped by `x_cols` and agg by `op`
+        """
+        cols = [self.fold_col]+x_cols
+        df_each_fold = train.groupby(cols, as_index=False)\
+            .agg({self.y_col: op})
+        df_all = df_each_fold.groupby(x_cols, as_index=False)\
+            .agg({self.y_col: 'sum'})
+
+        df_each_fold = df_each_fold.merge(df_all, on=x_cols, how='left')
+        df_each_fold[f'{self.y_col}_x'] = df_each_fold[f'{self.y_col}_y'] -\
+            df_each_fold[f'{self.y_col}_x']
+        return df_each_fold, df_all
+
+    def _check_is_fitted(self):
+        if not self._fitted or self.train is None:
+            msg = ("This LabelEncoder instance is not fitted yet. Call 'fit' "
+                   "with appropriate arguments before using this estimator.")
+            raise NotFittedError(msg)
+
+    def _is_train_df(self, df):
+        """
+        Return True if the dataframe `df` is the training dataframe, which
+        is used in `fit_transform`
+        """
+        if len(df) != len(self.train):
+            return False
+        self.train = self.train.sort_values(self.id_col).reset_index(drop=True)
+        for col in df.columns:
+            if col not in self.train.columns:
+                raise ValueError(f"Input column {col} "
+                                 "is not in train data.")
+            if not (df[col] == self.train[col]).all():
+                return False
+        return True
+
+    def _impute_and_sort(self, df):
+        """
+        Impute and sort the result encoding in the same row order as input
+        """
+        df[self.out_col] = df[self.out_col].nans_to_nulls()
+        df[self.out_col] = df[self.out_col].fillna(self.mean)
+        df = df.sort_values(self.id_col)
+        res = df[self.out_col].values.copy()
+        if self.output_type == 'numpy':
+            return cp.asnumpy(res)
+        return res
+
+    def _data_with_strings_to_cudf_dataframe(self, x):
+        """
+        Convert input data with strings to cudf dataframe.
+        Supported data types are:
+            1D or 2D numpy/cupy arrays
+            pandas/cudf Series
+            pandas/cudf DataFrame
+        Input data could have one or more string columns.
+        """
+        if isinstance(x, cudf.DataFrame):
+            df = x.copy()
+        elif isinstance(x, cudf.Series):
+            df = x.to_frame().copy()
+        elif isinstance(x, cp.ndarray) or isinstance(x, np.ndarray):
+            df = cudf.DataFrame()
+            if len(x.shape) == 1:
+                df[self.x_col] = x
+            else:
+                df = cudf.DataFrame(x,
+                                    columns=[f'{self.x_col}_{i}'
+                                             for i in range(x.shape[1])])
+        elif isinstance(x, pandas.DataFrame):
+            df = cudf.from_pandas(x)
+        elif isinstance(x, pandas.Series):
+            df = cudf.from_pandas(x.to_frame())
+        else:
+            raise TypeError(
+                f"Input of type {type(x)} is not cudf.Series, cudf.DataFrame "
+                "or pandas.Series or pandas.DataFrame"
+                "or cupy.ndarray or numpy.ndarray")
+        df[self.id_col] = cp.arange(len(x))
+        return df
+
+    def _get_output_type(self, x):
+        """
+        Infer output type if 'auto'
+        """
+        if self.output_type != 'auto':
+            return self.output_type
+        if isinstance(x, np.ndarray) \
+            or isinstance(x, pandas.DataFrame) \
+                or isinstance(x, pandas.Series):
+            return 'numpy'
+        return 'cupy'
diff --git a/python/cuml/preprocessing/__init__.py b/python/cuml/preprocessing/__init__.py
index 93a5b36e72..c5553cb98a 100644
--- a/python/cuml/preprocessing/__init__.py
+++ b/python/cuml/preprocessing/__init__.py
@@ -17,3 +17,5 @@
 from cuml.preprocessing.LabelEncoder import LabelEncoder
 from cuml.preprocessing.label import LabelBinarizer, label_binarize
 from cuml.preprocessing.encoders import OneHotEncoder
+from cuml.preprocessing.TargetEncoder import TargetEncoder
+from cuml.preprocessing import text
diff --git a/python/cuml/preprocessing/encoders.py b/python/cuml/preprocessing/encoders.py
index 0257d5d1ff..60c0e17d79 100644
--- a/python/cuml/preprocessing/encoders.py
+++ b/python/cuml/preprocessing/encoders.py
@@ -14,12 +14,14 @@
 #
 import numpy as np
 import cupy as cp
+import cupyx
 from cuml.common.exceptions import NotFittedError
 
 from cuml import Base
 from cuml.preprocessing import LabelEncoder
 from cudf import DataFrame, Series
 from cudf.core import GenericIndex
+import cuml.common.logger as logger
 
 from cuml.common import with_cupy_rmm
 import warnings
@@ -37,13 +39,14 @@ class OneHotEncoder(Base):
     By default, the encoder derives the categories based on the unique values
     in each feature. Alternatively, you can also specify the `categories`
     manually.
-    Note: a one-hot encoding of y labels should use a LabelBinarizer
-    instead.
+
+    .. note:: a one-hot encoding of y labels should use a LabelBinarizer
+        instead.
 
     Parameters
     ----------
     categories : 'auto' an cupy.ndarray or a cudf.DataFrame, default='auto'
-        Categories (unique values) per feature:
+                 Categories (unique values) per feature:
 
         - 'auto' : Determine categories automatically from the training data.
 
@@ -78,6 +81,21 @@ class OneHotEncoder(Base):
         transform, the resulting one-hot encoded columns for this feature
         will be all zeros. In the inverse transform, an unknown category
         will be denoted as None.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
@@ -87,8 +105,19 @@ class OneHotEncoder(Base):
         be retained.
 
     """
-    def __init__(self, categories='auto', drop=None, sparse=True,
-                 dtype=np.float, handle_unknown='error'):
+    def __init__(self,
+                 categories='auto',
+                 drop=None,
+                 sparse=True,
+                 dtype=np.float,
+                 handle_unknown='error',
+                 *,
+                 handle=None,
+                 verbose=False,
+                 output_type=None):
+        super().__init__(handle=handle,
+                         verbose=verbose,
+                         output_type=output_type)
         self.categories = categories
         self.sparse = sparse
         self.dtype = dtype
@@ -280,44 +309,77 @@ def transform(self, X):
         X = self._check_input(X)
 
         cols, rows = list(), list()
+        col_idx = None
         j = 0
-        for feature in X.columns:
-            encoder = self._encoders[feature]
-            col_idx = encoder.transform(X[feature])
-            idx_to_keep = cp.asarray(col_idx.notnull().to_gpu_array())
-            col_idx = cp.asarray(col_idx.dropna().to_gpu_array())
-
-            # increase indices to take previous features into account
-            col_idx += j
-
-            # Filter out rows with null values
-            row_idx = cp.arange(len(X))[idx_to_keep]
-
-            if self.drop_idx_ is not None:
-                drop_idx = self.drop_idx_[feature] + j
-                mask = cp.ones(col_idx.shape, dtype=cp.bool)
-                mask[col_idx == drop_idx] = False
-                col_idx = col_idx[mask]
-                row_idx = row_idx[mask]
-                # account for dropped category in indices
-                col_idx[col_idx > drop_idx] -= 1
-                # account for dropped category in current cats number
-                j -= 1
-            j += len(encoder.classes_)
-            cols.append(col_idx)
-            rows.append(row_idx)
 
-        cols = cp.concatenate(cols)
-        rows = cp.concatenate(rows)
-        val = cp.ones(rows.shape[0], dtype=self.dtype)
-        ohe = cp.sparse.coo_matrix((val, (rows, cols)),
-                                   shape=(len(X), j),
-                                   dtype=self.dtype)
+        try:
+            for feature in X.columns:
+                encoder = self._encoders[feature]
+                col_idx = encoder.transform(X[feature])
+                idx_to_keep = cp.asarray(col_idx.notnull().to_gpu_array())
+                col_idx = cp.asarray(col_idx.dropna().to_gpu_array())
+
+                # Simple test to auto upscale col_idx type as needed
+                # First, determine the maximum value we will add assuming
+                # monotonically increasing up to len(encoder.classes_)
+                # Ensure we dont go negative by clamping to 0
+                max_value = int(max(len(encoder.classes_) - 1, 0) + j)
+
+                # If we exceed the max value, upconvert
+                if (max_value > np.iinfo(col_idx.dtype).max):
+                    col_idx = col_idx.astype(np.min_scalar_type(max_value))
+                    logger.debug("Upconverting column: '{}', to dtype: '{}', \
+                            to support up to {} classes".format(
+                        feature, np.min_scalar_type(max_value), max_value))
+
+                # increase indices to take previous features into account
+                col_idx += j
+
+                # Filter out rows with null values
+                row_idx = cp.arange(len(X))[idx_to_keep]
+
+                if self.drop_idx_ is not None:
+                    drop_idx = self.drop_idx_[feature] + j
+                    mask = cp.ones(col_idx.shape, dtype=cp.bool)
+                    mask[col_idx == drop_idx] = False
+                    col_idx = col_idx[mask]
+                    row_idx = row_idx[mask]
+                    # account for dropped category in indices
+                    col_idx[col_idx > drop_idx] -= 1
+                    # account for dropped category in current cats number
+                    j -= 1
+
+                j += len(encoder.classes_)
+                cols.append(col_idx)
+                rows.append(row_idx)
+
+            cols = cp.concatenate(cols)
+            rows = cp.concatenate(rows)
+            val = cp.ones(rows.shape[0], dtype=self.dtype)
+            ohe = cupyx.scipy.sparse.coo_matrix((val, (rows, cols)),
+                                                shape=(len(X), j),
+                                                dtype=self.dtype)
+
+            if not self.sparse:
+                ohe = ohe.toarray()
+
+            return ohe
+
+        except TypeError as e:
+            # Append to cols to include the column that threw the error
+            cols.append(col_idx)
 
-        if not self.sparse:
-            ohe = ohe.toarray()
+            # Build a string showing what the types are
+            input_types_str = ", ".join([str(x.dtype) for x in cols])
 
-        return ohe
+            raise TypeError(
+                "A TypeError occurred while calculating column "
+                "category indices, most likely due to integer overflow. This "
+                "can occur when columns have a large difference in the number "
+                "of categories, resulting in different category code dtypes "
+                "for different columns."
+                "Calculated column code dtypes: {}.\n"
+                "Internal Error: {}".format(input_types_str, repr(e)))
 
     @with_cupy_rmm
     def inverse_transform(self, X):
@@ -340,10 +402,10 @@ def inverse_transform(self, X):
             Inverse transformed array.
         """
         self._check_is_fitted()
-        if cp.sparse.issparse(X):
-            # cupy.sparse 7.x does not support argmax, when we upgrade cupy to
-            # 8.x, we should add a condition in the
-            # if close: `and not cp.sparse.issparsecsc(X)`
+        if cupyx.scipy.sparse.issparse(X):
+            # cupyx.scipy.sparse 7.x does not support argmax,
+            # when we upgrade cupy to 8.x, we should add a condition in the
+            # if close: `and not cupyx.scipy.sparse.issparsecsc(X)`
             # and change the following line by `X = X.tocsc()`
             X = X.toarray()
         result = DataFrame(columns=self._encoders.keys())
@@ -390,3 +452,13 @@ def inverse_transform(self, X):
                               "values. Returning output as a DataFrame "
                               "instead.")
         return result
+
+    def get_param_names(self):
+        return super().get_param_names() + \
+            [
+                "categories",
+                "drop",
+                "sparse",
+                "dtype",
+                "handle_unknown",
+            ]
diff --git a/python/cuml/preprocessing/label.py b/python/cuml/preprocessing/label.py
index 2361d06d8b..13a4f1c9f1 100644
--- a/python/cuml/preprocessing/label.py
+++ b/python/cuml/preprocessing/label.py
@@ -14,12 +14,13 @@
 #
 
 import cupy as cp
+import cupyx
 
 from cuml.prims.label import make_monotonic, check_labels, \
     invert_labels
 
 from cuml import Base
-from cuml.common import rmm_cupy_ary
+from cuml.common import rmm_cupy_ary, with_cupy_rmm, CumlArray
 from cuml.common import has_scipy
 
 
@@ -50,10 +51,10 @@ def label_binarize(y, classes, neg_label=0, pos_label=1,
 
     val = rmm_cupy_ary(cp.full, row_ind.shape[0], pos_label, dtype=y.dtype)
 
-    sp = cp.sparse.coo_matrix((val, (row_ind, col_ind)),
-                              shape=(col_ind.shape[0],
-                                     classes.shape[0]),
-                              dtype=cp.float32)
+    sp = cupyx.scipy.sparse.coo_matrix((val, (row_ind, col_ind)),
+                                       shape=(col_ind.shape[0],
+                                              classes.shape[0]),
+                                       dtype=cp.float32)
 
     cp.cuda.Stream.null.synchronize()
 
@@ -73,6 +74,31 @@ class LabelBinarizer(Base):
     """
     A multi-class dummy encoder for labels.
 
+    Parameters
+    ----------
+
+    neg_label : integer
+        label to be used as the negative binary label
+    pos_label : integer
+        label to be used as the positive binary label
+    sparse_output : bool
+        whether to return sparse arrays for transformed output
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
     Examples
     --------
 
@@ -81,6 +107,7 @@ class LabelBinarizer(Base):
     .. code-block:: python
 
         import cupy as cp
+        import cupyx
         from cuml.preprocessing import LabelBinarizer
 
         labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
@@ -118,18 +145,18 @@ class LabelBinarizer(Base):
          [ 0  5 10  7  2  4  1  0  0  4  3  2  1]
     """
 
-    def __init__(self, neg_label=0, pos_label=1, sparse_output=False):
-        """
-        Creates a LabelBinarizer instance
+    def __init__(self,
+                 neg_label=0,
+                 pos_label=1,
+                 sparse_output=False,
+                 *,
+                 handle=None,
+                 verbose=False,
+                 output_type=None):
+        super().__init__(handle=handle,
+                         verbose=verbose,
+                         output_type=output_type)
 
-        Parameters
-        ----------
-
-        neg_label : integer label to be used as the negative binary label
-        pos_label : integer label to be used as the positive binary label
-        sparse_output : bool whether to return sparse arrays for transformed
-                        output
-        """
         if neg_label >= pos_label:
             raise ValueError("neg_label=%s must be less "
                              "than pos_label=%s." % (neg_label, pos_label))
@@ -144,7 +171,9 @@ def __init__(self, neg_label=0, pos_label=1, sparse_output=False):
         self.neg_label = neg_label
         self.pos_label = pos_label
         self.sparse_output = sparse_output
+        self._classes_ = None
 
+    @with_cupy_rmm
     def fit(self, y):
         """
         Fit label binarizer
@@ -160,23 +189,26 @@ def fit(self, y):
         self : returns an instance of self.
         """
 
+        self._set_output_type(y)
+
         if y.ndim > 2:
             raise ValueError("labels cannot be greater than 2 dimensions")
 
         if y.ndim == 2:
 
-            unique_classes = rmm_cupy_ary(cp.unique, y)
+            unique_classes = cp.unique(y)
             if unique_classes != [0, 1]:
                 raise ValueError("2-d array can must be binary")
 
-            self.classes_ = rmm_cupy_ary(cp.arange, 0, y.shape[1])
+            self._classes_ = CumlArray(cp.arange(0, y.shape[1]))
         else:
-            self.classes_ = rmm_cupy_ary(cp.unique, y).astype(y.dtype)
+            self._classes_ = CumlArray(cp.unique(y).astype(y.dtype))
 
         cp.cuda.Stream.null.synchronize()
 
         return self
 
+    @with_cupy_rmm
     def fit_transform(self, y):
         """
         Fit label binarizer and transform multi-class labels to their
@@ -206,7 +238,7 @@ def transform(self, y):
         -------
         arr : array with encoded labels
         """
-        return label_binarize(y, self.classes_,
+        return label_binarize(y, self._classes_,
                               pos_label=self.pos_label,
                               neg_label=self.neg_label,
                               sparse_output=self.sparse_output)
@@ -234,8 +266,8 @@ def inverse_transform(self, y, threshold=None):
                     as scipy_sparse_isspmatrix
 
         # If we are already given multi-class, just return it.
-        if cp.sparse.isspmatrix(y):
-            y_mapped = y.tocsr().indices.astype(self.classes_.dtype)
+        if cupyx.scipy.sparse.isspmatrix(y):
+            y_mapped = y.tocsr().indices.astype(self._classes_.dtype)
         elif scipy_sparse_isspmatrix(y):
             y = y.tocsr()
             y_mapped = rmm_cupy_ary(cp.array, y.indices,
@@ -246,4 +278,11 @@ def inverse_transform(self, y, threshold=None):
                                                  dtype=y.dtype),
                                     axis=1).astype(y.dtype)
 
-        return invert_labels(y_mapped, self.classes_)
+        return invert_labels(y_mapped, self._classes_)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "neg_label",
+            "pos_label",
+            "sparse_output",
+        ]
diff --git a/python/cuml/preprocessing/model_selection.py b/python/cuml/preprocessing/model_selection.py
index 8b6b89e726..e09bba7e68 100644
--- a/python/cuml/preprocessing/model_selection.py
+++ b/python/cuml/preprocessing/model_selection.py
@@ -1,3 +1,4 @@
+
 # Copyright (c) 2019, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,6 +16,7 @@
 
 import cudf
 import cupy as cp
+import cupyx
 import numpy as np
 import warnings
 
@@ -24,19 +26,208 @@
 from typing import Union
 
 
-def train_test_split(
-    X,
-    y=None,
-    test_size: Union[float, int] = None,
-    train_size: Union[float, int] = None,
-    shuffle: bool = True,
-    random_state: Union[int, cp.random.RandomState,
-                        np.random.RandomState] = None,
-    seed: Union[int, cp.random.RandomState, np.random.RandomState] = None
-):
+def _stratify_split(X, y, n_train, n_test, x_numba, y_numba, random_state):
+    """
+    Function to perform a stratified split based on y lables.
+    Based on scikit-learn stratified split implementation.
+
+    Parameters
+    ----------
+    X, y: Shuffled input data and labels
+    n_train: Number of samples in train set
+    n_test: number of samples in test set
+    x_numba: Determines whether the data should be converted to numba
+    y_numba: Determines whether the labales should be converted to numba
+
+    Returns
+    -------
+    X_train, X_test: Data X divided into train and test sets
+    y_train, y_test: Labels divided into train and test sets
+    """
+    x_cudf = False
+    y_cudf = False
+
+    if isinstance(X, cudf.DataFrame):
+        x_cudf = True
+    elif hasattr(X, "__cuda_array_interface__"):
+        X = cp.asarray(X)
+        x_order = _strides_to_order(X.__cuda_array_interface__['strides'],
+                                    cp.dtype(X.dtype))
+
+    if isinstance(y, cudf.Series):
+        y_cudf = True
+    elif hasattr(y, "__cuda_array_interface__"):
+        y = cp.asarray(y)
+        y_order = _strides_to_order(y.__cuda_array_interface__['strides'],
+                                    cp.dtype(y.dtype))
+    elif isinstance(y, cudf.DataFrame):
+        y_cudf = True
+        # ensuring it has just one column
+        if y.shape[1] != 1:
+            raise ValueError('Expected one label, but found y'
+                             'with shape = %d' % (y.shape))
+
+    classes, y_indices = cp.unique(y.values if y_cudf
+                                   else y,
+                                   return_inverse=True)
+
+    n_classes = classes.shape[0]
+    class_counts = cp.bincount(y_indices)
+    if n_train < n_classes:
+        raise ValueError('The train_size = %d should be greater or '
+                         'equal to the number of classes = %d' % (n_train,
+                                                                  n_classes))
+    if n_test < n_classes:
+        raise ValueError('The test_size = %d should be greater or '
+                         'equal to the number of classes = %d' % (n_test,
+                                                                  n_classes))
+    class_indices = cp.array_split(cp.argsort(y_indices), n_classes)
+
+    X_train = None
+
+    # random_state won't be None or int, that's handled earlier
+    if isinstance(random_state, np.random.RandomState):
+        random_state = cp.random.RandomState(seed=random_state.get_state()[1])
+
+    # Break ties
+    n_i = _approximate_mode(class_counts, n_train, random_state)
+    class_counts_remaining = class_counts - n_i
+    t_i = _approximate_mode(class_counts_remaining, n_test, random_state)
+
+    for i in range(n_classes):
+        permutation = random_state.permutation(class_counts[i].item())
+        perm_indices_class_i = class_indices[i].take(permutation)
+
+        if hasattr(X, "__cuda_array_interface__") or \
+           isinstance(X, cupyx.scipy.sparse.csr_matrix):
+
+            X_train_i = cp.array(X[perm_indices_class_i[:n_i[i]]],
+                                 order=x_order)
+            X_test_i = cp.array(X[perm_indices_class_i[n_i[i]:n_i[i] +
+                                                       t_i[i]]],
+                                order=x_order)
+
+            y_train_i = cp.array(y[perm_indices_class_i[:n_i[i]]],
+                                 order=y_order)
+            y_test_i = cp.array(y[perm_indices_class_i[n_i[i]:n_i[i] +
+                                                       t_i[i]]],
+                                order=y_order)
+
+            if X_train is None:
+                X_train = cp.array(X_train_i, order=x_order)
+                y_train = cp.array(y_train_i, order=y_order)
+                X_test = cp.array(X_test_i, order=x_order)
+                y_test = cp.array(y_test_i, order=y_order)
+            else:
+                X_train = cp.concatenate([X_train, X_train_i], axis=0)
+                X_test = cp.concatenate([X_test, X_test_i], axis=0)
+                y_train = cp.concatenate([y_train, y_train_i], axis=0)
+                y_test = cp.concatenate([y_test, y_test_i], axis=0)
+
+        elif x_cudf:
+            X_train_i = X.iloc[perm_indices_class_i[:n_i[i]]]
+            X_test_i = X.iloc[perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]]
+
+            y_train_i = y.iloc[perm_indices_class_i[:n_i[i]]]
+            y_test_i = y.iloc[perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]]
+
+            if X_train is None:
+                X_train = X_train_i
+                y_train = y_train_i
+                X_test = X_test_i
+                y_test = y_test_i
+            else:
+                X_train = cudf.concat([X_train, X_train_i], ignore_index=False)
+                X_test = cudf.concat([X_test, X_test_i], ignore_index=False)
+                y_train = cudf.concat([y_train, y_train_i], ignore_index=False)
+                y_test = cudf.concat([y_test, y_test_i], ignore_index=False)
+
+    if x_numba:
+        X_train = cuda.as_cuda_array(X_train)
+        X_test = cuda.as_cuda_array(X_test)
+    elif x_cudf:
+        X_train = cudf.DataFrame(X_train)
+        X_test = cudf.DataFrame(X_test)
+
+    if y_numba:
+        y_train = cuda.as_cuda_array(y_train)
+        y_test = cuda.as_cuda_array(y_test)
+    elif y_cudf:
+        y_train = cudf.DataFrame(y_train)
+        y_test = cudf.DataFrame(y_test)
+
+    return X_train, X_test, y_train, y_test
+
+
+def _approximate_mode(class_counts, n_draws, rng):
+    """
+    CuPy implementataiton based on scikit-learn approximate_mode method.
+    https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/__init__.py#L984
+
+    It is the mostly likely outcome of drawing n_draws many
+    samples from the population given by class_counts.
+
+    Parameters
+    ----------
+    class_counts : ndarray of int
+        Population per class.
+    n_draws : int
+        Number of draws (samples to draw) from the overall population.
+    rng : random state
+        Used to break ties.
+
+    Returns
+    -------
+    sampled_classes : cupy array of int
+        Number of samples drawn from each class.
+        np.sum(sampled_classes) == n_draws
+    """
+    # this computes a bad approximation to the mode of the
+    # multivariate hypergeometric given by class_counts and n_draws
+    continuous = n_draws * class_counts / class_counts.sum()
+    # floored means we don't overshoot n_samples, but probably undershoot
+    floored = cp.floor(continuous)
+    # we add samples according to how much "left over" probability
+    # they had, until we arrive at n_samples
+    need_to_add = int(n_draws - floored.sum())
+    if need_to_add > 0:
+        remainder = continuous - floored
+        values = cp.sort(cp.unique(remainder))[::-1]
+        # add according to remainder, but break ties
+        # randomly to avoid biases
+        for value in values:
+            inds, = cp.where(remainder == value)
+            # if we need_to_add less than what's in inds
+            # we draw randomly from them.
+            # if we need to add more, we add them all and
+            # go to the next value
+            add_now = min(len(inds), need_to_add)
+            inds = rng.choice(inds, size=add_now, replace=False)
+            floored[inds] += 1
+            need_to_add -= add_now
+            if need_to_add == 0:
+                break
+    return floored.astype(cp.int)
+
+
+def train_test_split(X,
+                     y=None,
+                     test_size: Union[float,
+                                      int] = None,
+                     train_size: Union[float,
+                                       int] = None,
+                     shuffle: bool = True,
+                     random_state: Union[int,
+                                         cp.random.RandomState,
+                                         np.random.RandomState] = None,
+                     seed: Union[int,
+                                 cp.random.RandomState,
+                                 np.random.RandomState] = None,
+                     stratify=None):
     """
     Partitions device data into four collated objects, mimicking
-    Scikit-learn's `train_test_split`
+    Scikit-learn's `train_test_split
+    <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`_.
 
     Parameters
     ----------
@@ -57,9 +248,13 @@ def train_test_split(
     seed: random_state : int, CuPy RandomState or NumPy RandomState optional
         Deprecated in favor of `random_state`.
         If shuffle is true, seeds the generator. Unseeded by default
+    stratify: bool, optional
+        Whether to stratify the input data based on class labels.
+        None by default
 
     Examples
     --------
+
     .. code-block:: python
 
         import cudf
@@ -80,7 +275,7 @@ def train_test_split(
 
         # Alternatively, if our labels are stored separately
         labels = df['y']
-        df = df.drop(['y'])
+        df = df.drop(['y'], axis=1)
 
         # we can also do
         X_train, X_test, y_train, y_test = train_test_split(df, labels,
@@ -98,18 +293,20 @@ def train_test_split(
 
     Returns
     -------
+
     X_train, X_test, y_train, y_test : cudf.DataFrame or array-like objects
         Partitioned dataframes if X and y were cuDF objects. If `y` was
-        provided as a column name, the column was dropped from the `X`s
+        provided as a column name, the column was dropped from `X`.
         Partitioned numba device arrays if X and y were Numba device arrays.
         Partitioned CuPy arrays for any other input.
+
     """
     if isinstance(y, str):
         # Use the column with name `str` as y
         if isinstance(X, cudf.DataFrame):
             name = y
             y = X[name]
-            X = X.drop(name)
+            X = X.drop(name, axis=1)
         else:
             raise TypeError("X needs to be a cuDF Dataframe when y is a \
                              string")
@@ -129,10 +326,10 @@ def train_test_split(
                             a cuda_array_interface compliant array.")
 
         if X.shape[0] != y.shape[0]:
-            raise ValueError(
-                "X and y must have the same first dimension"
-                "(found {} and {})".format(X.shape[0], y.shape[0])
-            )
+            raise ValueError("X and y must have the same first dimension"
+                             "(found {} and {})".format(
+                                 X.shape[0],
+                                 y.shape[0]))
     else:
         if not hasattr(X, "__cuda_array_interface__") and not \
                 isinstance(X, cudf.DataFrame):
@@ -141,46 +338,64 @@ def train_test_split(
 
     if isinstance(train_size, float):
         if not 0 <= train_size <= 1:
-            raise ValueError(
-                "proportion train_size should be between"
-                "0 and 1 (found {})".format(train_size)
-            )
+            raise ValueError("proportion train_size should be between"
+                             "0 and 1 (found {})".format(train_size))
 
     if isinstance(train_size, int):
         if not 0 <= train_size <= X.shape[0]:
             raise ValueError(
                 "Number of instances train_size should be between 0 and the"
-                "first dimension of X (found {})".format(train_size)
-            )
+                "first dimension of X (found {})".format(train_size))
 
     if isinstance(test_size, float):
         if not 0 <= test_size <= 1:
-            raise ValueError(
-                "proportion test_size should be between"
-                "0 and 1 (found {})".format(train_size)
-            )
+            raise ValueError("proportion test_size should be between"
+                             "0 and 1 (found {})".format(train_size))
 
     if isinstance(test_size, int):
         if not 0 <= test_size <= X.shape[0]:
             raise ValueError(
                 "Number of instances test_size should be between 0 and the"
-                "first dimension of X (found {})".format(test_size)
-            )
+                "first dimension of X (found {})".format(test_size))
 
     x_numba = cuda.devicearray.is_cuda_ndarray(X)
     y_numba = cuda.devicearray.is_cuda_ndarray(y)
 
     if seed is not None:
         if random_state is None:
-            warnings.warn("Parameter 'seed' is deprecated, please use \
-                          'random_state' instead.")
+            warnings.warn("Parameter 'seed' is deprecated and will be"
+                          " removed in 0.17. Please use 'random_state'"
+                          " instead. Setting 'random_state' as the"
+                          " curent 'seed' value",
+                          DeprecationWarning)
             random_state = seed
         else:
-            warnings.warn("Both 'seed' and 'random_state' parameters were \
-                          set, using 'random_state' since 'seed' is \
-                          deprecated. ")
+            warnings.warn("Both 'seed' and 'random_state' parameters were"
+                          " set. Using 'random_state' since 'seed' is"
+                          " deprecated and will be removed in 0.17.",
+                          DeprecationWarning)
+
+    # Determining sizes of splits
+    if isinstance(train_size, float):
+        train_size = int(X.shape[0] * train_size)
+
+    if test_size is None:
+        if train_size is None:
+            train_size = int(X.shape[0] * 0.75)
+
+        test_size = X.shape[0] - train_size
+
+    if isinstance(test_size, float):
+        test_size = int(X.shape[0] * test_size)
+        if train_size is None:
+            train_size = X.shape[0] - test_size
+
+    elif isinstance(test_size, int):
+        if train_size is None:
+            train_size = X.shape[0] - test_size
 
     if shuffle:
+        # Shuffle the data
         if random_state is None or isinstance(random_state, int):
             idxs = rmm_cupy_ary(cp.arange, X.shape[0])
             random_state = cp.random.RandomState(seed=random_state)
@@ -198,7 +413,7 @@ def train_test_split(
         random_state.shuffle(idxs)
 
         if isinstance(X, cudf.DataFrame) or isinstance(X, cudf.Series):
-            X = X.iloc[idxs].reset_index(drop=True)
+            X = X.iloc[idxs]
 
         elif hasattr(X, "__cuda_array_interface__"):
             # numba (and therefore rmm device_array) does not support
@@ -211,25 +426,17 @@ def train_test_split(
         elif hasattr(y, "__cuda_array_interface__"):
             y = cp.asarray(y)[idxs]
 
-    # Determining sizes of splits
-    if isinstance(train_size, float):
-        train_size = int(X.shape[0] * train_size)
-
-    if test_size is None:
-        if train_size is None:
-            train_size = int(X.shape[0] * 0.75)
-
-        test_size = X.shape[0] - train_size
-
-    if isinstance(test_size, float):
-        test_size = int(X.shape[0] * test_size)
-        if train_size is None:
-            train_size = X.shape[0] - test_size
-
-    elif isinstance(test_size, int):
-        if train_size is None:
-            train_size = X.shape[0] - test_size
-
+        if stratify is not None:
+            split_return = _stratify_split(X,
+                                           y,
+                                           train_size,
+                                           test_size,
+                                           x_numba,
+                                           y_numba,
+                                           random_state)
+            return split_return
+
+    # If not stratified, perform train_test_split splicing
     if hasattr(X, "__cuda_array_interface__"):
         x_order = _strides_to_order(X.__cuda_array_interface__['strides'],
                                     cp.dtype(X.dtype))
@@ -239,7 +446,7 @@ def train_test_split(
                                     cp.dtype(y.dtype))
 
     if hasattr(X, "__cuda_array_interface__") or \
-            isinstance(X, cp.sparse.csr_matrix):
+            isinstance(X, cupyx.scipy.sparse.csr_matrix):
         X_train = cp.array(X[0:train_size], order=x_order)
         if y is not None:
             y_train = cp.array(y[0:train_size], order=y_order)
@@ -249,7 +456,7 @@ def train_test_split(
             y_train = y.iloc[0:train_size]
 
     if hasattr(X, "__cuda_array_interface__") or \
-            isinstance(X, cp.sparse.csr_matrix):
+            isinstance(X, cupyx.scipy.sparse.csr_matrix):
         X_test = cp.array(X[-1 * test_size:], order=x_order)
         if y is not None:
             y_test = cp.array(y[-1 * test_size:], order=y_order)
@@ -257,7 +464,6 @@ def train_test_split(
         X_test = X.iloc[-1 * test_size:]
         if y is not None:
             y_test = y.iloc[-1 * test_size:]
-
     if x_numba:
         X_train = cuda.as_cuda_array(X_train)
         X_test = cuda.as_cuda_array(X_test)
diff --git a/python/cuml/preprocessing/onehotencoder_mg.py b/python/cuml/preprocessing/onehotencoder_mg.py
index 6456cdfbd8..080a52868b 100644
--- a/python/cuml/preprocessing/onehotencoder_mg.py
+++ b/python/cuml/preprocessing/onehotencoder_mg.py
@@ -14,10 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 from cuml.preprocessing.encoders import OneHotEncoder
 import dask
 import cupy as cp
diff --git a/python/cuml/preprocessing/text/__init__.py b/python/cuml/preprocessing/text/__init__.py
new file mode 100644
index 0000000000..e2bdd64294
--- /dev/null
+++ b/python/cuml/preprocessing/text/__init__.py
@@ -0,0 +1,17 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml.preprocessing.text import stem
diff --git a/python/cuml/preprocessing/text/stem/__init__.py b/python/cuml/preprocessing/text/stem/__init__.py
new file mode 100644
index 0000000000..224ee0e789
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/__init__.py
@@ -0,0 +1,18 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuml.preprocessing.text.stem.porter_stemmer import PorterStemmer
+import cuml.preprocessing.text.stem.porter_stemmer_utils
diff --git a/python/cuml/preprocessing/text/stem/porter_stemmer.py b/python/cuml/preprocessing/text/stem/porter_stemmer.py
new file mode 100644
index 0000000000..0beb026c18
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer.py
@@ -0,0 +1,816 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+import numpy as np
+import cupy as cp
+from .porter_stemmer_utils.suffix_utils import (
+    get_stem_series,
+    get_str_replacement_series,
+    replace_suffix,
+)
+from .porter_stemmer_utils.porter_stemmer_rules import (
+    ends_with_suffix,
+    ends_with_double_constant,
+    last_char_not_in,
+    last_char_in,
+    ends_cvc,
+)
+from .porter_stemmer_utils.consonant_vowel_utils import (
+    contains_vowel,
+    is_consonant,
+)
+from .porter_stemmer_utils.len_flags_utils import (
+    len_eq_n,
+    len_gt_n,
+)
+from .porter_stemmer_utils.measure_utils import (
+    has_positive_measure,
+    measure_gt_n,
+    measure_eq_n,
+)
+
+
+# Implementation based on nltk//stem/porter.html
+# https://www.nltk.org/_modules/nltk/stem/porter.html
+class PorterStemmer:
+    """
+    A word stemmer based on the Porter stemming algorithm.
+
+    Porter, M. "An algorithm for suffix stripping."
+    Program 14.3 (1980): 130-137.
+
+    See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage
+    of the algorithm.
+
+    Martin Porter has endorsed several modifications to the Porter
+    algorithm since writing his original paper, and those extensions are
+    included in the implementations on his website. Additionally, others
+    have proposed further improvements to the algorithm, including NLTK
+    contributors. Only below mode is supported currently
+    PorterStemmer.NLTK_EXTENSIONS
+
+    - Implementation that includes further improvements devised by
+      NLTK contributors or taken from other modified implementations
+      found on the web.
+
+    Parameters
+    ----------
+        mode: Modes of stemming (Only supports (NLTK_EXTENSIONS) currently)
+              default("NLTK_EXTENSIONS")
+
+    Examples
+    --------
+
+    .. code-block:: python
+
+        import cudf
+        from cuml.preprocessing.text.stem import PorterStemmer
+        stemmer = PorterStemmer()
+        word_str_ser =  cudf.Series(['revival','singing','adjustable'])
+        print(stemmer.stem(word_str_ser))
+
+    Output:
+
+    .. code-block:: python
+
+        0     reviv
+        1      sing
+        2    adjust
+        dtype: object
+    """
+
+    def __init__(self, mode="NLTK_EXTENSIONS"):
+        if mode != "NLTK_EXTENSIONS":
+            raise ValueError(
+                "Only PorterStemmer.NLTK_EXTENSIONS is supported currently"
+            )
+        self.mode = mode
+
+    def stem(self, word_str_ser):
+        """
+        Stem Words using Porter stemmer
+
+        Parameters
+        ----------
+        word_str_ser : cudf.Series
+            A string series of words to stem
+
+        Returns
+        -------
+        stemmed_ser : cudf.Series
+            Stemmed words strings series
+        """
+        # this is only for NLTK_EXTENSIONS
+        # remove the length condition for original algorithm
+        # do not stem is len(word) <= 2:
+        can_replace_mask = len_gt_n(word_str_ser, 2)
+
+        word_str_ser = word_str_ser.str.lower()
+
+        word_str_ser, can_replace_mask = map_irregular_forms(
+            word_str_ser, can_replace_mask
+        )
+
+        # apply step 1
+        word_str_ser = self._step1a(word_str_ser, can_replace_mask)
+        word_str_ser = self._step1b(word_str_ser, can_replace_mask)
+        word_str_ser = self._step1c(word_str_ser, can_replace_mask)
+
+        # apply step 2
+        word_str_ser = self._step2(word_str_ser, can_replace_mask)
+
+        # apply step 3
+        word_str_ser = self._step3(word_str_ser, can_replace_mask)
+
+        # apply step 4
+        word_str_ser = self._step4(word_str_ser, can_replace_mask)
+
+        # apply step 5
+        word_str_ser = self._step5a(word_str_ser, can_replace_mask)
+        word_str_ser = self._step5b(word_str_ser, can_replace_mask)
+
+        return word_str_ser
+
+    def _step1a(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 1a from "An algorithm for suffix stripping"
+
+        From the paper:
+
+            SSES -> SS                         caresses  ->  caress
+            IES  -> I                          ponies    ->  poni
+                                               ties      ->  ti
+                                               (### this is for orignal impl)
+            SS   -> SS                         caress    ->  caress
+            S    ->                            cats      ->  cat
+        """
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        # this NLTK-only rule extends the original algorithm, so
+        # that 'flies'->'fli' but 'dies'->'die' etc
+        # ties -> tie
+        if self.mode == "NLTK_EXTENSIONS":
+            # equivalent to
+            # word.endswith('ies') and len(word) == 4:
+            suffix_mask = ends_with_suffix(word_str_ser, "ies")
+            len_mask = len_eq_n(word_str_ser, 4)
+
+            condition_mask = suffix_mask & len_mask
+
+            valid_mask = can_replace_mask & condition_mask
+            word_str_ser = replace_suffix(
+                word_str_ser, "ies", "ie", valid_mask
+            )
+
+            # update can replace mask
+            can_replace_mask = can_replace_mask & cudf.logical_not(
+                condition_mask
+            )
+
+        return apply_rule_list(
+            word_str_ser,
+            [
+                ("sses", "ss", None),  # SSES -> SS
+                ("ies", "i", None),  # IES  -> I
+                ("ss", "ss", None),  # SS   -> SS
+                ("s", "", None),  # S    ->
+            ],
+            can_replace_mask,
+        )[0]
+
+    def _step1b(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 1b from "An algorithm for suffix stripping"
+
+        From the paper:
+
+            (m>0) EED -> EE                    feed      ->  feed
+                                            agreed    ->  agree
+            (*v*) ED  ->                       plastered ->  plaster
+                                            bled      ->  bled
+            (*v*) ING ->                       motoring  ->  motor
+                                            sing      ->  sing
+
+        If the second or third of the rules in Step 1b is successful,
+        the following is done:
+
+            AT -> ATE                       conflat(ed)  ->  conflate
+            BL -> BLE                       troubl(ed)   ->  trouble
+            IZ -> IZE                       siz(ed)      ->  size
+            (*d and not (*L or *S or *Z))
+            -> single letter
+                                            hopp(ing)    ->  hop
+                                            tann(ed)     ->  tan
+                                            fall(ing)    ->  fall
+                                            hiss(ing)    ->  hiss
+                                            fizz(ed)     ->  fizz
+            (m=1 and *o) -> E               fail(ing)    ->  fail
+                                            fil(ing)     ->  file
+
+        The rule to map to a single letter causes the removal of one of
+        the double letter pair. The -E is put back on -AT, -BL and -IZ,
+        so that the suffixes -ATE, -BLE and -IZE can be recognised
+        later. This E may be removed in step 4.
+        """
+
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        # this NLTK-only block extends the original algorithm, so that
+        # 'spied'->'spi' but 'died'->'die' etc
+        if self.mode == "NLTK_EXTENSIONS":
+            # word.endswith('ied'):
+            suffix_mask = ends_with_suffix(word_str_ser, "ied")
+            len_mask = len_eq_n(word_str_ser, 4)
+
+            condition_mask = suffix_mask & len_mask
+
+            valid_mask = can_replace_mask & condition_mask
+            word_str_ser = replace_suffix(
+                word_str_ser, "ied", "ie", valid_mask
+            )
+
+            # update can replace mask
+            can_replace_mask = can_replace_mask & cudf.logical_not(
+                condition_mask
+            )
+
+            condition_mask = suffix_mask
+            valid_mask = can_replace_mask & condition_mask
+            word_str_ser = replace_suffix(word_str_ser, "ied", "i", valid_mask)
+
+            # update can replace mask
+            can_replace_mask = can_replace_mask & cudf.logical_not(
+                condition_mask
+            )
+
+        # (m>0) EED -> EE
+        # if suffix ==eed we stop processing
+        # to be consistent with nltk
+        suffix_mask = ends_with_suffix(word_str_ser, "eed")
+        valid_mask = suffix_mask & can_replace_mask
+
+        stem = replace_suffix(word_str_ser, "eed", "", valid_mask)
+        measure_mask = measure_gt_n(stem, 0)
+
+        valid_mask = measure_mask & suffix_mask & can_replace_mask
+        # adding ee series to stem
+        word_str_ser = replace_suffix(word_str_ser, "eed", "ee", valid_mask)
+
+        # to be consistent with nltk we dont replace
+        # if word.endswith('eed') we stop proceesing
+        can_replace_mask = can_replace_mask & cudf.logical_not(suffix_mask)
+
+        # rule 2
+        #    (*v*) ED  ->   plastered ->  plaster
+        #                   bled      ->  bled
+
+        ed_suffix_mask = ends_with_suffix(word_str_ser, "ed")
+        intermediate_stem = replace_suffix(
+            word_str_ser, "ed", "", ed_suffix_mask & can_replace_mask
+        )
+        vowel_mask = contains_vowel(intermediate_stem)
+
+        rule_2_mask = vowel_mask & ed_suffix_mask & can_replace_mask
+
+        # rule 3
+
+        #    (*v*) ING ->  motoring  ->  motor
+        #                   sing      ->  sing
+        ing_suffix_mask = ends_with_suffix(word_str_ser, "ing")
+        intermediate_stem = replace_suffix(
+            word_str_ser, "ing", "", ing_suffix_mask & can_replace_mask
+        )
+        vowel_mask = contains_vowel(intermediate_stem)
+        rule_3_mask = vowel_mask & ing_suffix_mask & can_replace_mask
+
+        rule_2_or_rule_3_mask = rule_2_mask | rule_3_mask
+
+        # replace masks only if rule_2_or_rule_3_mask
+        intermediate_stem_1 = replace_suffix(
+            word_str_ser, "ed", "", rule_2_mask
+        )
+        intermediate_stem_2 = replace_suffix(
+            intermediate_stem_1, "ing", "", rule_3_mask
+        )
+
+        can_replace_mask = can_replace_mask & rule_2_or_rule_3_mask
+        return apply_rule_list(
+            intermediate_stem_2,
+            [
+                ("at", "ate", None),  # AT -> ATE
+                ("bl", "ble", None),  # BL -> BLE
+                ("iz", "ize", None),  # IZ -> IZE
+                # (*d and not (*L or *S or *Z))
+                # -> single letter
+                (
+                    "*d",
+                    -1,  # intermediate_stem[-1],
+                    lambda stem: last_char_not_in(
+                        stem, characters=["l", "s", "z"]
+                    ),
+                ),
+                # (m=1 and *o) -> E
+                (
+                    "",
+                    "e",
+                    lambda stem: measure_eq_n(stem, n=1) & ends_cvc(stem),
+                ),
+            ],
+            can_replace_mask,
+        )[0]
+
+    def _step1c(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 1c from "An algorithm for suffix stripping"
+
+        From the paper:
+
+        Step 1c
+
+            (*v*) Y -> I                    happy        ->  happi
+                                            sky          ->  sky
+        """
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        def nltk_condition(stem):
+            """
+            This has been modified from the original Porter algorithm so
+            that y->i is only done when y is preceded by a consonant,
+            but not if the stem is only a single consonant, i.e.
+
+            (*c and not c) Y -> I
+
+            So 'happy' -> 'happi', but
+            'enjoy' -> 'enjoy'  etc
+
+            This is a much better rule. Formerly 'enjoy'->'enjoi' and
+            'enjoyment'->'enjoy'. Step 1c is perhaps done too soon; but
+            with this modification that no longer really matters.
+
+            Also, the removal of the contains_vowel(z) condition means
+            that 'spy', 'fly', 'try' ... stem to 'spi', 'fli', 'tri' and
+            conflate with 'spied', 'tried', 'flies' ...
+            """
+
+            # equivalent to
+            # len(stem) > 1 and self._is_consonant(stem, len(stem) - 1)
+            len_gt_1_mask = len_gt_n(stem, 1)
+            last_char_is_consonant_mask = is_consonant(stem, -1)
+            return len_gt_1_mask & last_char_is_consonant_mask
+
+        def original_condition(stem):
+            return contains_vowel(stem)
+
+        return apply_rule_list(
+            word_str_ser,
+            [
+                (
+                    "y",
+                    "i",
+                    nltk_condition
+                    if self.mode == "NLTK_EXTENSIONS"
+                    else original_condition,
+                )
+            ],
+            can_replace_mask,
+        )[0]
+
+    def _step2(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 2 from "An algorithm for suffix stripping"
+
+        From the paper:
+
+        Step 2
+
+            (m>0) ATIONAL ->  ATE       relational     ->  relate
+            (m>0) TIONAL  ->  TION      conditional    ->  condition
+                                        rational       ->  rational
+            (m>0) ENCI    ->  ENCE      valenci        ->  valence
+            (m>0) ANCI    ->  ANCE      hesitanci      ->  hesitance
+            (m>0) IZER    ->  IZE       digitizer      ->  digitize
+            (m>0) ABLI    ->  ABLE      conformabli    ->  conformable
+            (m>0) ALLI    ->  AL        radicalli      ->  radical
+            (m>0) ENTLI   ->  ENT       differentli    ->  different
+            (m>0) ELI     ->  E         vileli        - >  vile
+            (m>0) OUSLI   ->  OUS       analogousli    ->  analogous
+            (m>0) IZATION ->  IZE       vietnamization ->  vietnamize
+            (m>0) ATION   ->  ATE       predication    ->  predicate
+            (m>0) ATOR    ->  ATE       operator       ->  operate
+            (m>0) ALISM   ->  AL        feudalism      ->  feudal
+            (m>0) IVENESS ->  IVE       decisiveness   ->  decisive
+            (m>0) FULNESS ->  FUL       hopefulness    ->  hopeful
+            (m>0) OUSNESS ->  OUS       callousness    ->  callous
+            (m>0) ALITI   ->  AL        formaliti      ->  formal
+            (m>0) IVITI   ->  IVE       sensitiviti    ->  sensitive
+            (m>0) BILITI  ->  BLE       sensibiliti    ->  sensible
+        """
+
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        if self.mode == "NLTK_EXTENSIONS":
+            # Instead of applying the ALLI -> AL rule after '(a)bli' per
+            # the published algorithm, instead we apply it first, and,
+            # if it succeeds, run the result through step2 again.
+
+            alli_suffix_flag = ends_with_suffix(word_str_ser, "alli")
+            stem_ser = replace_suffix(
+                word_str_ser, "alli", "", alli_suffix_flag & can_replace_mask
+            )
+            positive_measure_flag = has_positive_measure(stem_ser)
+
+            word_str_ser = replace_suffix(
+                word_str_ser,
+                "alli",
+                "al",
+                alli_suffix_flag & positive_measure_flag & can_replace_mask,
+            )
+
+        # not updating flag because nltk does not return
+
+        bli_rule = ("bli", "ble", has_positive_measure)
+        abli_rule = ("abli", "able", has_positive_measure)
+
+        rules = [
+            ("ational", "ate", has_positive_measure),
+            ("tional", "tion", has_positive_measure),
+            ("enci", "ence", has_positive_measure),
+            ("anci", "ance", has_positive_measure),
+            ("izer", "ize", has_positive_measure),
+            abli_rule if self.mode == "ORIGINAL_ALGORITHM" else bli_rule,
+            ("alli", "al", has_positive_measure),
+            ("entli", "ent", has_positive_measure),
+            ("eli", "e", has_positive_measure),
+            ("ousli", "ous", has_positive_measure),
+            ("ization", "ize", has_positive_measure),
+            ("ation", "ate", has_positive_measure),
+            ("ator", "ate", has_positive_measure),
+            ("alism", "al", has_positive_measure),
+            ("iveness", "ive", has_positive_measure),
+            ("fulness", "ful", has_positive_measure),
+            ("ousness", "ous", has_positive_measure),
+            ("aliti", "al", has_positive_measure),
+            ("iviti", "ive", has_positive_measure),
+            ("biliti", "ble", has_positive_measure),
+        ]
+
+        if self.mode == "NLTK_EXTENSIONS":
+            rules.append(("fulli", "ful", has_positive_measure))
+
+            word_str_ser, can_replace_mask = apply_rule_list(
+                word_str_ser, rules, can_replace_mask
+            )
+
+            # The 'l' of the 'logi' -> 'log' rule is put with the stem,
+            # so that short stems like 'geo' 'theo' etc work like
+            # 'archaeo' 'philo' etc.
+
+            logi_suffix_flag = ends_with_suffix(word_str_ser, "logi")
+            stem = word_str_ser.str.slice(stop=-3)
+            measure_flag = has_positive_measure(stem)
+
+            valid_flag = measure_flag & logi_suffix_flag & can_replace_mask
+            return replace_suffix(word_str_ser, "logi", "log", valid_flag)
+
+            # as below works on word rather than stem i don't
+            # send it to apply rules but do it here
+            # rules.append(
+            # ("logi", "log", lambda stem:
+            # self._has_positive_measure(word[:-3])
+            # ))
+
+        if self.mode == "MARTIN_EXTENSIONS":
+            rules.append(("logi", "log", has_positive_measure))
+            return apply_rule_list(word_str_ser, rules, can_replace_mask)[0]
+
+    def _step3(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 3 from "An algorithm for suffix stripping"
+
+        From the paper:
+
+        Step 3
+
+            (m>0) ICATE ->  IC              triplicate     ->  triplic
+            (m>0) ATIVE ->                  formative      ->  form
+            (m>0) ALIZE ->  AL              formalize      ->  formal
+            (m>0) ICITI ->  IC              electriciti    ->  electric
+            (m>0) ICAL  ->  IC              electrical     ->  electric
+            (m>0) FUL   ->                  hopeful        ->  hope
+            (m>0) NESS  ->                  goodness       ->  good
+        """
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        return apply_rule_list(
+            word_str_ser,
+            [
+                ("icate", "ic", has_positive_measure),
+                ("ative", "", has_positive_measure),
+                ("alize", "al", has_positive_measure),
+                ("iciti", "ic", has_positive_measure),
+                ("ical", "ic", has_positive_measure),
+                ("ful", "", has_positive_measure),
+                ("ness", "", has_positive_measure),
+            ],
+            can_replace_mask,
+        )[0]
+
+    def _step4(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 4 from "An algorithm for suffix stripping"
+
+        Step 4
+
+            (m>1) AL    ->                  revival        ->  reviv
+            (m>1) ANCE  ->                  allowance      ->  allow
+            (m>1) ENCE  ->                  inference      ->  infer
+            (m>1) ER    ->                  airliner       ->  airlin
+            (m>1) IC    ->                  gyroscopic     ->  gyroscop
+            (m>1) ABLE  ->                  adjustable     ->  adjust
+            (m>1) IBLE  ->                  defensible     ->  defens
+            (m>1) ANT   ->                  irritant       ->  irrit
+            (m>1) EMENT ->                  replacement    ->  replac
+            (m>1) MENT  ->                  adjustment     ->  adjust
+            (m>1) ENT   ->                  dependent      ->  depend
+            (m>1 and (*S or *T)) ION ->     adoption       ->  adopt
+            (m>1) OU    ->                  homologou      ->  homolog
+            (m>1) ISM   ->                  communism      ->  commun
+            (m>1) ATE   ->                  activate       ->  activ
+            (m>1) ITI   ->                  angulariti     ->  angular
+            (m>1) OUS   ->                  homologous     ->  homolog
+            (m>1) IVE   ->                  effective      ->  effect
+            (m>1) IZE   ->                  bowdlerize     ->  bowdler
+
+        The suffixes are now removed. All that remains is a little
+        tidying up.
+        """
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+
+        def measure_gt_1(ser):
+            return measure_gt_n(ser, 1)
+
+        return apply_rule_list(
+            word_str_ser,
+            [
+                ("al", "", measure_gt_1),
+                ("ance", "", measure_gt_1),
+                ("ence", "", measure_gt_1),
+                ("er", "", measure_gt_1),
+                ("ic", "", measure_gt_1),
+                ("able", "", measure_gt_1),
+                ("ible", "", measure_gt_1),
+                ("ant", "", measure_gt_1),
+                ("ement", "", measure_gt_1),
+                ("ment", "", measure_gt_1),
+                ("ent", "", measure_gt_1),
+                # (m>1 and (*S or *T)) ION ->
+                (
+                    "ion",
+                    "",
+                    lambda stem: measure_gt_n(stem, 1)
+                    & last_char_in(stem, characters=["s", "t"]),
+                ),
+                ("ou", "", measure_gt_1),
+                ("ism", "", measure_gt_1),
+                ("ate", "", measure_gt_1),
+                ("iti", "", measure_gt_1),
+                ("ous", "", measure_gt_1),
+                ("ive", "", measure_gt_1),
+                ("ize", "", measure_gt_1),
+            ],
+            can_replace_mask,
+        )[0]
+
+    def _step5a(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 5a from "An algorithm for suffix stripping"
+
+        From the paper:
+
+        Step 5a
+
+            (m>1) E     ->                  probate        ->  probat
+                                            rate           ->  rate
+            (m=1 and not *o) E ->           cease          ->  ceas
+        """
+
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+        # Note that Martin's test vocabulary and reference
+        # implementations are inconsistent in how they handle the case
+        # where two rules both refer to a suffix that matches the word
+        # to be stemmed, but only the condition of the second one is
+        # true.
+        # Earlier in step2b we had the rules:
+        #     (m>0) EED -> EE
+        #     (*v*) ED  ->
+        # but the examples in the paper included "feed"->"feed", even
+        # though (*v*) is true for "fe" and therefore the second rule
+        # alone would map "feed"->"fe".
+        # However, in THIS case, we need to handle the consecutive rules
+        # differently and try both conditions (obviously; the second
+        # rule here would be redundant otherwise). Martin's paper makes
+        # no explicit mention of the inconsistency; you have to infer it
+        # from the examples.
+        # For this reason, we can't use _apply_rule_list here.
+
+        ##
+
+        # logic is equivalent to below
+        # if word.endswith('e'):
+        #  stem = self._replace_suffix(word, 'e', '')
+        #  if self._measure(stem) > 1:
+        #      return stem  rule_1
+        #  if self._measure(stem) == 1 and not self._ends_cvc(stem):
+        #      return stem  rule_2
+        #
+
+        e_suffix_flag = ends_with_suffix(word_str_ser, "e")
+        stem = replace_suffix(
+            word_str_ser, "e", "", e_suffix_flag & can_replace_mask
+        )
+
+        measure_gt_1_flag = measure_gt_n(stem, 1)
+
+        # if self._measure(stem) > 1:
+        rule_1_flag = measure_gt_1_flag
+
+        # if measure==1 and not self._ends_cvc(stem):
+        measure_eq_1_flag = measure_eq_n(stem, 1)
+        does_not_ends_with_cvc_flag = cudf.logical_not(ends_cvc(stem))
+        rule_2_flag = measure_eq_1_flag & does_not_ends_with_cvc_flag
+
+        overall_rule_flag = (
+            (rule_1_flag | rule_2_flag) & e_suffix_flag & can_replace_mask
+        )
+
+        return replace_suffix(word_str_ser, "e", "", overall_rule_flag)
+
+    def _step5b(self, word_str_ser, can_replace_mask=None):
+        """Implements Step 5a from "An algorithm for suffix stripping"
+
+        From the paper:
+
+        Step 5b
+
+            (m > 1 and *d and *L) -> single letter
+                                    controll       ->  control
+                                    roll           ->  roll
+        """
+
+        can_replace_mask = build_can_replace_mask(
+            len_mask=len(word_str_ser), mask=can_replace_mask
+        )
+        # word, [('ll', 'l', lambda stem: self._measure(word[:-1]) > 1)]
+        # because here we are applying rule on word instead of stem
+        # so, unlike nltk we don't use apply rules
+
+        ll_suffix_flag = ends_with_suffix(word_str_ser, "ll")
+
+        stem = word_str_ser.str.slice(stops=-1)
+        measure_gt_1_flag = measure_gt_n(stem, 1)
+
+        valid_flag = measure_gt_1_flag & ll_suffix_flag & can_replace_mask
+
+        return replace_suffix(word_str_ser, "ll", "l", valid_flag)
+
+
+def map_irregular_forms(word_str_ser, can_replace_mask):
+    # replaces all strings and stop rules
+    # need to process it
+    irregular_forms = {
+        "sky": ["sky", "skies"],
+        "die": ["dying"],
+        "lie": ["lying"],
+        "tie": ["tying"],
+        "news": ["news"],
+        "inning": ["innings", "inning"],
+        "outing": ["outings", "outing"],
+        "canning": ["cannings", "canning"],
+        "howe": ["howe"],
+        "proceed": ["proceed"],
+        "exceed": ["exceed"],
+        "succeed": ["succeed"],
+    }
+    for replacement, form_ls in irregular_forms.items():
+        for form in form_ls:
+            equal_flag = word_str_ser == form
+            stem_ser = get_stem_series(
+                word_str_ser, len(form), can_replace_mask & equal_flag
+            )
+            replacement_ser = get_str_replacement_series(
+                replacement, can_replace_mask & equal_flag
+            )
+
+            word_str_ser = stem_ser.str.cat(replacement_ser)
+            can_replace_mask = can_replace_mask & cudf.logical_not(equal_flag)
+
+    return word_str_ser, can_replace_mask
+
+
+def get_condition_flag(word_str_ser, condition):
+    """
+        condition  = None or a function that returns a bool series
+        return a bool series where flag is valid
+    """
+    if condition is None:
+        return cudf.Series(cp.ones(len(word_str_ser), np.bool))
+    else:
+        return condition(word_str_ser)
+
+
+def apply_rule(word_str_ser, rule, w_in_c_flag):
+    """Applies the first applicable suffix-removal rule to the word
+
+    Takes a word and a list of suffix-removal rules represented as
+    3-tuples, with the first element being the suffix to remove,
+    the second element being the string to replace it with, and the
+    final element being the condition for the rule to be applicable,
+    or None if the rule is unconditional.
+    """
+    suffix, replacement, condition = rule
+    if suffix == "*d":
+        double_consonant_mask = ends_with_double_constant(word_str_ser)
+        # all flags needed  here
+        # with **d in nltk we pass word_series rather than stem_series
+        # see below:
+        # lambda stem: intermediate_stem[-1] not in ('l', 's', 'z'),
+        # condition is on  intermediate_stem
+        intermediate_stem = word_str_ser.str.slice(stop=-1)
+        condition_mask = get_condition_flag(intermediate_stem, condition)
+
+        # mask where replacement will happen
+        valid_mask = double_consonant_mask & condition_mask & w_in_c_flag
+
+        # new series with updated valid_mask
+        word_str_ser = replace_suffix(
+            word_str_ser, suffix, replacement, valid_mask
+        )
+        w_in_c_flag = w_in_c_flag & cudf.logical_not(double_consonant_mask)
+
+    else:
+
+        suffix_mask = ends_with_suffix(word_str_ser, suffix)
+        valid_mask = suffix_mask & w_in_c_flag
+
+        stem_ser = replace_suffix(word_str_ser, suffix, "", valid_mask)
+
+        condition_mask = get_condition_flag(stem_ser, condition)
+        # mask where replacement will happen
+        valid_mask = condition_mask & suffix_mask & w_in_c_flag
+        word_str_ser = replace_suffix(
+            word_str_ser, suffix, replacement, valid_mask
+        )
+
+        # we wont apply further rules if it has a matching suffix
+        w_in_c_flag = w_in_c_flag & cudf.logical_not(suffix_mask)
+
+    return word_str_ser, w_in_c_flag
+
+
+def apply_rule_list(word_str_ser, rules, condition_flag):
+    """Applies the first applicable suffix-removal rule to the word
+
+    Takes a word series and a list of suffix-removal rules represented as
+    3-tuples, with the first element being the suffix to remove,
+    the second element being the string to replace it with, and the
+    final element being the condition for the rule to be applicable,
+    or None if the rule is unconditional.
+    """
+
+    for rule in rules:
+        word_str_ser, condition_flag = apply_rule(
+            word_str_ser, rule, condition_flag
+        )
+
+    return word_str_ser, condition_flag
+
+
+def build_can_replace_mask(len_mask, mask):
+    """
+      Creates a cudf series represeting can_replace_mask of length=len_mask
+      if mask is None else returns mask
+    """
+    if mask is None:
+        mask = cudf.Series(cp.ones(len_mask, dtype=cp.bool))
+    return mask
diff --git a/python/cuml/preprocessing/text/stem/porter_stemmer_utils/__init__.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/__init__.py
new file mode 100644
index 0000000000..baeab05f3b
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/__init__.py
@@ -0,0 +1,15 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
diff --git a/python/cuml/preprocessing/text/stem/porter_stemmer_utils/consonant_vowel_utils.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/consonant_vowel_utils.py
new file mode 100644
index 0000000000..c849bbda4e
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/consonant_vowel_utils.py
@@ -0,0 +1,53 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+def is_consonant(str_ser, i):
+    """Returns True if word[i] is a consonant, False otherwise
+
+    A consonant is defined in the paper as follows:
+        A consonant in a word is a letter other than A, E, I, O or
+        U, and other than Y preceded by a consonant. (The fact that
+        the term `consonant' is defined to some extent in terms of
+        itself does not make it ambiguous.) So in TOY the consonants
+        are T and Y, and in SYZYGY they are S, Z and G. If a letter
+        is not a consonant it is a vowel.
+    """
+    return str_ser.str.is_consonant(i)
+
+
+def is_vowel(str_ser, i):
+    """Returns True if word[i] is a vowel, False otherwise
+       see:  is_consonant for more description
+    """
+    return str_ser.str.is_vowel(i)
+
+
+def contains_vowel(stem_ser):
+    """
+         Returns True if stem contains a vowel, else False
+    """
+    len_ser = stem_ser.str.len()
+    max_len = len_ser.max()
+    contains_vowel_flag = None
+
+    for i in range(0, max_len):
+        if contains_vowel_flag is None:
+            contains_vowel_flag = is_vowel(stem_ser, i)
+        else:
+            contains_vowel_flag = contains_vowel_flag | is_vowel(stem_ser, i)
+
+    return contains_vowel_flag
diff --git a/python/cuml/test/test_nccl.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/len_flags_utils.py
similarity index 70%
rename from python/cuml/test/test_nccl.py
rename to python/cuml/preprocessing/text/stem/porter_stemmer_utils/len_flags_utils.py
index d44cb82a9f..8cf13fb19d 100644
--- a/python/cuml/test/test_nccl.py
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/len_flags_utils.py
@@ -1,4 +1,5 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,16 +14,10 @@
 # limitations under the License.
 #
 
-from cuml.nccl import nccl
-
-
-def test_nccl_init():
-    n = nccl()
-    uid = n.get_unique_id()
 
-    n.init(1, uid, 0)
+def len_gt_n(word_str_ser, n):
+    return word_str_ser.str.len() > n
 
-    assert 0 == n.user_rank()
-    assert 0 == n.cu_device()
 
-    n.destroy()
+def len_eq_n(word_str_ser, n):
+    return word_str_ser.str.len() == n
diff --git a/python/cuml/test/test_handle.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/measure_utils.py
similarity index 57%
rename from python/cuml/test/test_handle.py
rename to python/cuml/preprocessing/text/stem/porter_stemmer_utils/measure_utils.py
index d2d17d1c99..67633728fc 100644
--- a/python/cuml/test/test_handle.py
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/measure_utils.py
@@ -1,3 +1,4 @@
+#
 # Copyright (c) 2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -13,20 +14,17 @@
 # limitations under the License.
 #
 
-import pickle
-import pytest
 
-from cuml.common.handle import Handle
+def has_positive_measure(word_ser):
+    measure_ser = word_ser.str.porter_stemmer_measure()
+    return measure_ser > 0
 
 
-@pytest.mark.parametrize('n_streams', [0, 1, 10])
-def test_pickle(n_streams):
-    a = Handle(n_streams=n_streams)
-    ap = pickle.dumps(a)
-    b = pickle.loads(ap)
+def measure_gt_n(word_ser, n):
+    measure_ser = word_ser.str.porter_stemmer_measure()
+    return measure_ser > n
 
-    assert isinstance(b, Handle)
-    assert b.getNumInternalStreams() == n_streams
 
-    # Add any additional asserts as parameters are added here
-    # If Handle gets a dict, add a proper comparison here.
+def measure_eq_n(word_ser, n):
+    measure_ser = word_ser.str.porter_stemmer_measure()
+    return measure_ser == n
diff --git a/python/cuml/preprocessing/text/stem/porter_stemmer_utils/porter_stemmer_rules.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/porter_stemmer_rules.py
new file mode 100644
index 0000000000..40b8397683
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/porter_stemmer_rules.py
@@ -0,0 +1,113 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from .consonant_vowel_utils import is_vowel, is_consonant
+from .len_flags_utils import len_gt_n, len_eq_n
+import cudf
+
+
+def ends_with_double_constant(string_ser):
+    len_flag = len_gt_n(string_ser, 1)
+
+    last_2_chars_equal = string_ser.str.get(-1) == string_ser.str.get(-2)
+    is_last_consonant_bool_mask = is_consonant(string_ser, -1)
+
+    return is_last_consonant_bool_mask & last_2_chars_equal & len_flag
+
+
+def last_char_in(string_ser, characters):
+    last_char_strs = string_ser.str.get(-1)
+    last_char_ser = cudf.Series(last_char_strs)
+    last_char_flag = None
+    for char in characters:
+        if last_char_flag is not None:
+            last_char_flag = last_char_flag | (last_char_ser == char)
+        else:
+            last_char_flag = last_char_ser == char
+    return last_char_flag
+
+
+def last_char_not_in(string_ser, characters):
+    last_char_strs = string_ser.str.get(-1)
+    last_char_ser = cudf.Series(last_char_strs)
+    last_char_flag = None
+    for char in characters:
+        if last_char_flag is not None:
+            last_char_flag = last_char_flag & (last_char_ser != char)
+        else:
+            last_char_flag = last_char_ser != char
+    return last_char_flag
+
+
+def ends_cvc(string_ser, mode="NLTK_EXTENSIONS"):
+    """Implements condition *o from the paper
+
+    From the paper:
+
+        *o  - the stem ends cvc, where the second c is not W, X or Y
+              (e.g. -WIL, -HOP).
+    """
+
+    if mode == "NLTK_EXTENSIONS":
+
+        # rule_1
+        # len(word) >= 3
+        # and self._is_consonant(word, len(word) - 3)
+        # and not self._is_consonant(word, len(word) - 2)
+        # and self._is_consonant(word, len(word) - 1)
+        # and word[-1] not in ("w", "x", "y")
+
+        len_flag = len_gt_n(string_ser, 2)
+
+        first_consonant = is_consonant(string_ser, -3)
+        middle_vowel = is_vowel(string_ser, -2)
+        last_consonant = is_consonant(string_ser, -1)
+
+        last_char_strs = string_ser.str.get(-1)
+
+        # converting to series to all strings
+        last_char_ser = cudf.Series(last_char_strs)
+        last_char_flag = None
+        for char in ["w", "x", "y"]:
+            if last_char_flag is not None:
+                last_char_flag = last_char_flag & (last_char_ser != char)
+            else:
+                last_char_flag = last_char_ser != char
+
+        rule_1 = (
+            len_flag
+            & first_consonant
+            & middle_vowel
+            & last_consonant
+            & last_char_flag
+        )
+        # rule_2
+        # self.mode == self.NLTK_EXTENSIONS
+        # and len(word) == 2
+        # and not self._is_consonant(word, 0)
+        # and self._is_consonant(word, 1)
+        len_flag = len_eq_n(string_ser, 2)
+        first_char = cudf.logical_not(is_consonant(string_ser, 0))
+        second_char = is_consonant(string_ser, 1)
+        rule_2 = len_flag & first_char & second_char
+
+        return rule_1 | rule_2
+    else:
+        assert NotImplementedError
+
+
+def ends_with_suffix(str_ser, suffix):
+    return str_ser.str.endswith(suffix)
diff --git a/python/cuml/preprocessing/text/stem/porter_stemmer_utils/suffix_utils.py b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/suffix_utils.py
new file mode 100644
index 0000000000..da232a5bd4
--- /dev/null
+++ b/python/cuml/preprocessing/text/stem/porter_stemmer_utils/suffix_utils.py
@@ -0,0 +1,102 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from numba import cuda
+import cudf
+import numpy as np
+import cupy as cp
+from cudf.utils.utils import scalar_broadcast_to
+
+
+def get_str_replacement_series(replacement, bool_mask):
+    """
+     Get replacement series with replacement at
+     Places marked by bool mask and empty other wise
+    """
+    word_ser = cudf.Series(scalar_broadcast_to("", size=len(bool_mask)))
+    word_ser.iloc[bool_mask] = replacement
+
+    return word_ser
+
+
+def get_index_replacement_series(word_str_ser, replacment_index, bool_mask):
+    """
+     Get replacement series with nulls at places marked by bool mask
+    """
+    valid_indexes = ~bool_mask
+    word_str_ser = word_str_ser.str.get(replacment_index)
+    word_str_ser = cudf.Series(word_str_ser)
+    word_str_ser.iloc[valid_indexes] = ""
+
+    return word_str_ser
+
+
+def replace_suffix(word_str_ser, suffix, replacement, can_replace_mask):
+    """
+        replaces string column with valid suffix with replacement
+    """
+
+    len_suffix = len(suffix)
+    if replacement == "":
+        stem_ser = get_stem_series(word_str_ser, len_suffix, can_replace_mask)
+        return stem_ser
+    else:
+        stem_ser = get_stem_series(word_str_ser, len_suffix, can_replace_mask)
+        if isinstance(replacement, str):
+            replacement_ser = get_str_replacement_series(
+                replacement, can_replace_mask
+            )
+        if isinstance(replacement, int):
+            replacement_ser = get_index_replacement_series(
+                word_str_ser, replacement, can_replace_mask
+            )
+        else:
+            assert ValueError(
+                "replacement: {} value should be a string or a int".format(
+                    replacement
+                )
+            )
+
+        return stem_ser + replacement_ser
+
+
+@cuda.jit()
+def subtract_valid(input_array, valid_bool_array, sub_val):
+    pos = cuda.grid(1)
+    if pos < input_array.size:
+        if valid_bool_array[pos]:
+            input_array[pos] = input_array[pos] - sub_val
+
+
+def get_stem_series(word_str_ser, suffix_len, can_replace_mask):
+    """
+        word_str_ser: input string column
+        suffix_len: length of suffix to replace
+        can_repalce_mask: bool array marking strings where to replace
+    """
+    NTHRD = 1024
+    NBLCK = int(np.ceil(float(len(word_str_ser)) / float(NTHRD)))
+
+    start_series = cudf.Series(cp.zeros(len(word_str_ser), dtype=cp.int32))
+    end_ser = word_str_ser.str.len()
+
+    end_ar = end_ser._column.data_array_view
+    can_replace_mask_ar = can_replace_mask._column.data_array_view
+
+    subtract_valid[NBLCK, NTHRD](end_ar, can_replace_mask_ar, suffix_len)
+    return word_str_ser.str.slice_from(
+        starts=start_series, stops=end_ser.fillna(0)
+    )
diff --git a/python/cuml/prims/label/__init__.py b/python/cuml/prims/label/__init__.py
index 4e93335c28..1528dbef78 100644
--- a/python/cuml/prims/label/__init__.py
+++ b/python/cuml/prims/label/__init__.py
@@ -16,4 +16,4 @@
 
 from cuml.prims.label.classlabels import make_monotonic  # NOQA
 from cuml.prims.label.classlabels import check_labels  # NOQA
-from cuml.prims.label.classlabels import invert_labels  # NOQA
\ No newline at end of file
+from cuml.prims.label.classlabels import invert_labels  # NOQA
diff --git a/python/cuml/prims/stats/covariance.py b/python/cuml/prims/stats/covariance.py
index e882516ca4..f94f51f17d 100644
--- a/python/cuml/prims/stats/covariance.py
+++ b/python/cuml/prims/stats/covariance.py
@@ -15,6 +15,7 @@
 #
 
 import cupy as cp
+import cupyx
 import math
 
 from cuml.common.memory_utils import with_cupy_rmm
@@ -60,8 +61,8 @@ def cov(x, y, mean_x=None, mean_y=None, return_gram=False,
     Parameters
     ----------
 
-    x : device-array or cupy.sparse of size (m, n)
-    y : device-array or cupy.sparse of size (m, n)
+    x : device-array or cupyx.scipy.sparse of size (m, n)
+    y : device-array or cupyx.scipy.sparse of size (m, n)
     mean_x : float (default = None)
         device-array of size (n, ) which is the mean
         of x across rows
@@ -111,7 +112,7 @@ def cov(x, y, mean_x=None, mean_y=None, return_gram=False,
 
     gram_matrix = x.T.dot(y) * (1 / x.shape[0])
 
-    if cp.sparse.issparse(gram_matrix):
+    if cupyx.scipy.sparse.issparse(gram_matrix):
         gram_matrix = gram_matrix.todense()
 
     if mean_x is None:
diff --git a/python/cuml/random_projection/random_projection.pyx b/python/cuml/random_projection/random_projection.pyx
index db486d4bf7..7bf26afb7d 100644
--- a/python/cuml/random_projection/random_projection.pyx
+++ b/python/cuml/random_projection/random_projection.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cudf
 import numpy as np
@@ -27,12 +24,12 @@ from libcpp cimport bool
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport *
+from cuml.raft.common.handle cimport *
 from cuml.common import input_to_cuml_array
 
 cdef extern from * nogil:
     ctypedef void* _Stream "cudaStream_t"
-    ctypedef void* _DevAlloc "std::shared_ptr<MLCommon::deviceAllocator>"
+    ctypedef void* _DevAlloc "std::shared_ptr<raft::mr::device::allocator>"
 
 cdef extern from "cuml/random_projection/rproj_c.h" namespace "ML":
 
@@ -57,11 +54,11 @@ cdef extern from "cuml/random_projection/rproj_c.h" namespace "ML":
         size_t sparse_data_size # sparse CSC random matrix number of non-zero elements # noqa E501
 
     # Function used to fit the model
-    cdef void RPROJfit[T](const cumlHandle& handle, rand_mat[T] *random_matrix,
+    cdef void RPROJfit[T](const handle_t& handle, rand_mat[T] *random_matrix,
                           paramsRPROJ* params) except +
 
     # Function used to apply data transformation
-    cdef void RPROJtransform[T](const cumlHandle& handle, T *input,
+    cdef void RPROJtransform[T](const handle_t& handle, T *input,
                                 rand_mat[T] *random_matrix, T *output,
                                 paramsRPROJ* params) except +
 
@@ -159,12 +156,13 @@ cdef class BaseRandomProjection():
         del self.rand_matS
         del self.rand_matD
 
-    def __init__(self, n_components='auto', eps=0.1,
-                 dense_output=True, random_state=None):
+    def __init__(self, *, bool gaussian_method, double density,
+                 n_components='auto', eps=0.1, dense_output=True,
+                 random_state=None):
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
-        cdef _DevAlloc alloc = <_DevAlloc>handle_.getDeviceAllocator()
-        cdef _Stream stream = handle_.getStream()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
+        cdef _DevAlloc alloc = <_DevAlloc>handle_.get_device_allocator()
+        cdef _Stream stream = handle_.get_stream()
         self.rand_matS = new rand_mat[float](alloc, stream)
         self.rand_matD = new rand_mat[double](alloc, stream)
 
@@ -175,8 +173,56 @@ cdef class BaseRandomProjection():
         if random_state is not None:
             self.params.random_state = random_state
 
-        self.params.gaussian_method = self.gaussian_method
-        self.params.density = self.density
+        self.params.gaussian_method = gaussian_method
+        self.params.density = density
+
+    @property
+    def n_components(self):
+        return self.params.n_components
+
+    @n_components.setter
+    def n_components(self, value):
+        self.params.n_components = value
+
+    @property
+    def eps(self):
+        return self.params.eps
+
+    @eps.setter
+    def eps(self, value):
+        self.params.eps = value
+
+    @property
+    def dense_output(self):
+        return self.params.dense_output
+
+    @dense_output.setter
+    def dense_output(self, value):
+        self.params.dense_output = value
+
+    @property
+    def random_state(self):
+        return self.params.random_state
+
+    @random_state.setter
+    def random_state(self, value):
+        self.params.random_state = value
+
+    @property
+    def gaussian_method(self):
+        return self.params.gaussian_method
+
+    @gaussian_method.setter
+    def gaussian_method(self, value):
+        self.params.gaussian_method = value
+
+    @property
+    def density(self):
+        return self.params.density
+
+    @density.setter
+    def density(self, value):
+        self.params.density = value
 
     def fit(self, X, y=None):
         """
@@ -195,13 +241,12 @@ cdef class BaseRandomProjection():
             generated random matrix as attributes
 
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
 
         _, n_samples, n_features, self.dtype = \
             input_to_cuml_array(X, check_dtype=[np.float32, np.float64])
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         self.params.n_samples = n_samples
         self.params.n_features = n_features
 
@@ -257,7 +302,7 @@ cdef class BaseRandomProjection():
             raise ValueError("n_features must be same as on fitting: %d" %
                              self.params.n_features)
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype == np.float32:
             RPROJtransform[float](handle_[0],
@@ -293,10 +338,11 @@ class GaussianRandomProjection(Base, BaseRandomProjection):
 
     The components of the random matrix are drawn from N(0, 1 / n_components).
 
-    Example
-    ---------
+    Examples
+    --------
 
     .. code-block:: python
+
         from cuml.random_projection import GaussianRandomProjection
         from sklearn.datasets.samples_generator import make_blobs
         from sklearn.svm import SVC
@@ -324,13 +370,19 @@ class GaussianRandomProjection(Base, BaseRandomProjection):
     Output:
 
     .. code-block:: python
+
         Score: 1.0
 
     Parameters
     ----------
 
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
 
     n_components : int (default = 'auto')
         Dimensionality of the target projection space. If set to 'auto',
@@ -347,6 +399,14 @@ class GaussianRandomProjection(Base, BaseRandomProjection):
 
     random_state : int (default = None)
         Seed used to initilize random generator
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
@@ -356,24 +416,38 @@ class GaussianRandomProjection(Base, BaseRandomProjection):
 
     Notes
     ------
-        Inspired by Scikit-learn's implementation :
-        https://scikit-learn.org/stable/modules/random_projection.html
+    This class is unable to be used with ``sklearn.base.clone()`` and will
+    raise an exception when called.
+
+    Inspired by Scikit-learn's implementation :
+    https://scikit-learn.org/stable/modules/random_projection.html
 
     """
 
     def __init__(self, handle=None, n_components='auto', eps=0.1,
-                 random_state=None, verbose=False):
-        Base.__init__(self, handle, verbose)
-        self.gaussian_method = True
-        self.density = -1.0  # not used
+                 random_state=None, verbose=False, output_type=None):
+
+        Base.__init__(self,
+                      handle=handle,
+                      verbose=verbose,
+                      output_type=output_type)
 
         BaseRandomProjection.__init__(
             self,
+            gaussian_method=True,
+            density=-1.0,
             n_components=n_components,
             eps=eps,
             dense_output=True,
             random_state=random_state)
 
+    def get_param_names(self):
+        return Base.get_param_names(self) + [
+            "n_components",
+            "eps",
+            "random_state"
+        ]
+
 
 class SparseRandomProjection(Base, BaseRandomProjection):
     """
@@ -389,16 +463,18 @@ class SparseRandomProjection(Base, BaseRandomProjection):
     (e.g. Gaussian) that guarantees similar embedding quality while being much
     more memory efficient and allowing faster computation of the projected data
     (with sparse enough matrices).
-    If we note 's = 1 / density' the components of the random matrix are
+    If we note ``s = 1 / density`` the components of the random matrix are
     drawn from:
-      - -sqrt(s) / sqrt(n_components)   with probability 1 / 2s
-      -  0                              with probability 1 - 1 / s
-      - +sqrt(s) / sqrt(n_components)   with probability 1 / 2s
 
-    Example
-    ---------
+    - ``-sqrt(s) / sqrt(n_components)`` - with probability ``1 / 2s``
+    - ``0`` - with probability ``1 - 1 / s``
+    - ``+sqrt(s) / sqrt(n_components)`` - with probability ``1 / 2s``
+
+    Examples
+    --------
 
     .. code-block:: python
+
         from cuml.random_projection import SparseRandomProjection
         from sklearn.datasets.samples_generator import make_blobs
         from sklearn.svm import SVC
@@ -426,26 +502,29 @@ class SparseRandomProjection(Base, BaseRandomProjection):
     Output:
 
     .. code-block:: python
+
         Score: 1.0
 
     Parameters
     ----------
-
     handle : cuml.Handle
-        If it is None, a new one is created just for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
 
     n_components : int (default = 'auto')
         Dimensionality of the target projection space. If set to 'auto',
         the parameter is deducted thanks to Johnson–Lindenstrauss lemma.
         The automatic deduction make use of the number of samples and
         the eps parameter.
-
         The Johnson–Lindenstrauss lemma can produce very conservative
         n_components parameter as it makes no assumption on dataset structure.
 
     density : float in range (0, 1] (default = 'auto')
         Ratio of non-zero component in the random projection matrix.
-
         If density = 'auto', the value is set to the minimum density
         as recommended by Ping Li et al.: 1 / sqrt(n_features).
 
@@ -459,29 +538,55 @@ class SparseRandomProjection(Base, BaseRandomProjection):
     random_state : int (default = None)
         Seed used to initilize random generator
 
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
     Attributes
     ----------
-        gaussian_method : boolean
-            To be passed to base class in order to determine
-            random matrix generation method
+    gaussian_method : boolean
+        To be passed to base class in order to determine
+        random matrix generation method
 
     Notes
-    ------
-        Inspired by Scikit-learn's implementation :
-        https://scikit-learn.org/stable/modules/random_projection.html
+    -----
+    This class is unable to be used with ``sklearn.base.clone()`` and will
+    raise an exception when called.
+
+    Inspired by Scikit-learn's `implementation
+    <https://scikit-learn.org/stable/modules/random_projection.html>`_.
 
     """
 
     def __init__(self, handle=None, n_components='auto', density='auto',
                  eps=0.1, dense_output=True, random_state=None,
-                 verbose=False):
-        Base.__init__(self, handle, verbose)
-        self.gaussian_method = False
-        self.density = density if density != 'auto' else -1.0
+                 verbose=False, output_type=None):
+
+        Base.__init__(self,
+                      handle=handle,
+                      verbose=verbose,
+                      output_type=output_type)
 
         BaseRandomProjection.__init__(
             self,
+            gaussian_method=False,
+            density=(density if density != 'auto' else -1.0),
             n_components=n_components,
             eps=eps,
             dense_output=dense_output,
             random_state=random_state)
+
+    def get_param_names(self):
+        return Base.get_param_names(self) + [
+            "n_components",
+            "density",
+            "eps",
+            "dense_output",
+            "random_state"
+        ]
diff --git a/python/cuml/solvers/cd.pyx b/python/cuml/solvers/cd.pyx
index 8ce1695262..38a36c553e 100644
--- a/python/cuml/solvers/cd.pyx
+++ b/python/cuml/solvers/cd.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -30,7 +27,8 @@ from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import get_cudf_column_ptr
 from cuml.common import get_dev_array_ptr
 from cuml.common import input_to_dev_array
@@ -40,7 +38,7 @@ from cuml.common.input_utils import input_to_cuml_array
 
 cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
 
-    cdef void cdFit(cumlHandle& handle,
+    cdef void cdFit(handle_t& handle,
                     float *input,
                     int n_rows,
                     int n_cols,
@@ -56,7 +54,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                     bool shuffle,
                     float tol) except +
 
-    cdef void cdFit(cumlHandle& handle,
+    cdef void cdFit(handle_t& handle,
                     double *input,
                     int n_rows,
                     int n_cols,
@@ -72,7 +70,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                     bool shuffle,
                     double tol) except +
 
-    cdef void cdPredict(cumlHandle& handle,
+    cdef void cdPredict(handle_t& handle,
                         const float *input,
                         int n_rows,
                         int n_cols,
@@ -81,7 +79,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                         float *preds,
                         int loss) except +
 
-    cdef void cdPredict(cumlHandle& handle,
+    cdef void cdPredict(handle_t& handle,
                         const double *input,
                         int n_rows,
                         int n_cols,
@@ -171,21 +169,36 @@ class CD(Base):
        than looping over features sequentially by default.
        This (setting to ‘True’) often leads to significantly faster convergence
        especially when tol is higher than 1e-4.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     """
 
     def __init__(self, loss='squared_loss', alpha=0.0001, l1_ratio=0.15,
                  fit_intercept=True, normalize=False, max_iter=1000, tol=1e-3,
-                 shuffle=True, handle=None, output_type=None):
+                 shuffle=True, handle=None, output_type=None, verbose=False):
 
-        if loss in ['squared_loss']:
-            self.loss = self._get_loss_int(loss)
-        else:
+        if loss not in ['squared_loss']:
             msg = "loss {!r} is not supported"
             raise NotImplementedError(msg.format(loss))
 
-        super(CD, self).__init__(handle=handle, verbose=False,
+        super(CD, self).__init__(handle=handle, verbose=verbose,
                                  output_type=output_type)
+
+        self.loss = loss
         self.alpha = alpha
         self.l1_ratio = l1_ratio
         self.fit_intercept = fit_intercept
@@ -203,34 +216,19 @@ class CD(Base):
                 msg = "alpha values have to be positive"
                 raise TypeError(msg.format(alpha))
 
-    def _get_loss_int(self, loss):
+    def _get_loss_int(self):
         return {
             'squared_loss': 0,
-        }[loss]
+        }[self.loss]
 
+    @generate_docstring()
     def fit(self, X, y, convert_dtype=False):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
 
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X)
 
         X_m, n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, check_dtype=[np.float32, np.float64])
@@ -251,7 +249,7 @@ class CD(Base):
 
         cdef float c_intercept1
         cdef double c_intercept2
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             cdFit(handle_[0],
@@ -264,7 +262,7 @@ class CD(Base):
                   <bool>self.fit_intercept,
                   <bool>self.normalize,
                   <int>self.max_iter,
-                  <int>self.loss,
+                  <int>self._get_loss_int(),
                   <float>self.alpha,
                   <float>self.l1_ratio,
                   <bool>self.shuffle,
@@ -282,7 +280,7 @@ class CD(Base):
                   <bool>self.fit_intercept,
                   <bool>self.normalize,
                   <int>self.max_iter,
-                  <int>self.loss,
+                  <int>self._get_loss_int(),
                   <double>self.alpha,
                   <double>self.l1_ratio,
                   <bool>self.shuffle,
@@ -294,26 +292,14 @@ class CD(Base):
 
         return self
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
         out_type = self._get_output_type(X)
 
@@ -329,7 +315,7 @@ class CD(Base):
         preds = CumlArray.zeros(n_rows, dtype=self.dtype)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             cdPredict(handle_[0],
@@ -339,7 +325,7 @@ class CD(Base):
                       <float*>coef_ptr,
                       <float>self.intercept_,
                       <float*>preds_ptr,
-                      <int>self.loss)
+                      <int>self._get_loss_int())
         else:
             cdPredict(handle_[0],
                       <double*>X_ptr,
@@ -348,10 +334,22 @@ class CD(Base):
                       <double*>coef_ptr,
                       <double>self.intercept_,
                       <double*>preds_ptr,
-                      <int>self.loss)
+                      <int>self._get_loss_int())
 
         self.handle.sync()
 
         del(X_m)
 
         return preds.to_output(out_type)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "loss",
+            "alpha",
+            "l1_ratio",
+            "fit_intercept",
+            "normalize",
+            "max_iter",
+            "tol",
+            "shuffle",
+        ]
diff --git a/python/cuml/solvers/cd_mg.pyx b/python/cuml/solvers/cd_mg.pyx
index 4981cf0f1a..8f072e58b3 100644
--- a/python/cuml/solvers/cd_mg.pyx
+++ b/python/cuml/solvers/cd_mg.pyx
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -29,7 +26,7 @@ from cython.operator cimport dereference as deref
 
 from cuml.common.base import Base
 from cuml.common.array import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.opg_data_utils_mg cimport *
 from cuml.common.input_utils import input_to_cuml_array
 from cuml.decomposition.utils cimport *
@@ -38,7 +35,7 @@ from cuml.solvers import CD
 
 cdef extern from "cuml/solvers/cd_mg.hpp" namespace "ML::CD::opg":
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[floatData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[floatData_t *] labels,
@@ -53,7 +50,7 @@ cdef extern from "cuml/solvers/cd_mg.hpp" namespace "ML::CD::opg":
                   float tol,
                   bool verbose) except +
 
-    cdef void fit(cumlHandle& handle,
+    cdef void fit(handle_t& handle,
                   vector[doubleData_t *] input_data,
                   PartDescriptor &input_desc,
                   vector[doubleData_t *] labels,
@@ -81,7 +78,7 @@ class CDMG(MGFitMixin, CD):
 
         cdef float float_intercept
         cdef double double_intercept
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             fit(handle_[0],
diff --git a/python/cuml/solvers/qn.pyx b/python/cuml/solvers/qn.pyx
index 892a664a5c..b3a8c8c786 100644
--- a/python/cuml/solvers/qn.pyx
+++ b/python/cuml/solvers/qn.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cudf
 import cupy as cp
@@ -25,8 +22,10 @@ import numpy as np
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 
-from cuml.common.base import Base, CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.common.array import CumlArray
+from cuml.common.base import Base
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from cuml.common import with_cupy_rmm
 from cuml.metrics import accuracy_score
@@ -34,7 +33,7 @@ from cuml.metrics import accuracy_score
 
 cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
 
-    void qnFit(cumlHandle& cuml_handle,
+    void qnFit(handle_t& cuml_handle,
                float *X,
                float *y,
                int N,
@@ -54,7 +53,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                bool X_col_major,
                int loss_type) except +
 
-    void qnFit(cumlHandle& cuml_handle,
+    void qnFit(handle_t& cuml_handle,
                double *X,
                double *y,
                int N,
@@ -74,7 +73,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                bool X_col_major,
                int loss_type) except +
 
-    void qnDecisionFunction(cumlHandle& cuml_handle,
+    void qnDecisionFunction(handle_t& cuml_handle,
                             float *X,
                             int N,
                             int D,
@@ -85,7 +84,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                             int loss_type,
                             float *scores) except +
 
-    void qnDecisionFunction(cumlHandle& cuml_handle,
+    void qnDecisionFunction(handle_t& cuml_handle,
                             double *X,
                             int N,
                             int D,
@@ -96,7 +95,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                             int loss_type,
                             double *scores) except +
 
-    void qnPredict(cumlHandle& cuml_handle,
+    void qnPredict(handle_t& cuml_handle,
                    float *X,
                    int N,
                    int D,
@@ -107,7 +106,7 @@ cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM":
                    int loss_type,
                    float *preds) except +
 
-    void qnPredict(cumlHandle& cuml_handle,
+    void qnPredict(handle_t& cuml_handle,
                    double *X,
                    int N,
                    int D,
@@ -137,7 +136,7 @@ class QN(Base):
     NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
         import cudf
@@ -158,9 +157,9 @@ class QN(Base):
         # Note: for now, the coefficients also include the intercept in the
         # last position if fit_intercept=True
         print("Coefficients:")
-        print(solver.coef_.copy_to_host())
+        print(solver.coef_)
         print("Intercept:")
-        print(solver.intercept_.copy_to_host())
+        print(solver.intercept_)
 
         X_new = cudf.DataFrame()
         X_new['col1'] = np.array([1,5], dtype = np.float32)
@@ -210,8 +209,21 @@ class QN(Base):
     lbfgs_memory: int (default = 5)
         Rank of the lbfgs inverse-Hessian approximation. Method will use
         O(lbfgs_memory * D) memory.
-    verbose : int or boolean (default = False)
-        Controls verbose level of logging.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     -----------
@@ -219,7 +231,7 @@ class QN(Base):
         The estimated coefficients for the linear regression model.
         Note: shape is (n_classes, n_features + 1) if fit_intercept = True.
     intercept_ : array (n_classes, 1)
-        The independent term. If fit_intercept_ is False, will be 0.
+        The independent term. If `fit_intercept` is False, will be 0.
 
     Notes
     ------
@@ -236,9 +248,10 @@ class QN(Base):
     def __init__(self, loss='sigmoid', fit_intercept=True,
                  l1_strength=0.0, l2_strength=0.0, max_iter=1000, tol=1e-3,
                  linesearch_max_iter=50, lbfgs_memory=5,
-                 verbose=False, handle=None):
+                 verbose=False, handle=None, output_type=None):
 
-        super(QN, self).__init__(handle=handle, verbose=verbose)
+        super(QN, self).__init__(handle=handle, verbose=verbose,
+                                 output_type=output_type)
 
         self.fit_intercept = fit_intercept
         self.l1_strength = l1_strength
@@ -248,7 +261,7 @@ class QN(Base):
         self.linesearch_max_iter = linesearch_max_iter
         self.lbfgs_memory = lbfgs_memory
         self.num_iter = 0
-        self.coef_ = None
+        self._coef_ = None  # accessed via coef_
 
         if loss not in ['sigmoid', 'softmax', 'normal']:
             raise ValueError("loss " + str(loss) + " not supported.")
@@ -262,29 +275,14 @@ class QN(Base):
             'normal': 1
         }[loss]
 
+    @generate_docstring()
     @with_cupy_rmm
     def fit(self, X, y, convert_dtype=False):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X)
 
         X_m, n_rows, self.n_cols, self.dtype = input_to_cuml_array(
             X, order='F', check_dtype=[np.float32, np.float64]
@@ -319,12 +317,12 @@ class QN(Base):
         else:
             coef_size = (self.n_cols, self._num_classes_dim)
 
-        self.coef_ = CumlArray.ones(coef_size, dtype=self.dtype, order='C')
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        self._coef_ = CumlArray.ones(coef_size, dtype=self.dtype, order='C')
+        cdef uintptr_t coef_ptr = self._coef_.ptr
 
         cdef float objective32
         cdef double objective64
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef int num_iters
 
@@ -414,10 +412,10 @@ class QN(Base):
         scores = CumlArray.zeros(shape=(self._num_classes_dim, n_rows),
                                  dtype=self.dtype, order='F')
 
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        cdef uintptr_t coef_ptr = self._coef_.ptr
         cdef uintptr_t scores_ptr = scores.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             qnDecisionFunction(handle_[0],
@@ -449,26 +447,14 @@ class QN(Base):
 
         return scores
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: cuDF DataFrame
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
         out_type = self._get_output_type(X)
         out_dtype = self._get_target_dtype()
@@ -481,10 +467,10 @@ class QN(Base):
         cdef uintptr_t X_ptr = X_m.ptr
 
         preds = CumlArray.zeros(shape=n_rows, dtype=self.dtype)
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        cdef uintptr_t coef_ptr = self._coef_.ptr
         cdef uintptr_t pred_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             qnPredict(handle_[0],
@@ -522,12 +508,18 @@ class QN(Base):
     def __getattr__(self, attr):
         if attr == 'intercept_':
             if self.fit_intercept:
-                return self.coef_[-1]
+                return self._coef_[-1].to_output(self.output_type)
             else:
                 return CumlArray.zeros(shape=1)
+        elif attr == 'coef_':
+            if self.fit_intercept:
+                return self._coef_[0:-1].to_output(self.output_type)
+            else:
+                return self._coef_.to_output(self.output_type)
         else:
-            raise AttributeError(attr)
+            return super().__getattr__(attr)
 
     def get_param_names(self):
-        return ['loss', 'fit_intercept', 'l1_strength', 'l2_strength',
+        return super().get_param_names() + \
+            ['loss', 'fit_intercept', 'l1_strength', 'l2_strength',
                 'max_iter', 'tol', 'linesearch_max_iter', 'lbfgs_memory']
diff --git a/python/cuml/solvers/sgd.pyx b/python/cuml/solvers/sgd.pyx
index 634f92f7e5..f70f14922f 100644
--- a/python/cuml/solvers/sgd.pyx
+++ b/python/cuml/solvers/sgd.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -30,13 +27,14 @@ from libc.stdint cimport uintptr_t
 from libc.stdlib cimport calloc, malloc, free
 
 from cuml.common.base import Base
-from cuml.common import CumlArray
-from cuml.common.handle cimport cumlHandle
+from cuml.common.array import CumlArray
+from cuml.common.doc_utils import generate_docstring
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array, with_cupy_rmm
 
 cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
 
-    cdef void sgdFit(cumlHandle& handle,
+    cdef void sgdFit(handle_t& handle,
                      float *input,
                      int n_rows,
                      int n_cols,
@@ -57,7 +55,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                      float tol,
                      int n_iter_no_change) except +
 
-    cdef void sgdFit(cumlHandle& handle,
+    cdef void sgdFit(handle_t& handle,
                      double *input,
                      int n_rows,
                      int n_cols,
@@ -78,7 +76,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                      double tol,
                      int n_iter_no_change) except +
 
-    cdef void sgdPredict(cumlHandle& handle,
+    cdef void sgdPredict(handle_t& handle,
                          const float *input,
                          int n_rows,
                          int n_cols,
@@ -87,7 +85,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                          float *preds,
                          int loss) except +
 
-    cdef void sgdPredict(cumlHandle& handle,
+    cdef void sgdPredict(handle_t& handle,
                          const double *input,
                          int n_rows,
                          int n_cols,
@@ -96,7 +94,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                          double *preds,
                          int loss) except +
 
-    cdef void sgdPredictBinaryClass(cumlHandle& handle,
+    cdef void sgdPredictBinaryClass(handle_t& handle,
                                     const float *input,
                                     int n_rows,
                                     int n_cols,
@@ -105,7 +103,7 @@ cdef extern from "cuml/solvers/solver.hpp" namespace "ML::Solver":
                                     float *preds,
                                     int loss) except +
 
-    cdef void sgdPredictBinaryClass(cumlHandle& handle,
+    cdef void sgdPredictBinaryClass(handle_t& handle,
                                     const double *input,
                                     int n_rows,
                                     int n_cols,
@@ -127,7 +125,7 @@ class SGD(Base):
     ridge regression and SVM models.
 
     Examples
-    ---------
+    --------
 
     .. code-block:: python
 
@@ -164,35 +162,35 @@ class SGD(Base):
     Parameters
     -----------
     loss : 'hinge', 'log', 'squared_loss' (default = 'squared_loss')
-       'hinge' uses linear SVM
-       'log' uses logistic regression
-       'squared_loss' uses linear regression
+        'hinge' uses linear SVM
+        'log' uses logistic regression
+        'squared_loss' uses linear regression
     penalty: 'none', 'l1', 'l2', 'elasticnet' (default = 'none')
-       'none' does not perform any regularization
-       'l1' performs L1 norm (Lasso) which minimizes the sum of the abs value
-       of coefficients
-       'l2' performs L2 norm (Ridge) which minimizes the sum of the square of
-       the coefficients
-       'elasticnet' performs Elastic Net regularization which is a weighted
-       average of L1 and L2 norms
+        'none' does not perform any regularization
+        'l1' performs L1 norm (Lasso) which minimizes the sum of the abs value
+        of coefficients
+        'l2' performs L2 norm (Ridge) which minimizes the sum of the square of
+        the coefficients
+        'elasticnet' performs Elastic Net regularization which is a weighted
+        average of L1 and L2 norms
     alpha: float (default = 0.0001)
         The constant value which decides the degree of regularization
     fit_intercept : boolean (default = True)
-       If True, the model tries to correct for the global mean of y.
-       If False, the model expects that you have centered the data.
+        If True, the model tries to correct for the global mean of y.
+        If False, the model expects that you have centered the data.
     epochs : int (default = 1000)
         The number of times the model should iterate through the entire dataset
         during training (default = 1000)
     tol : float (default = 1e-3)
-       The training process will stop if current_loss > previous_loss - tol
+        The training process will stop if current_loss > previous_loss - tol
     shuffle : boolean (default = True)
-       True, shuffles the training data after each epoch
-       False, does not shuffle the training data after each epoch
+        True, shuffles the training data after each epoch
+        False, does not shuffle the training data after each epoch
     eta0 : float (default = 0.001)
         Initial learning rate
     power_t : float (default = 0.5)
         The exponent used for calculating the invscaling learning rate
-    learning_rate : 'optimal', 'constant', 'invscaling',
+    learning_rate : 'optimal', 'constant', 'invscaling', \
                     'adaptive' (default = 'constant')
         optimal option supported in the next version
         constant keeps the learning rate constant
@@ -201,11 +199,21 @@ class SGD(Base):
         The old learning rate is generally divide by 5
     n_iter_no_change : int (default = 5)
         the number of epochs to train without any imporvement in the model
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
         Variable to control output type of the results and attributes of
-        the estimators. If None, it'll inherit the output type set at the
-        module level, cuml.output_type. If set, the estimator will override
-        the global option for its behavior.
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     """
 
@@ -213,21 +221,21 @@ class SGD(Base):
                  l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=1e-3,
                  shuffle=True, learning_rate='constant', eta0=0.001,
                  power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None,
-                 output_type=None):
+                 output_type=None, verbose=False):
 
         if loss in ['hinge', 'log', 'squared_loss']:
-            self.loss = self._get_loss_int(loss)
+            self.loss = loss
         else:
             msg = "loss {!r} is not supported"
             raise TypeError(msg.format(loss))
 
         if penalty in ['none', 'l1', 'l2', 'elasticnet']:
-            self.penalty = self._get_penalty_int(penalty)
+            self.penalty = penalty
         else:
             msg = "penalty {!r} is not supported"
             raise TypeError(msg.format(penalty))
 
-        super(SGD, self).__init__(handle=handle, verbose=False,
+        super(SGD, self).__init__(handle=handle, verbose=verbose,
                                   output_type=output_type)
         self.alpha = alpha
         self.l1_ratio = l1_ratio
@@ -270,7 +278,7 @@ class SGD(Base):
         self.batch_size = batch_size
         self.n_iter_no_change = n_iter_no_change
         self.intercept_value = 0.0
-        self.coef_ = None
+        self._coef_ = None  # accessed via coef_
         self.intercept_ = None
 
     def _check_alpha(self, alpha):
@@ -279,45 +287,29 @@ class SGD(Base):
                 msg = "alpha values have to be positive"
                 raise TypeError(msg.format(alpha))
 
-    def _get_loss_int(self, loss):
+    def _get_loss_int(self):
         return {
             'squared_loss': 0,
             'log': 1,
             'hinge': 2,
-        }[loss]
+        }[self.loss]
 
-    def _get_penalty_int(self, penalty):
+    def _get_penalty_int(self):
         return {
             'none': 0,
             'l1': 1,
             'l2': 2,
             'elasticnet': 3
-        }[penalty]
+        }[self.penalty]
 
+    @generate_docstring()
     @with_cupy_rmm
     def fit(self, X, y, convert_dtype=False):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_output_type(X)
-        self._set_target_dtype(y)
+        self._set_base_attributes(output_type=X, target_dtype=y)
 
         X_m, n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, check_dtype=[np.float32, np.float64])
@@ -330,20 +322,20 @@ class SGD(Base):
 
         _estimator_type = getattr(self, '_estimator_type', None)
         if _estimator_type == "classifier":
-            self.classes_ = cp.unique(y_m)
+            self._classes_ = CumlArray(cp.unique(y_m))
 
         cdef uintptr_t X_ptr = X_m.ptr
         cdef uintptr_t y_ptr = y_m.ptr
 
         self.n_alpha = 1
 
-        self.coef_ = CumlArray.zeros(self.n_cols,
-                                     dtype=self.dtype)
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        self._coef_ = CumlArray.zeros(self.n_cols,
+                                      dtype=self.dtype)
+        cdef uintptr_t coef_ptr = self._coef_.ptr
 
         cdef float c_intercept1
         cdef double c_intercept2
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             sgdFit(handle_[0],
@@ -359,8 +351,8 @@ class SGD(Base):
                    <int>self.lr_type,
                    <float>self.eta0,
                    <float>self.power_t,
-                   <int>self.loss,
-                   <int>self.penalty,
+                   <int>self._get_loss_int(),
+                   <int>self._get_penalty_int(),
                    <float>self.alpha,
                    <float>self.l1_ratio,
                    <bool>self.shuffle,
@@ -382,8 +374,8 @@ class SGD(Base):
                    <int>self.lr_type,
                    <double>self.eta0,
                    <double>self.power_t,
-                   <int>self.loss,
-                   <int>self.penalty,
+                   <int>self._get_loss_int(),
+                   <int>self._get_penalty_int(),
                    <double>self.alpha,
                    <double>self.l1_ratio,
                    <bool>self.shuffle,
@@ -399,26 +391,14 @@ class SGD(Base):
 
         return self
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predict(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predict method will, when necessary, convert
-            the input to the data type which was used to train the model. This
-            will increase memory used for the method.
-
-        Returns
-        ----------
-        y: Type specified in `output_type`
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
         output_type = self._get_output_type(X)
 
@@ -430,11 +410,11 @@ class SGD(Base):
 
         cdef uintptr_t X_ptr = X_m.ptr
 
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        cdef uintptr_t coef_ptr = self._coef_.ptr
         preds = CumlArray.zeros(n_rows, dtype=self.dtype)
         cdef uintptr_t preds_ptr = preds.ptr
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             sgdPredict(handle_[0],
@@ -444,7 +424,7 @@ class SGD(Base):
                        <float*>coef_ptr,
                        <float>self.intercept_,
                        <float*>preds_ptr,
-                       <int>self.loss)
+                       <int>self._get_loss_int())
         else:
             sgdPredict(handle_[0],
                        <double*>X_ptr,
@@ -453,7 +433,7 @@ class SGD(Base):
                        <double*>coef_ptr,
                        <double>self.intercept_,
                        <double*>preds_ptr,
-                       <int>self.loss)
+                       <int>self._get_loss_int())
 
         self.handle.sync()
 
@@ -461,26 +441,14 @@ class SGD(Base):
 
         return preds.to_output(output_type)
 
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
     def predictClass(self, X, convert_dtype=False):
         """
         Predicts the y for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = False)
-            When set to True, the predictClass method will automatically
-            convert the input to the data type which was used to train the
-            model. This will increase memory used for the method.
-
-        Returns
-        ----------
-        y : Type specified in `output_type`
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
         output_type = self._get_output_type(X)
         out_dtype = self._get_target_dtype()
@@ -492,10 +460,10 @@ class SGD(Base):
                                 check_cols=self.n_cols)
 
         cdef uintptr_t X_ptr = X_m.ptr
-        cdef uintptr_t coef_ptr = self.coef_.ptr
+        cdef uintptr_t coef_ptr = self._coef_.ptr
         preds = CumlArray.zeros(n_rows, dtype=dtype)
         cdef uintptr_t preds_ptr = preds.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if dtype.type == np.float32:
             sgdPredictBinaryClass(handle_[0],
@@ -505,7 +473,7 @@ class SGD(Base):
                                   <float*>coef_ptr,
                                   <float>self.intercept_,
                                   <float*>preds_ptr,
-                                  <int>self.loss)
+                                  <int>self._get_loss_int())
         else:
             sgdPredictBinaryClass(handle_[0],
                                   <double*>X_ptr,
@@ -514,10 +482,27 @@ class SGD(Base):
                                   <double*>coef_ptr,
                                   <double>self.intercept_,
                                   <double*>preds_ptr,
-                                  <int>self.loss)
+                                  <int>self._get_loss_int())
 
         self.handle.sync()
 
         del(X_m)
 
         return preds.to_output(output_type=output_type, output_dtype=out_dtype)
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "loss",
+            "penalty",
+            "alpha",
+            "l1_ratio",
+            "fit_intercept",
+            "epochs",
+            "tol",
+            "shuffle",
+            "learning_rate",
+            "eta0",
+            "power_t",
+            "batch_size",
+            "n_iter_no_change",
+        ]
diff --git a/python/cuml/svm/svc.pyx b/python/cuml/svm/svc.pyx
index dcd078c53f..a97342877b 100644
--- a/python/cuml/svm/svc.pyx
+++ b/python/cuml/svm/svc.pyx
@@ -13,13 +13,11 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
+import cupy as cp
 import numpy as np
 
 from numba import cuda
@@ -29,10 +27,18 @@ from libc.stdint cimport uintptr_t
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base, ClassifierMixin
-from cuml.common.handle cimport cumlHandle
-from cuml.common import input_to_cuml_array, with_cupy_rmm
-from libcpp cimport bool
+from cuml.common.doc_utils import generate_docstring
+from cuml.common.logger import warn
+from cuml.raft.common.handle cimport handle_t
+from cuml.common import input_to_cuml_array, input_to_host_array, with_cupy_rmm
+from cuml.preprocessing import LabelEncoder
+from cuml.common.memory_utils import using_output_type
+from libcpp cimport bool, nullptr
 from cuml.svm.svm_base import SVMBase
+from cuml.common.import_utils import has_sklearn
+
+if has_sklearn():
+    from sklearn.calibration import CalibratedClassifierCV
 
 cdef extern from "cuml/matrix/kernelparams.h" namespace "MLCommon::Matrix":
     enum KernelType:
@@ -79,34 +85,53 @@ cdef extern from "cuml/svm/svm_model.h" namespace "ML::SVM":
 
 cdef extern from "cuml/svm/svc.hpp" namespace "ML::SVM":
 
-    cdef void svcFit[math_t](const cumlHandle &handle, math_t *input,
+    cdef void svcFit[math_t](const handle_t &handle, math_t *input,
                              int n_rows, int n_cols, math_t *labels,
                              const svmParameter &param,
                              KernelParams &kernel_params,
-                             svmModel[math_t] &model) except+
+                             svmModel[math_t] &model,
+                             const math_t *sample_weight) except+
 
     cdef void svcPredict[math_t](
-        const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
+        const handle_t &handle, math_t *input, int n_rows, int n_cols,
         KernelParams &kernel_params, const svmModel[math_t] &model,
         math_t *preds, math_t buffer_size, bool predict_class) except +
 
-    cdef void svmFreeBuffers[math_t](const cumlHandle &handle,
+    cdef void svmFreeBuffers[math_t](const handle_t &handle,
                                      svmModel[math_t] &m) except +
 
 
+def _to_output(X, out_type):
+    """ Convert array X to out_type.
+
+    X can be host (numpy) array.
+
+    Arguments:
+    X: cuDF.DataFrame, cuDF.Series, numba array, NumPy array or any
+    cuda_array_interface compliant array like CuPy or pytorch.
+
+    out_type: string (as defined by the  CumlArray's to_output method).
+    """
+    if out_type == 'numpy' and isinstance(X, np.ndarray):
+        return X
+    else:
+        X, _, _, _ = input_to_cuml_array(X)
+        return X.to_output(output_type=out_type)
+
+
 class SVC(SVMBase, ClassifierMixin):
     """
     SVC (C-Support Vector Classification)
 
     Construct an SVC classifier for training and predictions.
 
-    Known limitations
-    -----------------
-    - Currently only binary classification is supported.
-    - predict_proba is not yet supported
+    .. note::
+        This implementation has the following known limitations:
+
+        - Currently only binary classification is supported.
 
     Examples
-    ---------
+    --------
     .. code-block:: python
 
             import numpy as np
@@ -127,7 +152,12 @@ class SVC(SVMBase, ClassifierMixin):
     Parameters
     ----------
     handle : cuml.Handle
-        If it is None, a new one is created for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     C : float (default = 1.0)
         Penalty parameter C
     kernel : string (default='rbf')
@@ -154,14 +184,30 @@ class SVC(SVMBase, ClassifierMixin):
         matrix elements (this can be signifficant if n_support is large).
         The cache_size variable sets an upper limit to the prediction
         buffer as well.
+    class_weight : dict or string (default=None)
+        Weights to modify the parameter C for class i to class_weight[i]*C. The
+        string 'balanced' is also accepted, in which case class_weight[i] =
+        n_samples / (n_classes * n_samples_of_class[i])
     max_iter : int (default = 100*n_samples)
         Limit the number of outer iterations in the solver
     nochange_steps : int (default = 1000)
         We monitor how much our stopping criteria changes during outer
         iterations. If it does not change (changes less then 1e-3*tol)
         for nochange_steps consecutive steps, then we stop training.
-    verbose : int or boolean (default = False)
-        verbosity level
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+    probability: bool (default = False)
+        Enable or disable probability estimates.
+    random_state: int (default = None)
+        Seed for random number generator (used only when probability = True).
+        Currently this argument is not used and a waring will be printed if the
+        user provides it.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
 
     Attributes
     ----------
@@ -170,7 +216,7 @@ class SVC(SVMBase, ClassifierMixin):
         future to represent number support vectors for each class (like
         in Sklearn, see https://github.com/rapidsai/cuml/issues/956 )
     support_ : int, shape = (n_support)
-        Device array of suppurt vector indices
+        Device array of support vector indices
     support_vectors_ : float, shape (n_support, n_cols)
         Device array of support vectors
     dual_coef_ : float, shape = (1, n_support)
@@ -186,75 +232,152 @@ class SVC(SVMBase, ClassifierMixin):
     classes_: shape (n_classes_,)
         Array of class labels.
 
-    For additional docs, see `scikitlearn's SVC
-    <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_.
-
     Notes
     -----
     The solver uses the SMO method to fit the classifier. We use the Optimized
-    Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]
+    Hierarchical Decomposition [1]_ variant of the SMO algorithm, similar to
+    [2]_.
+
+    For additional docs, see `scikitlearn's SVC
+    <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_.
 
     References
     ----------
-    [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition
-    Algorithm for Support VectorMachine Training, IEEE Transactions on
-    Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)
+    .. [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical
+       Decomposition Algorithm for Support VectorMachine Training, IEEE
+       Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330,
+       (2017)
 
-    [2] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal
-    of Machine Learning Research, 19, 1-5 (2018)
-    https://github.com/Xtra-Computing/thundersvm
+    .. [2] `Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs,
+       Journal of Machine Learning Research, 19, 1-5 (2018)
+       <https://github.com/Xtra-Computing/thundersvm>`_
 
     """
     def __init__(self, handle=None, C=1, kernel='rbf', degree=3,
                  gamma='scale', coef0=0.0, tol=1e-3, cache_size=200.0,
-                 max_iter=-1, nochange_steps=1000,
-                 verbose=False):
+                 max_iter=-1, nochange_steps=1000, verbose=False,
+                 output_type=None, probability=False, random_state=None,
+                 class_weight=None):
         super(SVC, self).__init__(handle, C, kernel, degree, gamma, coef0, tol,
                                   cache_size, max_iter, nochange_steps,
-                                  verbose)
+                                  verbose, output_type=output_type)
+        self.probability = probability
+        self.random_state = random_state
+        if probability and random_state is not None:
+            warn("Random state is currently ignored by probabilistic SVC")
+        self.class_weight = class_weight
         self.svmType = C_SVC
 
     @property
     def classes_(self):
-        return self.unique_labels
+        if self.probability:
+            return self.prob_svc.classes_
+        else:
+            return self.unique_labels
+
+    @with_cupy_rmm
+    def _apply_class_weight(self, sample_weight, y_m):
+        """
+        Scale the sample weights with the class weights.
+
+        Returns the modified sample weights, or None if neither class weights
+        nor sample weights are defined. The returned weights are defined as
+
+        sample_weight[i] = class_weight[y[i]] * sample_weight[i].
+
+        Parameters:
+        -----------
+        sample_weight: array-like (device or host), shape = (n_samples, 1)
+            sample weights or None if not given
+        y_m: device array of floats or doubles, shape = (n_samples, 1)
+            Array of target labels already copied to the device.
+
+        Returns
+        --------
+        sample_weight: device array shape = (n_samples, 1) or None
+        """
+        if self.class_weight is None:
+            return sample_weight
+
+        le = LabelEncoder()
+        labels = y_m.to_output(output_type='series')
+        encoded_labels = cp.asarray(le.fit_transform(labels))
+
+        # Define class weights for the encoded labels
+        if self.class_weight == 'balanced':
+            counts = cp.asnumpy(cp.bincount(encoded_labels))
+            n_classes = len(counts)
+            n_samples = y_m.shape[0]
+            weights = n_samples / (n_classes * counts)
+            class_weight = {i: weights[i] for i in range(n_classes)}
+        else:
+            keys = self.class_weight.keys()
+            keys_series = cudf.Series(keys)
+            encoded_keys = le.transform(cudf.Series(keys)).values_host
+            class_weight = {enc_key: self.class_weight[key]
+                            for enc_key, key in zip(encoded_keys, keys)}
+
+        if sample_weight is None:
+            sample_weight = cp.ones(y_m.shape, dtype=self.dtype)
+        else:
+            sample_weight_m, _, _, _ = \
+                input_to_cuml_array(sample_weight, convert_to_dtype=self.dtype,
+                                    check_rows=self.n_rows, check_cols=1)
+            sample_weight = sample_weight_m.to_output(output_type='cupy')
+
+        for label, weight in class_weight.items():
+            sample_weight[encoded_labels==label] *= weight
 
+        return sample_weight
+
+    @generate_docstring(y='dense_anydtype')
     @with_cupy_rmm
-    def fit(self, X, y, convert_dtype=True):
+    def fit(self, X, y, sample_weight=None, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
-        self._set_target_dtype(y)
+        self._set_base_attributes(output_type=X, target_dtype=y, n_features=X)
+
+        if self.probability:
+            params = self.get_params()
+            params["probability"] = False
+            # Currently CalibratedClassifierCV expects data on the host, see
+            # https://github.com/rapidsai/cuml/issues/2608
+            X, _, _, _, _ = input_to_host_array(X)
+            y, _, _, _, _ = input_to_host_array(y)
+            with using_output_type('numpy'):
+                if not has_sklearn():
+                    raise RuntimeError(
+                        "Scikit-learn is needed to use SVM probabilities")
+
+                self.prob_svc = CalibratedClassifierCV(SVC(**params), cv=5,
+                                                       method='sigmoid')
+                self.prob_svc.fit(X, y)
+                self._fit_status_ = 0
+            return self
 
         X_m, self.n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, order='F')
 
         cdef uintptr_t X_ptr = X_m.ptr
+        convert_to_dtype = self.dtype if convert_dtype else None
         y_m, _, _, _ = \
             input_to_cuml_array(y, check_dtype=self.dtype,
-                                convert_to_dtype=(self.dtype if convert_dtype
-                                                  else None),
+                                convert_to_dtype=convert_to_dtype,
                                 check_rows=self.n_rows, check_cols=1)
 
         cdef uintptr_t y_ptr = y_m.ptr
+
+        sample_weight = self._apply_class_weight(sample_weight, y_m)
+        cdef uintptr_t sample_weight_ptr = <uintptr_t> nullptr
+        if sample_weight is not None:
+            sample_weight_m, _, _, _ = \
+                input_to_cuml_array(sample_weight, check_dtype=self.dtype,
+                                    convert_to_dtype=convert_to_dtype,
+                                    check_rows=self.n_rows, check_cols=1)
+            sample_weight_ptr = sample_weight_m.ptr
+
         self._dealloc()  # delete any previously fitted model
         self._coef_ = None
 
@@ -262,19 +385,19 @@ class SVC(SVMBase, ClassifierMixin):
         cdef svmParameter param = self._get_svm_params()
         cdef svmModel[float] *model_f
         cdef svmModel[double] *model_d
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             model_f = new svmModel[float]()
             svcFit(handle_[0], <float*>X_ptr, <int>self.n_rows,
                    <int>self.n_cols, <float*>y_ptr, param, _kernel_params,
-                   model_f[0])
+                   model_f[0], <float*>sample_weight_ptr)
             self._model = <uintptr_t>model_f
         elif self.dtype == np.float64:
             model_d = new svmModel[double]()
             svcFit(handle_[0], <double*>X_ptr, <int>self.n_rows,
                    <int>self.n_cols, <double*>y_ptr, param, _kernel_params,
-                   model_d[0])
+                   model_d[0], <double*>sample_weight_ptr)
             self._model = <uintptr_t>model_d
         else:
             raise TypeError('Input data type should be float32 or float64')
@@ -288,41 +411,108 @@ class SVC(SVMBase, ClassifierMixin):
 
         return self
 
-    def predict(self, X):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts the class labels for X. The returned y values are the class
         labels associated to sign(decision_function(X)).
+        """
+
+        if self.probability:
+            self._check_is_fitted('prob_svc')
+            out_type = self._get_output_type(X)
+            X, _, _, _, _ = input_to_host_array(X)
+            preds = self.prob_svc.predict(X)
+            # prob_svc has numpy output type, change it if it is necessary:
+            return _to_output(preds, out_type)
+        else:
+            return super(SVC, self).predict(X, True, convert_dtype)
+
+    @generate_docstring(skip_parameters_heading=True,
+                        return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted \
+                                       probabilities',
+                                       'shape': '(n_samples, n_classes)'})
+    def predict_proba(self, X, log=False):
+        """
+        Predicts the class probabilities for X.
+
+        The model has to be trained with probability=True to use this method.
 
         Parameters
         ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
+        log: boolean (default = False)
+             Whether to return log probabilities.
 
-        Returns
-        -------
-        y : (same as the input datatype)
-            Dense vector (ints, floats, or doubles) of shape (n_samples, 1).
         """
 
-        return super(SVC, self).predict(X, True)
+        if self.probability:
+            self._check_is_fitted('prob_svc')
+            out_type = self._get_output_type(X)
+            X, _, _, _, _ = input_to_host_array(X)
+            preds = self.prob_svc.predict_proba(X)
+            if (log):
+                preds = np.log(preds)
+            # prob_svc has numpy output type, change it if it is necessary:
+            return _to_output(preds, out_type)
+        else:
+            raise AttributeError("This classifier is not fitted to predict "
+                                 "probabilities. Fit a new classifier with"
+                                 "probability=True to enable predict_proba.")
+
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Log of predicted \
+                                       probabilities',
+                                       'shape': '(n_samples, n_classes)'})
+    def predict_log_proba(self, X):
+        """
+        Predicts the log probabilities for X (returns log(predict_proba(x)).
+
+        The model has to be trained with probability=True to use this method.
 
+        """
+        return self.predict_proba(X, log=True)
+
+    @generate_docstring(return_values={'name': 'results',
+                                       'type': 'dense',
+                                       'description': 'Decision function \
+                                       values',
+                                       'shape': '(n_samples, 1)'})
     def decision_function(self, X):
         """
         Calculates the decision function values for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        y : cuDF Series
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
+        if self.probability:
+            self._check_is_fitted('prob_svc')
+            out_type = self._get_output_type(X)
+            # Probabilistic SVC is an ensemble of simple SVC classifiers
+            # fitted to different subset of the training data. As such, it
+            # does not have a single decision function. (During prediction
+            # we use the calibrated probabilities to determine the class
+            # label.) Here we average the decision function value. This can
+            # be useful for visualization, but predictions should be made
+            # using the probabilities.
+            df = np.zeros((X.shape[0],))
+            with using_output_type('numpy'):
+                for clf in self.prob_svc.calibrated_classifiers_:
+                    df = df + clf.base_estimator.decision_function(X)
+            df = df / len(self.prob_svc.calibrated_classifiers_)
+            return _to_output(df, out_type)
+        else:
+            return super(SVC, self).predict(X, False)
+
+    def get_param_names(self):
+        params = super().get_param_names() + \
+            ["probability", "random_state", "class_weight"]
+
+        # Ignore "epsilon" since its not used in the constructor
+        if ("epsilon" in params):
+            params.remove("epsilon")
 
-        return super(SVC, self).predict(X, False)
+        return params
diff --git a/python/cuml/svm/svm_base.pyx b/python/cuml/svm/svm_base.pyx
index e63d5b54a7..142045db99 100644
--- a/python/cuml/svm/svm_base.pyx
+++ b/python/cuml/svm/svm_base.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -30,7 +27,8 @@ from libc.stdint cimport uintptr_t
 
 from cuml.common.array import CumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.common.exceptions import NotFittedError
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
 from libcpp cimport bool
 
@@ -79,18 +77,19 @@ cdef extern from "cuml/svm/svm_model.h" namespace "ML::SVM":
 
 cdef extern from "cuml/svm/svc.hpp" namespace "ML::SVM":
 
-    cdef void svcFit[math_t](const cumlHandle &handle, math_t *input,
+    cdef void svcFit[math_t](const handle_t &handle, math_t *input,
                              int n_rows, int n_cols, math_t *labels,
                              const svmParameter &param,
                              KernelParams &kernel_params,
-                             svmModel[math_t] &model) except+
+                             svmModel[math_t] &model,
+                             const math_t *sample_weight) except+
 
     cdef void svcPredict[math_t](
-        const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
+        const handle_t &handle, math_t *input, int n_rows, int n_cols,
         KernelParams &kernel_params, const svmModel[math_t] &model,
         math_t *preds, math_t buffer_size, bool predict_class) except +
 
-    cdef void svmFreeBuffers[math_t](const cumlHandle &handle,
+    cdef void svmFreeBuffers[math_t](const handle_t &handle,
                                      svmModel[math_t] &m) except +
 
 
@@ -101,89 +100,108 @@ class SVMBase(Base):
     Currently only binary classification is supported.
 
     The solver uses the SMO method to fit the classifier. We use the Optimized
-    Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]
+    Hierarchical Decomposition [1]_ variant of the SMO algorithm, similar to
+    [2]_
+
+    Parameters
+    ----------
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    C : float (default = 1.0)
+        Penalty parameter C
+    kernel : string (default='rbf')
+        Specifies the kernel function. Possible options: 'linear', 'poly',
+        'rbf', 'sigmoid'. Currently precomputed kernels are not supported.
+    degree : int (default=3)
+        Degree of polynomial kernel function.
+    gamma : float or string (default = 'scale')
+        Coefficient for rbf, poly, and sigmoid kernels. You can specify the
+        numeric value, or use one of the following options:
+
+        - 'auto': gamma will be set to ``1 / n_features``
+        - 'scale': gamma will be se to ``1 / (n_features * X.var())``
+
+    coef0 : float (default = 0.0)
+        Independent term in kernel function, only signifficant for poly and
+        sigmoid
+    tol : float (default = 1e-3)
+        Tolerance for stopping criterion.
+    cache_size : float (default = 200.0)
+        Size of the kernel cache during training in MiB. The default is a
+        conservative value, increase it to improve the training time, at
+        the cost of higher memory footprint. After training the kernel
+        cache is deallocated.
+        During prediction, we also need a temporary space to store kernel
+        matrix elements (this can be signifficant if n_support is large).
+        The cache_size variable sets an upper limit to the prediction
+        buffer as well.
+    max_iter : int (default = 100*n_samples)
+        Limit the number of outer iterations in the solver
+    nochange_steps : int (default = 1000)
+        We monitor how much our stopping criteria changes during outer
+        iterations. If it does not change (changes less then 1e-3*tol)
+        for nochange_steps consecutive steps, then we stop training.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    epsilon: float (default = 0.1)
+        epsilon parameter of the epsiron-SVR model. There is no penalty
+        associated to points that are predicted within the epsilon-tube
+        around the target values.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
+    Attributes
+    ----------
+    n_support_ : int
+        The total number of support vectors. Note: this will change in the
+        future to represent number support vectors for each class (like
+        in Sklearn, see Issue #956)
+    support_ : int, shape = [n_support]
+        Device array of suppurt vector indices
+    support_vectors_ : float, shape [n_support, n_cols]
+        Device array of support vectors
+    dual_coef_ : float, shape = [1, n_support]
+        Device array of coefficients for support vectors
+    intercept_ : int
+        The constant in the decision function
+    fit_status_ : int
+        0 if SVM is correctly fitted
+    coef_ : float, shape [1, n_cols]
+        Only available for linear kernels. It is the normal of the
+        hyperplane.
+        ``coef_ = sum_k=1..n_support dual_coef_[k] * support_vectors[k,:]``
+
+    Notes
+    -----
+    For additional docs, see `scikitlearn's SVC
+    <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_.
 
     References
     ----------
-    [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition
-         Algorithm for Support VectorMachine Training, IEEE Transactions on
-         Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)
-    [2] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal
-    *      of Machine Learning Research, 19, 1-5 (2018)
-        https://github.com/Xtra-Computing/thundersvm
+    .. [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical
+        Decomposition Algorithm for Support VectorMachine Training, IEEE
+        Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330,
+        (2017)
+    .. [2] `Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs,
+        Journal of Machine Learning Research, 19, 1-5 (2018)
+        <https://github.com/Xtra-Computing/thundersvm>`_
 
     """
     def __init__(self, handle=None, C=1, kernel='rbf', degree=3,
                  gamma='auto', coef0=0.0, tol=1e-3, cache_size=200.0,
                  max_iter=-1, nochange_steps=1000, verbose=False,
-                 epsilon=0.1):
-        """
-        Construct an SVC classifier for training and predictions.
-
-        Parameters
-        ----------
-        handle : cuml.Handle
-            If it is None, a new one is created for this class
-        C : float (default = 1.0)
-            Penalty parameter C
-        kernel : string (default='rbf')
-            Specifies the kernel function. Possible options: 'linear', 'poly',
-            'rbf', 'sigmoid'. Currently precomputed kernels are not supported.
-        degree : int (default=3)
-            Degree of polynomial kernel function.
-        gamma : float or string (default = 'auto')
-            Coefficient for rbf, poly, and sigmoid kernels. You can specify the
-            numeric value, or use one of the following options:
-            - 'auto': gamma will be set to 1 / n_features
-            - 'scale': gamma will be se to 1 / (n_features * X.var())
-        coef0 : float (default = 0.0)
-            Independent term in kernel function, only signifficant for poly and
-            sigmoid
-        tol : float (default = 1e-3)
-            Tolerance for stopping criterion.
-        cache_size : float (default = 200 MiB)
-            Size of the kernel cache during training in MiB. The default is a
-            conservative value, increase it to improve the training time, at
-            the cost of higher memory footprint. After training the kernel
-            cache is deallocated.
-            During prediction, we also need a temporary space to store kernel
-            matrix elements (this can be signifficant if n_support is large).
-            The cache_size variable sets an upper limit to the prediction
-            buffer as well.
-        max_iter : int (default = 100*n_samples)
-            Limit the number of outer iterations in the solver
-        nochange_steps : int (default = 1000)
-            We monitor how much our stopping criteria changes during outer
-            iterations. If it does not change (changes less then 1e-3*tol)
-            for nochange_steps consecutive steps, then we stop training.
-        verbose : int or boolean (default = False)
-            verbosity level
-
-        Attributes
-        ----------
-        n_support_ : int
-            The total number of support vectors. Note: this will change in the
-            future to represent number support vectors for each class (like
-            in Sklearn, see Issue #956)
-        support_ : int, shape = [n_support]
-            Device array of suppurt vector indices
-        support_vectors_ : float, shape [n_support, n_cols]
-            Device array of support vectors
-        dual_coef_ : float, shape = [1, n_support]
-            Device array of coefficients for support vectors
-        intercept_ : int
-            The constant in the decision function
-        fit_status_ : int
-            0 if SVM is correctly fitted
-        coef_ : float, shape [1, n_cols]
-            Only available for linear kernels. It is the normal of the
-            hyperplane.
-            coef_ = sum_k=1..n_support dual_coef_[k] * support_vectors[k,:]
-
-        For additional docs, see `scikitlearn's SVC
-        <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_.
-        """
-        super(SVMBase, self).__init__(handle=handle, verbose=verbose)
+                 epsilon=0.1, output_type=None):
+        super(SVMBase, self).__init__(handle=handle, verbose=verbose,
+                                      output_type=output_type)
         # Input parameters for training
         self.tol = tol
         self.C = C
@@ -222,7 +240,7 @@ class SVMBase(Base):
         # deallocate model parameters
         cdef svmModel[float] *model_f
         cdef svmModel[double] *model_d
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         if self._model is not None:
             if self.dtype == np.float32:
                 model_f = <svmModel[float]*><uintptr_t> self._model
@@ -272,7 +290,7 @@ class SVMBase(Base):
             if self.gamma == 'auto':
                 return 1 / self.n_cols
             elif self.gamma == 'scale':
-                x_var = cupy.asarray(X).var()
+                x_var = cupy.asarray(X).var().item()
                 return 1 / (self.n_cols * x_var)
             else:
                 raise ValueError("Not implemented gamma option: " + self.gamma)
@@ -280,8 +298,14 @@ class SVMBase(Base):
             return self.gamma
 
     def _calc_coef(self):
-        return np.dot(self._dual_coef_.to_output('numpy'),
-                      self._support_vectors_.to_output('numpy'))
+        return cupy.dot(cupy.asarray(self._dual_coef_),
+                        cupy.asarray(self._support_vectors_))
+
+    def _check_is_fitted(self, attr):
+        if not hasattr(self, attr) or (getattr(self, attr) is None):
+            msg = ("This classifier instance is not fitted yet. Call 'fit' "
+                   "with appropriate arguments before using this estimator.")
+            raise NotFittedError(msg)
 
     @property
     def coef_(self):
@@ -290,8 +314,9 @@ class SVMBase(Base):
         if self._model is None:
             raise RuntimeError("Call fit before prediction")
         if self._coef_ is None:
-            self._coef_ = self._calc_coef()
-        return self._coef_
+            self._coef_ = CumlArray(self._calc_coef())
+        # Call the base class to perform the to_output conversion
+        return super().__getattr__("coef_")
 
     def _get_kernel_params(self, X=None):
         """ Wrap the kernel parameters in a KernelParams obtect """
@@ -443,69 +468,7 @@ class SVMBase(Base):
             else:
                 self._unique_labels = None
 
-    def fit(self, X, y):
-        """
-        Fit the model with X and y.
-
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        """
-
-        self._set_output_type(X)
-
-        X_m, self.n_rows, self.n_cols, self.dtype = \
-            input_to_cuml_array(X, order='F')
-
-        cdef uintptr_t X_ptr = X_m.ptr
-
-        y_m, _, _, _ = input_to_cuml_array(y, convert_to_dtype=self.dtype)
-
-        cdef uintptr_t y_ptr = y_m.ptr
-
-        self._dealloc()  # delete any previously fitted model
-        self._coef_ = None
-
-        cdef KernelParams _kernel_params = self._get_kernel_params(X_m)
-        cdef svmParameter param = self._get_svm_params()
-        cdef svmModel[float] *model_f
-        cdef svmModel[double] *model_d
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
-
-        if self.dtype == np.float32:
-            model_f = new svmModel[float]()
-            svcFit(handle_[0], <float*>X_ptr, <int>self.n_rows,
-                   <int>self.n_cols, <float*>y_ptr, param, _kernel_params,
-                   model_f[0])
-            self._model = <uintptr_t>model_f
-        elif self.dtype == np.float64:
-            model_d = new svmModel[double]()
-            svcFit(handle_[0], <double*>X_ptr, <int>self.n_rows,
-                   <int>self.n_cols, <double*>y_ptr, param, _kernel_params,
-                   model_d[0])
-            self._model = <uintptr_t>model_d
-        else:
-            raise TypeError('Input data type should be float32 or float64')
-
-        self._unpack_model()
-        self._fit_status_ = 0
-        self.handle.sync()
-
-        del X_m
-        del y_m
-
-        return self
-
-    def predict(self, X, predict_class):
+    def predict(self, X, predict_class, convert_dtype=True):
         """
         Predicts the y for X, where y is either the decision function value
         (if predict_class == False), or the label associated with X.
@@ -532,16 +495,19 @@ class SVMBase(Base):
         else:
             out_dtype = self.dtype
 
-        if self._model is None:
-            raise RuntimeError("Call fit before prediction")
+        self._check_is_fitted('_model')
 
         X_m, n_rows, n_cols, pred_dtype = \
-            input_to_cuml_array(X, check_dtype=self.dtype)
+            input_to_cuml_array(
+                X,
+                check_dtype=self.dtype,
+                convert_to_dtype=(self.dtype if convert_dtype else None))
+
         cdef uintptr_t X_ptr = X_m.ptr
 
         preds = CumlArray.zeros(n_rows, dtype=self.dtype)
         cdef uintptr_t preds_ptr = preds.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         cdef svmModel[float]* model_f
         cdef svmModel[double]* model_d
 
@@ -565,8 +531,18 @@ class SVMBase(Base):
         return preds.to_output(output_type=out_type, output_dtype=out_dtype)
 
     def get_param_names(self):
-        return ["C", "kernel", "degree", "gamma", "coef0", "cache_size",
-                "max_iter", "nochange_steps", "tol"]
+        return super().get_param_names() + [
+            "C",
+            "kernel",
+            "degree",
+            "gamma",
+            "coef0",
+            "tol",
+            "cache_size",
+            "max_iter",
+            "nochange_steps",
+            "epsilon",
+        ]
 
     def __getstate__(self):
         state = self.__dict__.copy()
diff --git a/python/cuml/svm/svr.pyx b/python/cuml/svm/svr.pyx
index 46f095dc70..06711016be 100644
--- a/python/cuml/svm/svr.pyx
+++ b/python/cuml/svm/svr.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import cudf
@@ -29,11 +26,13 @@ from cython.operator cimport dereference as deref
 from libc.stdint cimport uintptr_t
 
 from cuml.common.array import CumlArray
-from cuml.common.base import Base, RegressorMixin
+from cuml.common.base import Base
+from cuml.common.base import RegressorMixin
+from cuml.common.doc_utils import generate_docstring
 from cuml.metrics import r2_score
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common import input_to_cuml_array
-from libcpp cimport bool
+from libcpp cimport bool, nullptr
 from cuml.svm.svm_base import SVMBase
 
 cdef extern from "cuml/matrix/kernelparams.h" namespace "MLCommon::Matrix":
@@ -75,27 +74,29 @@ cdef extern from "cuml/svm/svm_model.h" namespace "ML::SVM":
 
 cdef extern from "cuml/svm/svc.hpp" namespace "ML::SVM":
 
-    cdef void svcFit[math_t](const cumlHandle &handle, math_t *input,
+    cdef void svcFit[math_t](const handle_t &handle, math_t *input,
                              int n_rows, int n_cols, math_t *labels,
                              const svmParameter &param,
                              KernelParams &kernel_params,
-                             svmModel[math_t] &model) except+
+                             svmModel[math_t] &model,
+                             const math_t *sample_weight) except+
 
     cdef void svcPredict[math_t](
-        const cumlHandle &handle, math_t *input, int n_rows, int n_cols,
+        const handle_t &handle, math_t *input, int n_rows, int n_cols,
         KernelParams &kernel_params, const svmModel[math_t] &model,
         math_t *preds, math_t buffer_size, bool predict_class) except +
 
-    cdef void svmFreeBuffers[math_t](const cumlHandle &handle,
+    cdef void svmFreeBuffers[math_t](const handle_t &handle,
                                      svmModel[math_t] &m) except +
 
 cdef extern from "cuml/svm/svr.hpp" namespace "ML::SVM":
 
-    cdef void svrFit[math_t](const cumlHandle &handle, math_t *X,
+    cdef void svrFit[math_t](const handle_t &handle, math_t *X,
                              int n_rows, int n_cols, math_t *y,
                              const svmParameter &param,
                              KernelParams &kernel_params,
-                             svmModel[math_t] &model) except+
+                             svmModel[math_t] &model,
+                             const math_t *sample_weight) except+
 
 
 class SVR(SVMBase, RegressorMixin):
@@ -104,28 +105,15 @@ class SVR(SVMBase, RegressorMixin):
 
     Construct an SVC classifier for training and predictions.
 
-    Examples
-    ---------
-    .. code-block:: python
-
-            import numpy as np
-            from cuml.svm import SVR
-            X = np.array([[1], [2], [3], [4], [5]], dtype=np.float32)
-            y = np.array([1.1, 4, 5, 3.9, 1.], dtype = np.float32)
-            reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1)
-            reg.fit(X, y)
-            print("Predicted values:", reg.predict(X))
-
-    Output:
-
-    .. code-block:: none
-
-            Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]
-
     Parameters
     ----------
     handle : cuml.Handle
-        If it is None, a new one is created for this class
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
     C : float (default = 1.0)
         Penalty parameter C
     kernel : string (default='rbf')
@@ -136,8 +124,10 @@ class SVR(SVMBase, RegressorMixin):
     gamma : float or string (default = 'scale')
         Coefficient for rbf, poly, and sigmoid kernels. You can specify the
         numeric value, or use one of the following options:
-        - 'auto': gamma will be set to 1 / n_features
-        - 'scale': gamma will be se to 1 / (n_features * X.var())
+
+        - 'auto': gamma will be set to ``1 / n_features``
+        - 'scale': gamma will be se to ``1 / (n_features * X.var())``
+
     coef0 : float (default = 0.0)
         Independent term in kernel function, only signifficant for poly and
         sigmoid
@@ -162,8 +152,14 @@ class SVR(SVMBase, RegressorMixin):
         We monitor how much our stopping criteria changes during outer
         iterations. If it does not change (changes less then 1e-3*tol)
         for nochange_steps consecutive steps, then we stop training.
-    verbose : int or boolean (default = False)
-        verbosity level
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
@@ -184,73 +180,88 @@ class SVR(SVMBase, RegressorMixin):
     coef_ : float, shape [1, n_cols]
         Only available for linear kernels. It is the normal of the
         hyperplane.
-        coef_ = sum_k=1..n_support dual_coef_[k] * support_vectors[k,:]
-
+        ``coef_ = sum_k=1..n_support dual_coef_[k] * support_vectors[k,:]``
 
     Notes
     -----
+
     For additional docs, see `Scikit-learn's SVR
     <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html>`_.
 
     The solver uses the SMO method to fit the regressor. We use the Optimized
-    Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]
+    Hierarchical Decomposition [1]_ variant of the SMO algorithm, similar to
+    [2]_
 
     References
     ----------
-    [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition
-         Algorithm for Support VectorMachine Training, IEEE Transactions on
-         Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)
-    [2] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal
-    *      of Machine Learning Research, 19, 1-5 (2018)
-        https://github.com/Xtra-Computing/thundersvm
+
+    .. [1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical
+           Decomposition Algorithm for Support VectorMachine Training, IEEE
+           Transactions on Parallel and Distributed Systems, vol 28, no 12,
+           3330, (2017)
+
+    .. [2] `Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs,
+           Journal of Machine Learning Research, 19, 1-5 (2018)
+           <https://github.com/Xtra-Computing/thundersvm>`_
+
+    Examples
+    --------
+
+    .. code-block:: python
+
+        import numpy as np
+        from cuml.svm import SVR
+        X = np.array([[1], [2], [3], [4], [5]], dtype=np.float32)
+        y = np.array([1.1, 4, 5, 3.9, 1.], dtype = np.float32)
+        reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1)
+        reg.fit(X, y)
+        print("Predicted values:", reg.predict(X))
+
+    Output:
+
+    .. code-block:: python
+
+        Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]
 
     """
     def __init__(self, handle=None, C=1, kernel='rbf', degree=3,
                  gamma='scale', coef0=0.0, tol=1e-3, epsilon=0.1,
                  cache_size=200.0, max_iter=-1, nochange_steps=1000,
-                 verbose=False):
+                 verbose=False, output_type=None):
         super(SVR, self).__init__(handle, C, kernel, degree, gamma, coef0, tol,
                                   cache_size, max_iter, nochange_steps,
-                                  verbose, epsilon)
+                                  verbose, epsilon, output_type=output_type)
         self.svmType = EPSILON_SVR
 
-    def fit(self, X, y, convert_dtype=True):
+    @generate_docstring()
+    def fit(self, X, y, sample_weight=None, convert_dtype=True):
         """
         Fit the model with X and y.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        y : array-like (device or host) shape = (n_samples, 1)
-            Dense vector (floats or doubles) of shape (n_samples, 1).
-            Acceptable formats: cuDF Series, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        convert_dtype : bool, optional (default = True)
-            When set to True, the fit method will, when necessary, convert
-            y to be the same data type as X if they differ. This
-            will increase memory used for the method.
         """
-        self._set_n_features_in(X)
-        self._set_output_type(X)
+        self._set_base_attributes(output_type=X, n_features=X)
         cdef uintptr_t X_ptr, y_ptr
 
         X_m, self.n_rows, self.n_cols, self.dtype = \
             input_to_cuml_array(X, order='F')
         X_ptr = X_m.ptr
 
+        convert_to_dtype = self.dtype if convert_dtype else None
         y_m, _, _, _ = \
             input_to_cuml_array(y, check_dtype=self.dtype,
-                                convert_to_dtype=(self.dtype if convert_dtype
-                                                  else None),
+                                convert_to_dtype=convert_to_dtype,
                                 check_rows=self.n_rows, check_cols=1)
 
         y_ptr = y_m.ptr
 
+        cdef uintptr_t sample_weight_ptr = <uintptr_t> nullptr
+        if sample_weight is not None:
+            sample_weight_m, _, _, _ = \
+                input_to_cuml_array(sample_weight, check_dtype=self.dtype,
+                                    convert_to_dtype=convert_to_dtype,
+                                    check_rows=self.n_rows, check_cols=1)
+            sample_weight_ptr = sample_weight_m.ptr
+
         self._dealloc()  # delete any previously fitted model
         self._coef_ = None
 
@@ -258,19 +269,19 @@ class SVR(SVMBase, RegressorMixin):
         cdef svmParameter param = self._get_svm_params()
         cdef svmModel[float] *model_f
         cdef svmModel[double] *model_d
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if self.dtype == np.float32:
             model_f = new svmModel[float]()
             svrFit(handle_[0], <float*>X_ptr, <int>self.n_rows,
                    <int>self.n_cols, <float*>y_ptr, param, _kernel_params,
-                   model_f[0])
+                   model_f[0], <float*>sample_weight_ptr)
             self._model = <uintptr_t>model_f
         elif self.dtype == np.float64:
             model_d = new svmModel[double]()
             svrFit(handle_[0], <double*>X_ptr, <int>self.n_rows,
                    <int>self.n_cols, <double*>y_ptr, param, _kernel_params,
-                   model_d[0])
+                   model_d[0], <double*>sample_weight_ptr)
             self._model = <uintptr_t>model_d
         else:
             raise TypeError('Input data type should be float32 or float64')
@@ -284,21 +295,14 @@ class SVR(SVMBase, RegressorMixin):
 
         return self
 
-    def predict(self, X):
+    @generate_docstring(return_values={'name': 'preds',
+                                       'type': 'dense',
+                                       'description': 'Predicted values',
+                                       'shape': '(n_samples, 1)'})
+    def predict(self, X, convert_dtype=True):
         """
         Predicts the values for X.
 
-        Parameters
-        ----------
-        X : array-like (device or host) shape = (n_samples, n_features)
-            Dense matrix (floats or doubles) of shape (n_samples, n_features).
-            Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
-            ndarray, cuda array interface compliant array like CuPy
-
-        Returns
-        -------
-        y : cuDF Series
-           Dense vector (floats or doubles) of shape (n_samples, 1)
         """
 
-        return super(SVR, self).predict(X, False)
+        return super(SVR, self).predict(X, False, convert_dtype)
diff --git a/python/cuml/test/conftest.py b/python/cuml/test/conftest.py
index 70cfdbfe33..1f0f79fc74 100644
--- a/python/cuml/test/conftest.py
+++ b/python/cuml/test/conftest.py
@@ -1,19 +1,210 @@
-import pytest
+#
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
 
+import os
+import sys
+import cupy as cp
+import cupyx
+import pytest
+from pytest import Item
 from sklearn.datasets import fetch_20newsgroups
 from sklearn.feature_extraction.text import CountVectorizer
+import numbers
+
+
+# Stores incorrect uses of CumlArray on cuml.common.base.Base to print at the
+# end
+bad_cuml_array_loc = set()
 
-import cupy as cp
+
+def checked_isinstance(obj, class_name_dot_separated):
+    """
+    Small helper function to check instance of object that doesn't import
+    class_path at import time, only at check time. Returns False if
+    class_path cannot be imported.
+
+    Parameters:
+    -----------
+    obj: Python object
+        object to check if it is instance of a class
+    class_name_dot_separated: list of str
+        List of classes to check whether object is an instance of, each item
+        can be a full dot  separated class like
+        'cuml.dask.preprocessing.LabelEncoder'
+    """
+    ret = False
+    for class_path in class_name_dot_separated:
+        module_name, class_name = class_path.rsplit(".", 1)
+        module = sys.modules[module_name]
+        module_class = getattr(module, class_name, None)
+
+        if module_class is not None:
+            ret = isinstance(obj, module_class) or ret
+
+    return ret
 
 
 def pytest_configure(config):
     cp.cuda.set_allocator(None)
 
 
+# Use the runtest_makereport hook to get the result of the test. This is
+# necessary because pytest has some magic to extract the Cython source file
+# from the traceback
+@pytest.hookimpl(hookwrapper=True)
+def pytest_runtest_makereport(item: Item, call):
+
+    # Yield to the default implementation and get the result
+    outcome = yield
+    report = outcome.get_result()
+
+    if (report.failed):
+
+        # Save the abs path to this file. We will only mark bad CumlArray uses
+        # if the assertion failure comes from this file
+        conf_test_path = os.path.abspath(__file__)
+
+        found_assert = False
+
+        # Ensure these attributes exist. They can be missing if something else
+        # failed outside of the test
+        if (hasattr(report.longrepr, "reprtraceback")
+                and hasattr(report.longrepr.reprtraceback, "reprentries")):
+
+            for entry in reversed(report.longrepr.reprtraceback.reprentries):
+
+                if (not found_assert and
+                        entry.reprfileloc.message.startswith("AssertionError")
+                        and os.path.abspath(
+                            entry.reprfileloc.path) == conf_test_path):
+                    found_assert = True
+                elif (found_assert):
+                    true_path = "{}:{}".format(entry.reprfileloc.path,
+                                               entry.reprfileloc.lineno)
+
+                    bad_cuml_array_loc.add(
+                        (true_path, entry.reprfileloc.message))
+
+                    break
+
+
+# Closing hook to display the file/line numbers at the end of the test
+def pytest_unconfigure(config):
+    def split_exists(filename: str) -> bool:
+        strip_colon = filename[:filename.rfind(":")]
+        return os.path.exists(strip_colon)
+
+    if (len(bad_cuml_array_loc) > 0):
+
+        print("Incorrect CumlArray uses in class derived from "
+              "cuml.common.base.Base:")
+
+        prefix = ""
+
+        # Depending on where pytest was launched from, it may need to append
+        # "python"
+        if (not os.path.basename(os.path.abspath(
+                os.curdir)).endswith("python")):
+            prefix = "python"
+
+        for location, message in bad_cuml_array_loc:
+
+            combined_path = os.path.abspath(location)
+
+            # Try appending prefix if that file doesnt exist
+            if (not split_exists(combined_path)):
+                combined_path = os.path.abspath(os.path.join(prefix, location))
+
+                # If that still doesnt exist, just use the original
+                if (not split_exists(combined_path)):
+                    combined_path = location
+
+            print("{} {}".format(combined_path, message))
+
+        print("See https://github.com/rapidsai/cuml/issues/2456#issuecomment-666106406" # noqa
+              " for more information on naming conventions")
+
+
+# This fixture will monkeypatch cuml.common.base.Base to check for incorrect
+# uses of CumlArray.
+@pytest.fixture(autouse=True)
+def fail_on_bad_cuml_array_name(monkeypatch, request):
+
+    if 'no_bad_cuml_array_check' in request.keywords:
+        return
+
+    from cuml.common import CumlArray
+    from cuml.common.base import Base
+    from cuml.common.input_utils import get_supported_input_type
+
+    def patched__setattr__(self, name, value):
+
+        if name == 'classes_' and \
+                checked_isinstance(self,
+                                   ['cuml.dask.preprocessing.LabelEncoder',
+                                    'cuml.preprocessing.LabelEncoder']):
+            # For label encoder, classes_ stores the set of unique classes
+            # which is strings, and can't be saved as cuml array
+            # even called `get_supported_input_type` causes a failure.
+            pass
+        else:
+            supported_type = get_supported_input_type(value)
+
+            if name == 'idf_':
+                # We skip this test because idf_' for tfidf setter returns
+                # a sparse diagonal matrix and getter gets a cupy array
+                # see discussion at:
+                # https://github.com/rapidsai/cuml/pull/2698/files#r471865982
+                pass
+            elif (supported_type == CumlArray):
+                assert name.startswith("_"), "Invalid CumlArray Use! CumlArray \
+                    attributes need a leading underscore. Attribute: '{}' In: {}" \
+                        .format(name, self.__repr__())
+            elif (supported_type == cp.ndarray and
+                  cupyx.scipy.sparse.issparse(value)):
+                # Leave sparse matrices alone for now.
+                pass
+            elif (supported_type is not None):
+                if not isinstance(value, numbers.Number):
+                    # Is this an estimated property?
+                    # If so, should always be CumlArray
+                    assert not name.endswith("_"), "Invalid Estimated Array-Like \
+                        Attribute! Estimated attributes should always be \
+                        CumlArray. \
+                        Attribute: '{}' In: {}".format(name, self.__repr__())
+                    assert not name.startswith("_"), "Invalid Public Array-Like \
+                        Attribute! Public array-like attributes should always \
+                        be CumlArray. Attribute: '{}' In: {}".format(
+                        name, self.__repr__())
+                else:
+                    # Estimated properties can be numbers
+                    pass
+
+        return super(Base, self).__setattr__(name, value)
+
+    # Monkeypatch CumlArray.__setattr__ to test for incorrect uses of
+    # array-like objects
+    monkeypatch.setattr(Base, "__setattr__", patched__setattr__)
+
+
 @pytest.fixture(scope="module")
 def nlp_20news():
     twenty_train = fetch_20newsgroups(subset='train',
-                                      shuffle=True, random_state=42)
+                                      shuffle=True,
+                                      random_state=42)
 
     count_vect = CountVectorizer()
     X = count_vect.fit_transform(twenty_train.data)
@@ -23,14 +214,20 @@ def nlp_20news():
 
 
 def pytest_addoption(parser):
-    parser.addoption("--run_stress", action="store_true",
-                     default=False, help="run stress tests")
+    parser.addoption("--run_stress",
+                     action="store_true",
+                     default=False,
+                     help="run stress tests")
 
-    parser.addoption("--run_quality", action="store_true",
-                     default=False, help="run quality tests")
+    parser.addoption("--run_quality",
+                     action="store_true",
+                     default=False,
+                     help="run quality tests")
 
-    parser.addoption("--run_unit", action="store_true",
-                     default=False, help="run unit tests")
+    parser.addoption("--run_unit",
+                     action="store_true",
+                     default=False,
+                     help="run unit tests")
 
 
 def pytest_collection_modifyitems(config, items):
diff --git a/python/cuml/test/dask/__init__.py b/python/cuml/test/dask/__init__.py
index 0f78d4d994..5bd9829c0e 100644
--- a/python/cuml/test/dask/__init__.py
+++ b/python/cuml/test/dask/__init__.py
@@ -12,4 +12,4 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-#
\ No newline at end of file
+#
diff --git a/python/cuml/test/dask/test_comms.py b/python/cuml/test/dask/test_comms.py
deleted file mode 100644
index 435ee01894..0000000000
--- a/python/cuml/test/dask/test_comms.py
+++ /dev/null
@@ -1,156 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import pytest
-
-import random
-
-from dask.distributed import wait
-
-from cuml.dask.common.comms import CommsContext, worker_state, default_comms
-from cuml.dask.common import perform_test_comms_send_recv
-from cuml.dask.common import perform_test_comms_allreduce
-from cuml.dask.common import perform_test_comms_recv_any_rank
-
-pytestmark = pytest.mark.mg
-
-
-def test_comms_init_no_p2p(client):
-
-    try:
-        cb = CommsContext(comms_p2p=False)
-        cb.init()
-
-        assert cb.nccl_initialized is True
-        assert cb.ucx_initialized is False
-
-    finally:
-
-        cb.destroy()
-
-
-def func_test_allreduce(sessionId, r):
-    handle = worker_state(sessionId)["handle"]
-    return perform_test_comms_allreduce(handle)
-
-
-def func_test_send_recv(sessionId, n_trials, r):
-    handle = worker_state(sessionId)["handle"]
-    return perform_test_comms_send_recv(handle, n_trials)
-
-
-def func_test_recv_any_rank(sessionId, n_trials, r):
-    handle = worker_state(sessionId)["handle"]
-    return perform_test_comms_recv_any_rank(handle, n_trials)
-
-
-@pytest.mark.skip(reason="default_comms() not yet being used")
-def test_default_comms_no_exist(client):
-
-    try:
-        cb = default_comms()
-        assert cb is not None
-
-        cb2 = default_comms()
-        assert cb.sessionId == cb2.sessionId
-
-    finally:
-        cb.destroy()
-
-
-@pytest.mark.skip(reason="default_comms() not yet being used")
-def test_default_comms(client):
-
-    try:
-        cb = CommsContext(comms_p2p=True, client=client)
-        cb.init()
-
-        comms = default_comms()
-        assert(cb.sessionId == comms.sessionId)
-
-    finally:
-        comms.destroy()
-
-
-@pytest.mark.nccl
-def test_allreduce(client):
-
-    try:
-        cb = CommsContext()
-        cb.init()
-
-        dfs = [client.submit(func_test_allreduce, cb.sessionId,
-                             random.random(), workers=[w])
-               for w in cb.worker_addresses]
-        wait(dfs, timeout=5)
-
-        assert all([x.result() for x in dfs])
-
-    finally:
-        cb.destroy()
-
-
-@pytest.mark.ucx
-@pytest.mark.parametrize("n_trials", [1, 5])
-@pytest.mark.skip("Skipping to unblock GH CI, see issue "
-                  "https://github.com/rapidsai/dask-cuda/issues/288")
-def test_send_recv(n_trials, ucx_client):
-
-    try:
-
-        cb = CommsContext(comms_p2p=True)
-        cb.init()
-
-        dfs = [ucx_client.submit(func_test_send_recv,
-                                 cb.sessionId,
-                                 n_trials,
-                                 random.random(),
-                                 workers=[w])
-               for w in cb.worker_addresses]
-
-        wait(dfs, timeout=5)
-
-        assert(list(map(lambda x: x.result(), dfs)))
-
-    finally:
-        cb.destroy()
-
-
-@pytest.mark.ucx
-@pytest.mark.parametrize("n_trials", [5])
-@pytest.mark.skip(reason="This has stopped working at some point and the "
-                         "feature is not yet being used.")
-def test_recv_any_rank(n_trials, ucx_client):
-
-    try:
-
-        cb = CommsContext(comms_p2p=True)
-        cb.init()
-
-        dfs = [ucx_client.submit(func_test_recv_any_rank,
-                                 cb.sessionId,
-                                 n_trials,
-                                 random.random(),
-                                 workers=[w])
-               for w in cb.worker_addresses]
-
-        wait(dfs, timeout=5)
-
-        result = [x.result() for x in dfs]
-
-        assert result
-
-    finally:
-        cb.destroy()
diff --git a/python/cuml/test/dask/test_coordinate_descent.py b/python/cuml/test/dask/test_coordinate_descent.py
index 00658033fa..7d75f688c6 100644
--- a/python/cuml/test/dask/test_coordinate_descent.py
+++ b/python/cuml/test/dask/test_coordinate_descent.py
@@ -26,9 +26,9 @@
 
 @pytest.mark.mg
 @pytest.mark.parametrize('dtype', [np.float32, np.float64])
-@pytest.mark.parametrize('alpha', [0.1, 0.001])
+@pytest.mark.parametrize('alpha', [0.001])
 @pytest.mark.parametrize('algorithm', ['cyclic', 'random'])
-@pytest.mark.parametrize('nrows', [unit_param(500),
+@pytest.mark.parametrize('nrows', [unit_param(50),
                                    quality_param(5000),
                                    stress_param(500000)])
 @pytest.mark.parametrize('column_info', [unit_param([20, 10]),
@@ -63,7 +63,7 @@ def test_lasso(dtype, alpha, algorithm,
 
 @pytest.mark.mg
 @pytest.mark.parametrize('dtype', [np.float32, np.float64])
-@pytest.mark.parametrize('nrows', [unit_param(500),
+@pytest.mark.parametrize('nrows', [unit_param(50),
                                    quality_param(5000),
                                    stress_param(500000)])
 @pytest.mark.parametrize('column_info', [unit_param([20, 10]),
@@ -92,7 +92,7 @@ def test_lasso_default(dtype, nrows, column_info, n_parts, client):
 
 
 @pytest.mark.parametrize('dtype', [np.float32, np.float64])
-@pytest.mark.parametrize('alpha', [0.2, 0.7])
+@pytest.mark.parametrize('alpha', [0.5])
 @pytest.mark.parametrize('algorithm', ['cyclic', 'random'])
 @pytest.mark.parametrize('nrows', [unit_param(500),
                                    quality_param(5000),
diff --git a/python/cuml/test/dask/test_dask_arr_utils.py b/python/cuml/test/dask/test_dask_arr_utils.py
index 399fe654cb..a567906c84 100644
--- a/python/cuml/test/dask/test_dask_arr_utils.py
+++ b/python/cuml/test/dask/test_dask_arr_utils.py
@@ -20,6 +20,7 @@
 import dask_cudf
 import cudf
 import cupy as cp
+import cupyx
 
 
 from cuml.dask.common.dask_arr_utils import validate_dask_array
@@ -43,7 +44,7 @@ def test_to_sparse_dask_array(input_type, nrows, ncols, client):
 
     c = client
 
-    a = cp.sparse.random(nrows, ncols, format='csr', dtype=cp.float32)
+    a = cupyx.scipy.sparse.random(nrows, ncols, format='csr', dtype=cp.float32)
     if input_type == "dask_dataframe":
         df = cudf.DataFrame(a.todense())
         inp = dask_cudf.from_cudf(df, npartitions=2)
diff --git a/python/cuml/test/dask/test_datasets.py b/python/cuml/test/dask/test_datasets.py
index 6824879db4..01cb9c220e 100644
--- a/python/cuml/test/dask/test_datasets.py
+++ b/python/cuml/test/dask/test_datasets.py
@@ -78,9 +78,8 @@ def test_make_blobs(nrows,
 
 
 @pytest.mark.parametrize('n_samples', [unit_param(int(1e3)),
-                         stress_param(int(1e6))])
-@pytest.mark.parametrize('n_features', [unit_param(int(1e2)),
-                         stress_param(int(1e3))])
+                                       stress_param(int(1e6))])
+@pytest.mark.parametrize('n_features', [unit_param(100), stress_param(1000)])
 @pytest.mark.parametrize('n_informative', [7])
 @pytest.mark.parametrize('n_targets', [1, 3])
 @pytest.mark.parametrize('bias', [-4.0])
@@ -89,7 +88,7 @@ def test_make_blobs(nrows,
 @pytest.mark.parametrize('noise', [1.0])
 @pytest.mark.parametrize('shuffle', [True, False])
 @pytest.mark.parametrize('coef', [True, False])
-@pytest.mark.parametrize('n_parts', [4, 23])
+@pytest.mark.parametrize('n_parts', [unit_param(4), stress_param(23)])
 @pytest.mark.parametrize('order', ['F', 'C'])
 @pytest.mark.parametrize('use_full_low_rank', [True, False])
 def test_make_regression(n_samples, n_features, n_informative,
@@ -171,18 +170,20 @@ def test_make_regression(n_samples, n_features, n_informative,
                 assert coefs_part.flags['C_CONTIGUOUS']
 
 
-@pytest.mark.parametrize('n_samples', [500, 1000])
-@pytest.mark.parametrize('n_features', [50, 100])
+@pytest.mark.parametrize('n_samples', [unit_param(500), stress_param(1000)])
+@pytest.mark.parametrize('n_features', [unit_param(50), stress_param(100)])
 @pytest.mark.parametrize('hypercube', [True, False])
 @pytest.mark.parametrize('n_classes', [2, 4])
 @pytest.mark.parametrize('n_clusters_per_class', [2, 4])
 @pytest.mark.parametrize('n_informative', [7, 20])
 @pytest.mark.parametrize('random_state', [None, 1234])
-@pytest.mark.parametrize('n_parts', [2, 23])
+@pytest.mark.parametrize('n_parts', [unit_param(4), stress_param(23)])
 @pytest.mark.parametrize('order', ['C', 'F'])
+@pytest.mark.parametrize('dtype', ['float32', 'float64'])
 def test_make_classification(n_samples, n_features, hypercube, n_classes,
                              n_clusters_per_class, n_informative,
-                             random_state, n_parts, order, client):
+                             random_state, n_parts, order, dtype,
+                             client):
     from cuml.dask.datasets.classification import make_classification
 
     X, y = make_classification(n_samples=n_samples, n_features=n_features,
@@ -190,13 +191,16 @@ def test_make_classification(n_samples, n_features, hypercube, n_classes,
                                n_clusters_per_class=n_clusters_per_class,
                                n_informative=n_informative,
                                random_state=random_state, n_parts=n_parts,
-                               order=order)
+                               order=order, dtype=dtype)
     assert(len(X.chunks[0])) == n_parts
     assert(len(X.chunks[1])) == 1
 
     assert X.shape == (n_samples, n_features)
     assert y.shape == (n_samples, )
 
+    assert X.dtype == dtype
+    assert y.dtype == np.int64
+
     assert len(X.chunks[0]) == n_parts
     assert len(y.chunks[0]) == n_parts
 
diff --git a/python/cuml/test/dask/test_kmeans.py b/python/cuml/test/dask/test_kmeans.py
index 9716870fdc..4c82de6c64 100644
--- a/python/cuml/test/dask/test_kmeans.py
+++ b/python/cuml/test/dask/test_kmeans.py
@@ -73,8 +73,8 @@ def test_end_to_end(nrows, ncols, nclusters, n_parts,
 
     if input_type == "dataframe":
         assert cumlLabels.npartitions == parts_len
-        cumlPred = cp.array(cumlLabels.compute().to_pandas().values)
-        labels = cp.squeeze(y_train.compute().to_pandas().values)
+        cumlPred = cumlLabels.compute().values
+        labels = y_train.compute().values
     elif input_type == "array":
         assert len(cumlLabels.chunks[0]) == parts_len
         cumlPred = cp.array(cumlLabels.compute())
@@ -118,7 +118,7 @@ def test_transform(nrows, ncols, nclusters, n_parts, input_type, client):
     if input_type == "dataframe":
         X_train = to_dask_cudf(X)
         y_train = to_dask_cudf(y)
-        labels = cp.squeeze(y_train.compute().to_pandas().values)
+        labels = y_train.compute().values
     elif input_type == "array":
         X_train, y_train = X, y
         labels = cp.squeeze(y_train.compute())
@@ -159,10 +159,9 @@ def test_transform(nrows, ncols, nclusters, n_parts, input_type, client):
                                        stress_param(50)])
 @pytest.mark.parametrize("n_parts", [unit_param(None), quality_param(7),
                                      stress_param(50)])
-@pytest.mark.parametrize("score_eps", [unit_param(0.06), stress_param(35.0)])
 @pytest.mark.parametrize("input_type", ["dataframe", "array"])
 def test_score(nrows, ncols, nclusters, n_parts,
-               input_type, score_eps, client):
+               input_type, client):
 
     from cuml.dask.cluster import KMeans as cumlKMeans
 
@@ -191,28 +190,7 @@ def test_score(nrows, ncols, nclusters, n_parts,
 
     actual_score = cumlModel.score(X_train)
 
-    predictions = cumlModel.predict(X_train).compute()
+    local_model = cumlModel.get_combined_model()
+    expected_score = local_model.score(X_train.compute())
 
-    if input_type == "dataframe":
-        X = cp.array(X_train.compute().as_gpu_matrix())
-        predictions = cp.array(predictions)
-
-        centers = cp.array(cumlModel.cluster_centers_.as_gpu_matrix())
-    elif input_type == "array":
-        X = X_train.compute()
-        centers = cumlModel.cluster_centers_
-
-    expected_score = 0
-    for idx, label in enumerate(predictions):
-
-        x = X[idx]
-        y = centers[label]
-
-        dist = cp.sqrt(cp.sum((x - y)**2))
-        expected_score += dist**2
-
-    # Threshold increased for stress test to 35.0. This is the
-    # threshold required for `test_score[array-0.8-50-50-30-5000000.0]`
-    assert actual_score + score_eps \
-        >= (-1 * expected_score) \
-        >= actual_score - score_eps
+    assert actual_score == expected_score
diff --git a/python/cuml/test/dask/test_kneighbors_classifier.py b/python/cuml/test/dask/test_kneighbors_classifier.py
index b0dc2698fa..1e9036a8db 100644
--- a/python/cuml/test/dask/test_kneighbors_classifier.py
+++ b/python/cuml/test/dask/test_kneighbors_classifier.py
@@ -43,12 +43,12 @@ def generate_dask_array(np_array, n_parts):
 @pytest.fixture(
     scope="module",
     params=[
-        unit_param({'n_samples': 1000, 'n_features': 30,
+        unit_param({'n_samples': 3000, 'n_features': 30,
                     'n_classes': 5, 'n_targets': 2}),
-        quality_param({'n_samples': 5000, 'n_features': 100,
-                       'n_classes': 12, 'n_targets': 4}),
-        stress_param({'n_samples': 12000, 'n_features': 40,
-                      'n_classes': 5, 'n_targets': 2})
+        quality_param({'n_samples': 8000, 'n_features': 35,
+                       'n_classes': 12, 'n_targets': 3}),
+        stress_param({'n_samples': 20000, 'n_features': 40,
+                      'n_classes': 12, 'n_targets': 4})
     ])
 def dataset(request):
     X, y = make_multilabel_classification(
@@ -69,18 +69,14 @@ def dataset(request):
         if len(new_x) >= request.param['n_samples']:
             break
     X = X[new_x]
+    noise = np.random.normal(0, 1.2, X.shape)
+    X += noise
     y = np.array(new_y)
 
-    return train_test_split(X, y, test_size=0.33)
+    return train_test_split(X, y, test_size=0.1)
 
 
-def accuracy_score(y_true, y_pred):
-    assert y_pred.shape[0] == y_true.shape[0]
-    assert y_pred.shape[1] == y_true.shape[1]
-    return np.mean(y_pred == y_true)
-
-
-def match_test(output1, output2):
+def exact_match(output1, output2):
     l1, i1, d1 = output1
     l2, i2, d2 = output2
     l2 = l2.squeeze()
@@ -90,56 +86,54 @@ def match_test(output1, output2):
     assert i1.shape == i2.shape
     assert d1.shape == d2.shape
 
-    # Distances should strictly match
-    assert np.array_equal(d1, d2)
+    # Distances should match
+    d1 = np.round(d1, 4)
+    d2 = np.round(d2, 4)
+    assert np.mean(d1 == d2) > 0.98
 
-    # Indices might differ for equivalent distances
-    for i in range(d1.shape[0]):
-        idx_set1, idx_set2 = (set(), set())
-        dist = 0.
-        for j in range(d1.shape[1]):
-            if d1[i, j] > dist:
-                assert idx_set1 == idx_set2
-                idx_set1, idx_set2 = (set(), set())
-                dist = d1[i, j]
-            idx_set1.add(i1[i, j])
-            idx_set2.add(i2[i, j])
-        # the last set of indices is not guaranteed
+    # Indices should match
+    correct_queries = (i1 == i2).all(axis=1)
+    assert np.mean(correct_queries) > 0.95
 
-    # As indices might differ, labels can also differ
-    # assert np.mean((l1 == l2)) > 0.6
+    # Labels should match
+    correct_queries = (l1 == l2).all(axis=1)
+    assert np.mean(correct_queries) > 0.95
 
 
 def check_probabilities(l_probas, d_probas):
     assert len(l_probas) == len(d_probas)
     for i in range(len(l_probas)):
         assert l_probas[i].shape == d_probas[i].shape
+        assert np.array_equal(l_probas[i], d_probas[i])
 
 
 @pytest.mark.parametrize("datatype", ['dask_array', 'dask_cudf'])
-@pytest.mark.parametrize("n_neighbors", [1, 3, 6])
-@pytest.mark.parametrize("n_parts", [None, 2, 3, 5])
-@pytest.mark.parametrize("batch_size", [256, 512, 1024])
-def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
+@pytest.mark.parametrize("n_neighbors", [1, 3, 8])
+@pytest.mark.parametrize("n_parts", [2, 4, 12])
+@pytest.mark.parametrize("batch_size", [128, 1024])
+def test_predict_and_score(dataset, datatype, n_neighbors,
+                           n_parts, batch_size, client):
     X_train, X_test, y_train, y_test = dataset
+    np_y_test = y_test
 
     l_model = lKNNClf(n_neighbors=n_neighbors)
     l_model.fit(X_train, y_train)
     l_distances, l_indices = l_model.kneighbors(X_test)
     l_labels = l_model.predict(X_test)
     local_out = (l_labels, l_indices, l_distances)
-
-    if not n_parts:
-        n_parts = len(client.has_what().keys())
+    handmade_local_score = np.mean(y_test == l_labels)
+    handmade_local_score = round(handmade_local_score, 3)
 
     X_train = generate_dask_array(X_train, n_parts)
     X_test = generate_dask_array(X_test, n_parts)
     y_train = generate_dask_array(y_train, n_parts)
+    y_test = generate_dask_array(y_test, n_parts)
 
     if datatype == 'dask_cudf':
         X_train = to_dask_cudf(X_train, client)
         X_test = to_dask_cudf(X_test, client)
         y_train = to_dask_cudf(y_train, client)
+        y_test = to_dask_cudf(y_test, client)
 
     d_model = dKNNClf(client=client, n_neighbors=n_neighbors,
                       batch_size=batch_size)
@@ -147,6 +141,9 @@ def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
     d_labels, d_indices, d_distances = \
         d_model.predict(X_test, convert_dtype=True)
     distributed_out = da.compute(d_labels, d_indices, d_distances)
+    if datatype == 'dask_array':
+        distributed_score = d_model.score(X_test, y_test)
+        distributed_score = round(distributed_score, 3)
 
     if datatype == 'dask_cudf':
         distributed_out = list(map(lambda o: o.as_matrix()
@@ -154,65 +151,31 @@ def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
                                    else o.to_array()[..., np.newaxis],
                                    distributed_out))
 
-    match_test(local_out, distributed_out)
-    assert accuracy_score(y_test, distributed_out[0]) > 0.12
-
-
-@pytest.mark.parametrize("datatype", ['dask_array'])
-@pytest.mark.parametrize("n_neighbors", [1, 3, 6])
-@pytest.mark.parametrize("n_parts", [None, 2, 3, 5])
-def test_score(dataset, datatype, n_neighbors, n_parts, client):
-    X_train, X_test, y_train, y_test = dataset
-
-    if not n_parts:
-        n_parts = len(client.has_what().keys())
+    exact_match(local_out, distributed_out)
 
-    X_train = generate_dask_array(X_train, n_parts)
-    X_test = generate_dask_array(X_test, n_parts)
-    y_train = generate_dask_array(y_train, n_parts)
-    y_test = generate_dask_array(y_test, n_parts)
-
-    if datatype == 'dask_cudf':
-        X_train = to_dask_cudf(X_train, client)
-        X_test = to_dask_cudf(X_test, client)
-        y_train = to_dask_cudf(y_train, client)
-        y_test = to_dask_cudf(y_test, client)
-
-    d_model = dKNNClf(client=client, n_neighbors=n_neighbors)
-    d_model.fit(X_train, y_train)
-    d_labels, d_indices, d_distances = \
-        d_model.predict(X_test, convert_dtype=True)
-    distributed_out = da.compute(d_labels, d_indices, d_distances)
-
-    if datatype == 'dask_cudf':
-        distributed_out = list(map(lambda o: o.as_matrix()
-                                   if isinstance(o, DataFrame)
-                                   else o.to_array()[..., np.newaxis],
-                                   distributed_out))
-    cuml_score = d_model.score(X_test, y_test)
-
-    if datatype == 'dask_cudf':
-        y_test = y_test.compute().as_matrix()
+    if datatype == 'dask_array':
+        assert distributed_score == pytest.approx(handmade_local_score,
+                                                  abs=1e-2)
     else:
-        y_test = y_test.compute()
-    manual_score = accuracy_score(y_test, distributed_out[0])
-
-    assert cuml_score == manual_score
+        y_pred = distributed_out[0]
+        handmade_distributed_score = np.mean(np_y_test == y_pred)
+        handmade_distributed_score = round(handmade_distributed_score, 3)
+        assert handmade_distributed_score == pytest.approx(
+            handmade_local_score, abs=1e-2)
 
 
 @pytest.mark.parametrize("datatype", ['dask_array', 'dask_cudf'])
-@pytest.mark.parametrize("n_neighbors", [1, 3, 6])
-@pytest.mark.parametrize("n_parts", [None, 2, 3, 5])
-def test_predict_proba(dataset, datatype, n_neighbors, n_parts, client):
+@pytest.mark.parametrize("n_neighbors", [1, 3, 8])
+@pytest.mark.parametrize("n_parts", [2, 4, 12])
+@pytest.mark.parametrize("batch_size", [128, 1024])
+def test_predict_proba(dataset, datatype, n_neighbors,
+                       n_parts, batch_size, client):
     X_train, X_test, y_train, y_test = dataset
 
     l_model = lKNNClf(n_neighbors=n_neighbors)
     l_model.fit(X_train, y_train)
     l_probas = l_model.predict_proba(X_test)
 
-    if not n_parts:
-        n_parts = len(client.has_what().keys())
-
     X_train = generate_dask_array(X_train, n_parts)
     X_test = generate_dask_array(X_test, n_parts)
     y_train = generate_dask_array(y_train, n_parts)
diff --git a/python/cuml/test/dask/test_kneighbors_regressor.py b/python/cuml/test/dask/test_kneighbors_regressor.py
index 0cbf093a58..3bcd052014 100644
--- a/python/cuml/test/dask/test_kneighbors_regressor.py
+++ b/python/cuml/test/dask/test_kneighbors_regressor.py
@@ -29,6 +29,7 @@
 from cuml.dask.common.dask_arr_utils import to_dask_cudf
 from cudf.core.dataframe import DataFrame
 import numpy as np
+from sklearn.metrics import r2_score
 
 
 def generate_dask_array(np_array, n_parts):
@@ -43,12 +44,12 @@ def generate_dask_array(np_array, n_parts):
 @pytest.fixture(
     scope="module",
     params=[
-        unit_param({'n_samples': 1000, 'n_features': 30,
+        unit_param({'n_samples': 3000, 'n_features': 30,
                     'n_classes': 5, 'n_targets': 2}),
-        quality_param({'n_samples': 5000, 'n_features': 100,
-                       'n_classes': 12, 'n_targets': 4}),
-        stress_param({'n_samples': 12000, 'n_features': 40,
-                      'n_classes': 5, 'n_targets': 2})
+        quality_param({'n_samples': 8000, 'n_features': 35,
+                       'n_classes': 12, 'n_targets': 3}),
+        stress_param({'n_samples': 20000, 'n_features': 40,
+                      'n_classes': 12, 'n_targets': 4})
     ])
 def dataset(request):
     X, y = make_multilabel_classification(
@@ -69,69 +70,64 @@ def dataset(request):
         if len(new_x) >= request.param['n_samples']:
             break
     X = X[new_x]
-    y = np.array(new_y)
+    noise = np.random.normal(0, 1.2, X.shape)
+    X += noise
+    y = np.array(new_y, dtype=np.float32)
 
-    return train_test_split(X, y, test_size=0.33)
+    return train_test_split(X, y, test_size=0.1)
 
 
-def accuracy_score(y_true, y_pred):
-    assert y_pred.shape[0] == y_true.shape[0]
-    assert y_pred.shape[1] == y_true.shape[1]
-    return np.mean(y_pred == y_true)
-
-
-def match_test(output1, output2):
-    o1, i1, d1 = output1
-    o2, i2, d2 = output2
+def exact_match(output1, output2):
+    l1, i1, d1 = output1
+    l2, i2, d2 = output2
+    l2 = l2.squeeze()
 
     # Check shapes
-    assert o1.shape == o2.shape
+    assert l1.shape == l2.shape
     assert i1.shape == i2.shape
     assert d1.shape == d2.shape
 
-    # Distances should strictly match
-    assert np.array_equal(d1, d2)
+    # Distances should match
+    d1 = np.round(d1, 4)
+    d2 = np.round(d2, 4)
+    assert np.mean(d1 == d2) > 0.98
 
-    # Indices might differ for equivalent distances
-    for i in range(d1.shape[0]):
-        idx_set1, idx_set2 = (set(), set())
-        dist = 0.
-        for j in range(d1.shape[1]):
-            if d1[i, j] > dist:
-                assert idx_set1 == idx_set2
-                idx_set1, idx_set2 = (set(), set())
-                dist = d1[i, j]
-            idx_set1.add(i1[i, j])
-            idx_set2.add(i2[i, j])
-        # the last set of indices is not guaranteed
+    # Indices should match
+    correct_queries = (i1 == i2).all(axis=1)
+    assert np.mean(correct_queries) > 0.95
 
-    # As indices might differ, outputs can also differ
+    # Labels should match
+    correct_queries = (l1 == l2).all(axis=1)
+    assert np.mean(correct_queries) > 0.95
 
 
 @pytest.mark.parametrize("datatype", ['dask_array', 'dask_cudf'])
 @pytest.mark.parametrize("n_neighbors", [1, 3, 8])
-@pytest.mark.parametrize("n_parts", [None, 2, 3, 5])
-@pytest.mark.parametrize("batch_size", [128, 512, 1024])
-def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
+@pytest.mark.parametrize("n_parts", [2, 4, 12])
+@pytest.mark.parametrize("batch_size", [128, 1024])
+def test_predict_and_score(dataset, datatype, n_neighbors,
+                           n_parts, batch_size, client):
     X_train, X_test, y_train, y_test = dataset
+    np_y_test = y_test
 
     l_model = lKNNReg(n_neighbors=n_neighbors)
     l_model.fit(X_train, y_train)
     l_distances, l_indices = l_model.kneighbors(X_test)
     l_outputs = l_model.predict(X_test)
     local_out = (l_outputs, l_indices, l_distances)
-
-    if not n_parts:
-        n_parts = len(client.has_what().keys())
+    handmade_local_score = r2_score(y_test, l_outputs)
+    handmade_local_score = round(float(handmade_local_score), 3)
 
     X_train = generate_dask_array(X_train, n_parts)
     X_test = generate_dask_array(X_test, n_parts)
     y_train = generate_dask_array(y_train, n_parts)
+    y_test = generate_dask_array(y_test, n_parts)
 
     if datatype == 'dask_cudf':
         X_train = to_dask_cudf(X_train, client)
         X_test = to_dask_cudf(X_test, client)
         y_train = to_dask_cudf(y_train, client)
+        y_test = to_dask_cudf(y_test, client)
 
     d_model = dKNNReg(client=client, n_neighbors=n_neighbors,
                       batch_size=batch_size)
@@ -139,6 +135,9 @@ def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
     d_outputs, d_indices, d_distances = \
         d_model.predict(X_test, convert_dtype=True)
     distributed_out = da.compute(d_outputs, d_indices, d_distances)
+    if datatype == 'dask_array':
+        distributed_score = d_model.score(X_test, y_test)
+        distributed_score = round(float(distributed_score), 3)
 
     if datatype == 'dask_cudf':
         distributed_out = list(map(lambda o: o.as_matrix()
@@ -146,47 +145,14 @@ def test_predict(dataset, datatype, n_neighbors, n_parts, batch_size, client):
                                    else o.to_array()[..., np.newaxis],
                                    distributed_out))
 
-    match_test(local_out, distributed_out)
-    accuracy_score(local_out[0], distributed_out[0]) > 0.12
-
-
-@pytest.mark.parametrize("datatype", ['dask_array'])
-@pytest.mark.parametrize("n_neighbors", [1, 3, 8])
-@pytest.mark.parametrize("n_parts", [None, 2, 3, 5])
-def test_score(dataset, datatype, n_neighbors, n_parts, client):
-    X_train, X_test, y_train, y_test = dataset
-
-    if not n_parts:
-        n_parts = len(client.has_what().keys())
-
-    X_train = generate_dask_array(X_train, n_parts)
-    X_test = generate_dask_array(X_test, n_parts)
-    y_train = generate_dask_array(y_train, n_parts)
-    y_test = generate_dask_array(y_test, n_parts)
-
-    if datatype == 'dask_cudf':
-        X_train = to_dask_cudf(X_train, client)
-        X_test = to_dask_cudf(X_test, client)
-        y_train = to_dask_cudf(y_train, client)
-        y_test = to_dask_cudf(y_test, client)
-
-    d_model = dKNNReg(client=client, n_neighbors=n_neighbors)
-    d_model.fit(X_train, y_train)
-    d_outputs, d_indices, d_distances = \
-        d_model.predict(X_test, convert_dtype=True)
-    distributed_out = da.compute(d_outputs, d_indices, d_distances)
+    exact_match(local_out, distributed_out)
 
-    if datatype == 'dask_cudf':
-        distributed_out = list(map(lambda o: o.as_matrix()
-                                   if isinstance(o, DataFrame)
-                                   else o.to_array()[..., np.newaxis],
-                                   distributed_out))
-    cuml_score = d_model.score(X_test, y_test)
-
-    if datatype == 'dask_cudf':
-        y_test = y_test.compute().as_matrix()
+    if datatype == 'dask_array':
+        assert distributed_score == pytest.approx(handmade_local_score,
+                                                  abs=1e-2)
     else:
-        y_test = y_test.compute()
-    manual_score = accuracy_score(y_test, distributed_out[0])
-
-    assert cuml_score == manual_score
+        y_pred = distributed_out[0]
+        handmade_distributed_score = float(r2_score(np_y_test, y_pred))
+        handmade_distributed_score = round(handmade_distributed_score, 3)
+        assert handmade_distributed_score == pytest.approx(
+            handmade_local_score, abs=1e-2)
diff --git a/python/cuml/test/dask/test_label_encoder.py b/python/cuml/test/dask/test_label_encoder.py
new file mode 100644
index 0000000000..78fa9e002e
--- /dev/null
+++ b/python/cuml/test/dask/test_label_encoder.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import cuml
+from cuml.dask.preprocessing.LabelEncoder import LabelEncoder
+import cudf
+import numpy as np
+import dask_cudf
+import pytest
+from cuml.common.exceptions import NotFittedError
+import cupy as cp
+
+
+def _arr_to_similarity_mat(arr):
+    arr = arr.reshape(1, -1)
+    return np.pad(arr, [(arr.shape[1] - 1, 0), (0, 0)], "edge")
+
+
+@pytest.mark.parametrize("length", [10, 1000])
+@pytest.mark.parametrize("cardinality", [5, 10, 50])
+def test_labelencoder_fit_transform(length, cardinality, client):
+    """ Try encoding the entire df
+    """
+    tmp = cudf.Series(np.random.choice(cardinality, (length,)))
+    df = dask_cudf.from_cudf(tmp, npartitions=len(client.has_what()))
+    encoded = cuml.dask.preprocessing.LabelEncoder().fit_transform(df)
+
+    df_arr = df.compute().to_array()
+    df_arr = _arr_to_similarity_mat(df_arr)
+    encoder_arr = cp.asnumpy(encoded.compute().to_array())
+    encoded_arr = _arr_to_similarity_mat(encoder_arr)
+    assert ((encoded_arr == encoded_arr.T) == (df_arr == df_arr.T)).all()
+
+
+@pytest.mark.parametrize("length", [10, 100, 1000])
+@pytest.mark.parametrize("cardinality", [5, 10, 50])
+def test_labelencoder_transform(length, cardinality, client):
+    """ Try fitting and then encoding a small subset of the df
+    """
+    tmp = cudf.Series(np.random.choice(cardinality, (length,)))
+    df = dask_cudf.from_cudf(tmp, npartitions=len(client.has_what()))
+    le = LabelEncoder().fit(df)
+    assert le._fitted
+
+    encoded = le.transform(df)
+
+    df_arr = df.compute().to_array()
+    df_arr = _arr_to_similarity_mat(df_arr)
+    encoder_arr = cp.asnumpy(encoded.compute().to_array())
+    encoded_arr = _arr_to_similarity_mat(encoder_arr)
+    assert (
+        (encoded_arr == encoded_arr.T) == (df_arr == df_arr.T)
+    ).all()
+
+
+def test_labelencoder_unseen(client):
+    """ Try encoding a value that was not present during fitting
+    """
+    df = dask_cudf.from_cudf(cudf.Series(np.random.choice(10, (10,))),
+                             npartitions=len(client.has_what()))
+    le = LabelEncoder().fit(df)
+    assert le._fitted
+
+    with pytest.raises(KeyError):
+        tmp = dask_cudf.from_cudf(cudf.Series([-100, -120]),
+                                  npartitions=len(client.has_what()))
+        le.transform(tmp).compute()
+
+
+def test_labelencoder_unfitted(client):
+    """ Try calling `.transform()` without fitting first
+    """
+    df = dask_cudf.from_cudf(cudf.Series(np.random.choice(10, (10,))),
+                             npartitions=len(client.has_what()))
+    le = LabelEncoder()
+    with pytest.raises(NotFittedError):
+        le.transform(df).compute()
+
+
+@pytest.mark.parametrize("use_fit_transform", [False, True])
+@pytest.mark.parametrize(
+        "orig_label, ord_label, expected_reverted, bad_ord_label",
+        [(cudf.Series(['a', 'b', 'c']),
+          cudf.Series([2, 1, 2, 0]),
+          cudf.Series(['c', 'b', 'c', 'a']),
+          cudf.Series([-1, 1, 2, 0])),
+         (cudf.Series(['Tokyo', 'Paris', 'Austin']),
+          cudf.Series([0, 2, 0]),
+          cudf.Series(['Austin', 'Tokyo', 'Austin']),
+          cudf.Series([0, 1, 2, 3])),
+         (cudf.Series(['a', 'b', 'c1']),
+          cudf.Series([2, 1]),
+          cudf.Series(['c1', 'b']),
+          cudf.Series([0, 1, 2, 3])),
+         (cudf.Series(['1.09', '0.09', '.09', '09']),
+          cudf.Series([0, 1, 2, 3]),
+          cudf.Series(['.09', '0.09', '09', '1.09']),
+          cudf.Series([0, 1, 2, 3, 4]))])
+def test_inverse_transform(orig_label, ord_label,
+                           expected_reverted, bad_ord_label,
+                           use_fit_transform, client):
+    n_workers = len(client.has_what())
+    orig_label = dask_cudf.from_cudf(orig_label, npartitions=n_workers)
+    ord_label = dask_cudf.from_cudf(ord_label,
+                                    npartitions=n_workers)
+    expected_reverted = dask_cudf.from_cudf(expected_reverted,
+                                            npartitions=n_workers)
+    bad_ord_label = dask_cudf.from_cudf(bad_ord_label, npartitions=n_workers)
+
+    # prepare LabelEncoder
+    le = LabelEncoder()
+    if use_fit_transform:
+        le.fit_transform(orig_label)
+    else:
+        le.fit(orig_label)
+    assert(le._fitted is True)
+
+    # test if inverse_transform is correct
+    reverted = le.inverse_transform(ord_label)
+    reverted = reverted.compute().reset_index(drop=True)
+    expected_reverted = expected_reverted.compute()
+
+    assert(len(reverted) == len(expected_reverted))
+    assert(len(reverted)
+           == len(reverted[reverted == expected_reverted]))
+    # test if correctly raies ValueError
+    with pytest.raises(ValueError, match='y contains previously unseen label'):
+        le.inverse_transform(bad_ord_label).compute()
+
+
+def test_unfitted_inverse_transform(client):
+    """ Try calling `.inverse_transform()` without fitting first
+    """
+    tmp = cudf.Series(np.random.choice(10, (10,)))
+    df = dask_cudf.from_cudf(tmp, npartitions=len(client.has_what()))
+    le = LabelEncoder()
+
+    with pytest.raises(NotFittedError):
+        le.transform(df)
+
+
+@pytest.mark.parametrize("empty, ord_label",
+                         [(cudf.Series([]), cudf.Series([2, 1]))])
+def test_empty_input(empty, ord_label, client):
+    # prepare LabelEncoder
+    n_workers = len(client.has_what())
+    empty = dask_cudf.from_cudf(empty, npartitions=n_workers)
+    ord_label = dask_cudf.from_cudf(ord_label, npartitions=n_workers)
+    le = LabelEncoder()
+    le.fit(empty)
+    assert(le._fitted is True)
+
+    # test if correctly raies ValueError
+    with pytest.raises(ValueError, match='y contains previously unseen label'):
+        le.inverse_transform(ord_label).compute()
+
+    # check fit_transform()
+    le = LabelEncoder()
+    transformed = le.fit_transform(empty).compute()
+    assert(le._fitted is True)
+    assert(len(transformed) == 0)
+
+
+def test_masked_encode(client):
+    n_workers = len(client.has_what())
+    df = cudf.DataFrame({"filter_col": [1, 1, 2, 3, 1, 1, 1, 1, 6, 5],
+                         "cat_col": ['a', 'b', 'c', 'd', 'a',
+                                     'a', 'a', 'c', 'b', 'c']})
+    ddf = dask_cudf.from_cudf(df, npartitions=n_workers)
+
+    ddf_filter = ddf[ddf["filter_col"] == 1]
+    filter_encoded = LabelEncoder().fit_transform(ddf_filter["cat_col"])
+    ddf_filter = ddf_filter.assign(filter_encoded=filter_encoded.values)
+
+    encoded_filter = LabelEncoder().fit_transform(ddf["cat_col"])
+    ddf = ddf.assign(encoded_filter=encoded_filter.values)
+
+    ddf = ddf[ddf.filter_col == 1]
+
+    assert(ddf.encoded_filter ==
+           ddf_filter.filter_encoded).compute().all()
diff --git a/python/cuml/test/dask/test_linear_regression.py b/python/cuml/test/dask/test_linear_regression.py
index 554c67b71e..3dcc191b09 100644
--- a/python/cuml/test/dask/test_linear_regression.py
+++ b/python/cuml/test/dask/test_linear_regression.py
@@ -55,8 +55,8 @@ def make_regression_dataset(datatype, nrows, ncols, n_info):
 
 
 @pytest.mark.mg
-@pytest.mark.parametrize("nrows", [1e4])
-@pytest.mark.parametrize("ncols", [10])
+@pytest.mark.parametrize("nrows", [1e5])
+@pytest.mark.parametrize("ncols", [20])
 @pytest.mark.parametrize("n_parts", [2, 23])
 @pytest.mark.parametrize("fit_intercept", [False, True])
 @pytest.mark.parametrize("normalize", [False])
diff --git a/python/cuml/test/dask/test_nearest_neighbors.py b/python/cuml/test/dask/test_nearest_neighbors.py
index 7e674609d5..65c2d451a1 100644
--- a/python/cuml/test/dask/test_nearest_neighbors.py
+++ b/python/cuml/test/dask/test_nearest_neighbors.py
@@ -71,7 +71,6 @@ def _scale_rows(client, nrows):
 
 @pytest.mark.parametrize("nrows", [unit_param(100),
                                    unit_param(1e3),
-                                   unit_param(1e4),
                                    quality_param(1e6),
                                    stress_param(5e8)])
 @pytest.mark.parametrize("ncols", [10, 30])
diff --git a/python/cuml/test/dask/test_random_forest.py b/python/cuml/test/dask/test_random_forest.py
index 0e4a393733..31933a81fa 100644
--- a/python/cuml/test/dask/test_random_forest.py
+++ b/python/cuml/test/dask/test_random_forest.py
@@ -48,6 +48,8 @@
 from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
 from sklearn.ensemble import RandomForestClassifier as skrfc
 
+from dask.distributed import Client
+
 
 def _prep_training_data(c, X_train, y_train, partitions_per_worker):
     workers = c.has_what().keys()
@@ -68,34 +70,50 @@ def _prep_training_data(c, X_train, y_train, partitions_per_worker):
 
 
 @pytest.mark.parametrize('partitions_per_worker', [3])
-def test_rf_classification_dask_cudf(partitions_per_worker, client):
+def test_rf_classification_multi_class(partitions_per_worker, cluster):
 
     # Use CUDA_VISIBLE_DEVICES to control the number of workers
-    X, y = make_classification(n_samples=10000, n_features=20,
-                               n_clusters_per_class=1, n_informative=10,
-                               random_state=123, n_classes=5)
+    c = Client(cluster)
 
-    X = X.astype(np.float32)
-    y = y.astype(np.int32)
+    try:
 
-    X_train, X_test, y_train, y_test = \
-        train_test_split(X, y, test_size=1000)
+        X, y = make_classification(n_samples=10000, n_features=20,
+                                   n_clusters_per_class=1, n_informative=10,
+                                   random_state=123, n_classes=15)
 
-    cu_rf_params = {
-        'n_estimators': 40,
-        'max_depth': 16,
-        'n_bins': 16,
-    }
+        X = X.astype(np.float32)
+        y = y.astype(np.int32)
 
-    X_train_df, y_train_df = _prep_training_data(client, X_train, y_train,
-                                                 partitions_per_worker)
+        X_train, X_test, y_train, y_test = \
+            train_test_split(X, y, test_size=1000, random_state=123)
 
-    cuml_mod = cuRFC_mg(**cu_rf_params)
-    cuml_mod.fit(X_train_df, y_train_df)
-    cuml_mod_predict = cuml_mod.predict(X_test)
-    acc_score = accuracy_score(cuml_mod_predict, y_test, normalize=True)
+        cu_rf_params = {
+            'n_estimators': 25,
+            'max_depth': 16,
+            'n_bins': 256,
+            'random_state': 10,
+        }
 
-    assert acc_score > 0.8
+        X_train_df, y_train_df = _prep_training_data(c, X_train, y_train,
+                                                     partitions_per_worker)
+
+        cuml_mod = cuRFC_mg(**cu_rf_params)
+        cuml_mod.fit(X_train_df, y_train_df)
+        X_test_dask_array = from_array(X_test)
+        cuml_preds_gpu = cuml_mod.predict(X_test_dask_array,
+                                          predict_model="GPU").compute()
+        acc_score_gpu = accuracy_score(cuml_preds_gpu, y_test)
+
+        # the sklearn model when ran with the same parameters gives an
+        # accuracy of 0.69. There is a difference of 0.0632 (6.32%) between
+        # the two when the code runs on a single GPU (seen in the CI)
+        # Refer to issue : https://github.com/rapidsai/cuml/issues/2806 for
+        # more information on the threshold value.
+
+        assert acc_score_gpu >= 0.60
+
+    finally:
+        c.close()
 
 
 @pytest.mark.parametrize('dtype', [np.float32, np.float64])
@@ -250,7 +268,7 @@ def test_rf_classification_dask_fil_predict_proba(partitions_per_worker,
     cu_rf_mg.fit(X_train_df, y_train_df)
 
     fil_preds_proba = cu_rf_mg.predict_proba(X_test_df).compute()
-    fil_preds_proba = cp.asnumpy(fil_preds_proba.to_gpu_matrix())
+    fil_preds_proba = cp.asnumpy(fil_preds_proba.as_gpu_matrix())
     y_proba = np.zeros(np.shape(fil_preds_proba))
     y_proba[:, 1] = y_test
     y_proba[:, 0] = 1.0 - y_test
@@ -297,3 +315,39 @@ def test_rf_concatenation_dask(client, model_type):
         take_handle_ownership=False)
 
     assert local_tl.num_trees == n_estimators
+
+
+@pytest.mark.parametrize('model_type', ['classification', 'regression'])
+@pytest.mark.parametrize('ignore_empty_partitions', [True, False])
+def test_single_input(client, model_type, ignore_empty_partitions):
+    X, y = make_classification(n_samples=1)
+    X = X.astype(np.float32)
+    if model_type == 'classification':
+        y = y.astype(np.int32)
+    else:
+        y = y.astype(np.float32)
+
+    X, y = _prep_training_data(client, X, y,
+                               partitions_per_worker=2)
+    if model_type == 'classification':
+        cu_rf_mg = cuRFC_mg(n_bins=1,
+                            ignore_empty_partitions=ignore_empty_partitions)
+    else:
+        cu_rf_mg = cuRFR_mg(n_bins=1,
+                            ignore_empty_partitions=ignore_empty_partitions)
+
+    if ignore_empty_partitions or \
+       len(client.scheduler_info()['workers'].keys()) == 1:
+        cu_rf_mg.fit(X, y)
+        cuml_mod_predict = cu_rf_mg.predict(X)
+        cuml_mod_predict = cp.asnumpy(cp.array(cuml_mod_predict.compute()))
+
+        y = cp.asnumpy(cp.array(y.compute()))
+
+        acc_score = accuracy_score(cuml_mod_predict, y)
+
+        assert acc_score == 1.0
+
+    else:
+        with pytest.raises(ValueError):
+            cu_rf_mg.fit(X, y)
diff --git a/python/cuml/test/dask/test_utils.py b/python/cuml/test/dask/test_serialization.py
similarity index 75%
rename from python/cuml/test/dask/test_utils.py
rename to python/cuml/test/dask/test_serialization.py
index 1c91b5e74f..14d6af6bd1 100644
--- a/python/cuml/test/dask/test_utils.py
+++ b/python/cuml/test/dask/test_serialization.py
@@ -14,6 +14,9 @@
 #
 
 import cupy as cp
+import cupyx
+
+from cuml.common.array_sparse import SparseCumlArray
 
 from cuml.naive_bayes.naive_bayes import MultinomialNB
 
@@ -31,7 +34,7 @@ def test_register_naive_bayes_serialization():
 
     mnb = MultinomialNB()
 
-    X = cp.sparse.random(1, 5)
+    X = cupyx.scipy.sparse.random(1, 5)
     y = cp.array([0])
 
     mnb.fit(X, y)
@@ -46,3 +49,18 @@ def test_register_naive_bayes_serialization():
     stype, sbytes = serialize(mnb, serializers=['dask'])
 
     assert stype['serializer'] == 'dask'
+
+
+def test_sparse_cumlarray_serialization():
+
+    X = cupyx.scipy.sparse.random(10, 5, format='csr', density=0.9)
+
+    X_m = SparseCumlArray(X)
+
+    stype, sbytes = serialize(X_m, serializers=['cuda'])
+
+    assert stype['serializer'] == 'cuda'
+
+    stype, sbytes = serialize(X_m, serializers=['dask'])
+
+    assert stype['serializer'] == 'dask'
diff --git a/python/cuml/test/dask/test_tfidf.py b/python/cuml/test/dask/test_tfidf.py
new file mode 100644
index 0000000000..9461f07e2d
--- /dev/null
+++ b/python/cuml/test/dask/test_tfidf.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+import numpy as np
+import cupy as cp
+from scipy.sparse import csr_matrix as scipy_csr_matrix
+from cupy.sparse import csr_matrix as cp_csr_matrix
+import dask.array as da
+import dask
+
+from cuml.dask.feature_extraction.text import TfidfTransformer
+from sklearn.feature_extraction.text import (
+    TfidfTransformer as SkTfidfTransformer,
+)
+
+
+# Testing Util Functions
+def generate_dask_array(np_array, n_parts):
+    """
+        Creates a dask array from a numpy 2d array
+    """
+    n_samples = np_array.shape[0]
+    n_samples_per_part = int(n_samples / n_parts)
+    chunks = [n_samples_per_part] * n_parts
+    samples_last_row = n_samples - ((n_parts - 1) * n_samples_per_part)
+    chunks[-1] = samples_last_row
+    chunks = tuple(chunks)
+    return da.from_array(np_array, chunks=(chunks, -1))
+
+
+def create_cp_sparse_ar_from_dense_np_ar(ar, dtype=np.float32):
+    """
+        Creates a gpu array from a dense cpu array
+    """
+    return cp_csr_matrix(scipy_csr_matrix(ar), dtype=dtype)
+
+
+def create_cp_sparse_dask_array(np_ar, n_parts):
+    """
+        Creates a sparse gpu dask array from the given numpy array
+    """
+    ar = generate_dask_array(np_ar, n_parts)
+    meta = dask.array.from_array(cp_csr_matrix(cp.zeros(1, dtype=cp.float32)))
+    ar = ar.map_blocks(create_cp_sparse_ar_from_dense_np_ar, meta=meta)
+    return ar
+
+
+def create_scipy_sparse_array_from_dask_cp_sparse_array(ar):
+    """
+        Creates a cpu sparse array from the given numpy array
+        Will not be needed probably once we have
+        https://github.com/cupy/cupy/issues/3178
+    """
+    meta = dask.array.from_array(scipy_csr_matrix(np.zeros(1, dtype=ar.dtype)))
+    ar = ar.map_blocks(lambda x: x.get(), meta=meta)
+    ar = ar.compute()
+    return ar
+
+
+# data_ids correspond to data, order is important
+data_ids = ["base_case", "diag", "empty_feature", "123", "empty_doc"]
+data = [
+    np.array(
+        [
+            [0, 1, 1, 1, 0, 0, 1, 0, 1],
+            [0, 2, 0, 1, 0, 1, 1, 0, 1],
+            [1, 0, 0, 1, 1, 0, 1, 1, 1],
+            [0, 1, 1, 1, 0, 0, 1, 0, 1],
+        ]
+    ),
+    np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]]),
+    np.array([[1, 1, 0], [1, 1, 0], [1, 0, 0]]),
+    np.array([[1], [2], [3]]),
+    np.array([[1, 1, 1], [1, 1, 0], [0, 0, 0]]),
+]
+
+
+@pytest.mark.mg
+@pytest.mark.parametrize("data", data, ids=data_ids)
+@pytest.mark.parametrize("norm", ["l1", "l2", None])
+@pytest.mark.parametrize("use_idf", [True, False])
+@pytest.mark.parametrize("smooth_idf", [True, False])
+@pytest.mark.parametrize("sublinear_tf", [True, False])
+def test_tfidf_transformer(
+    data, norm, use_idf, smooth_idf, sublinear_tf, client
+):
+    # Testing across multiple-n_parts
+    for n_parts in range(1, data.shape[0]):
+        dask_sp_array = create_cp_sparse_dask_array(data, n_parts)
+        tfidf = TfidfTransformer(
+            norm=norm,
+            use_idf=use_idf,
+            smooth_idf=smooth_idf,
+            sublinear_tf=sublinear_tf,
+        )
+        sk_tfidf = SkTfidfTransformer(
+            norm=norm,
+            use_idf=use_idf,
+            smooth_idf=smooth_idf,
+            sublinear_tf=sublinear_tf,
+        )
+
+        res = tfidf.fit_transform(dask_sp_array)
+        res = create_scipy_sparse_array_from_dask_cp_sparse_array(
+            res
+        ).todense()
+        ref = sk_tfidf.fit_transform(data).todense()
+
+        cp.testing.assert_array_almost_equal(res, ref)
diff --git a/python/cuml/test/dask/test_tsvd.py b/python/cuml/test/dask/test_tsvd.py
index 5145822fa9..e7df4a8d26 100644
--- a/python/cuml/test/dask/test_tsvd.py
+++ b/python/cuml/test/dask/test_tsvd.py
@@ -25,7 +25,7 @@
 
 @pytest.mark.mg
 @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 30]),
-                         stress_param([9e6, 5000, 30])])
+                         stress_param([int(9e6), 5000, 30])])
 @pytest.mark.parametrize("input_type", ["dataframe", "array"])
 def test_pca_fit(data_info, input_type, client):
 
@@ -72,7 +72,7 @@ def test_pca_fit(data_info, input_type, client):
 
 @pytest.mark.mg
 @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 46]),
-                         stress_param([9e6, 5000, 46])])
+                         stress_param([int(9e6), 5000, 46])])
 def test_pca_fit_transform_fp32(data_info, client):
 
     nrows, ncols, n_parts = data_info
@@ -92,7 +92,7 @@ def test_pca_fit_transform_fp32(data_info, client):
 
 @pytest.mark.mg
 @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 33]),
-                         stress_param([9e6, 5000, 33])])
+                         stress_param([int(9e6), 5000, 33])])
 def test_pca_fit_transform_fp64(data_info, client):
 
     nrows, ncols, n_parts = data_info
diff --git a/python/cuml/test/dask/test_ucx.py b/python/cuml/test/dask/test_ucx.py
deleted file mode 100644
index 25d4c2208f..0000000000
--- a/python/cuml/test/dask/test_ucx.py
+++ /dev/null
@@ -1,62 +0,0 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-from cuml.dask.common.ucx import UCX
-
-from dask.distributed import get_worker
-
-from cuml.dask.common.utils import parse_host_port
-
-
-def test_listener(client):
-
-    c = client
-
-    multiple_workers = len(c.scheduler_info()["workers"]) > 1
-
-    # Test only runs when multiple workers are present
-    if multiple_workers:
-
-        def build_ucx():
-            # Create listener and cache on worker
-            get_worker()._callback_invoked = False
-
-            def mock_callback(ep):
-                get_worker()._callback_invoked = True
-
-            ucx = UCX.get(mock_callback)
-
-            get_worker()._ucx = ucx
-            return get_worker().address, ucx.listener_port()
-
-        ports = c.run(build_ucx)
-
-        def get_endpoints(addr_ports):
-            # Create endpoints to all other workers
-            ucx = get_worker()._ucx
-
-            for address, port in addr_ports:
-                if address != get_worker().address:
-                    host, p = parse_host_port(address)
-                    ucx.get_endpoint(host, port)
-
-        c.run(get_endpoints, [ap for ap in ports.values()])
-
-        def callback_invoked():
-            # Return True if listener callback was invoked
-            return get_worker()._callback_invoked
-
-        invoked = c.run(callback_invoked)
-
-        assert all(invoked)
diff --git a/python/cuml/test/stemmer_tests/__init__.py b/python/cuml/test/stemmer_tests/__init__.py
new file mode 100644
index 0000000000..baeab05f3b
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/__init__.py
@@ -0,0 +1,15 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
diff --git a/python/cuml/test/stemmer_tests/test_len_utils.py b/python/cuml/test/stemmer_tests/test_len_utils.py
new file mode 100644
index 0000000000..bea0c1a54a
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/test_len_utils.py
@@ -0,0 +1,36 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+import numpy as np
+from cuml.preprocessing.text.stem.porter_stemmer_utils.len_flags_utils import (
+    len_eq_n,
+    len_gt_n,
+)
+
+
+def test_len_gt_n():
+    word_str_ser = cudf.Series(["a", "abcd", "abc", "abcd"])
+    got = len_gt_n(word_str_ser, 3).values.get()
+    expect = np.asarray([False, True, False, True])
+    np.testing.assert_array_equal(got, expect)
+
+
+def test_len_eq_n():
+    word_str_ser = cudf.Series(["a", "abcd", "abc", "abcd"])
+    got = len_eq_n(word_str_ser, 3).values.get()
+    expect = np.asarray([False, False, True, False])
+    np.testing.assert_array_equal(got, expect)
diff --git a/python/cuml/test/stemmer_tests/test_porter_stemmer_rules.py b/python/cuml/test/stemmer_tests/test_porter_stemmer_rules.py
new file mode 100644
index 0000000000..104bf623ce
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/test_porter_stemmer_rules.py
@@ -0,0 +1,34 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import cudf
+import numpy as np
+from cuml.preprocessing.text.stem.porter_stemmer_utils import (
+    porter_stemmer_rules,
+)
+
+
+def test_ends_with_suffix():
+    test_strs = cudf.Series(["happy", "apple", "nappy", ""])
+    expect = np.asarray([True, False, True, False])
+    got = porter_stemmer_rules.ends_with_suffix(test_strs, "ppy").values.get()
+    np.testing.assert_array_equal(got, expect)
+
+
+def test_ends_with_empty_suffix():
+    test_strs = cudf.Series(["happy", "sad"])
+    expect = np.asarray([True, True])
+    got = porter_stemmer_rules.ends_with_suffix(test_strs, "").values.get()
+    np.testing.assert_array_equal(got, expect)
diff --git a/python/cuml/test/stemmer_tests/test_stemmer.py b/python/cuml/test/stemmer_tests/test_stemmer.py
new file mode 100644
index 0000000000..f5cc5154cc
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/test_stemmer.py
@@ -0,0 +1,54 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+
+from cuml.preprocessing.text import stem as rapids_stem
+from nltk import stem as nltk_stem
+
+
+def get_words():
+    """
+        Returns list of words from nltk treebank
+    """
+    import nltk
+
+    nltk.download("treebank")
+    from nltk.corpus import treebank
+
+    word_ls = []
+    for item in treebank.fileids():
+        for (word, tag) in treebank.tagged_words(item):
+            # assuming the words are allready lowered
+            word = word.lower()
+            word_ls.append(word)
+
+    word_ls = list(set(word_ls))
+    return word_ls
+
+
+def test_same_results():
+    word_ls = get_words()
+    word_ser = cudf.Series(word_ls)
+
+    nltk_stemmer = nltk_stem.PorterStemmer()
+    nltk_stemmed = [nltk_stemmer.stem(word) for word in word_ls]
+
+    cuml_stemmer = rapids_stem.PorterStemmer()
+    cuml_stemmed = cuml_stemmer.stem(word_ser)
+
+    assert all([a == b for a, b in
+           zip(nltk_stemmed, cuml_stemmed.to_pandas().values)])
diff --git a/python/cuml/test/stemmer_tests/test_steps.py b/python/cuml/test/stemmer_tests/test_steps.py
new file mode 100644
index 0000000000..60185138b3
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/test_steps.py
@@ -0,0 +1,287 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+from cuml.preprocessing.text.stem.porter_stemmer import PorterStemmer
+
+
+def test_step1a():
+    word_str_ser = cudf.Series(
+        ["caresses", "ponies", "ties", "caress", "cats"]
+    )
+
+    st = PorterStemmer()
+    got = st._step1a(word_str_ser)
+
+    expect = ["caress", "poni", "tie", "caress", "cat"]
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    mask = cudf.Series([True, False, True, True, False])
+    expect = ["caress", "ponies", "tie", "caress", "cats"]
+    got = st._step1a(word_str_ser, mask)
+
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step1b():
+    word_str_ser_ls = [
+        "feed",
+        "agreed",
+        "plastered",
+        "bled",
+        "motoring",
+        "sing",
+        "conflated",
+        "troubled",
+        "sized",
+        "hopping",
+        "tanned",
+        "falling",
+        "hissing",
+        "fizzed",
+        "failing",
+        "filing",
+    ]
+
+    expected = [
+        "feed",
+        "agree",
+        "plaster",
+        "bled",
+        "motor",
+        "sing",
+        "conflate",
+        "trouble",
+        "size",
+        "hop",
+        "tan",
+        "fall",
+        "hiss",
+        "fizz",
+        "fail",
+        "file",
+    ]
+
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    st = PorterStemmer()
+    got = st._step1b(word_str_ser)
+
+    assert list(got.to_pandas().values) == expected
+
+    # mask test
+    expected = expected[:-3] + ["fizzed", "failing", "filing"]
+    mask = cudf.Series([True] * (len(expected) - 3) + [False] * 3)
+    got = st._step1b(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expected
+
+
+def test_step1c():
+    word_str_ser_ls = ["happy", "sky", "enjoy", "boy", "toy", "y"]
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    st = PorterStemmer()
+    got = st._step1c(word_str_ser)
+
+    expect = ["happi", "ski", "enjoy", "boy", "toy", "y"]
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = ["happi", "sky", "enjoy", "boy", "toy", "y"]
+    mask = cudf.Series([True, False, False, False, False, True])
+    got = st._step1c(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step2():
+    word_str_ser_ls = [
+        "relational",
+        "conditional",
+        "rational",
+        "valenci",
+        "hesitanci",
+        "digitizer",
+        "conformabli",
+        "radicalli",
+        "differentli",
+        "vileli",
+        "analogousli",
+        "vietnamization",
+        "predication",
+        "operator",
+        "feudalism",
+        "decisiveness",
+        "hopefulness",
+        "callousness",
+        "formaliti",
+        "sensitiviti",
+        "sensibiliti",
+    ]
+
+    expect = [
+        "relate",
+        "condition",
+        "rational",
+        "valence",
+        "hesitance",
+        "digitize",
+        "conformable",
+        "radical",
+        "different",
+        "vile",
+        "analogous",
+        "vietnamize",
+        "predicate",
+        "operate",
+        "feudal",
+        "decisive",
+        "hopeful",
+        "callous",
+        "formal",
+        "sensitive",
+        "sensible",
+    ]
+
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    st = PorterStemmer()
+    got = st._step2(word_str_ser)
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = expect[:-3] + ["formaliti", "sensitiviti", "sensibiliti"]
+    mask = cudf.Series([True] * (len(expect) - 3) + [False] * 3)
+    got = st._step2(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step3():
+    word_str_ser_ls = [
+        "triplicate",
+        "formative",
+        "formalize",
+        "electriciti",
+        "electriciti",
+        "hopeful",
+        "goodness",
+    ]
+    expect = [
+        "triplic",
+        "form",
+        "formal",
+        "electric",
+        "electric",
+        "hope",
+        "good",
+    ]
+
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    st = PorterStemmer()
+    got = st._step3(word_str_ser)
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = expect[:-2] + ["hopeful", "goodness"]
+    mask = cudf.Series([True] * (len(expect) - 2) + [False] * 2)
+    got = st._step3(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step4():
+    word_str_ser_ls = [
+        "revival",
+        "allowance",
+        "inference",
+        "airliner",
+        "gyroscopic",
+        "adjustable",
+        "defensible",
+        "irritant",
+        "replacement",
+        "adjustment",
+        "dependent",
+        "adoption",
+        "homologou",
+        "communism",
+        "activate",
+        "angulariti",
+        "homologous",
+        "effective",
+        "bowdlerize",
+    ]
+
+    expect = [
+        "reviv",
+        "allow",
+        "infer",
+        "airlin",
+        "gyroscop",
+        "adjust",
+        "defens",
+        "irrit",
+        "replac",
+        "adjust",
+        "depend",
+        "adopt",
+        "homolog",
+        "commun",
+        "activ",
+        "angular",
+        "homolog",
+        "effect",
+        "bowdler",
+    ]
+
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    st = PorterStemmer()
+    got = st._step4(word_str_ser)
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = expect[:-2] + ["effective", "bowdlerize"]
+    mask = cudf.Series([True] * (len(expect) - 2) + [False] * 2)
+    got = st._step4(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step5a():
+    word_str_ser_ls = ["probate", "rate", "cease", "ones"]
+    word_str_ser = cudf.Series(word_str_ser_ls)
+
+    expect = ["probat", "rate", "ceas", "ones"]
+    st = PorterStemmer()
+    got = st._step5a(word_str_ser)
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = expect[:-2] + ["cease", "ones"]
+    mask = cudf.Series([True] * (len(expect) - 2) + [False] * 2)
+    got = st._step5a(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
+
+
+def test_step5b():
+    word_str_ser_ls = ["controll", "roll"]
+    word_str_ser = cudf.Series(word_str_ser_ls)
+    expect = ["control", "roll"]
+
+    st = PorterStemmer()
+    got = st._step5b(word_str_ser)
+    assert list(got.to_pandas().values) == expect
+
+    # mask test
+    expect = ["controll", "roll"]
+    mask = cudf.Series([False, True])
+    got = st._step5b(word_str_ser, mask)
+    assert list(got.to_pandas().values) == expect
diff --git a/python/cuml/test/stemmer_tests/test_suffix_utils.py b/python/cuml/test/stemmer_tests/test_suffix_utils.py
new file mode 100644
index 0000000000..88ee3c432b
--- /dev/null
+++ b/python/cuml/test/stemmer_tests/test_suffix_utils.py
@@ -0,0 +1,53 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cudf
+from cuml.preprocessing.text.stem.porter_stemmer_utils.suffix_utils import (
+    get_stem_series,
+    replace_suffix,
+)
+
+
+def test_get_stem_series():
+    word_str_ser = cudf.Series(
+        ["ihop", "packit", "mishit", "crow", "girl", "boy"]
+    )
+    can_replace_mask = cudf.Series([True, True, True, False, False, False])
+
+    expect = ["ih", "pack", "mish", "crow", "girl", "boy"]
+    got = get_stem_series(
+        word_str_ser, suffix_len=2, can_replace_mask=can_replace_mask
+    )
+    assert sorted(list(got.to_pandas().values)) == sorted(expect)
+
+
+def test_replace_suffix():
+    # test 'ing' -> 's'
+    word_str_ser = cudf.Series(
+        ["shopping", "parking", "drinking", "sing", "bing"]
+    )
+    can_replace_mask = cudf.Series([True, True, True, False, False])
+    got = replace_suffix(word_str_ser, "ing", "s", can_replace_mask)
+    expect = ["shopps", "parks", "drinks", "sing", "bing"]
+    assert sorted(list(got.to_pandas().values)) == sorted(expect)
+
+    # basic test 'ies' -> 's'
+    word_str_ser = cudf.Series(["shops", "ties"])
+    can_replace_mask = cudf.Series([False, True])
+    got = replace_suffix(word_str_ser, "ies", "i", can_replace_mask)
+
+    expect = ["shops", "ti"]
+    assert sorted(list(got.to_pandas().values)) == sorted(expect)
diff --git a/python/cuml/test/test_adapters.py b/python/cuml/test/test_adapters.py
new file mode 100644
index 0000000000..2b1708b621
--- /dev/null
+++ b/python/cuml/test/test_adapters.py
@@ -0,0 +1,69 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import cupy as cp
+import pytest
+
+from cupyx.scipy.sparse import coo_matrix
+
+from cuml.thirdparty_adapters.adapters import check_array
+
+
+def test_check_array():
+    # accept_sparse
+    arr = coo_matrix((3, 4), dtype=cp.float64)
+    check_array(arr, accept_sparse=True)
+    with pytest.raises(ValueError):
+        check_array(arr, accept_sparse=False)
+
+    # dtype
+    arr = cp.array([[1, 2]], dtype=cp.int64)
+    check_array(arr, dtype=cp.int64, copy=False)
+
+    arr = cp.array([[1, 2]], dtype=cp.float32)
+    new_arr = check_array(arr, dtype=cp.int64)
+    assert new_arr.dtype == cp.int64
+
+    # order
+    arr = cp.array([[1, 2]], dtype=cp.int64, order='F')
+    new_arr = check_array(arr, order='F')
+    assert new_arr.flags.f_contiguous
+    new_arr = check_array(arr, order='C')
+    assert new_arr.flags.c_contiguous
+
+    # force_all_finite
+    arr = cp.array([[1, cp.inf]])
+    check_array(arr, force_all_finite=False)
+    with pytest.raises(ValueError):
+        check_array(arr, force_all_finite=True)
+
+    # ensure_2d
+    arr = cp.array([1, 2], dtype=cp.float32)
+    check_array(arr, ensure_2d=False)
+    with pytest.raises(ValueError):
+        check_array(arr, ensure_2d=True)
+
+    # ensure_min_samples
+    arr = cp.array([[1, 2]], dtype=cp.float32)
+    check_array(arr, ensure_min_samples=1)
+    with pytest.raises(ValueError):
+        check_array(arr, ensure_min_samples=2)
+
+    # ensure_min_features
+    arr = cp.array([[]], dtype=cp.float32)
+    check_array(arr, ensure_min_features=0)
+    with pytest.raises(ValueError):
+        check_array(arr, ensure_min_features=1)
diff --git a/python/cuml/test/test_arima.py b/python/cuml/test/test_arima.py
index 5d45a28228..e70a9d0680 100644
--- a/python/cuml/test/test_arima.py
+++ b/python/cuml/test/test_arima.py
@@ -261,9 +261,12 @@ def _statsmodels_to_cuml(ref_fits, cuml_model, order, seasonal_order,
                          intercept, dtype):
     """Utility function to transfer the parameters from a statsmodels'
     SARIMAXResults object to a cuML ARIMA object.
-    Note: be cautious with the intercept, it is not always equivalent
-    in statsmodels and cuML models (it depends on the order).
+
+    .. note:: be cautious with the intercept, it is not always equivalent
+        in statsmodels and cuML models (it depends on the order).
+
     """
+
     nb = cuml_model.batch_size
     N = cuml_model.complexity
     x = np.zeros(nb * N, dtype=np.float64)
@@ -274,7 +277,8 @@ def _statsmodels_to_cuml(ref_fits, cuml_model, order, seasonal_order,
     cuml_model.unpack(x)
 
 
-def _predict_common(key, data, dtype, start, end, num_steps=None):
+def _predict_common(key, data, dtype, start, end, num_steps=None, level=None,
+                    simple_differencing=True):
     """Utility function used by test_predict and test_forecast to avoid
     code duplication.
     """
@@ -287,49 +291,87 @@ def _predict_common(key, data, dtype, start, end, num_steps=None):
 
     # Create cuML model
     cuml_model = arima.ARIMA(y_cudf, order, seasonal_order,
-                             fit_intercept=intercept, output_type='numpy')
+                             fit_intercept=intercept, output_type='numpy',
+                             simple_differencing=simple_differencing)
 
     # Feed the parameters to the cuML model
     _statsmodels_to_cuml(ref_fits, cuml_model, order, seasonal_order,
                          intercept, dtype)
 
     # Predict or forecast
+    # Reference (statsmodels)
     ref_preds = np.zeros((end - start, data.batch_size))
     for i in range(data.batch_size):
         ref_preds[:, i] = ref_fits[i].get_prediction(
             start, end - 1).predicted_mean
+    if level is not None:
+        ref_lower = np.zeros((end - start, data.batch_size))
+        ref_upper = np.zeros((end - start, data.batch_size))
+        for i in range(data.batch_size):
+            temp_pred = ref_fits[i].get_forecast(num_steps)
+            ci = temp_pred.summary_frame(alpha=1-level)
+            ref_lower[:, i] = ci["mean_ci_lower"].to_numpy()
+            ref_upper[:, i] = ci["mean_ci_upper"].to_numpy()
+    # cuML
     if num_steps is None:
         cuml_pred = cuml_model.predict(start, end)
+    elif level is not None:
+        cuml_pred, cuml_lower, cuml_upper = \
+            cuml_model.forecast(num_steps, level)
     else:
         cuml_pred = cuml_model.forecast(num_steps)
 
     # Compare results
     np.testing.assert_allclose(cuml_pred, ref_preds, rtol=0.001, atol=0.01)
+    if level is not None:
+        np.testing.assert_allclose(
+            cuml_lower, ref_lower, rtol=0.005, atol=0.01)
+        np.testing.assert_allclose(
+            cuml_upper, ref_upper, rtol=0.005, atol=0.01)
 
 
 @pytest.mark.parametrize('key, data', test_data)
 @pytest.mark.parametrize('dtype', [np.float64])
-def test_predict(key, data, dtype):
+@pytest.mark.parametrize('simple_differencing', [True, False])
+def test_predict(key, data, dtype, simple_differencing):
     """Test in-sample prediction against statsmodels (with the same values
     for the model parameters)
     """
     n_obs = data.n_obs
-    _predict_common(key, data, dtype, n_obs // 2, n_obs)
+    _predict_common(key, data, dtype, n_obs // 2, n_obs,
+                    simple_differencing=simple_differencing)
 
 
 @pytest.mark.parametrize('key, data', test_data)
 @pytest.mark.parametrize('dtype', [np.float64])
-def test_forecast(key, data, dtype):
+@pytest.mark.parametrize('num_steps', [10])
+@pytest.mark.parametrize('simple_differencing', [True, False])
+def test_forecast(key, data, dtype, num_steps, simple_differencing):
     """Test out-of-sample forecasting against statsmodels (with the same
     values for the model parameters)
     """
     n_obs = data.n_obs
-    _predict_common(key, data, dtype, n_obs, n_obs + 10, 10)
+    _predict_common(key, data, dtype, n_obs, n_obs + num_steps, num_steps,
+                    simple_differencing=simple_differencing)
 
 
 @pytest.mark.parametrize('key, data', test_data)
 @pytest.mark.parametrize('dtype', [np.float64])
-def test_loglikelihood(key, data, dtype):
+@pytest.mark.parametrize('num_steps', [10])
+@pytest.mark.parametrize('level', [0.5, 0.95])
+def test_intervals(key, data, dtype, num_steps, level):
+    """Test forecast confidence intervals against statsmodels (with the same
+    values for the model parameters)
+    """
+    n_obs = data.n_obs
+    _predict_common(key, data, dtype, n_obs, n_obs + num_steps, num_steps,
+                    level)
+
+
+@pytest.mark.parametrize('key, data', test_data)
+@pytest.mark.parametrize('dtype', [np.float64])
+@pytest.mark.parametrize('simple_differencing', [True, False])
+def test_loglikelihood(key, data, dtype, simple_differencing):
     """Test loglikelihood against statsmodels (with the same values for the
     model parameters)
     """
@@ -342,7 +384,8 @@ def test_loglikelihood(key, data, dtype):
 
     # Create cuML model
     cuml_model = arima.ARIMA(
-        y_cudf, order, seasonal_order, fit_intercept=intercept)
+        y_cudf, order, seasonal_order, fit_intercept=intercept,
+        simple_differencing=simple_differencing)
 
     # Feed the parameters to the cuML model
     _statsmodels_to_cuml(ref_fits, cuml_model, order, seasonal_order,
@@ -359,8 +402,11 @@ def test_loglikelihood(key, data, dtype):
 @pytest.mark.parametrize('key, data', test_data)
 @pytest.mark.parametrize('dtype', [np.float64])
 def test_gradient(key, data, dtype):
-    """Test batched gradient implementation against scipy non-batched
-    gradient. Note: it doesn't test that the loglikelihood is correct!
+    """
+    Test batched gradient implementation against scipy non-batched
+    gradient.
+
+    .. note:: it doesn't test that the loglikelihood is correct!
     """
     order, seasonal_order, intercept = extract_order(key)
     p, _, q = order
@@ -368,7 +414,7 @@ def test_gradient(key, data, dtype):
     N = p + P + q + Q + intercept + 1
     h = 1e-8
 
-    y, y_cudf = get_dataset(data, dtype)
+    _, y_cudf = get_dataset(data, dtype)
 
     # Create cuML model
     cuml_model = arima.ARIMA(y_cudf, order, seasonal_order,
diff --git a/python/cuml/test/test_array.py b/python/cuml/test/test_array.py
index c1b90c4350..008c971313 100644
--- a/python/cuml/test/test_array.py
+++ b/python/cuml/test/test_array.py
@@ -15,18 +15,21 @@
 #
 
 import sys
+import gc
 
 import pytest
 
 import cupy as cp
-import numpy as np
 import cudf
+import numpy as np
+import operator
 
 from copy import deepcopy
 from numba import cuda
 from cudf.core.buffer import Buffer
 from cuml.common.array import CumlArray
 from cuml.common.memory_utils import _get_size_from_shape
+from cuml.common.memory_utils import _strides_to_order
 from rmm import DeviceBuffer
 
 if sys.version_info < (3, 8):
@@ -98,7 +101,7 @@ def test_array_init(input_type, dtype, shape, order):
 
     if shape == 10:
         assert ary.shape == (10,)
-        len(ary) == 10
+        assert len(ary) == 10
     elif input_type == 'series':
         # cudf Series make their shape (10,) from (10, 1)
         if shape == (10, 1):
@@ -108,15 +111,7 @@ def test_array_init(input_type, dtype, shape, order):
 
     assert ary.dtype == np.dtype(dtype)
 
-    if input_type in ['cupy', 'numba', 'series']:
-        assert ary._owner is inp
-        inp_copy = deepcopy(cp.asarray(inp))
-
-        # testing owner reference keeps data of ary alive
-        del inp
-        assert cp.all(cp.asarray(ary._owner) == cp.asarray(inp_copy))
-
-    else:
+    if (input_type == "numpy"):
         assert isinstance(ary._owner, cp.ndarray)
 
         truth = cp.asnumpy(inp)
@@ -126,6 +121,38 @@ def test_array_init(input_type, dtype, shape, order):
         data = ary.to_output('numpy')
 
         assert np.array_equal(truth, data)
+    else:
+        found_owner = False
+
+        def get_owner(curr):
+            if (isinstance(curr, CumlArray)):
+                return curr._owner
+            elif (isinstance(curr, cp.ndarray)):
+                return curr.data.mem._owner
+            else:
+                return None
+
+        # Make sure the input array is in the ownership chain
+        curr_owner = ary
+
+        while (curr_owner is not None):
+            if (curr_owner is inp):
+                found_owner = True
+                break
+
+            curr_owner = get_owner(curr_owner)
+
+        assert found_owner, "GPU input arrays must be in the owner chain"
+
+        inp_copy = deepcopy(cp.asarray(inp))
+
+        # testing owner reference keeps data of ary alive
+        del inp
+
+        # Force GC just in case it lingers
+        gc.collect()
+
+        assert cp.all(cp.asarray(ary._owner) == cp.asarray(inp_copy))
 
     return True
 
@@ -158,6 +185,36 @@ def test_array_init_from_bytes(data_type, dtype, shape, order):
     assert cp.all(cp.asarray(cp_ary) == cp_ary)
 
 
+@pytest.mark.parametrize('input_type', test_input_types)
+@pytest.mark.parametrize('dtype', test_dtypes_all)
+@pytest.mark.parametrize('shape', test_shapes)
+@pytest.mark.parametrize('order', ['F', 'C'])
+def test_array_init_bad(input_type, dtype, shape, order):
+    """
+    This test ensures that we assert on incorrect combinations of arguments
+    when creating CumlArray
+    """
+    if input_type == 'series':
+        inp = create_input(input_type, dtype, shape, 'C')
+    else:
+        inp = create_input(input_type, dtype, shape, order)
+
+    # Ensure the array is creatable
+    cuml_ary = CumlArray(inp)
+
+    with pytest.raises(AssertionError):
+        CumlArray(inp, dtype=cuml_ary.dtype)
+
+    with pytest.raises(AssertionError):
+        CumlArray(inp, shape=cuml_ary.shape)
+
+    with pytest.raises(AssertionError):
+        CumlArray(inp,
+                  order=_strides_to_order(cuml_ary.strides, cuml_ary.dtype))
+
+    assert cp.all(cp.asarray(inp) == cp.asarray(cuml_ary))
+
+
 @pytest.mark.parametrize('slice', test_slices)
 @pytest.mark.parametrize('order', ['C', 'F'])
 def test_get_set_item(slice, order):
@@ -465,6 +522,60 @@ def test_deepcopy(input_type):
         assert ary.order == b.order
 
 
+@pytest.mark.parametrize('operation', [operator.add, operator.sub])
+def test_cumlary_binops(operation):
+    a = cp.arange(5)
+    b = cp.arange(5)
+
+    ary_a = CumlArray(a)
+    ary_b = CumlArray(b)
+
+    c = operation(a, b)
+    ary_c = operation(ary_a, ary_b)
+
+    assert(cp.all(ary_c.to_output('cupy') == c))
+
+
+@pytest.mark.parametrize('order', ['F', 'C'])
+def test_sliced_array_owner(order):
+    """
+    When slicing a CumlArray, a new object can be created created which
+    previously had an incorrect owner. This was due to the requirement by
+    `cudf.core.Buffer` that all data be in "u1" form. CumlArray would satisfy
+    this requirement by calling
+    `cp.asarray(data).ravel(order='A').view('u1')`. If the slice is not
+    contiguous, this would create an intermediate object with no references
+    that would be cleaned up by GC causing an error when using the memory
+    """
+
+    # Create 2 copies of a random array
+    random_cp = cp.array(cp.random.random((500, 4)),
+                         dtype=np.float32,
+                         order=order)
+    cupy_array = cp.array(random_cp, copy=True)
+    cuml_array = CumlArray(random_cp)
+
+    # Make sure we have 2 pieces of data
+    assert cupy_array.data.ptr != cuml_array.ptr
+
+    # Since these are C arrays, slice off the first column to ensure they are
+    # non-contiguous
+    cuml_slice = cuml_array[1:, 1:]
+    cupy_slice = cupy_array[1:, 1:]
+
+    # Delete the input object just to be sure
+    del random_cp
+
+    # Make sure to cleanup any objects. Forces deletion of intermediate owner
+    # object
+    gc.collect()
+
+    # Calling `to_output` forces use of the pointer. This can fail with a cuda
+    # error on `cupy.cuda.runtime.pointerGetAttributes(cuml_slice.ptr)` in CUDA
+    # < 11.0 or cudaErrorInvalidDevice in CUDA > 11.0 (unclear why it changed)
+    assert (cp.all(cuml_slice.to_output('cupy') == cupy_slice))
+
+
 def create_input(input_type, dtype, shape, order):
     float_dtypes = [np.float16, np.float32, np.float64]
     if dtype in float_dtypes:
diff --git a/python/cuml/test/test_array_sparse.py b/python/cuml/test/test_array_sparse.py
new file mode 100644
index 0000000000..cac4bc0981
--- /dev/null
+++ b/python/cuml/test/test_array_sparse.py
@@ -0,0 +1,147 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+
+from cuml.common.array import CumlArray
+from cuml.common.array_sparse import SparseCumlArray
+
+import scipy.sparse
+import cupy as cp
+import cupyx
+
+test_input_types = [
+    'cupy', 'scipy'
+]
+
+
+@pytest.mark.parametrize('input_type', test_input_types)
+@pytest.mark.parametrize('sparse_format', ['csr', 'coo', 'csc'])
+@pytest.mark.parametrize('dtype', [cp.float32, cp.float64])
+@pytest.mark.parametrize('convert_format', [True, False])
+def test_input(input_type, sparse_format, dtype, convert_format):
+
+    rand_func = cupyx.scipy.sparse if input_type == 'cupy' else scipy.sparse
+
+    X = rand_func.random(100, 100, format=sparse_format,
+                         density=0.5,
+                         dtype=dtype)
+
+    if convert_format or sparse_format == 'csr':
+        X_m = SparseCumlArray(X, convert_format=convert_format)
+
+        assert X.shape == X_m.shape
+        assert X.nnz == X_m.nnz
+        assert X.dtype == X_m.dtype
+
+        # Just a sanity check
+        assert isinstance(X_m.indptr, CumlArray)
+        assert isinstance(X_m.indices, CumlArray)
+        assert isinstance(X_m.data, CumlArray)
+
+        assert X_m.indptr.dtype == cp.int32
+        assert X_m.indices.dtype == cp.int32
+        assert X_m.data.dtype == dtype
+
+    elif not convert_format:
+        with pytest.raises(ValueError):
+            SparseCumlArray(X, convert_format=convert_format)
+
+
+def test_nonsparse_input_fails():
+
+    X = cp.random.random((100, 100))
+
+    with pytest.raises(ValueError):
+        SparseCumlArray(X)
+
+
+@pytest.mark.parametrize('input_type', test_input_types)
+def test_convert_to_dtype(input_type):
+
+    rand_func = cupyx.scipy.sparse if input_type == 'cupy' else scipy.sparse
+
+    X = rand_func.random(100, 100, format='csr', density=0.5, dtype=cp.float64)
+
+    X_m = SparseCumlArray(X, convert_to_dtype=cp.float32)
+
+    assert X_m.dtype == cp.float32
+
+    assert X_m.indptr.dtype == cp.int32
+    assert X_m.indices.dtype == cp.int32
+    assert X_m.data.dtype == cp.float32
+
+    X_m = SparseCumlArray(X)
+
+    assert X_m.dtype == X.dtype
+
+
+@pytest.mark.parametrize('input_type', test_input_types)
+def test_convert_index(input_type):
+
+    rand_func = cupyx.scipy.sparse if input_type == 'cupy' else scipy.sparse
+
+    X = rand_func.random(100, 100, format='csr', density=0.5, dtype=cp.float64)
+
+    # Will convert to 32-bit by default
+    X.indptr = X.indptr.astype(cp.int64)
+    X.indices = X.indices.astype(cp.int64)
+
+    X_m = SparseCumlArray(X)
+
+    assert X_m.indptr.dtype == cp.int32
+    assert X_m.indices.dtype == cp.int32
+
+    X_m = SparseCumlArray(X, convert_index=cp.int64)
+
+    assert X_m.indptr.dtype == cp.int64
+    assert X_m.indices.dtype == cp.int64
+
+
+@pytest.mark.parametrize('input_type', test_input_types)
+@pytest.mark.parametrize('dtype', ['float32', 'float64'])
+@pytest.mark.parametrize('output_type', test_input_types)
+@pytest.mark.parametrize('output_format', [None, 'coo', 'csc'])
+def test_output(input_type, output_type, dtype, output_format):
+
+    rand_func = cupyx.scipy.sparse if input_type == 'cupy' else scipy.sparse
+
+    X = rand_func.random(100, 100, format='csr', density=0.5, dtype=dtype)
+
+    X_m = SparseCumlArray(X)
+
+    output = X_m.to_output(output_type, output_format=output_format)
+
+    if output_type == 'scipy':
+        if output_format is None:
+            assert isinstance(output, scipy.sparse.csr_matrix)
+        elif output_format == 'coo':
+            assert isinstance(output, scipy.sparse.coo_matrix)
+        elif output_format == 'csc':
+            assert isinstance(output, scipy.sparse.csc_matrix)
+        else:
+            pytest.fail("unecpected output format")
+    else:
+        if output_format is None:
+            assert isinstance(output, cupyx.scipy.sparse.csr_matrix)
+        elif output_format == 'coo':
+            assert isinstance(output, cupyx.scipy.sparse.coo_matrix)
+        elif output_format == 'csc':
+            assert isinstance(output, cupyx.scipy.sparse.csc_matrix)
+        else:
+            pytest.fail("unecpected output format")
+
+    cp.testing.assert_array_equal(X.todense(), output.todense())
diff --git a/python/cuml/test/test_base.py b/python/cuml/test/test_base.py
index 6ea0439f7d..b429022921 100644
--- a/python/cuml/test/test_base.py
+++ b/python/cuml/test/test_base.py
@@ -13,16 +13,27 @@
 # limitations under the License.
 #
 
-import pytest
+import inspect
+
 import cuml
-from cuml.test.utils import small_classification_dataset
+import pytest
+import numpydoc.docscrape
+from cuml.test.utils import (get_classes_from_package,
+                             small_classification_dataset)
+
+all_base_children = get_classes_from_package(cuml, import_sub_packages=True)
 
 
 def test_base_class_usage():
+    # Ensure base class returns the 3 main properties needed by all classes
     base = cuml.Base()
     base.handle.sync()
     base_params = base.get_param_names()
-    assert base_params == []
+
+    assert "handle" in base_params
+    assert "verbose" in base_params
+    assert "output_type" in base_params
+
     del base
 
 
@@ -56,3 +67,108 @@ def test_base_n_features_in(datatype, use_integer_n_features):
     else:
         clf._set_n_features_in(X_train)
         assert clf.n_features_in_ == X_train.shape[1]
+
+
+@pytest.mark.parametrize('child_class', list(all_base_children.keys()))
+def test_base_subclass_init_matches_docs(child_class: str):
+    """
+    This test is comparing the docstrings for arguments in __init__ for any
+    class that derives from `Base`, We ensure that 1) the base arguments exist
+    in the derived class, 2) The types and default values are the same and 3)
+    That the docstring matches the base class
+
+    This is to prevent multiple different docstrings for identical arguments
+    throughout the documentation
+
+    Parameters
+    ----------
+    child_class : str
+        Classname to test in the dict all_base_children
+
+    """
+
+    # To quickly find and replace all instances in the documentation, the below
+    # regex's may be useful
+    # output_type: r"^[ ]{4}output_type :.*\n(^(?![ ]{0,4}(?![ ]{4,})).*(\n))+"
+    # verbose: r"^[ ]{4}verbose :.*\n(^(?![ ]{0,4}(?![ ]{4,})).*(\n))+"
+    # handle: r"^[ ]{4}handle :.*\n(^(?![ ]{0,4}(?![ ]{4,})).*(\n))+"
+
+    def get_param_doc(param_doc_obj, name: str):
+        found_doc = next((x for x in param_doc_obj if x.name == name), None)
+
+        assert found_doc is not None, \
+            "Could not find {} in docstring".format(name)
+
+        return found_doc
+
+    # Load the base class signature, parse the docstring and pull out params
+    base_sig = inspect.signature(cuml.Base, follow_wrapped=True)
+    base_doc = numpydoc.docscrape.NumpyDocString(cuml.Base.__doc__)
+    base_doc_params = base_doc["Parameters"]
+
+    klass = all_base_children[child_class]
+
+    # Load the current class signature, parse the docstring and pull out params
+    klass_sig = inspect.signature(klass, follow_wrapped=True)
+    klass_doc = numpydoc.docscrape.NumpyDocString(klass.__doc__ or "")
+    klass_doc_params = klass_doc["Parameters"]
+
+    for name, param in base_sig.parameters.items():
+        # Ensure the base param exists in the derived
+        assert param.name in klass_sig.parameters
+
+        klass_param = klass_sig.parameters[param.name]
+
+        # Ensure the default values are the same
+        assert param.default == klass_param.default
+
+        # Make sure we arent accidentally a *args or **kwargs
+        assert (klass_param.kind == inspect.Parameter.POSITIONAL_OR_KEYWORD
+                or klass_param.kind == inspect.Parameter.KEYWORD_ONLY)
+
+        if (klass.__doc__ is not None):
+
+            found_doc = get_param_doc(klass_doc_params, name)
+
+            base_item_doc = get_param_doc(base_doc_params, name)
+
+            # Ensure the docstring is identical
+            assert found_doc.type == base_item_doc.type, \
+                "Docstring mismatch for {}".format(name)
+
+            assert " ".join(found_doc.desc) == " ".join(base_item_doc.desc)
+
+
+@pytest.mark.parametrize('child_class', list(all_base_children.keys()))
+def test_base_children_get_param_names(child_class: str):
+
+    """
+    This test ensures that the arguments in `Base.__init__` are available in
+    all derived classes `get_param_names`
+    """
+
+    klass = all_base_children[child_class]
+
+    sig = inspect.signature(klass, follow_wrapped=True)
+
+    try:
+        bound = sig.bind()
+        bound.apply_defaults()
+    except TypeError:
+        pytest.skip(
+            "{}.__init__ requires non-default arguments to create. Skipping.".
+            format(klass.__name__))
+    else:
+        # Create an instance
+        obj = klass(*bound.args, **bound.kwargs)
+
+        param_names = obj.get_param_names()
+
+        # Now ensure the base parameters are included in get_param_names
+        for name, param in sig.parameters.items():
+
+            if (param.kind == inspect.Parameter.VAR_KEYWORD
+                    or param.kind == inspect.Parameter.VAR_POSITIONAL):
+                continue
+
+            assert name in param_names
diff --git a/python/cuml/test/test_coordinate_descent.py b/python/cuml/test/test_coordinate_descent.py
index 0a959f4016..087415cb1d 100644
--- a/python/cuml/test/test_coordinate_descent.py
+++ b/python/cuml/test/test_coordinate_descent.py
@@ -159,3 +159,33 @@ def test_elastic_net_default(datatype, nrows, column_info):
     sk_predict = elastic_sk.predict(X_test)
     sk_r2 = r2_score(y_test, sk_predict)
     assert cu_r2 >= sk_r2 - 0.07
+
+
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_elastic_net_predict_convert_dtype(train_dtype, test_dtype):
+    X, y = make_regression(n_samples=50, n_features=10,
+                           n_informative=5, random_state=0)
+    X = X.astype(train_dtype)
+    y = y.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    clf = cuElasticNet()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
+
+
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_lasso_predict_convert_dtype(train_dtype, test_dtype):
+    X, y = make_regression(n_samples=50, n_features=10,
+                           n_informative=5, random_state=0)
+    X = X.astype(train_dtype)
+    y = y.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    clf = cuLasso()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
diff --git a/python/cuml/test/test_fil.py b/python/cuml/test/test_fil.py
index 1acc1686f2..316ddf3bd4 100644
--- a/python/cuml/test/test_fil.py
+++ b/python/cuml/test/test_fil.py
@@ -29,6 +29,7 @@
 from sklearn.metrics import accuracy_score, mean_squared_error
 from sklearn.model_selection import train_test_split
 
+
 if has_xgboost():
     import xgboost as xgb
 
@@ -57,6 +58,7 @@ def _build_and_save_xgboost(model_path,
                             y_train,
                             classification=True,
                             num_rounds=5,
+                            n_classes=2,
                             xgboost_params={}):
     """Trains a small xgboost classifier and saves it to model_path"""
     dtrain = xgb.DMatrix(X_train, label=y_train)
@@ -67,7 +69,11 @@ def _build_and_save_xgboost(model_path,
     # learning task params
     if classification:
         params['eval_metric'] = 'error'
-        params['objective'] = 'binary:logistic'
+        if n_classes == 2:
+            params['objective'] = 'binary:logistic'
+        else:
+            params['num_class'] = n_classes
+            params['objective'] = 'multi:softmax'
     else:
         params['eval_metric'] = 'error'
         params['objective'] = 'reg:squarederror'
@@ -83,23 +89,22 @@ def _build_and_save_xgboost(model_path,
 @pytest.mark.parametrize('n_rows', [unit_param(1000),
                                     quality_param(10000),
                                     stress_param(500000)])
-@pytest.mark.parametrize('n_columns', [unit_param(20),
+@pytest.mark.parametrize('n_columns', [unit_param(30),
                                        quality_param(100),
                          stress_param(1000)])
 @pytest.mark.parametrize('num_rounds', [unit_param(1),
                                         unit_param(5),
                                         quality_param(50),
                                         stress_param(90)])
+@pytest.mark.parametrize('n_classes', [2, 5, 25])
 @pytest.mark.skipif(has_xgboost() is False, reason="need to install xgboost")
-def test_fil_classification(n_rows, n_columns, num_rounds, tmp_path):
+def test_fil_classification(n_rows, n_columns, num_rounds,
+                            n_classes, tmp_path):
     # settings
     classification = True  # change this to false to use regression
-    n_rows = n_rows  # we'll use 1 millions rows
-    n_columns = n_columns
-    n_categories = 2
     random_state = np.random.RandomState(43210)
 
-    X, y = simulate_data(n_rows, n_columns, n_categories,
+    X, y = simulate_data(n_rows, n_columns, n_classes,
                          random_state=random_state,
                          classification=classification)
     # identify shape and indices
@@ -113,28 +118,27 @@ def test_fil_classification(n_rows, n_columns, num_rounds, tmp_path):
 
     bst = _build_and_save_xgboost(model_path, X_train, y_train,
                                   num_rounds=num_rounds,
-                                  classification=classification)
+                                  classification=classification,
+                                  n_classes=n_classes)
 
     dvalidation = xgb.DMatrix(X_validation, label=y_validation)
     xgb_preds = bst.predict(dvalidation)
     xgb_preds_int = np.around(xgb_preds)
-    xgb_proba = np.stack([1-xgb_preds, xgb_preds], axis=1)
+    xgb_acc = accuracy_score(y_validation, xgb_preds_int)
 
-    xgb_acc = accuracy_score(y_validation, xgb_preds > 0.5)
     fm = ForestInference.load(model_path,
                               algo='auto',
                               output_class=True,
                               threshold=0.50)
     fil_preds = np.asarray(fm.predict(X_validation))
-    fil_preds = np.reshape(fil_preds, np.shape(xgb_preds_int))
-    fil_proba = np.asarray(fm.predict_proba(X_validation))
-
-    fil_proba = np.reshape(fil_proba, np.shape(xgb_proba))
     fil_acc = accuracy_score(y_validation, fil_preds)
 
-    assert fil_acc == pytest.approx(xgb_acc, 0.01)
-    assert array_equal(fil_preds, xgb_preds_int)
-    assert array_equal(fil_proba, xgb_proba)
+    assert fil_acc == pytest.approx(xgb_acc, abs=0.01)
+    if n_classes == 2:
+        assert array_equal(fil_preds, xgb_preds_int)
+        xgb_proba = np.stack([1-xgb_preds, xgb_preds], axis=1)
+        fil_proba = np.asarray(fm.predict_proba(X_validation))
+        assert np.allclose(fil_proba, xgb_proba, 1e-3)
 
 
 @pytest.mark.parametrize('n_rows', [unit_param(1000), quality_param(10000),
@@ -182,29 +186,33 @@ def test_fil_regression(n_rows, n_columns, num_rounds, tmp_path, max_depth):
     fil_preds = np.reshape(fil_preds, np.shape(xgb_preds))
     fil_mse = mean_squared_error(y_validation, fil_preds)
 
-    assert fil_mse == pytest.approx(xgb_mse, 0.01)
-    assert array_equal(fil_preds, xgb_preds)
+    assert fil_mse == pytest.approx(xgb_mse, abs=0.01)
+    assert np.allclose(fil_preds, xgb_preds, 1e-3)
 
 
 @pytest.mark.parametrize('n_rows', [1000])
-@pytest.mark.parametrize('n_columns', [20])
+@pytest.mark.parametrize('n_columns', [30])
 @pytest.mark.parametrize('n_estimators', [1, 10])
 @pytest.mark.parametrize('max_depth', [2, 10, 20])
+@pytest.mark.parametrize('n_classes', [2, 5, 25])
 @pytest.mark.parametrize('storage_type', [False, True])
 @pytest.mark.parametrize('model_class',
                          [GradientBoostingClassifier, RandomForestClassifier])
 def test_fil_skl_classification(n_rows, n_columns, n_estimators, max_depth,
-                                storage_type, model_class):
+                                n_classes, storage_type, model_class):
     # skip depth 20 for dense tests
     if max_depth == 20 and not storage_type:
         return
 
+    # FIL not supporting multi-class sklearn RandomForestClassifiers
+    if n_classes > 2 and model_class == RandomForestClassifier:
+        return
+
     # settings
     classification = True  # change this to false to use regression
-    n_categories = 2
     random_state = np.random.RandomState(43210)
 
-    X, y = simulate_data(n_rows, n_columns, n_categories,
+    X, y = simulate_data(n_rows, n_columns, n_classes,
                          random_state=random_state,
                          classification=classification)
     # identify shape and indices
@@ -224,14 +232,14 @@ def test_fil_skl_classification(n_rows, n_columns, n_estimators, max_depth,
         # model_class == GradientBoostingClassifier
         init_kwargs['init'] = 'zero'
 
-    skl_model = model_class(**init_kwargs)
+    skl_model = model_class(**init_kwargs, random_state=random_state)
     skl_model.fit(X_train, y_train)
 
     skl_preds = skl_model.predict(X_validation)
     skl_preds_int = np.around(skl_preds)
     skl_proba = skl_model.predict_proba(X_validation)
 
-    skl_acc = accuracy_score(y_validation, skl_preds > 0.5)
+    skl_acc = accuracy_score(y_validation, skl_preds_int)
 
     algo = 'NAIVE' if storage_type else 'BATCH_TREE_REORG'
 
@@ -242,15 +250,19 @@ def test_fil_skl_classification(n_rows, n_columns, n_estimators, max_depth,
                                            storage_type=storage_type)
     fil_preds = np.asarray(fm.predict(X_validation))
     fil_preds = np.reshape(fil_preds, np.shape(skl_preds_int))
-
-    fil_proba = np.asarray(fm.predict_proba(X_validation))
-    fil_proba = np.reshape(fil_proba, np.shape(skl_proba))
-
     fil_acc = accuracy_score(y_validation, fil_preds)
+    # fil_acc is within p99 error bars of skl_acc (diff == 0.017 +- 0.012)
+    # however, some tests have a delta as big as 0.04.
+    # sklearn uses float64 thresholds, while FIL uses float32
+    # TODO(levsnv): once FIL supports float64 accuracy, revisit thresholds
+    threshold = 1e-5 if n_classes == 2 else 0.1
+    assert fil_acc == pytest.approx(skl_acc, abs=threshold)
 
-    assert fil_acc == pytest.approx(skl_acc, 1e-5)
-    assert array_equal(fil_preds, skl_preds_int)
-    assert array_equal(fil_proba, skl_proba)
+    if n_classes == 2:
+        assert array_equal(fil_preds, skl_preds_int)
+        fil_proba = np.asarray(fm.predict_proba(X_validation))
+        fil_proba = np.reshape(fil_proba, np.shape(skl_proba))
+        assert np.allclose(fil_proba, skl_proba, 1e-3)
 
 
 @pytest.mark.parametrize('n_rows', [1000])
@@ -309,9 +321,8 @@ def test_fil_skl_regression(n_rows, n_columns, n_estimators, max_depth,
 
     fil_mse = mean_squared_error(y_validation, fil_preds)
 
-    # if fil is better than skl, no need to fail the test
-    assert fil_mse <= skl_mse * (1. + 1e-7) + 1e-4
-    assert array_equal(fil_preds, skl_preds)
+    assert fil_mse <= skl_mse * (1. + 1e-6) + 1e-4
+    assert np.allclose(fil_preds, skl_preds, 1.2e-3)
 
 
 @pytest.fixture(scope="session")
@@ -350,7 +361,7 @@ def test_output_algos(algo, small_classifier_and_preds):
 
 @pytest.mark.skipif(has_xgboost() is False, reason="need to install xgboost")
 @pytest.mark.parametrize('storage_type',
-                         [False, True, 'auto'])
+                         [False, True, 'auto', 'dense', 'sparse', 'sparse8'])
 def test_output_storage_type(storage_type, small_classifier_and_preds):
     model_path, X, xgb_preds = small_classifier_and_preds
     fm = ForestInference.load(model_path,
@@ -395,43 +406,79 @@ def test_output_args(small_classifier_and_preds):
 
 
 @pytest.mark.skipif(has_lightgbm() is False, reason="need to install lightgbm")
-def test_lightgbm(tmp_path):
+def test_cpp_exception(tmp_path):
     import lightgbm as lgb
-    X, y = simulate_data(500, 10,
-                         random_state=43210,
+    num_class = 3
+    X, y = simulate_data(50,
+                         5 * num_class,
+                         num_class,
+                         random_state=2020,
                          classification=True)
     train_data = lgb.Dataset(X, label=y)
-    param = {'objective': 'binary',
-             'metric': 'binary_logloss'}
-    num_round = 5
-    bst = lgb.train(param, train_data, num_round)
-    gbm_preds = bst.predict(X)
 
-    model_path = str(os.path.join(tmp_path,
-                                  'lgb.model'))
+    num_round = 1
+    param = {'objective': 'ova',  # 'multiclass', would use softmax
+             'metric': 'multi_logloss',
+             'num_class': num_class}
+    bst = lgb.train(param, train_data, num_round)
+    model_path = str(os.path.join(tmp_path, 'lgb.model'))
     bst.save_model(model_path)
+
     fm = ForestInference.load(model_path,
                               algo='TREE_REORG',
-                              output_class=False,
-                              model_type="lightgbm")
+                              output_class=True,
+                              model_type='lightgbm')
 
-    fil_preds = np.asarray(fm.predict(X))
-    fil_preds = np.reshape(fil_preds, np.shape(gbm_preds))
+    with pytest.raises(RuntimeError) as excinfo:
+        _ = fm.predict_proba(X)
 
-    assert np.allclose(gbm_preds, fil_preds, 1e-3)
+    assert ('predict_proba not supported for multi-class gradient boosted ' +
+            'decision trees' in str(excinfo.value))
 
-    lcls = lgb.LGBMClassifier().set_params(objective='binary',
-                                           metric='binary_logloss')
-    lcls.fit(X, y)
-    gbm_proba = lcls.predict_proba(X)
 
-    lcls.booster_.save_model(model_path)
+@pytest.mark.parametrize('num_classes', [2, 5])
+@pytest.mark.skipif(has_lightgbm() is False, reason="need to install lightgbm")
+def test_lightgbm(tmp_path, num_classes):
+    import lightgbm as lgb
+    X, y = simulate_data(500,
+                         10 if num_classes == 2 else 50,
+                         num_classes,
+                         random_state=43210,
+                         classification=True)
+    train_data = lgb.Dataset(X, label=y)
+
+    if num_classes == 2:
+        param = {'objective': 'binary',
+                 'metric': 'binary_logloss',
+                 'num_class': 1}
+    else:
+        param = {'objective': 'ova',  # 'multiclass', would use softmax
+                 'metric': 'multi_logloss',
+                 'num_class': num_classes}
+    num_round = 5
+    bst = lgb.train(param, train_data, num_round)
+    gbm_preds = bst.predict(X)
+    if num_classes > 2:
+        gbm_preds = gbm_preds.argmax(axis=1)
+    model_path = str(os.path.join(tmp_path,
+                                  'lgb.model'))
+    bst.save_model(model_path)
     fm = ForestInference.load(model_path,
                               algo='TREE_REORG',
-                              output_class=False,
+                              output_class=True,
                               model_type="lightgbm")
-
-    fil_proba = np.asarray(fm.predict_proba(X))
-    fil_proba = np.reshape(fil_proba, np.shape(gbm_proba))
-
-    assert np.allclose(gbm_proba, fil_proba, 1e-3)
+    fil_preds = fm.predict(X)
+    assert array_equal(np.round(gbm_preds), fil_preds)
+
+    if num_classes == 2:
+        lcls = lgb.LGBMClassifier().set_params(**param)
+        lcls.fit(X, y)
+        gbm_proba = lcls.predict_proba(X)
+
+        lcls.booster_.save_model(model_path)
+        fm = ForestInference.load(model_path,
+                                  algo='TREE_REORG',
+                                  output_class=True,
+                                  model_type="lightgbm")
+        fil_proba = fm.predict_proba(X)
+        assert np.allclose(gbm_proba, fil_proba, 1e-2)
diff --git a/python/cuml/test/test_fit_function.py b/python/cuml/test/test_fit_function.py
index 1f264572cc..6a0059dab5 100644
--- a/python/cuml/test/test_fit_function.py
+++ b/python/cuml/test/test_fit_function.py
@@ -36,7 +36,8 @@ def test_fit_function(dataset, model_name):
         "SparseRandomProjection",
         "TSNE",
         "TruncatedSVD",
-        "AutoARIMA"
+        "AutoARIMA",
+        "MultinomialNB"
     ]:
         pytest.xfail("These models are not tested yet")
 
@@ -59,7 +60,7 @@ def test_fit_function(dataset, model_name):
         # and the inspect module doesn't work with Cython. Therefore we need
         # to register the number of arguments manually if `fit` is decorated
         pos_args_spec = {
-            "ARIMA": 1,
+            "ARIMA": 1
         }
         n_pos_args_fit = (
             pos_args_spec[model_name]
@@ -84,5 +85,5 @@ def test_fit_function(dataset, model_name):
         if hasattr(model, "_estimator_type"):
             if model._estimator_type == "classifier":
                 cp.testing.assert_array_almost_equal(
-                    model.classes_, cp.unique(y)
+                    model.classes_, np.unique(y)
                 )
diff --git a/python/cuml/test/test_incremental_pca.py b/python/cuml/test/test_incremental_pca.py
new file mode 100644
index 0000000000..e0b4d28958
--- /dev/null
+++ b/python/cuml/test/test_incremental_pca.py
@@ -0,0 +1,107 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+import cupy as cp
+import cupyx
+
+from sklearn.decomposition import IncrementalPCA as skIPCA
+
+from cuml.datasets import make_blobs
+from cuml.experimental.decomposition import IncrementalPCA as cuIPCA
+
+from cuml.test.utils import array_equal
+
+
+@pytest.mark.parametrize(
+    'nrows, ncols, n_components, sparse_input, density, sparse_format,'
+    ' batch_size_divider', [
+        (500, 15, 2, True, 0.4, 'csr', 5),
+        (5000, 25, 12, False, 0.07, 'csc', 10),
+        (5000, 15, None, True, 0.4, 'csc', 5),
+        (500, 25, 2, False, 0.07, 'csr', 10),
+        (5000, 25, 12, False, 0.07, 'csr', 10)
+    ]
+)
+@pytest.mark.no_bad_cuml_array_check
+def test_fit(nrows, ncols, n_components, sparse_input, density,
+             sparse_format, batch_size_divider):
+
+    if sparse_format == 'csc':
+        pytest.skip("cupyx.scipy.sparse.csc.csc_matrix does not support"
+                    " indexing as of cupy 7.6.0")
+
+    if sparse_input:
+        X = cupyx.scipy.sparse.random(nrows, ncols, density=density,
+                                      random_state=10, format=sparse_format)
+    else:
+        X, _ = make_blobs(n_samples=nrows, n_features=ncols, random_state=10)
+
+    cu_ipca = cuIPCA(n_components=n_components,
+                     batch_size=int(nrows / batch_size_divider))
+    cu_ipca.fit(X)
+    cu_t = cu_ipca.transform(X)
+    cu_inv = cu_ipca.inverse_transform(cu_t)
+
+    sk_ipca = skIPCA(n_components=n_components,
+                     batch_size=int(nrows / batch_size_divider))
+    if sparse_input:
+        X = X.get()
+    else:
+        X = cp.asnumpy(X)
+    sk_ipca.fit(X)
+    sk_t = sk_ipca.transform(X)
+    sk_inv = sk_ipca.inverse_transform(sk_t)
+
+    assert array_equal(cu_inv, sk_inv,
+                       5e-5, with_sign=True)
+
+
+@pytest.mark.parametrize(
+    'nrows, ncols, n_components, density, batch_size_divider', [
+        (500, 15, 2, 0.07, 5),
+        (5000, 25, 12, 0.07, 10),
+        (5000, 15, 2, 0.4, 5),
+        (500, 25, 12, 0.4, 10),
+    ]
+)
+@pytest.mark.no_bad_cuml_array_check
+def test_partial_fit(nrows, ncols, n_components, density,
+                     batch_size_divider):
+
+    X, _ = make_blobs(n_samples=nrows, n_features=ncols, random_state=10)
+
+    cu_ipca = cuIPCA(n_components=n_components)
+
+    sample_size = int(nrows / batch_size_divider)
+    for i in range(0, nrows, sample_size):
+        cu_ipca.partial_fit(X[i:i + sample_size].copy())
+
+    cu_t = cu_ipca.transform(X)
+    cu_inv = cu_ipca.inverse_transform(cu_t)
+
+    sk_ipca = skIPCA(n_components=n_components)
+
+    X = cp.asnumpy(X)
+
+    for i in range(0, nrows, sample_size):
+        sk_ipca.partial_fit(X[i:i + sample_size].copy())
+
+    sk_t = sk_ipca.transform(X)
+    sk_inv = sk_ipca.inverse_transform(sk_t)
+
+    assert array_equal(cu_inv, sk_inv,
+                       5e-5, with_sign=True)
diff --git a/python/cuml/test/test_input_utils.py b/python/cuml/test/test_input_utils.py
index f1633f5e42..606b3cb078 100644
--- a/python/cuml/test/test_input_utils.py
+++ b/python/cuml/test/test_input_utils.py
@@ -19,6 +19,7 @@
 import cudf
 import cupy as cp
 import numpy as np
+from pandas import DataFrame as pdDF
 
 from cuml.common import input_to_cuml_array, CumlArray
 from cuml.common import input_to_dev_array
@@ -131,8 +132,10 @@ def test_convert_matrix_order_cuml_array(dtype, input_type, from_order,
                                                 order=to_order)
 
     if to_order == 'K':
-        if input_type in ['cudf', 'pandas']:
+        if input_type in ['cudf']:
             assert conv_data.order == 'F'
+        elif input_type in ['pandas']:
+            assert conv_data.order == 'C'
         else:
             assert conv_data.order == from_order
     else:
@@ -317,7 +320,7 @@ def get_ptr(x):
 
 def get_input(type, nrows, ncols, dtype, order='C', out_dtype=False):
     rand_mat = (cp.random.rand(nrows, ncols) * 10)
-    rand_mat = cp.array(rand_mat, order=order).astype(dtype)
+    rand_mat = cp.array(rand_mat, dtype=dtype, order=order)
 
     if type == 'numpy':
         result = np.array(cp.asnumpy(rand_mat), order=order)
@@ -332,12 +335,10 @@ def get_input(type, nrows, ncols, dtype, order='C', out_dtype=False):
         result = cudf.DataFrame(rand_mat)
 
     if type == 'pandas':
-        result = cudf.DataFrame(rand_mat)
-        result = result.to_pandas()
+        result = pdDF(cp.asnumpy(rand_mat))
 
     if type == 'cuml':
-        result = CumlArray(data=rand_mat,
-                           order=order if order != 'K' else None)
+        result = CumlArray(data=rand_mat)
 
     if out_dtype:
         return result, np.array(cp.asnumpy(rand_mat).astype(out_dtype),
diff --git a/python/cuml/test/test_kmeans.py b/python/cuml/test/test_kmeans.py
index de0479b603..15f4c38e11 100644
--- a/python/cuml/test/test_kmeans.py
+++ b/python/cuml/test/test_kmeans.py
@@ -282,11 +282,10 @@ def test_all_kmeans_params(n_clusters, max_iter, init,
 @pytest.mark.parametrize("ncols", [10, 30])
 @pytest.mark.parametrize("nclusters", [unit_param(5), quality_param(10),
                                        stress_param(50)])
-@pytest.mark.parametrize("score_eps", [unit_param(0.06), stress_param(6.00)])
-def test_score(nrows, ncols, score_eps, nclusters):
+def test_score(nrows, ncols, nclusters):
 
     X, y = make_blobs(int(nrows), ncols, nclusters,
-                      cluster_std=0.01,
+                      cluster_std=1.0,
                       shuffle=False,
                       random_state=10)
 
@@ -298,19 +297,19 @@ def test_score(nrows, ncols, score_eps, nclusters):
     cuml_kmeans.fit(X)
 
     actual_score = cuml_kmeans.score(X)
-
     predictions = cuml_kmeans.predict(X)
 
     centers = cuml_kmeans.cluster_centers_
 
-    expected_score = 0
+    expected_score = 0.0
     for idx, label in enumerate(predictions):
-        x = X[idx]
-        y = cp.array(centers[label])
+        x = X[idx, :]
+        y = cp.array(centers[label, :], dtype=cp.float32)
+
+        sq_euc_dist = cp.sum(cp.square((x - y)))
+        expected_score += sq_euc_dist
 
-        dist = cp.sqrt(cp.sum((x - y)**2))
-        expected_score += dist**2
+    expected_score *= -1
 
-    assert actual_score + score_eps \
-        >= (-1*expected_score) \
-        >= actual_score - score_eps
+    cp.testing.assert_allclose(
+        actual_score, expected_score, atol=0.1, rtol=1e-5)
diff --git a/python/cuml/test/test_kneighbors_classifier.py b/python/cuml/test/test_kneighbors_classifier.py
index b266898662..8882514696 100644
--- a/python/cuml/test/test_kneighbors_classifier.py
+++ b/python/cuml/test/test_kneighbors_classifier.py
@@ -138,6 +138,65 @@ def test_predict_proba(nrows, ncols, n_neighbors, n_clusters, datatype):
     assert array_equal(predictions.sum(axis=1), np.ones(y_test.shape[0]))
 
 
+@pytest.mark.parametrize("datatype", ["dataframe", "numpy"])
+def test_predict_proba_large_n_classes(datatype):
+
+    nrows = 10000
+    ncols = 100
+    n_neighbors = 10
+    n_clusters = 10000
+
+    X, y = make_blobs(n_samples=nrows,
+                      centers=n_clusters,
+                      n_features=ncols,
+                      cluster_std=0.01,
+                      random_state=0)
+
+    X = X.astype(np.float32)
+
+    X_train, X_test, y_train, y_test = _build_train_test_data(X, y, datatype)
+
+    knn_cu = cuKNN(n_neighbors=n_neighbors)
+    knn_cu.fit(X_train, y_train)
+
+    predictions = knn_cu.predict_proba(X_test)
+
+    if datatype == "dataframe":
+        predictions = predictions.as_gpu_matrix().copy_to_host()
+
+    assert np.rint(np.sum(predictions)) == len(y_test)
+
+
+@pytest.mark.parametrize("datatype", ["dataframe", "numpy"])
+def test_predict_large_n_classes(datatype):
+
+    nrows = 10000
+    ncols = 100
+    n_neighbors = 2
+    n_clusters = 1000
+
+    X, y = make_blobs(n_samples=nrows,
+                      centers=n_clusters,
+                      n_features=ncols,
+                      cluster_std=0.01,
+                      random_state=0)
+
+    X = X.astype(np.float32)
+
+    X_train, X_test, y_train, y_test = _build_train_test_data(X, y, datatype)
+
+    knn_cu = cuKNN(n_neighbors=n_neighbors)
+    knn_cu.fit(X_train, y_train)
+
+    y_hat = knn_cu.predict(X_test)
+
+    if datatype == "dataframe":
+        y_hat = y_hat.to_gpu_array().copy_to_host()
+        y_test = y_test.as_gpu_matrix().copy_to_host().ravel()
+
+    assert array_equal(y_hat.astype(np.int32), y_test.astype(np.int32))
+
+
 @pytest.mark.parametrize("n_samples", [100])
 @pytest.mark.parametrize("n_features", [40])
 @pytest.mark.parametrize("n_neighbors", [4])
@@ -204,9 +263,6 @@ def test_nonmonotonic_labels(n_classes, n_rows, n_cols,
         p = p.to_frame().as_gpu_matrix().copy_to_host().reshape(p.shape[0])
         y_test = y_test.as_gpu_matrix().copy_to_host().reshape(y_test.shape[0])
 
-    print(str(p))
-    print(str(y_test))
-
     assert array_equal(p.astype(np.int32), y_test.astype(np.int32))
 
 
diff --git a/python/cuml/test/test_kneighbors_regressor.py b/python/cuml/test/test_kneighbors_regressor.py
index 27f2a5c62a..8e04e768db 100644
--- a/python/cuml/test/test_kneighbors_regressor.py
+++ b/python/cuml/test/test_kneighbors_regressor.py
@@ -104,6 +104,23 @@ def test_score(nrows, ncols, n_neighbors, n_clusters, datatype):
     assert knn_cu.score(X, y) >= 0.9999
 
 
+@pytest.mark.parametrize("dtype", [np.float32, np.float64])
+def test_score_dtype(dtype):
+    # Using make_blobs here to check averages and neighborhoods
+    X, y = make_blobs(n_samples=1000, centers=2,
+                      cluster_std=0.01,
+                      n_features=50, random_state=0)
+
+    X = X.astype(dtype)
+    y = y.astype(dtype)
+
+    knn_cu = cuKNN(n_neighbors=5)
+    knn_cu.fit(X, y)
+    pred = knn_cu.predict(X)
+    assert pred.dtype == dtype
+    assert knn_cu.score(X, y) >= 0.9999
+
+
 @pytest.mark.parametrize("input_type", ["cudf", "numpy", "cupy"])
 @pytest.mark.parametrize("output_type", ["cudf", "numpy", "cupy"])
 def test_predict_multioutput(input_type, output_type):
diff --git a/python/cuml/test/test_label_encoder.py b/python/cuml/test/test_label_encoder.py
index 7e19260bc5..c5dd76e166 100644
--- a/python/cuml/test/test_label_encoder.py
+++ b/python/cuml/test/test_label_encoder.py
@@ -147,3 +147,23 @@ def test_empty_input(empty, ord_label):
     transformed = le.fit_transform(empty)
     assert(le._fitted is True)
     assert(len(transformed) == 0)
+
+
+def test_masked_encode():
+    int_values = [3, 1, 1, 2, 1, 1, 1, 1, 6, 5]
+    cat_values = ['a', 'd', 'b', 'c', 'd', 'd', 'd', 'c', 'b', 'c']
+    df = cudf.DataFrame({"filter_col": int_values,
+                         "cat_col": cat_values})
+
+    df_filter = df[df["filter_col"] == 1]
+    df_filter["cat_col"] = LabelEncoder().fit_transform(df_filter["cat_col"])
+
+    filtered_int_values = [int_values[i] for i in range(len(int_values))
+                           if int_values[i] == 1]
+    filtered_cat_values = [cat_values[i] for i in range(len(int_values))
+                           if int_values[i] == 1]
+    df_test = cudf.DataFrame({"filter_col": filtered_int_values,
+                              "cat_col": filtered_cat_values})
+    df_test["cat_col"] = LabelEncoder().fit_transform(df_test["cat_col"])
+
+    assert(df_filter["cat_col"].values == df_test["cat_col"].values).all()
diff --git a/python/cuml/test/test_linear_model.py b/python/cuml/test/test_linear_model.py
index d82ada6a0d..3137887fdd 100644
--- a/python/cuml/test/test_linear_model.py
+++ b/python/cuml/test/test_linear_model.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from functools import lru_cache
 import cupy as cp
 import numpy as np
 import pytest
@@ -38,17 +39,25 @@
 from sklearn.model_selection import train_test_split
 
 
-def make_regression_dataset(datatype, nrows, ncols, n_info):
+def _make_regression_dataset_uncached(nrows, ncols, n_info):
     X, y = make_regression(
         n_samples=nrows, n_features=ncols, n_informative=n_info, random_state=0
     )
-    X = X.astype(datatype)
-    y = y.astype(datatype)
-    X_train, X_test, y_train, y_test = train_test_split(
-        X, y, train_size=0.8, random_state=10
-    )
+    return train_test_split(X, y, train_size=0.8, random_state=10)
 
-    return X_train, X_test, y_train, y_test
+
+@lru_cache(4)
+def _make_regression_dataset_from_cache(nrows, ncols, n_info):
+    return _make_regression_dataset_uncached(nrows, ncols, n_info)
+
+
+def make_regression_dataset(datatype, nrows, ncols, n_info):
+    if nrows * ncols < 1e8:  # Keep cache under 4 GB
+        dataset = _make_regression_dataset_from_cache(nrows, ncols, n_info)
+    else:
+        dataset = _make_regression_dataset_uncached(nrows, ncols, n_info)
+
+    return map(lambda arr: arr.astype(datatype), dataset)
 
 
 def make_classification_dataset(datatype, nrows, ncols, n_info, num_classes):
@@ -71,7 +80,7 @@ def make_classification_dataset(datatype, nrows, ncols, n_info, num_classes):
 @pytest.mark.parametrize("datatype", [np.float32, np.float64])
 @pytest.mark.parametrize("algorithm", ["eig", "svd"])
 @pytest.mark.parametrize(
-    "nrows", [unit_param(500), quality_param(5000), stress_param(500000)]
+    "nrows", [unit_param(1000), quality_param(5000), stress_param(500000)]
 )
 @pytest.mark.parametrize(
     "column_info",
@@ -83,6 +92,11 @@ def make_classification_dataset(datatype, nrows, ncols, n_info, num_classes):
 )
 def test_linear_regression_model(datatype, algorithm, nrows, column_info):
 
+    if algorithm == "svd" and nrows > 46340:
+        pytest.skip("svd solver is not supported for the data that has more"
+                    "than 46340 rows or columns if you are using CUDA version"
+                    "10.x")
+
     ncols, n_info = column_info
     X_train, X_test, y_train, y_test = make_regression_dataset(
         datatype, nrows, ncols, n_info
@@ -162,6 +176,11 @@ def test_ridge_regression_model_default(datatype):
 )
 def test_ridge_regression_model(datatype, algorithm, nrows, column_info):
 
+    if algorithm == "svd" and nrows > 46340:
+        pytest.skip("svd solver is not supported for the data that has more"
+                    "than 46340 rows or columns if you are using CUDA version"
+                    "10.x")
+
     ncols, n_info = column_info
     X_train, X_test, y_train, y_test = make_regression_dataset(
         datatype, nrows, ncols, n_info
@@ -187,15 +206,24 @@ def test_ridge_regression_model(datatype, algorithm, nrows, column_info):
                            with_sign=True)
 
 
-@pytest.mark.parametrize("num_classes", [2, 10])
-@pytest.mark.parametrize("dtype", [np.float32, np.float64])
-@pytest.mark.parametrize("penalty", ["none", "l1", "l2", "elasticnet"])
-@pytest.mark.parametrize("l1_ratio", [1.0])
-@pytest.mark.parametrize("fit_intercept", [True, False])
+@pytest.mark.parametrize(
+    "num_classes, dtype, penalty, l1_ratio, fit_intercept, C, tol", [
+        # L-BFGS Solver
+        (2, np.float32, "none", 1.0, True, 1.0, 1e-3),
+        (2, np.float64, "l2", 1.0, True, 1.0, 1e-8),
+        (10, np.float32, "elasticnet", 0.0, True, 1.0, 1e-3),
+        (10, np.float32, "none", 1.0, False, 1.0, 1e-8),
+        (10, np.float32, "none", 1.0, False, 2.0, 1e-3),
+        # OWL-QN Solver
+        (2, np.float32, "l1", 1.0, True, 1.0, 1e-3),
+        (2, np.float64, "elasticnet", 1.0, True, 1.0, 1e-8),
+        (10, np.float32, "l1", 1.0, True, 1.0, 1e-3),
+        (10, np.float32, "l1", 1.0, False, 1.0, 1e-8),
+        (10, np.float32, "elasticnet", 1.0, False, 0.5, 1e-3),
+    ]
+)
 @pytest.mark.parametrize("nrows", [unit_param(1000)])
 @pytest.mark.parametrize("column_info", [unit_param([20, 10])])
-@pytest.mark.parametrize("C", [2.0, 1.0, 0.5])
-@pytest.mark.parametrize("tol", [1e-3, 1e-8])
 def test_logistic_regression(
     num_classes, dtype, penalty, l1_ratio,
     fit_intercept, nrows, column_info, C, tol
@@ -266,9 +294,13 @@ def test_logistic_regression(
     assert len(np.unique(cu_preds)) == len(np.unique(y_test))
 
 
-@pytest.mark.parametrize("dtype", [np.float32, np.float64])
-@pytest.mark.parametrize("penalty", ["none", "l1", "l2", "elasticnet"])
-def test_logistic_regression_unscaled(dtype, penalty):
+@pytest.mark.parametrize("dtype, penalty, l1_ratio", [
+    (np.float32, "none", 1.0),
+    (np.float64, "l2", 0.0),
+    (np.float32, "elasticnet", 1.0),
+    (np.float64, "l1", None),
+])
+def test_logistic_regression_unscaled(dtype, penalty, l1_ratio):
     # Test logistic regression on the breast cancer dataset. We do not scale
     # the dataset which could lead to numerical problems (fixed in PR #2543).
     X, y = load_breast_cancer(return_X_y=True)
@@ -276,9 +308,7 @@ def test_logistic_regression_unscaled(dtype, penalty):
     y = y.astype(dtype)
     X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
     params = {"penalty": penalty, "C": 1, "tol": 1e-4, "fit_intercept": True,
-              'max_iter': 5000}
-    if penalty == "elasticnet":
-        params["l1_ratio"] = 1.0
+              'max_iter': 5000, "l1_ratio": l1_ratio}
     culog = cuLog(**params)
     culog.fit(X_train, y_train)
 
@@ -304,11 +334,12 @@ def test_logistic_regression_model_default(dtype):
     assert culog.score(X_test, y_test) >= sklog.score(X_test, y_test) - 0.022
 
 
-@pytest.mark.parametrize("dtype", [np.float32, np.float64])
-@pytest.mark.parametrize("nrows", [10, 100])
+@pytest.mark.parametrize("dtype, nrows, num_classes, fit_intercept", [
+    (np.float32, 10, 2, True),
+    (np.float64, 100, 10, False),
+    (np.float64, 100, 2, True)
+])
 @pytest.mark.parametrize("column_info", [(20, 10)])
-@pytest.mark.parametrize("num_classes", [2, 10])
-@pytest.mark.parametrize("fit_intercept", [True, False])
 def test_logistic_regression_decision_function(
     dtype, nrows, column_info, num_classes, fit_intercept
 ):
@@ -321,18 +352,18 @@ def test_logistic_regression_decision_function(
     y_train = y_train.astype(dtype)
     y_test = y_test.astype(dtype)
 
-    culog = cuLog(fit_intercept=fit_intercept)
+    culog = cuLog(fit_intercept=fit_intercept, output_type="numpy")
     culog.fit(X_train, y_train)
 
     sklog = skLog(fit_intercept=fit_intercept)
-    sklog.coef_ = cp.asnumpy(culog.coef_.to_output("cupy").T)
+    sklog.coef_ = culog.coef_.T
     if fit_intercept:
-        sklog.intercept_ = cp.asnumpy(culog.intercept_.to_output("cupy"))
+        sklog.intercept_ = culog.intercept_
     else:
         skLog.intercept_ = 0
     sklog.classes_ = np.arange(num_classes)
 
-    cu_dec_func = culog.decision_function(X_test).to_output("cupy")
+    cu_dec_func = culog.decision_function(X_test)
     if num_classes > 2:
         cu_dec_func = cu_dec_func.T
     sk_dec_func = sklog.decision_function(X_test)
@@ -340,11 +371,12 @@ def test_logistic_regression_decision_function(
     assert array_equal(cu_dec_func, sk_dec_func)
 
 
-@pytest.mark.parametrize("dtype", [np.float32, np.float64])
-@pytest.mark.parametrize("nrows", [10, 100])
+@pytest.mark.parametrize("dtype, nrows, num_classes, fit_intercept", [
+    (np.float32, 10, 2, True),
+    (np.float64, 100, 10, False),
+    (np.float64, 100, 2, True)
+])
 @pytest.mark.parametrize("column_info", [(20, 10)])
-@pytest.mark.parametrize("num_classes", [2, 10])
-@pytest.mark.parametrize("fit_intercept", [True, False])
 def test_logistic_regression_predict_proba(
     dtype, nrows, column_info, num_classes, fit_intercept
 ):
@@ -357,7 +389,7 @@ def test_logistic_regression_predict_proba(
     y_train = y_train.astype(dtype)
     y_test = y_test.astype(dtype)
 
-    culog = cuLog(fit_intercept=fit_intercept)
+    culog = cuLog(fit_intercept=fit_intercept, output_type="numpy")
     culog.fit(X_train, y_train)
 
     if num_classes > 2:
@@ -368,9 +400,9 @@ def test_logistic_regression_predict_proba(
         )
     else:
         sklog = skLog(fit_intercept=fit_intercept)
-    sklog.coef_ = cp.asnumpy(culog.coef_.to_output("cupy")).T
+    sklog.coef_ = culog.coef_.T
     if fit_intercept:
-        sklog.intercept_ = cp.asnumpy(culog.intercept_.to_output("cupy"))
+        sklog.intercept_ = culog.intercept_
     else:
         skLog.intercept_ = 0
     sklog.classes_ = np.arange(num_classes)
@@ -400,3 +432,46 @@ def test_logistic_regression_input_type_consistency(constructor, dtype):
 
     assert isinstance(clf.predict_proba(X), original_type)
     assert isinstance(clf.predict(X), original_type)
+
+
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_linreg_predict_convert_dtype(train_dtype, test_dtype):
+    X, y = make_regression(n_samples=50, n_features=10,
+                           n_informative=5, random_state=0)
+    X = X.astype(train_dtype)
+    y = y.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    clf = cuLinearRegression()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
+
+
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_ridge_predict_convert_dtype(train_dtype, test_dtype):
+    X, y = make_regression(n_samples=50, n_features=10,
+                           n_informative=5, random_state=0)
+    X = X.astype(train_dtype)
+    y = y.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    clf = cuRidge()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
+
+
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_logistic_predict_convert_dtype(train_dtype, test_dtype):
+    X, y = make_classification(n_samples=50, n_features=10, random_state=0)
+    X = X.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    clf = cuLog()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
diff --git a/python/cuml/test/test_logger.py b/python/cuml/test/test_logger.py
index b36eb3b232..8b4656b688 100644
--- a/python/cuml/test/test_logger.py
+++ b/python/cuml/test/test_logger.py
@@ -14,7 +14,9 @@
 # limitations under the License.
 #
 
+from contextlib import redirect_stdout
 import cuml.common.logger as logger
+from io import StringIO, TextIOWrapper, BytesIO
 
 
 def test_logger():
@@ -31,3 +33,57 @@ def test_logger():
 
     with logger.set_pattern("%v"):
         logger.info("This is an info message")
+
+
+def test_redirected_logger():
+    new_stdout = StringIO()
+
+    with logger.set_level(logger.level_trace):
+        # We do not test trace because CUML_LOG_TRACE is not compiled by
+        # default
+        test_msg = "This is a debug message"
+        with redirect_stdout(new_stdout):
+            logger.debug(test_msg)
+        assert test_msg in new_stdout.getvalue()
+
+        test_msg = "This is an info message"
+        with redirect_stdout(new_stdout):
+            logger.info(test_msg)
+        assert test_msg in new_stdout.getvalue()
+
+        test_msg = "This is a warn message"
+        with redirect_stdout(new_stdout):
+            logger.warn(test_msg)
+        assert test_msg in new_stdout.getvalue()
+
+        test_msg = "This is an error message"
+        with redirect_stdout(new_stdout):
+            logger.error(test_msg)
+        assert test_msg in new_stdout.getvalue()
+
+        test_msg = "This is a critical message"
+        with redirect_stdout(new_stdout):
+            logger.critical(test_msg)
+        assert test_msg in new_stdout.getvalue()
+
+    # Check that logging does not error with sys.stdout of None
+    with redirect_stdout(None):
+        test_msg = "This is a debug message"
+        logger.debug(test_msg)
+
+
+def test_log_flush():
+    stdout_buffer = BytesIO()
+    new_stdout = TextIOWrapper(stdout_buffer)
+
+    with logger.set_level(logger.level_trace):
+        test_msg = "This is a debug message"
+        with redirect_stdout(new_stdout):
+            logger.debug(test_msg)
+            assert test_msg not in stdout_buffer.getvalue().decode('utf-8')
+            logger.flush()
+            assert test_msg in stdout_buffer.getvalue().decode('utf-8')
+
+    # Check that logging flush does not error with sys.stdout of None
+    with redirect_stdout(None):
+        logger.flush()
diff --git a/python/cuml/test/test_make_blobs.py b/python/cuml/test/test_make_blobs.py
index 436ac6ce6d..08da7172c6 100644
--- a/python/cuml/test/test_make_blobs.py
+++ b/python/cuml/test/test_make_blobs.py
@@ -17,8 +17,6 @@
 import pytest
 import cupy as cp
 
-from sklearn.metrics import adjusted_rand_score
-
 
 # Testing parameters for scalar parameter tests
 
@@ -96,63 +94,3 @@ def test_make_blobs_scalar_parameters(dtype, n_samples, n_features, centers,
     elif centers <= n_samples:
         assert cp.unique(labels).shape == (centers,), \
             "unexpected number of clusters"
-
-
-# Parameters for array tests
-n_features_ary = [
-    2,
-    100
-]
-
-centers_ary = [
-    cp.random.uniform(size=(10, 2)),
-    cp.random.uniform(size=(10, 100))
-]
-
-
-@pytest.mark.parametrize('dtype', dtype)
-@pytest.mark.parametrize('n_samples', n_samples)
-@pytest.mark.parametrize('n_features', n_features_ary)
-@pytest.mark.parametrize('centers', centers_ary)
-@pytest.mark.parametrize('cluster_std', cluster_std)
-@pytest.mark.parametrize('center_box', center_box)
-@pytest.mark.parametrize('shuffle', shuffle)
-@pytest.mark.parametrize('random_state', random_state)
-def test_make_blobs_ary_parameters(dtype, n_samples, n_features,
-                                   centers, cluster_std, center_box,
-                                   shuffle, random_state):
-
-    centers = cp.array(centers)
-    cluster_std = cp.full(shape=10, fill_value=cluster_std, dtype=dtype)
-
-    if centers.shape[1] != n_features or \
-            len(cluster_std) != centers.shape[0]:
-        with pytest.raises(ValueError):
-            out, labels = \
-                cuml.make_blobs(dtype=dtype, n_samples=n_samples,
-                                n_features=n_features, centers=centers,
-                                cluster_std=cluster_std,
-                                center_box=center_box, shuffle=shuffle,
-                                random_state=random_state)
-
-    else:
-
-        out, labels = \
-            cuml.make_blobs(dtype=dtype, n_samples=n_samples,
-                            n_features=n_features, centers=centers,
-                            cluster_std=cluster_std,
-                            center_box=center_box, shuffle=shuffle,
-                            random_state=random_state)
-
-        assert out.shape == (n_samples, n_features), "out shape mismatch"
-        assert labels.shape == (n_samples,), "labels shape mismatch"
-
-        assert cp.unique(labels).shape == (centers.shape[0],), \
-            "unexpected number of clusters"
-
-        # Use kmeans to verify k cluster centers
-        from sklearn.cluster import KMeans
-        model = KMeans(n_clusters=centers.shape[0])
-        model.fit(cp.asnumpy(out))
-
-        assert adjusted_rand_score(model.labels_, cp.asnumpy(labels))
diff --git a/python/cuml/test/test_metrics.py b/python/cuml/test/test_metrics.py
index 3f2f3a252f..3cbca81c88 100644
--- a/python/cuml/test/test_metrics.py
+++ b/python/cuml/test/test_metrics.py
@@ -20,6 +20,7 @@
 import cupy as cp
 import numpy as np
 import pytest
+import cudf
 
 from cuml.ensemble import RandomForestClassifier as curfc
 from cuml.metrics.cluster import adjusted_rand_score as cu_ars
@@ -59,6 +60,9 @@
 from sklearn.metrics import precision_recall_curve \
     as sklearn_precision_recall_curve
 
+from cuml.metrics import pairwise_distances, PAIRWISE_DISTANCE_METRICS
+from sklearn.metrics import pairwise_distances as sklearn_pairwise_distances
+
 
 @pytest.mark.parametrize('datatype', [np.float32, np.float64])
 @pytest.mark.parametrize('use_handle', [True, False])
@@ -375,8 +379,8 @@ def test_regression_metrics_random(n_samples, dtype, function):
         lambda rng: rng.randint(0, 1000, n_samples).astype(dtype))
 
     cuml_reg, sklearn_reg = {
-        'mse':  (mean_squared_error, sklearn_mse),
-        'mae':  (mean_absolute_error, sklearn_mae),
+        'mse': (mean_squared_error, sklearn_mse),
+        'mae': (mean_absolute_error, sklearn_mae),
         'msle': (mean_squared_log_error, sklearn_msle)
     }[function]
 
@@ -749,3 +753,251 @@ def test_log_loss_at_limits():
     err_msg = ("'y_true' can only have integer values")
     with pytest.raises(ValueError, match=err_msg):
         log_loss(y_true, y_pred)
+
+
+@pytest.mark.parametrize("metric", PAIRWISE_DISTANCE_METRICS)
+@pytest.mark.parametrize("matrix_size", [(5, 4), (1000, 3), (2, 10),
+                                         (500, 400)])
+@pytest.mark.parametrize("is_col_major", [True, False])
+def test_pairwise_distances(metric: str, matrix_size, is_col_major):
+    # Test the pairwise_distance helper function.
+    rng = np.random.RandomState(0)
+
+    def prep_array(array):
+        return np.asfortranarray(array) if is_col_major else array
+
+    # For fp64, compare at 13 decimals, (2 places less than the ~15 max)
+    compare_precision = 10
+
+    # Compare to sklearn, single input
+    X = prep_array(rng.random_sample(matrix_size))
+    S = pairwise_distances(X, metric=metric)
+    S2 = sklearn_pairwise_distances(X, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, double input with same dimensions
+    Y = X
+    S = pairwise_distances(X, Y, metric=metric)
+    S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare single and double inputs to eachother
+    S = pairwise_distances(X, metric=metric)
+    S2 = pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, with Y dim != X dim
+    Y = prep_array(rng.random_sample((2, matrix_size[1])))
+    S = pairwise_distances(X, Y, metric=metric)
+    S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Change precision of one parameter
+    Y = np.asfarray(Y, dtype=np.float32)
+    S = pairwise_distances(X, Y, metric=metric)
+    S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # For fp32, compare at 5 decimals, (2 places less than the ~7 max)
+    compare_precision = 2
+
+    # Change precision of both parameters to float
+    X = np.asfarray(X, dtype=np.float32)
+    Y = np.asfarray(Y, dtype=np.float32)
+    S = pairwise_distances(X, Y, metric=metric)
+    S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Test sending an int type with convert_dtype=True
+    Y = prep_array(rng.randint(10, size=Y.shape))
+    S = pairwise_distances(X, Y, metric=metric, convert_dtype=True)
+    S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Test that uppercase on the metric name throws an error.
+    with pytest.raises(ValueError):
+        pairwise_distances(X, Y, metric=metric.capitalize())
+
+
+@pytest.mark.parametrize("metric", PAIRWISE_DISTANCE_METRICS)
+@pytest.mark.parametrize("matrix_size", [
+    unit_param((1000, 100)),
+    quality_param((2000, 1000)),
+    stress_param((10000, 10000))])
+def test_pairwise_distances_sklearn_comparison(metric: str, matrix_size):
+    # Test larger sizes to sklearn
+    rng = np.random.RandomState(1)
+
+    element_count = matrix_size[0] * matrix_size[1]
+
+    X = rng.random_sample(matrix_size)
+    Y = rng.random_sample(matrix_size)
+
+    # For fp64, compare at 10 decimals, (5 places less than the ~15 max)
+    compare_precision = 10
+
+    # Compare to sklearn, fp64
+    S = pairwise_distances(X, Y, metric=metric)
+
+    if (element_count <= 2000000):
+        S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+        cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # For fp32, compare at 4 decimals, (3 places less than the ~7 max)
+    compare_precision = 4
+
+    X = np.asfarray(X, dtype=np.float32)
+    Y = np.asfarray(Y, dtype=np.float32)
+
+    # Compare to sklearn, fp32
+    S = pairwise_distances(X, Y, metric=metric)
+
+    if (element_count <= 2000000):
+        S2 = sklearn_pairwise_distances(X, Y, metric=metric)
+        cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+
+@pytest.mark.parametrize("metric", PAIRWISE_DISTANCE_METRICS)
+def test_pairwise_distances_one_dimension_order(metric: str):
+    # Test the pairwise_distance helper function for 1 dimensional cases which
+    # can break down when using a size of 1 for either dimension
+    rng = np.random.RandomState(2)
+
+    Xc = rng.random_sample((1, 4))
+    Yc = rng.random_sample((10, 4))
+    Xf = np.asfortranarray(Xc)
+    Yf = np.asfortranarray(Yc)
+
+    # For fp64, compare at 13 decimals, (2 places less than the ~15 max)
+    compare_precision = 13
+
+    # Compare to sklearn, C/C order
+    S = pairwise_distances(Xc, Yc, metric=metric)
+    S2 = sklearn_pairwise_distances(Xc, Yc, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, C/F order
+    S = pairwise_distances(Xc, Yf, metric=metric)
+    S2 = sklearn_pairwise_distances(Xc, Yf, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, F/C order
+    S = pairwise_distances(Xf, Yc, metric=metric)
+    S2 = sklearn_pairwise_distances(Xf, Yc, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, F/F order
+    S = pairwise_distances(Xf, Yf, metric=metric)
+    S2 = sklearn_pairwise_distances(Xf, Yf, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Switch which input has single dimension
+    Xc = rng.random_sample((1, 4))
+    Yc = rng.random_sample((10, 4))
+    Xf = np.asfortranarray(Xc)
+    Yf = np.asfortranarray(Yc)
+
+    # Compare to sklearn, C/C order
+    S = pairwise_distances(Xc, Yc, metric=metric)
+    S2 = sklearn_pairwise_distances(Xc, Yc, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, C/F order
+    S = pairwise_distances(Xc, Yf, metric=metric)
+    S2 = sklearn_pairwise_distances(Xc, Yf, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, F/C order
+    S = pairwise_distances(Xf, Yc, metric=metric)
+    S2 = sklearn_pairwise_distances(Xf, Yc, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+    # Compare to sklearn, F/F order
+    S = pairwise_distances(Xf, Yf, metric=metric)
+    S2 = sklearn_pairwise_distances(Xf, Yf, metric=metric)
+    cp.testing.assert_array_almost_equal(S, S2, decimal=compare_precision)
+
+
+@pytest.mark.parametrize("metric", ["haversine", "nan_euclidean"])
+def test_pairwise_distances_unsuppored_metrics(metric):
+    rng = np.random.RandomState(3)
+
+    X = rng.random_sample((5, 4))
+
+    with pytest.raises(ValueError):
+        pairwise_distances(X, metric=metric)
+
+
+def test_pairwise_distances_exceptions():
+
+    rng = np.random.RandomState(4)
+
+    X_int = rng.randint(10, size=(5, 4))
+    X_double = rng.random_sample((5, 4))
+    X_float = np.asfarray(X_double, dtype=np.float32)
+    X_bool = rng.choice([True, False], size=(5, 4))
+
+    # Test int inputs (only float/double accepted at this time)
+    with pytest.raises(TypeError):
+        pairwise_distances(X_int, metric="euclidean")
+
+    # Test second int inputs (should not have an exception with
+    # convert_dtype=True)
+    pairwise_distances(X_double, X_int, metric="euclidean")
+
+    # Test bool inputs (only float/double accepted at this time)
+    with pytest.raises(TypeError):
+        pairwise_distances(X_bool, metric="euclidean")
+
+    # Test sending different types with convert_dtype=False
+    with pytest.raises(TypeError):
+        pairwise_distances(X_double, X_float, metric="euclidean",
+                           convert_dtype=False)
+
+    # Invalid metric name
+    with pytest.raises(ValueError):
+        pairwise_distances(X_double, metric="Not a metric")
+
+    # Invalid dimensions
+    X = rng.random_sample((5, 4))
+    Y = rng.random_sample((5, 7))
+
+    with pytest.raises(ValueError):
+        pairwise_distances(X, Y, metric="euclidean")
+
+
+@pytest.mark.parametrize("input_type", ["cudf", "numpy", "cupy"])
+@pytest.mark.parametrize("output_type", ["cudf", "numpy", "cupy"])
+@pytest.mark.parametrize("use_global", [True, False])
+def test_pairwise_distances_output_types(input_type, output_type, use_global):
+    # Test larger sizes to sklearn
+    rng = np.random.RandomState(5)
+
+    X = rng.random_sample((100, 100))
+    Y = rng.random_sample((100, 100))
+
+    if input_type == "cudf":
+        X = cudf.DataFrame(X)
+        Y = cudf.DataFrame(Y)
+    elif input_type == "cupy":
+        X = cp.asarray(X)
+        Y = cp.asarray(Y)
+
+    # Set to None if we are using the global object
+    output_type_param = None if use_global else output_type
+
+    # Use the global manager object. Should do nothing unless use_global is set
+    with cuml.using_output_type(output_type):
+
+        # Compare to sklearn, fp64
+        S = pairwise_distances(X, Y, metric="euclidean",
+                               output_type=output_type_param)
+
+        if output_type == "input":
+            assert isinstance(S, type(X))
+        elif output_type == "cudf":
+            assert isinstance(S, cudf.DataFrame)
+        elif output_type == "numpy":
+            assert isinstance(S, np.ndarray)
+        elif output_type == "cupy":
+            assert isinstance(S, cp.core.core.ndarray)
diff --git a/python/cuml/test/test_nearest_neighbors.py b/python/cuml/test/test_nearest_neighbors.py
index 1d7be0ca75..1e1d07db18 100644
--- a/python/cuml/test/test_nearest_neighbors.py
+++ b/python/cuml/test/test_nearest_neighbors.py
@@ -24,6 +24,7 @@
 from sklearn.datasets.samples_generator import make_blobs
 
 import cupy as cp
+import cupyx
 import cudf
 import pandas as pd
 import numpy as np
@@ -113,23 +114,27 @@ def test_return_dists():
 @pytest.mark.parametrize('k', [unit_param(3), quality_param(30),
                          stress_param(50)])
 @pytest.mark.parametrize("metric", valid_metrics())
-def test_knn(input_type, nrows, n_feats, k, metric):
+def test_knn_separate_index_search(input_type, nrows, n_feats, k, metric):
     X, _ = make_blobs(n_samples=nrows,
                       n_features=n_feats, random_state=0)
 
+    X_index = X[:100]
+    X_search = X[101:]
+
     p = 5  # Testing 5-norm of the minkowski metric only
     knn_sk = skKNN(metric=metric, p=p)  # Testing
-    knn_sk.fit(X)
-    D_sk, I_sk = knn_sk.kneighbors(X, k)
+    knn_sk.fit(X_index)
+    D_sk, I_sk = knn_sk.kneighbors(X_search, k)
 
-    X_orig = X
+    X_orig = X_index
 
     if input_type == "dataframe":
-        X = cudf.DataFrame(X)
+        X_index = cudf.DataFrame(X_index)
+        X_search = cudf.DataFrame(X_search)
 
     knn_cu = cuKNN(metric=metric, p=p)
-    knn_cu.fit(X)
-    D_cuml, I_cuml = knn_cu.kneighbors(X, k)
+    knn_cu.fit(X_index)
+    D_cuml, I_cuml = knn_cu.kneighbors(X_search, k)
 
     if input_type == "dataframe":
         assert isinstance(D_cuml, cudf.DataFrame)
@@ -143,12 +148,18 @@ def test_knn(input_type, nrows, n_feats, k, metric):
         I_cuml_arr = I_cuml
 
     # Assert the cuml model was properly reverted
-    np.testing.assert_allclose(knn_cu.X_m.to_output("numpy"), X_orig,
-                               atol=1e-5, rtol=1e-4)
-
-    # Allow a max relative diff of 10% and absolute diff of 1%
-    np.testing.assert_allclose(D_cuml_arr, D_sk, atol=1e-2,
-                               rtol=1e-1)
+    np.testing.assert_allclose(knn_cu._X_m.to_output("numpy"), X_orig,
+                               atol=1e-3, rtol=1e-3)
+
+    if metric == 'braycurtis':
+        diff = D_cuml_arr - D_sk
+        # Braycurtis has a few differences, but this is computed by FAISS.
+        # So long as the indices all match below, the small discrepancy
+        # should be okay.
+        assert len(diff[diff > 1e-2]) / X_search.shape[0] < 0.06
+    else:
+        np.testing.assert_allclose(D_cuml_arr, D_sk, atol=1e-3,
+                                   rtol=1e-3)
     assert I_cuml_arr.all() == I_sk.all()
 
 
@@ -174,12 +185,12 @@ def test_knn_x_none(input_type, nrows, n_feats, k, metric):
     if input_type == "dataframe":
         X = cudf.DataFrame(X)
 
-    knn_cu = cuKNN(metric=metric, p=p)
+    knn_cu = cuKNN(metric=metric, p=p, output_type="numpy")
     knn_cu.fit(X)
     D_cuml, I_cuml = knn_cu.kneighbors(X=None, n_neighbors=k)
 
     # Assert the cuml model was properly reverted
-    cp.testing.assert_allclose(knn_cu.X_m.to_output("numpy"), X_orig,
+    cp.testing.assert_allclose(knn_cu.X_m, X_orig,
                                atol=1e-5, rtol=1e-4)
 
     # Allow a max relative diff of 10% and absolute diff of 1%
@@ -301,6 +312,6 @@ def test_knn_graph(input_type, nrows, n_feats, p, k, metric, mode,
     assert np.array_equal(sparse_sk.toarray().shape, sparse_cu.toarray().shape)
 
     if output_type == 'cupy':
-        assert cp.sparse.isspmatrix_csr(sparse_cu)
+        assert cupyx.scipy.sparse.isspmatrix_csr(sparse_cu)
     else:
         assert isspmatrix_csr(sparse_cu)
diff --git a/python/cuml/test/test_one_hot_encoder.py b/python/cuml/test/test_one_hot_encoder.py
index 0998d5bf0f..83cea01191 100644
--- a/python/cuml/test/test_one_hot_encoder.py
+++ b/python/cuml/test/test_one_hot_encoder.py
@@ -11,18 +11,17 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import pytest
-from cudf import DataFrame
-from cuml.preprocessing import OneHotEncoder
+import math
 
 import cupy as cp
-import pandas as pd
 import numpy as np
-
-from sklearn.preprocessing import OneHotEncoder as SkOneHotEncoder
-
+import pandas as pd
+import pytest
+from cudf import DataFrame
+from cuml.preprocessing import OneHotEncoder
 from cuml.test.utils import stress_param
 from pandas.util.testing import assert_frame_equal
+from sklearn.preprocessing import OneHotEncoder as SkOneHotEncoder
 
 
 def from_df_to_array(df):
@@ -305,3 +304,70 @@ def test_onehot_categories_shape_mismatch(as_array):
 
     with pytest.raises(ValueError):
         OneHotEncoder(categories=categories, sparse=False).fit(X)
+
+
+def test_onehot_category_specific_cases():
+    # See this for reasoning: https://github.com/rapidsai/cuml/issues/2690
+
+    # All of these cases use sparse=False, where
+    # test_onehot_category_class_count uses sparse=True
+
+    # ==== 2 Rows (Low before High) ====
+    example_df = DataFrame()
+    example_df["low_cardinality_column"] = ["A"] * 200 + ["B"] * 56
+    example_df["high_cardinality_column"] = cp.linspace(0, 255, 256)
+
+    encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
+    encoder.fit_transform(example_df)
+
+    # ==== 2 Rows (High before Low, used to fail) ====
+    example_df = DataFrame()
+    example_df["high_cardinality_column"] = cp.linspace(0, 255, 256)
+    example_df["low_cardinality_column"] = ["A"] * 200 + ["B"] * 56
+
+    encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
+    encoder.fit_transform(example_df)
+
+
+@pytest.mark.parametrize('total_classes',
+                         [np.iinfo(np.uint8).max, np.iinfo(np.uint16).max],
+                         ids=['uint8', 'uint16'])
+def test_onehot_category_class_count(total_classes: int):
+    # See this for reasoning: https://github.com/rapidsai/cuml/issues/2690
+    # All tests use sparse=True to avoid memory errors
+
+    encoder = OneHotEncoder(handle_unknown="ignore", sparse=True)
+
+    # ==== 2 Rows ====
+    example_df = DataFrame()
+    example_df["high_cardinality_column"] = cp.linspace(
+        0, total_classes - 1, total_classes)
+    example_df["low_cardinality_column"] = ["A"] * 200 + ["B"] * (
+        total_classes - 200)
+
+    assert (encoder.fit_transform(example_df).shape[1] == total_classes + 2)
+
+    # ==== 3 Rows ====
+    example_df = DataFrame()
+    example_df["high_cardinality_column"] = cp.linspace(
+        0, total_classes - 1, total_classes)
+    example_df["low_cardinality_column"] = ["A"] * total_classes
+    example_df["med_cardinality_column"] = ["B"] * total_classes
+
+    assert (encoder.fit_transform(example_df).shape[1] == total_classes + 2)
+
+    # ==== N Rows (Even Split) ====
+    num_rows = [3, 10, 100]
+
+    for row_count in num_rows:
+
+        class_per_row = int(math.ceil(total_classes / float(row_count))) + 1
+        example_df = DataFrame()
+
+        for row_idx in range(row_count):
+            example_df[str(row_idx)] = cp.linspace(
+                row_idx * class_per_row, ((row_idx + 1) * class_per_row) - 1,
+                class_per_row)
+
+        assert (encoder.fit_transform(example_df).shape[1] == class_per_row *
+                row_count)
diff --git a/python/cuml/test/test_pca.py b/python/cuml/test/test_pca.py
index 2c00e11b2a..aa7d48125a 100644
--- a/python/cuml/test/test_pca.py
+++ b/python/cuml/test/test_pca.py
@@ -15,6 +15,7 @@
 
 import numpy as np
 import cupy as cp
+import cupyx
 import pytest
 
 from cuml import PCA as cuPCA
@@ -69,6 +70,25 @@ def test_pca_fit(datatype, input_type, name, use_handle):
         assert array_equal(cuml_res, skl_res, 1e-3, with_sign=with_sign)
 
 
+@pytest.mark.parametrize('n_samples', [200])
+@pytest.mark.parametrize('n_features', [100, 300])
+def test_pca_defaults(n_samples, n_features):
+    X, Y = make_multilabel_classification(n_samples=n_samples,
+                                          n_features=n_features,
+                                          n_classes=2,
+                                          n_labels=1,
+                                          random_state=1)
+    skpca = skPCA()
+    skpca.fit(X)
+
+    cupca = cuPCA()
+    cupca.fit(X)
+    cupca.handle.sync()
+
+    assert skpca.svd_solver == cupca.svd_solver
+    assert cupca.components_.shape[0] == skpca.components_.shape[0]
+
+
 @pytest.mark.parametrize('datatype', [np.float32, np.float64])
 @pytest.mark.parametrize('input_type', ['ndarray'])
 @pytest.mark.parametrize('use_handle', [True, False])
@@ -189,8 +209,8 @@ def test_sparse_pca_inputs(nrows, ncols, whiten, return_sparse):
     if return_sparse:
         pytest.skip("Loss of information in converting to cupy sparse csr")
 
-    X = cp.sparse.random(nrows, ncols, density=0.07, dtype=cp.float32,
-                         random_state=10)
+    X = cupyx.scipy.sparse.random(nrows, ncols, density=0.07, dtype=cp.float32,
+                                  random_state=10)
 
     p_sparse = cuPCA(n_components=ncols, whiten=whiten)
 
@@ -201,7 +221,7 @@ def test_sparse_pca_inputs(nrows, ncols, whiten, return_sparse):
 
     if return_sparse:
 
-        assert isinstance(i_sparse, cp.sparse.csr_matrix)
+        assert isinstance(i_sparse, cupyx.scipy.sparse.csr_matrix)
 
         assert array_equal(i_sparse.todense(), X.todense(), 1e-1,
                            with_sign=True)
diff --git a/python/cuml/test/test_pickle.py b/python/cuml/test/test_pickle.py
index 7751ffff72..268b4a3247 100644
--- a/python/cuml/test/test_pickle.py
+++ b/python/cuml/test/test_pickle.py
@@ -18,16 +18,16 @@
 import pickle
 import pytest
 
-from cuml.test import test_arima
 from cuml.tsa.arima import ARIMA
 from cuml.test.utils import array_equal, unit_param, stress_param, \
     ClassEnumerator, get_classes_from_package
-from cuml.test.test_svm import compare_svm
+from cuml.test.test_svm import compare_svm, compare_probabilistic_svm
 from sklearn.base import clone
 from sklearn.datasets import load_iris, make_classification, make_regression
 from sklearn.manifold.t_sne import trustworthiness
 from sklearn.model_selection import train_test_split
 
+
 regression_config = ClassEnumerator(module=cuml.linear_model)
 regression_models = regression_config.get_models()
 
@@ -62,14 +62,25 @@
 rf_models = rf_module.get_models()
 
 k_neighbors_config = ClassEnumerator(module=cuml.neighbors, exclude_classes=[
-    cuml.neighbors.NearestNeighbors, cuml.neighbors.KNeighborsMG])
+    cuml.neighbors.NearestNeighbors])
 k_neighbors_models = k_neighbors_config.get_models()
 
-unfit_pickle_xfail = ['ARIMA', 'KalmanFilter', 'ForestInference']
-unfit_clone_xfail = ['ARIMA', 'ExponentialSmoothing', 'KalmanFilter',
-                     'MBSGDClassifier', 'MBSGDRegressor']
-
-all_models = get_classes_from_package(cuml)
+unfit_pickle_xfail = [
+    'ARIMA',
+    'AutoARIMA',
+    'KalmanFilter',
+    'BaseRandomForestModel',
+    'ForestInference'
+]
+unfit_clone_xfail = [
+    'AutoARIMA',
+    "ARIMA",
+    "BaseRandomForestModel",
+    "GaussianRandomProjection",
+    "SparseRandomProjection",
+]
+
+all_models = get_classes_from_package(cuml, import_sub_packages=True)
 all_models.update({
     **regression_models,
     **solver_models,
@@ -81,14 +92,9 @@
     **umap_model,
     **rf_models,
     **k_neighbors_models,
-    'ARIMA': lambda: ARIMA((1, 1, 1),
-                           np.array([-217.72, -206.77]),
-                           [np.array([0.03]), np.array([-0.03])],
-                           [np.array([-0.99]), np.array([-0.99])],
-                           test_arima.get_data()[1]),
+    'ARIMA': lambda: ARIMA(np.random.normal(0.0, 1.0, (10,))),
     'ExponentialSmoothing':
         lambda: cuml.ExponentialSmoothing(np.array([-217.72, -206.77])),
-    'KalmanFilter': lambda: cuml.KalmanFilter(1, 1),
 })
 
 
@@ -175,10 +181,6 @@ def assert_model(pickled_model, X_test):
         pickled_model.score(X_test, np.zeros(X_test.shape[0]),
                             predict_model="GPU")
 
-    if (n_classes > 2 and key != 'RandomForestRegressor'):
-        with pytest.raises(NotImplementedError):
-            pickle_save_load(tmpdir, create_mod, assert_model)
-    else:
         pickle_save_load(tmpdir, create_mod, assert_model)
 
 
@@ -192,9 +194,16 @@ def test_regressor_pickle(tmpdir, datatype, keys, data_size, fit_intercept):
 
     def create_mod():
         nrows, ncols, n_info = data_size
+        if "LogisticRegression" in keys and nrows == 500000:
+            nrows, ncols, n_info = (nrows // 20, ncols // 20, n_info // 20)
+
         X_train, y_train, X_test = make_dataset(datatype, nrows,
                                                 ncols, n_info)
-        model = regression_models[keys](fit_intercept=fit_intercept)
+        if "MBSGD" in keys:
+            model = regression_models[keys](fit_intercept=fit_intercept,
+                                            batch_size=nrows/100)
+        else:
+            model = regression_models[keys](fit_intercept=fit_intercept)
         model.fit(X_train, y_train)
         result["regressor"] = model.predict(X_test)
         return model, X_test
@@ -214,6 +223,9 @@ def test_solver_pickle(tmpdir, datatype, keys, data_size):
 
     def create_mod():
         nrows, ncols, n_info = data_size
+        if "QN" in keys and nrows == 500000:
+            nrows, ncols, n_info = (nrows // 20, ncols // 20, n_info // 20)
+
         X_train, y_train, X_test = make_dataset(datatype, nrows,
                                                 ncols, n_info)
         model = solver_models[keys]()
@@ -280,10 +292,10 @@ def test_umap_pickle(tmpdir, datatype, keys):
     def create_mod():
         X_train = load_iris().data
 
-        model = umap_model[keys]()
+        model = umap_model[keys](output_type="numpy")
         cu_before_pickle_transform = model.fit_transform(X_train)
 
-        result["umap_embedding"] = model.embedding_.to_output('numpy')
+        result["umap_embedding"] = model.embedding_
         n_neighbors = model.n_neighbors
 
         result["umap"] = trustworthiness(X_train,
@@ -292,7 +304,7 @@ def create_mod():
         return model, X_train
 
     def assert_model(pickled_model, X_train):
-        cu_after_embed = pickled_model.embedding_.to_output('numpy')
+        cu_after_embed = pickled_model.embedding_
 
         n_neighbors = pickled_model.n_neighbors
         assert array_equal(result["umap_embedding"], cu_after_embed)
@@ -410,7 +422,7 @@ def assert_model(pickled_model, X_test):
         assert array_equal(result["neighbors"], D_after)
         state = pickled_model.__dict__
         assert state["n_indices"] == 1
-        assert "X_m" in state
+        assert "_X_m" in state
 
     pickle_save_load(tmpdir, create_mod, assert_model)
 
@@ -421,7 +433,7 @@ def assert_model(pickled_model, X_test):
 def test_neighbors_pickle_nofit(tmpdir, datatype, data_info):
     result = {}
     """
-    Note: This test digs down a bit far into the
+    .. note:: This test digs down a bit far into the
     internals of the implementation, but it's
     important that regressions do not occur
     from changes to the class.
@@ -437,13 +449,13 @@ def create_mod():
     def assert_model(loaded_model, X):
         state = loaded_model.__dict__
         assert state["n_indices"] == 0
-        assert "X_m" not in state
+        assert "_X_m" not in state
         loaded_model.fit(X[0])
 
         state = loaded_model.__dict__
 
         assert state["n_indices"] == 1
-        assert "X_m" in state
+        assert "_X_m" in state
 
     pickle_save_load(tmpdir, create_mod, assert_model)
 
@@ -512,12 +524,16 @@ def assert_second_model(pickled_model, X):
     pickle_save_load(tmpdir, create_mod_2, assert_second_model)
 
 
+# Probabilistic SVM is tested separately because it is a meta estimator that
+# owns a set of base SV classifiers.
+@pytest.mark.parametrize('params', [{'probability': True},
+                                    {'probability': False}])
 @pytest.mark.parametrize('datatype', [np.float32, np.float64])
-def test_svc_pickle(tmpdir, datatype):
+def test_svc_pickle(tmpdir, datatype, params):
     result = {}
 
     def create_mod():
-        model = cuml.svm.SVC()
+        model = cuml.svm.SVC(**params)
         iris = load_iris()
         iris_selection = np.random.RandomState(42).choice(
             [True, False], 150, replace=True, p=[0.75, 0.25])
@@ -529,8 +545,13 @@ def create_mod():
         return model, data
 
     def assert_model(pickled_model, data):
-        compare_svm(result["model"], pickled_model, data[0], data[1], cmp_sv=0,
-                    dcoef_tol=0)
+        if result["model"].probability:
+            print("Comparing probabilistic svc")
+            compare_probabilistic_svm(result["model"], pickled_model, data[0],
+                                      data[1], 0, 0)
+        else:
+            print("comparing base svc")
+            compare_svm(result["model"], pickled_model, data[0], data[1])
 
     pickle_save_load(tmpdir, create_mod, assert_model)
 
@@ -586,14 +607,16 @@ def assert_model(pickled_model, X):
 @pytest.mark.parametrize('nrows', [unit_param(500)])
 @pytest.mark.parametrize('ncols', [unit_param(16)])
 @pytest.mark.parametrize('n_info', [unit_param(7)])
-def test_svc_pickle_nofit(tmpdir, datatype, nrows, ncols, n_info):
+@pytest.mark.parametrize('params', [{'probability': True},
+                                    {'probability': False}])
+def test_svc_pickle_nofit(tmpdir, datatype, nrows, ncols, n_info, params):
     def create_mod():
         X_train, y_train, X_test = make_classification_dataset(datatype,
                                                                nrows,
                                                                ncols,
                                                                n_info,
                                                                n_classes=2)
-        model = cuml.svm.SVC()
+        model = cuml.svm.SVC(**params)
         return model, [X_train, y_train, X_test]
 
     def assert_model(pickled_model, X):
@@ -625,7 +648,7 @@ def create_mod():
                                                                n_info,
                                                                n_classes=2)
         model = rf_models[key](n_estimators=1, max_depth=1,
-                               max_features=1.0, seed=10)
+                               max_features=1.0, random_state=10)
         model.fit(X_train, y_train)
         result['rf_res'] = model.predict(X_test)
         return model, X_test
diff --git a/python/cuml/test/test_preproc_utils.py b/python/cuml/test/test_preproc_utils.py
new file mode 100644
index 0000000000..a96ee71976
--- /dev/null
+++ b/python/cuml/test/test_preproc_utils.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+
+from cuml.datasets import make_classification, make_blobs
+from ..thirdparty_adapters import to_output_type
+from numpy.testing import assert_allclose as np_assert_allclose
+
+import numpy as np
+import cupy as cp
+from cupy.sparse import csr_matrix as gpu_csr_matrix
+from cupy.sparse import csc_matrix as gpu_csc_matrix
+from scipy.sparse import csr_matrix as cpu_csr_matrix
+from scipy.sparse import csc_matrix as cpu_csc_matrix
+
+
+def create_rand_clf():
+    clf, _ = make_classification(n_samples=500,
+                                 n_features=20,
+                                 n_clusters_per_class=1,
+                                 n_informative=12,
+                                 n_classes=5,
+                                 order='F')
+    return clf
+
+
+def create_rand_blobs():
+    blobs, _ = make_blobs(n_samples=500,
+                          n_features=20,
+                          centers=20,
+                          order='F')
+    return blobs
+
+
+def create_rand_integers():
+    randint = cp.random.randint(30, size=(500, 20)).astype(cp.float64)
+    randint = cp.asfortranarray(randint)
+    return randint
+
+
+def convert(dataset, conversion_format):
+    converted_dataset = to_output_type(dataset, conversion_format)
+    dataset = cp.asnumpy(dataset)
+    return dataset, converted_dataset
+
+
+def sparsify_and_convert(dataset, conversion_format, sparsify_ratio=0.3):
+    """Randomly set values to 0 and produce a sparse array.
+
+    Parameters
+    ----------
+    dataset : array
+        Input array to convert
+    conversion_format : string
+        Type of sparse array :
+        - scipy-csr: SciPy CSR sparse array
+        - scipy-csc: SciPy CSC sparse array
+        - cupy-csr: CuPy CSR sparse array
+        - cupy-csc: CuPy CSC sparse array
+    sparsify_ratio: float [0-1]
+        Ratio of zeros in the sparse array
+
+    Returns
+    -------
+    SciPy CSR array and converted array
+    """
+    random_loc = cp.random.choice(dataset.size,
+                                  int(dataset.size * sparsify_ratio),
+                                  replace=False)
+    dataset.ravel()[random_loc] = 0
+
+    if conversion_format == "scipy-csr":
+        dataset = cp.asnumpy(dataset)
+        converted_dataset = cpu_csr_matrix(dataset)
+    elif conversion_format == "scipy-csc":
+        dataset = cp.asnumpy(dataset)
+        converted_dataset = cpu_csc_matrix(dataset)
+    elif conversion_format == "cupy-csr":
+        converted_dataset = gpu_csr_matrix(dataset)
+        dataset = cp.asnumpy(dataset)
+    elif conversion_format == "cupy-csc":
+        converted_dataset = gpu_csc_matrix(dataset)
+        dataset = cp.asnumpy(dataset)
+    return cpu_csr_matrix(dataset), converted_dataset
+
+
+@pytest.fixture(scope="session",
+                params=["numpy", "dataframe", "cupy", "cudf", "numba"])
+def clf_dataset(request):
+    clf = create_rand_clf()
+    return convert(clf, request.param)
+
+
+@pytest.fixture(scope="session",
+                params=["numpy", "dataframe", "cupy", "cudf", "numba"])
+def blobs_dataset(request):
+    blobs = create_rand_blobs()
+    return convert(blobs, request.param)
+
+
+@pytest.fixture(scope="session",
+                params=["numpy", "dataframe", "cupy", "cudf", "numba"])
+def int_dataset(request):
+    randint = create_rand_integers()
+    random_loc = cp.random.choice(randint.size,
+                                  int(randint.size * 0.3),
+                                  replace=False)
+    randint.ravel()[random_loc] = cp.nan
+    return convert(randint, request.param)
+
+
+@pytest.fixture(scope="session",
+                params=["scipy-csr", "scipy-csc", "cupy-csr", "cupy-csc"])
+def sparse_clf_dataset(request):
+    clf = create_rand_clf()
+    return sparsify_and_convert(clf, request.param)
+
+
+@pytest.fixture(scope="session",
+                params=["scipy-csr", "scipy-csc", "cupy-csr", "cupy-csc"])
+def sparse_blobs_dataset(request):
+    blobs = create_rand_blobs()
+    return sparsify_and_convert(blobs, request.param)
+
+
+@pytest.fixture(scope="session",
+                params=["scipy-csr", "scipy-csc", "cupy-csr", "cupy-csc"])
+def sparse_int_dataset(request):
+    randint = create_rand_integers()
+    return sparsify_and_convert(randint, request.param)
+
+
+def assert_allclose(actual, desired, rtol=1e-05, atol=1e-05,
+                    ratio_tol=None):
+    if not isinstance(actual, np.ndarray):
+        actual = to_output_type(actual, 'numpy')
+    if not isinstance(desired, np.ndarray):
+        desired = to_output_type(desired, 'numpy')
+    if ratio_tol:
+        assert actual.shape == desired.shape
+        diff_ratio = (actual != desired).sum() / actual.size
+        assert diff_ratio <= ratio_tol
+    else:
+        return np_assert_allclose(actual, desired, rtol=rtol, atol=atol)
diff --git a/python/cuml/test/test_preprocessing.py b/python/cuml/test/test_preprocessing.py
new file mode 100644
index 0000000000..6ab2eb901e
--- /dev/null
+++ b/python/cuml/test/test_preprocessing.py
@@ -0,0 +1,674 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+
+from ..experimental.preprocessing import StandardScaler as cuStandardScaler, \
+                            MinMaxScaler as cuMinMaxScaler, \
+                            MaxAbsScaler as cuMaxAbsScaler, \
+                            Normalizer as cuNormalizer, \
+                            Binarizer as cuBinarizer, \
+                            PolynomialFeatures as cuPolynomialFeatures, \
+                            SimpleImputer as cuSimpleImputer, \
+                            RobustScaler as cuRobustScaler, \
+                            KBinsDiscretizer as cuKBinsDiscretizer
+from ..experimental.preprocessing import scale as cu_scale, \
+                            minmax_scale as cu_minmax_scale, \
+                            normalize as cu_normalize, \
+                            add_dummy_feature as cu_add_dummy_feature, \
+                            binarize as cu_binarize, \
+                            robust_scale as cu_robust_scale
+from sklearn.preprocessing import StandardScaler as skStandardScaler, \
+                                  MinMaxScaler as skMinMaxScaler, \
+                                  MaxAbsScaler as skMaxAbsScaler, \
+                                  Normalizer as skNormalizer, \
+                                  Binarizer as skBinarizer, \
+                                  PolynomialFeatures as skPolynomialFeatures, \
+                                  RobustScaler as skRobustScaler
+from sklearn.preprocessing import scale as sk_scale, \
+                                  minmax_scale as sk_minmax_scale, \
+                                  normalize as sk_normalize, \
+                                  add_dummy_feature as sk_add_dummy_feature, \
+                                  binarize as sk_binarize, \
+                                  robust_scale as sk_robust_scale
+from sklearn.impute import SimpleImputer as skSimpleImputer
+from sklearn.preprocessing import KBinsDiscretizer as skKBinsDiscretizer
+
+from ..thirdparty_adapters.sparsefuncs_fast import csr_mean_variance_axis0, \
+                                                csc_mean_variance_axis0, \
+                                                _csc_mean_variance_axis0, \
+                                                inplace_csr_row_normalize_l1, \
+                                                inplace_csr_row_normalize_l2
+
+from .test_preproc_utils import clf_dataset, int_dataset, blobs_dataset, \
+                                sparse_clf_dataset, \
+                                sparse_blobs_dataset, \
+                                sparse_int_dataset  # noqa: F401
+from .test_preproc_utils import assert_allclose
+from ..common.import_utils import check_cupy8
+
+import numpy as np
+import cupy as cp
+
+
+def test_minmax_scaler(clf_dataset):  # noqa: F811
+    X_np, X = clf_dataset
+
+    scaler = cuMinMaxScaler(copy=True)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skMinMaxScaler(copy=True)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+def test_minmax_scale(clf_dataset):  # noqa: F811
+    X_np, X = clf_dataset
+
+    t_X = cu_minmax_scale(X)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_minmax_scale(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("with_mean", [True, False])
+@pytest.mark.parametrize("with_std", [True, False])
+def test_standard_scaler(clf_dataset, with_mean, with_std):  # noqa: F811
+    X_np, X = clf_dataset
+
+    scaler = cuStandardScaler(copy=True, with_mean=with_mean,
+                              with_std=with_std)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skStandardScaler(copy=True, with_mean=with_mean,
+                              with_std=with_std)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@pytest.mark.parametrize("with_std", [True, False])
+def test_standard_scaler_sparse(sparse_clf_dataset, with_std):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    scaler = cuStandardScaler(copy=True, with_mean=False, with_std=with_std)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skStandardScaler(copy=True, with_mean=False, with_std=with_std)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@pytest.mark.parametrize("with_mean", [True, False])
+@pytest.mark.parametrize("with_std", [True, False])
+def test_scale(clf_dataset, with_mean, with_std):  # noqa: F811
+    X_np, X = clf_dataset
+
+    t_X = cu_scale(X, copy=True, with_mean=with_mean, with_std=with_std)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_scale(X_np, copy=True, with_mean=with_mean, with_std=with_std)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("with_std", [True, False])
+def test_scale_sparse(sparse_clf_dataset, with_std):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    t_X = cu_scale(X, copy=True, with_mean=False, with_std=with_std)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_scale(X_np, copy=True, with_mean=False, with_std=with_std)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+def test_maxabs_scaler(clf_dataset):  # noqa: F811
+    X_np, X = clf_dataset
+
+    scaler = cuMaxAbsScaler(copy=True)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skMaxAbsScaler(copy=True)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@check_cupy8('pytest')
+def test_maxabs_scaler_sparse(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    scaler = cuMaxAbsScaler(copy=True)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skMaxAbsScaler(copy=True)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("norm", ['l1', 'l2', 'max'])
+def test_normalizer(clf_dataset, norm):  # noqa: F811
+    X_np, X = clf_dataset
+
+    normalizer = cuNormalizer(norm=norm, copy=True)
+    t_X = normalizer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    normalizer = skNormalizer(norm=norm, copy=True)
+    sk_t_X = normalizer.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("norm", ['l1', 'l2', 'max'])
+def test_normalizer_sparse(sparse_clf_dataset, norm):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if X.format == 'csc':
+        pytest.skip("Skipping CSC matrices")
+
+    normalizer = cuNormalizer(norm=norm, copy=True)
+    t_X = normalizer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    normalizer = skNormalizer(norm=norm, copy=True)
+    sk_t_X = normalizer.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("axis", [0, 1])
+@pytest.mark.parametrize("norm", ['l1', 'l2', 'max'])
+@pytest.mark.parametrize("return_norm", [True, False])
+def test_normalize(clf_dataset, axis, norm, return_norm):  # noqa: F811
+    X_np, X = clf_dataset
+
+    if return_norm:
+        t_X, t_norms = cu_normalize(X, axis=axis, norm=norm,
+                                    return_norm=return_norm)
+        sk_t_X, sk_t_norms = sk_normalize(X_np, axis=axis, norm=norm,
+                                          return_norm=return_norm)
+        assert_allclose(t_norms, sk_t_norms)
+    else:
+        t_X = cu_normalize(X, axis=axis, norm=norm, return_norm=return_norm)
+        sk_t_X = sk_normalize(X_np, axis=axis, norm=norm,
+                              return_norm=return_norm)
+
+    assert type(t_X) == type(X)
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("norm", ['l1', 'l2', 'max'])
+def test_normalize_sparse(sparse_clf_dataset, norm):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    axis = 0 if X.format == 'csc' else 1
+
+    t_X = cu_normalize(X, axis=axis, norm=norm)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_normalize(X_np, axis=axis, norm=norm)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("strategy", ["mean", "median", "most_frequent",
+                                      "constant"])
+@pytest.mark.parametrize("missing_values", [0., 1., np.nan])
+def test_imputer(int_dataset, strategy, missing_values):  # noqa: F811
+    X_np, X = int_dataset
+    fill_value = np.random.randint(10, size=1)[0]
+
+    imputer = cuSimpleImputer(copy=True, missing_values=missing_values,
+                              strategy=strategy, fill_value=fill_value)
+    t_X = imputer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    imputer = skSimpleImputer(copy=True, missing_values=missing_values,
+                              strategy=strategy, fill_value=fill_value)
+    sk_t_X = imputer.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("strategy", ["mean", "median", "most_frequent",
+                         "constant"])
+@pytest.mark.parametrize("missing_values", [np.nan, 1.])
+def test_imputer_sparse(sparse_int_dataset, strategy,  # noqa: F811
+                        missing_values):
+    X_np, X = sparse_int_dataset
+
+    if X.format == 'csr':
+        pytest.skip("Skipping CSR matrices")
+
+    X_sp = X_np.tocsc()
+
+    if np.isnan(missing_values):
+        # Adding nan when missing value is nan
+        random_loc = np.random.choice(X.nnz,
+                                      int(X.nnz * 0.1),
+                                      replace=False)
+        X_sp.data[random_loc] = np.nan
+        X = X.copy()
+        X.data[random_loc] = np.nan
+
+    fill_value = np.random.randint(10, size=1)[0]
+
+    imputer = cuSimpleImputer(copy=True, missing_values=missing_values,
+                              strategy=strategy, fill_value=fill_value)
+    t_X = imputer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    imputer = skSimpleImputer(copy=True, missing_values=missing_values,
+                              strategy=strategy, fill_value=fill_value)
+    sk_t_X = imputer.fit_transform(X_sp)
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("degree", [2, 3])
+@pytest.mark.parametrize("interaction_only", [True, False])
+@pytest.mark.parametrize("include_bias", [True, False])
+@pytest.mark.parametrize("order", ['C', 'F'])
+def test_poly_features(clf_dataset, degree,  # noqa: F811
+                       interaction_only, include_bias, order):
+    X_np, X = clf_dataset
+
+    polyfeatures = cuPolynomialFeatures(degree=degree, order=order,
+                                        interaction_only=interaction_only,
+                                        include_bias=include_bias)
+    t_X = polyfeatures.fit_transform(X)
+    assert type(X) == type(t_X)
+
+    if isinstance(t_X, np.ndarray):
+        if order == 'C':
+            assert t_X.flags['C_CONTIGUOUS']
+        elif order == 'F':
+            assert t_X.flags['F_CONTIGUOUS']
+
+    polyfeatures = skPolynomialFeatures(degree=degree, order=order,
+                                        interaction_only=interaction_only,
+                                        include_bias=include_bias)
+    sk_t_X = polyfeatures.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X, rtol=0.1, atol=0.1)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("degree", [2, 3])
+@pytest.mark.parametrize("interaction_only", [True, False])
+@pytest.mark.parametrize("include_bias", [True, False])
+def test_poly_features_sparse(sparse_clf_dataset, degree,  # noqa: F811
+                              interaction_only, include_bias):
+    X_np, X = sparse_clf_dataset
+
+    polyfeatures = cuPolynomialFeatures(degree=degree,
+                                        interaction_only=interaction_only,
+                                        include_bias=include_bias)
+    t_X = polyfeatures.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    polyfeatures = skPolynomialFeatures(degree=degree,
+                                        interaction_only=interaction_only,
+                                        include_bias=include_bias)
+    sk_t_X = polyfeatures.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X, rtol=0.1, atol=0.1)
+
+
+@pytest.mark.parametrize("value", [1.0, 42])
+def test_add_dummy_feature(clf_dataset, value):  # noqa: F811
+    X_np, X = clf_dataset
+
+    t_X = cu_add_dummy_feature(X, value=value)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_add_dummy_feature(X_np, value=value)
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("value", [1.0, 42])
+def test_add_dummy_feature_sparse(sparse_clf_dataset, value):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    t_X = cu_add_dummy_feature(X, value=value)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_add_dummy_feature(X_np, value=value)
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("threshold", [0., 1.])
+def test_binarize(clf_dataset, threshold):  # noqa: F811
+    X_np, X = clf_dataset
+
+    t_X = cu_binarize(X, threshold=threshold, copy=True)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_binarize(X_np, threshold=threshold, copy=True)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("threshold", [0., 1.])
+def test_binarize_sparse(sparse_clf_dataset, threshold):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    t_X = cu_binarize(X, threshold=threshold, copy=True)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_binarize(X_np, threshold=threshold, copy=True)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("threshold", [0., 1.])
+def test_binarizer(clf_dataset, threshold):  # noqa: F811
+    X_np, X = clf_dataset
+
+    binarizer = cuBinarizer(threshold=threshold, copy=True)
+    t_X = binarizer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    binarizer = skBinarizer(threshold=threshold, copy=True)
+    sk_t_X = binarizer.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("threshold", [0., 1.])
+def test_binarizer_sparse(sparse_clf_dataset, threshold):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    binarizer = cuBinarizer(threshold=threshold, copy=True)
+    t_X = binarizer.fit_transform(X)
+    assert type(t_X) == type(X)
+
+    binarizer = skBinarizer(threshold=threshold, copy=True)
+    sk_t_X = binarizer.fit_transform(X_np)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("with_centering", [True, False])
+@pytest.mark.parametrize("with_scaling", [True, False])
+@pytest.mark.parametrize("quantile_range", [(25., 75.), (10., 90.)])
+def test_robust_scaler(clf_dataset, with_centering,  # noqa: F811
+                       with_scaling, quantile_range):
+    X_np, X = clf_dataset
+
+    scaler = cuRobustScaler(with_centering=with_centering,
+                            with_scaling=with_scaling,
+                            quantile_range=quantile_range,
+                            copy=True)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skRobustScaler(with_centering=with_centering,
+                            with_scaling=with_scaling,
+                            quantile_range=quantile_range,
+                            copy=True)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@pytest.mark.parametrize("with_scaling", [True, False])
+@pytest.mark.parametrize("quantile_range", [(25., 75.), (10., 90.)])
+def test_robust_scaler_sparse(sparse_clf_dataset,  # noqa: F811
+                              with_scaling, quantile_range):
+    X_np, X = sparse_clf_dataset
+
+    if X.format != 'csc':
+        X = X.tocsc()
+
+    scaler = cuRobustScaler(with_centering=False,
+                            with_scaling=with_scaling,
+                            quantile_range=quantile_range,
+                            copy=True)
+    t_X = scaler.fit_transform(X)
+    r_X = scaler.inverse_transform(t_X)
+    assert type(t_X) == type(X)
+    assert type(r_X) == type(t_X)
+
+    scaler = skRobustScaler(with_centering=False,
+                            with_scaling=with_scaling,
+                            quantile_range=quantile_range,
+                            copy=True)
+    sk_t_X = scaler.fit_transform(X_np)
+    sk_r_X = scaler.inverse_transform(sk_t_X)
+
+    assert_allclose(t_X, sk_t_X)
+    assert_allclose(r_X, sk_r_X)
+
+
+@pytest.mark.parametrize("axis", [0, 1])
+@pytest.mark.parametrize("with_centering", [True, False])
+@pytest.mark.parametrize("with_scaling", [True, False])
+@pytest.mark.parametrize("quantile_range", [(25., 75.), (10., 90.)])
+def test_robust_scale(clf_dataset, with_centering,  # noqa: F811
+                      axis, with_scaling, quantile_range):
+    X_np, X = clf_dataset
+
+    t_X = cu_robust_scale(X, axis=axis,
+                          with_centering=with_centering,
+                          with_scaling=with_scaling,
+                          quantile_range=quantile_range,
+                          copy=True)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_robust_scale(X_np, axis=axis,
+                             with_centering=with_centering,
+                             with_scaling=with_scaling,
+                             quantile_range=quantile_range,
+                             copy=True)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@pytest.mark.parametrize("axis", [0, 1])
+@pytest.mark.parametrize("with_scaling", [True, False])
+@pytest.mark.parametrize("quantile_range", [(25., 75.), (10., 90.)])
+def test_robust_scale_sparse(sparse_clf_dataset,  # noqa: F811
+                             axis, with_scaling, quantile_range):
+    X_np, X = sparse_clf_dataset
+
+    if X.format != 'csc' and axis == 0:
+        X = X.tocsc()
+    elif X.format != 'csr' and axis == 1:
+        X = X.tocsr()
+
+    t_X = cu_robust_scale(X, axis=axis,
+                          with_centering=False,
+                          with_scaling=with_scaling,
+                          quantile_range=quantile_range,
+                          copy=True)
+    assert type(t_X) == type(X)
+
+    sk_t_X = sk_robust_scale(X_np, axis=axis,
+                             with_centering=False,
+                             with_scaling=with_scaling,
+                             quantile_range=quantile_range,
+                             copy=True)
+
+    assert_allclose(t_X, sk_t_X)
+
+
+@check_cupy8('pytest')
+@pytest.mark.parametrize("n_bins", [5, 20])
+@pytest.mark.parametrize("encode", ['ordinal', 'onehot-dense', 'onehot'])
+@pytest.mark.parametrize("strategy", ['uniform', 'quantile', 'kmeans'])
+@pytest.mark.xfail(strict=False)
+def test_kbinsdiscretizer(blobs_dataset, n_bins,  # noqa: F811
+                          encode, strategy):
+    X_np, X = blobs_dataset
+
+    transformer = cuKBinsDiscretizer(n_bins=n_bins,
+                                     encode=encode,
+                                     strategy=strategy)
+    t_X = transformer.fit_transform(X)
+    r_X = transformer.inverse_transform(t_X)
+
+    if encode != 'onehot':
+        assert type(t_X) == type(X)
+        assert type(r_X) == type(t_X)
+
+    transformer = skKBinsDiscretizer(n_bins=n_bins,
+                                     encode=encode,
+                                     strategy=strategy)
+    sk_t_X = transformer.fit_transform(X_np)
+    sk_r_X = transformer.inverse_transform(sk_t_X)
+
+    if strategy == 'kmeans':
+        assert_allclose(t_X, sk_t_X, ratio_tol=0.2)
+    else:
+        assert_allclose(t_X, sk_t_X)
+        assert_allclose(r_X, sk_r_X)
+
+
+def test_csr_mean_variance_axis0(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if not cp.sparse.issparse(X):
+        pytest.skip("Skipping non-CuPy or non-sparse arrays")
+
+    if X.format != 'csr':
+        X = X.tocsr()
+
+    means, variances = csr_mean_variance_axis0(X)
+
+    X_np = X_np.toarray()
+    ref_means = np.nanmean(X_np, axis=0)
+    ref_variances = np.nanvar(X_np, axis=0)
+
+    assert_allclose(means, ref_means)
+    assert_allclose(variances, ref_variances)
+
+
+def test_csc_mean_variance_axis0(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if not cp.sparse.issparse(X):
+        pytest.skip("Skipping non-CuPy or non-sparse arrays")
+
+    if X.format != 'csc':
+        X = X.tocsc()
+
+    means, variances = csc_mean_variance_axis0(X)
+
+    X_np = X_np.toarray()
+    ref_means = np.nanmean(X_np, axis=0)
+    ref_variances = np.nanvar(X_np, axis=0)
+
+    assert_allclose(means, ref_means)
+    assert_allclose(variances, ref_variances)
+
+
+def test__csc_mean_variance_axis0(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if not cp.sparse.issparse(X):
+        pytest.skip("Skipping non-CuPy or non-sparse arrays")
+
+    if X.format != 'csc':
+        X = X.tocsc()
+
+    means, variances, counts_nan = _csc_mean_variance_axis0(X)
+
+    X_np = X_np.toarray()
+    ref_means = np.nanmean(X_np, axis=0)
+    ref_variances = np.nanvar(X_np, axis=0)
+    ref_counts_nan = np.isnan(X_np).sum(axis=0)
+
+    assert_allclose(means, ref_means)
+    assert_allclose(variances, ref_variances)
+    assert_allclose(counts_nan, ref_counts_nan)
+
+
+def test_inplace_csr_row_normalize_l1(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if not cp.sparse.issparse(X):
+        pytest.skip("Skipping non-CuPy or non-sparse arrays")
+
+    if X.format != 'csr':
+        X = X.tocsr()
+
+    inplace_csr_row_normalize_l1(X)
+
+    X_np = X_np.toarray()
+    X_np = sk_normalize(X_np, norm='l1', axis=1)
+
+    assert_allclose(X, X_np)
+
+
+def test_inplace_csr_row_normalize_l2(sparse_clf_dataset):  # noqa: F811
+    X_np, X = sparse_clf_dataset
+
+    if not cp.sparse.issparse(X):
+        pytest.skip("Skipping non-CuPy or non-sparse arrays")
+
+    if X.format != 'csr':
+        X = X.tocsr()
+
+    inplace_csr_row_normalize_l2(X)
+
+    X_np = X_np.toarray()
+    X_np = sk_normalize(X_np, norm='l2', axis=1)
+
+    assert_allclose(X, X_np)
diff --git a/python/cuml/test/test_qn.py b/python/cuml/test/test_qn.py
index aafc5f1fb0..0e88da4393 100644
--- a/python/cuml/test/test_qn.py
+++ b/python/cuml/test/test_qn.py
@@ -42,7 +42,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         y = np.array(precomputed_y_multi, dtype=dtype)
 
     qn = cuQN(loss=loss, fit_intercept=fit_intercept, l1_strength=l1_strength,
-              l2_strength=l2_strength, tol=1e-8)
+              l2_strength=l2_strength, tol=1e-8, output_type="cupy")
 
     qn.fit(X, y)
 
@@ -54,15 +54,14 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
             if fit_intercept:
                 assert (qn.objective - 0.40263831615448) < tol
                 cp.testing.assert_array_almost_equal(
-                    qn.coef_.to_output('cupy'),
+                    qn.coef_,
                     np.array([[-2.1088872],
-                              [2.4812558],
-                              [0.7960136]]),
+                              [2.4812558]]),
                     decimal=3)
             else:
                 assert (qn.objective - 0.4317452311515808) < tol
                 cp.testing.assert_array_almost_equal(
-                    qn.coef_.to_output('cupy'),
+                    qn.coef_,
                     np.array([[-2.120777],
                               [3.056865]]),
                     decimal=3)
@@ -71,25 +70,23 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 if l1_strength == 0.0:
                     assert (qn.objective - 0.40263831615448) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.1088872],
-                                  [2.4812558],
-                                  [0.7960136]]),
+                                  [2.4812558]]),
                         decimal=3)
                 else:
                     assert (qn.objective - 0.44295936822891235) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.6899368],
-                                  [1.9021575],
-                                  [0.8057671]]),
+                                  [1.9021575]]),
                         decimal=3)
 
             else:
                 if l1_strength == 0.0:
                     assert (qn.objective - 0.4317452311515808) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.120777],
                                   [3.056865]]),
                         decimal=3)
@@ -97,7 +94,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 else:
                     assert (qn.objective - 0.4769895672798157) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.6214856],
                                   [2.3650239]]),
                         decimal=3)
@@ -109,25 +106,23 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 if l2_strength == 0.0:
                     assert (qn.objective - 0.40263831615448) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.1088872],
-                                  [2.4812558],
-                                  [0.7960136]]),
+                                  [2.4812558]]),
                         decimal=3)
                 else:
                     assert (qn.objective - 0.43780848383903503) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.5337948],
-                                  [1.678699],
-                                  [0.8060587]]),
+                                  [1.678699]]),
                         decimal=3)
 
             else:
                 if l2_strength == 0.0:
                     assert (qn.objective - 0.4317452311515808) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.120777],
                                   [3.056865]]),
                         decimal=3)
@@ -135,7 +130,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 else:
                     assert (qn.objective - 0.4750209450721741) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.3931049],
                                   [2.0140104]]),
                         decimal=3)
@@ -145,47 +140,43 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 if l1_strength == 0.0 and l2_strength == 0.0:
                     assert (qn.objective - 0.40263831615448) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.1088872],
-                                  [2.4812558],
-                                  [0.7960136]]),
+                                  [2.4812558]]),
                         decimal=3)
                 elif l1_strength == 0.0:
                     assert (qn.objective - 0.43780848383903503) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.5337948],
-                                  [1.678699],
-                                  [0.8060587]]),
+                                  [1.678699]]),
                         decimal=3)
                 elif l2_strength == 0.0:
                     assert (qn.objective - 0.44295936822891235) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.6899368],
-                                  [1.9021575],
-                                  [0.8057671]]),
+                                  [1.9021575]]),
                         decimal=3)
                 else:
                     assert (qn.objective - 0.467987984418869) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.3727235],
-                                  [1.4639963],
-                                  [0.79312485]]),
+                                  [1.4639963]]),
                         decimal=3)
             else:
                 if l1_strength == 0.0 and l2_strength == 0.0:
                     assert (qn.objective - 0.4317452311515808) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-2.120777],
                                   [3.056865]]),
                         decimal=3)
                 elif l1_strength == 0.0:
                     assert (qn.objective - 0.4750209450721741) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.3931049],
                                   [2.0140104]]),
                         decimal=3)
@@ -193,14 +184,14 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
                 elif l2_strength == 0.0:
                     assert (qn.objective - 0.4769895672798157) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.6214856],
                                   [2.3650239]]),
                         decimal=3)
                 else:
                     assert (qn.objective - 0.5067970156669617) < tol
                     cp.testing.assert_array_almost_equal(
-                        qn.coef_.to_output('cupy'),
+                        qn.coef_,
                         np.array([[-1.2102532],
                                   [1.752459]]),
                         decimal=3)
@@ -217,7 +208,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
     #     if penalty == 'none' and l1_strength == 0.0 and l2_strength == 0.0:
     #         if fit_intercept:
     #             assert (qn.objective - 0.007433414924889803) < tol
-    #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #             np.testing.assert_almost_equal(qn.coef_
     #                                            np.array([[15.236361,
     #                                                      -41.595913,
     #                                                      -33.55021],
@@ -230,7 +221,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
     #                                            decimal=3)
     #         else:
     #             assert (qn.objective - 0.18794211745262146) < tol
-    #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #             np.testing.assert_almost_equal(qn.coef_
     #                                            np.array([[14.2959795,
     #                                                      -104.63812,
     #                                                      -96.41866],
@@ -242,7 +233,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
     #         if fit_intercept:
     #             if l1_strength == 0.0:
     #                 assert (qn.objective - 0.007433414924889803) < tol
-    #                 np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #                 np.testing.assert_almost_equal(qn.coef_
     #                                                np.array([[15.236361,
     #                                                          -41.595913,
     #                                                          -33.55021],
@@ -255,7 +246,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
     #                                                decimal=3)
     #             else:
     #                 assert (qn.objective - 0.2925984263420105) < tol
-    #                 np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #                 np.testing.assert_almost_equal(qn.coef_
     #                                                np.array([[1.2279763,
     #                                                           -2.011927,
     #                                                           -1.8038181],
@@ -270,7 +261,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
     #         else:
     #             if l1_strength == 0.0:
     #                 assert (qn.objective - 0.18794211745262146) < tol
-    #                 np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #                 np.testing.assert_almost_equal(qn.coef_
     #                                                np.array([[14.2959795,
     #                                                          -104.63812,
     #                                                          -96.41866],
@@ -281,7 +272,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
 
     #             else:
     #                 assert (qn.objective - 0.3777262568473816) < tol
-    #                 np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+    #                 np.testing.assert_almost_equal(qn.coef_
     #                                                np.array([[1.4765631,
     #                                                           -1.569497,
     #                                                           -0.6421711],
@@ -294,7 +285,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #     if fit_intercept:
         #         if l2_strength == 0.0:
         #             assert (qn.objective - 0.007433414924889803) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[15.236361,
         #                                                      -41.595913,
         #                                                      -33.55021],
@@ -307,7 +298,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         else:
         #             assert (qn.objective - 0.28578639030456543) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.6702422,
         #                                                      -1.5495867,
         #                                                      -1.193351],
@@ -322,7 +313,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #     else:
         #         if l2_strength == 0.0:
         #             assert (qn.objective - 0.18794211745262146) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[14.2959795,
         #                                                      -104.63812,
         #                                                      -96.41866],
@@ -333,7 +324,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
 
         #         else:
         #             assert (qn.objective - 0.3537392020225525) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.3769588,
         #                                                      -1.0002015,
         #                                                      -0.5205092],
@@ -346,7 +337,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #     if fit_intercept:
         #         if l1_strength == 0.0 and l2_strength == 0.0:
         #             assert (qn.objective - 0.007433414924889803) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[15.236361,
         #                                                      -41.595913,
         #                                                      -33.55021],
@@ -359,7 +350,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         elif l1_strength == 0.0:
         #             assert (qn.objective - 0.28578639030456543) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.6702422,
         #                                                      -1.5495867,
         #                                                      -1.193351],
@@ -372,7 +363,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         elif l2_strength == 0.0:
         #             assert (qn.objective - 0.2925984263420105) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.2279763,
         #                                                      -2.011927,
         #                                                      -1.8038181],
@@ -385,7 +376,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         else:
         #             assert (qn.objective - 0.34934690594673157) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.1901233,
         #                                                      -1.2236115,
         #                                                      -1.0416932],
@@ -399,7 +390,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #     else:
         #         if l1_strength == 0.0 and l2_strength == 0.0:
         #             assert (qn.objective - 0.18794211745262146) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[14.2959795,
         #                                                      -104.63812,
         #                                                      -96.41866],
@@ -409,7 +400,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         elif l1_strength == 0.0:
         #             assert (qn.objective - 0.3537392020225525) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.3769588,
         #                                                      -1.0002015,
         #                                                      -0.5205092],
@@ -420,7 +411,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
 
         #         elif l2_strength == 0.0:
         #             assert (qn.objective - 0.3777262568473816) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.4765631,
         #                                                      -1.569497,
         #                                                      -0.6421711],
@@ -430,7 +421,7 @@ def test_qn(loss, dtype, penalty, l1_strength, l2_strength, fit_intercept):
         #                                            decimal=3)
         #         else:
         #             assert (qn.objective - 0.40656331181526184) < tol
-        #             np.testing.assert_almost_equal(qn.coef_.to_output('cupy'),
+        #             np.testing.assert_almost_equal(qn.coef_
         #                                            np.array([[1.2176441,
         #                                                      -0.8387626,
         #                                                      -0.3155345],
diff --git a/python/cuml/test/test_random_forest.py b/python/cuml/test/test_random_forest.py
index 45f56557c1..abadee7378 100644
--- a/python/cuml/test/test_random_forest.py
+++ b/python/cuml/test/test_random_forest.py
@@ -17,6 +17,9 @@
 import numpy as np
 import pytest
 import random
+import json
+import io
+from contextlib import redirect_stdout
 
 from numba import cuda
 
@@ -53,6 +56,24 @@ def small_clf(request):
     return X, y
 
 
+@pytest.fixture(
+    scope="session",
+    params=[
+        unit_param({'n_samples': 350, 'n_features': 30, 'n_informative': 15}),
+        quality_param({'n_samples': 5000, 'n_features': 200,
+                      'n_informative': 80}),
+        stress_param({'n_samples': 500000, 'n_features': 400,
+                     'n_informative': 180})
+    ])
+def mclass_clf(request):
+    X, y = make_classification(n_samples=request.param['n_samples'],
+                               n_features=request.param['n_features'],
+                               n_clusters_per_class=1,
+                               n_informative=request.param['n_informative'],
+                               random_state=123, n_classes=10)
+    return X, y
+
+
 @pytest.fixture(
     scope="session",
     params=[
@@ -123,8 +144,10 @@ def special_reg(request):
 @pytest.mark.parametrize('datatype', [np.float32])
 @pytest.mark.parametrize('split_algo', [0, 1])
 @pytest.mark.parametrize('max_features', [1.0, 'auto', 'log2', 'sqrt'])
+@pytest.mark.parametrize('use_experimental_backend', [True, False])
 def test_rf_classification(small_clf, datatype, split_algo,
-                           rows_sample, max_features):
+                           rows_sample, max_features,
+                           use_experimental_backend):
     use_handle = True
 
     X, y = small_clf
@@ -139,10 +162,32 @@ def test_rf_classification(small_clf, datatype, split_algo,
     # random forest classification model
     cuml_model = curfc(max_features=max_features, rows_sample=rows_sample,
                        n_bins=16, split_algo=split_algo, split_criterion=0,
-                       min_rows_per_node=2, seed=123, n_streams=1,
+                       min_rows_per_node=2, random_state=123, n_streams=1,
                        n_estimators=40, handle=handle, max_leaves=-1,
-                       max_depth=16)
-    cuml_model.fit(X_train, y_train)
+                       max_depth=16,
+                       use_experimental_backend=use_experimental_backend)
+    f = io.StringIO()
+    with redirect_stdout(f):
+        cuml_model.fit(X_train, y_train)
+    captured_stdout = f.getvalue()
+    if use_experimental_backend:
+        is_fallback_used = False
+        if max_features != 1.0:
+            assert ('Experimental backend does not yet support feature ' +
+                    'sub-sampling' in captured_stdout)
+            is_fallback_used = True
+        if split_algo != 1:
+            assert ('Experimental backend does not yet support histogram ' +
+                    'split algorithm' in captured_stdout)
+            is_fallback_used = True
+        if is_fallback_used:
+            assert ('Not using the experimental backend due to above ' +
+                    'mentioned reason(s)' in captured_stdout)
+        else:
+            assert ('Using experimental backend for growing trees'
+                    in captured_stdout)
+    else:
+        assert captured_stdout == ''
     fil_preds = cuml_model.predict(X_test,
                                    predict_model="GPU",
                                    output_class=True,
@@ -186,7 +231,7 @@ def test_rf_regression(special_reg, datatype, split_algo, max_features,
     # Initialize and fit using cuML's random forest regression model
     cuml_model = curfr(max_features=max_features, rows_sample=rows_sample,
                        n_bins=16, split_algo=split_algo, split_criterion=2,
-                       min_rows_per_node=2, seed=123, n_streams=1,
+                       min_rows_per_node=2, random_state=123, n_streams=1,
                        n_estimators=50, handle=handle, max_leaves=-1,
                        max_depth=16, accuracy_metric='mse')
     cuml_model.fit(X_train, y_train)
@@ -223,7 +268,7 @@ def test_rf_classification_seed(small_clf, datatype):
         seed = random.randint(100, 1e5)
         # Initialize, fit and predict using cuML's
         # random forest classification model
-        cu_class = curfc(seed=seed, n_streams=1)
+        cu_class = curfc(random_state=seed, n_streams=1)
         cu_class.fit(X_train, y_train)
 
         # predict using FIL
@@ -238,7 +283,7 @@ def test_rf_classification_seed(small_clf, datatype):
 
         # Initialize, fit and predict using cuML's
         # random forest classification model
-        cu_class2 = curfc(seed=seed, n_streams=1)
+        cu_class2 = curfc(random_state=seed, n_streams=1)
         cu_class2.fit(X_train, y_train)
 
         # predict using FIL
@@ -340,51 +385,88 @@ def test_rf_regression_float64(large_reg, datatype):
                                        convert_dtype=False)
 
 
-@pytest.mark.parametrize('datatype', [(np.float32, np.float32)])
-@pytest.mark.parametrize('column_info', [unit_param([20, 10]),
-                         quality_param([200, 100]),
-                         stress_param([500, 350])])
-@pytest.mark.parametrize('nrows', [unit_param(500), quality_param(5000),
-                         stress_param(500000)])
-@pytest.mark.parametrize('n_classes', [10])
-@pytest.mark.parametrize('type', ['dataframe', 'numpy'])
-def test_rf_classification_multi_class(datatype, column_info, nrows,
-                                       n_classes, type):
+def check_predict_proba(test_proba, baseline_proba, y_test, rel_err):
+    y_proba = np.zeros(np.shape(baseline_proba))
+    for count, _class in enumerate(y_test):
+        y_proba[count, _class] = 1
+    baseline_mse = mean_squared_error(y_proba, baseline_proba)
+    test_mse = mean_squared_error(y_proba, test_proba)
+    # using relative error is more stable when changing decision tree
+    # parameters, column or class count
+    assert test_mse <= baseline_mse * (1.0 + rel_err)
 
-    ncols, n_info = column_info
-    X, y = make_classification(n_samples=nrows, n_features=ncols,
-                               n_clusters_per_class=1, n_informative=n_info,
-                               random_state=0, n_classes=n_classes)
+
+def rf_classification(datatype, array_type, max_features, rows_sample,
+                      fixture):
+    X, y = fixture
     X = X.astype(datatype[0])
     y = y.astype(np.int32)
     X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
                                                         random_state=0)
     X_test = X_test.astype(datatype[1])
 
+    handle, stream = get_handle(True, n_streams=1)
     # Initialize, fit and predict using cuML's
     # random forest classification model
-    cuml_model = curfc()
-    if type == 'dataframe':
+    cuml_model = curfc(max_features=max_features, rows_sample=rows_sample,
+                       n_bins=16, split_criterion=0,
+                       min_rows_per_node=2, random_state=123,
+                       n_estimators=40, handle=handle, max_leaves=-1,
+                       max_depth=16)
+    if array_type == 'dataframe':
         X_train_df = cudf.DataFrame(X_train)
         y_train_df = cudf.Series(y_train)
         X_test_df = cudf.DataFrame(X_test)
         cuml_model.fit(X_train_df, y_train_df)
-        cu_preds = cuml_model.predict(X_test_df,
-                                      predict_model="CPU").to_array()
+        cu_proba_gpu = np.array(cuml_model.predict_proba(X_test_df)
+                                .as_gpu_matrix())
+        cu_preds_cpu = cuml_model.predict(X_test_df,
+                                          predict_model="CPU").to_array()
+        cu_preds_gpu = cuml_model.predict(X_test_df, output_class=True,
+                                          predict_model="GPU").to_array()
     else:
         cuml_model.fit(X_train, y_train)
-        cu_preds = cuml_model.predict(X_test, predict_model="CPU")
+        cu_proba_gpu = cuml_model.predict_proba(X_test)
+        cu_preds_cpu = cuml_model.predict(X_test, predict_model="CPU")
+        cu_preds_gpu = cuml_model.predict(X_test, predict_model="GPU",
+                                          output_class=True)
 
-    cu_acc = accuracy_score(y_test, cu_preds)
+    cu_acc_cpu = accuracy_score(y_test, cu_preds_cpu)
+    cu_acc_gpu = accuracy_score(y_test, cu_preds_gpu)
+    assert cu_acc_cpu == pytest.approx(cu_acc_gpu, abs=0.01, rel=0.1)
 
     # sklearn random forest classification model
     # initialization, fit and predict
-    if nrows < 500000:
-        sk_model = skrfc(max_depth=16, random_state=10)
+    if y.size < 500000:
+        sk_model = skrfc(n_estimators=40,
+                         max_depth=16,
+                         min_samples_split=2, max_features=max_features,
+                         random_state=10)
         sk_model.fit(X_train, y_train)
         sk_preds = sk_model.predict(X_test)
         sk_acc = accuracy_score(y_test, sk_preds)
-        assert cu_acc >= (sk_acc - 0.07)
+        sk_proba = sk_model.predict_proba(X_test)
+        assert cu_acc_cpu >= sk_acc - 0.07
+        assert cu_acc_gpu >= sk_acc - 0.07
+        # 0.06 is the highest relative error observed on CI, within
+        # 0.0061 absolute error boundaries seen previously
+        check_predict_proba(cu_proba_gpu, sk_proba, y_test, 0.1)
+
+
+@pytest.mark.parametrize('datatype', [(np.float32, np.float32)])
+@pytest.mark.parametrize('array_type', ['dataframe', 'numpy'])
+def test_rf_classification_multi_class(mclass_clf, datatype, array_type):
+    rf_classification(datatype, array_type, 1.0, 1.0, mclass_clf)
+
+
+@pytest.mark.parametrize('datatype', [(np.float32, np.float32)])
+@pytest.mark.parametrize('rows_sample', [unit_param(1.0),
+                         stress_param(0.95)])
+@pytest.mark.parametrize('max_features', [1.0, 'auto', 'log2', 'sqrt'])
+def test_rf_classification_proba(small_clf, datatype,
+                                 rows_sample, max_features):
+    rf_classification(datatype, 'numpy', max_features, rows_sample,
+                      small_clf)
 
 
 @pytest.mark.parametrize('datatype', [np.float32])
@@ -408,7 +490,7 @@ def test_rf_classification_sparse(small_clf, datatype,
     # Initialize, fit and predict using cuML's
     # random forest classification model
     cuml_model = curfc(n_bins=16, split_criterion=0,
-                       min_rows_per_node=2, seed=123, n_streams=1,
+                       min_rows_per_node=2, random_state=123, n_streams=1,
                        n_estimators=num_treees, handle=handle, max_leaves=-1,
                        max_depth=40)
     cuml_model.fit(X_train, y_train)
@@ -475,7 +557,7 @@ def test_rf_regression_sparse(special_reg, datatype, fil_sparse_format, algo):
 
     # Initialize and fit using cuML's random forest regression model
     cuml_model = curfr(n_bins=16, split_criterion=2,
-                       min_rows_per_node=2, seed=123, n_streams=1,
+                       min_rows_per_node=2, random_state=123, n_streams=1,
                        n_estimators=num_treees, handle=handle, max_leaves=-1,
                        max_depth=40, accuracy_metric='mse')
     cuml_model.fit(X_train, y_train)
@@ -643,51 +725,6 @@ def test_multiple_fits_regression(column_info, nrows, n_estimators, n_bins):
     assert params['n_bins'] == n_bins
 
 
-@pytest.mark.parametrize('rows_sample', [unit_param(1.0),
-                         stress_param(0.95)])
-@pytest.mark.parametrize('datatype', [np.float32])
-@pytest.mark.parametrize('max_features', [1.0, 'auto', 'log2', 'sqrt'])
-def test_rf_classification_proba(small_clf, datatype,
-                                 rows_sample, max_features):
-    use_handle = True
-
-    X, y = small_clf
-    X = X.astype(datatype)
-    y = y.astype(np.int32)
-    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
-                                                        random_state=0)
-    # Create a handle for the cuml model
-    handle, stream = get_handle(use_handle, n_streams=1)
-
-    # Initialize, fit and predict using cuML's
-    # random forest classification model
-    cuml_model = curfc(max_features=max_features, rows_sample=rows_sample,
-                       n_bins=16, split_criterion=0,
-                       min_rows_per_node=2, seed=123, n_streams=1,
-                       n_estimators=40, handle=handle, max_leaves=-1,
-                       max_depth=16)
-    cuml_model.fit(X_train, y_train)
-    fil_preds_proba = cuml_model.predict_proba(X_test,
-                                               output_class=True,
-                                               threshold=0.5,
-                                               algo='auto')
-    y_proba = np.zeros(np.shape(fil_preds_proba))
-    y_proba[:, 1] = y_test
-    y_proba[:, 0] = 1.0 - y_test
-    fil_mse = mean_squared_error(y_proba, fil_preds_proba)
-    if X.shape[0] < 500000:
-        sk_model = skrfc(n_estimators=40,
-                         max_depth=16,
-                         min_samples_split=2, max_features=max_features,
-                         random_state=10)
-        sk_model.fit(X_train, y_train)
-        sk_preds_proba = sk_model.predict_proba(X_test)
-        sk_mse = mean_squared_error(y_proba, sk_preds_proba)
-        # Max difference of 0.0061 is seen between the mse values of
-        # predict proba function of fil and sklearn
-        assert fil_mse <= (sk_mse + 0.0061)
-
-
 @pytest.mark.parametrize('n_estimators', [5, 10, 20])
 @pytest.mark.parametrize('detailed_printing', [True, False])
 def test_rf_printing(capfd, n_estimators, detailed_printing):
@@ -705,7 +742,7 @@ def test_rf_printing(capfd, n_estimators, detailed_printing):
     # Initialize cuML Random Forest classification model
     cuml_model = curfc(handle=handle, max_features=1.0, rows_sample=1.0,
                        n_bins=16, split_algo=0, split_criterion=0,
-                       min_rows_per_node=2, seed=23707, n_streams=1,
+                       min_rows_per_node=2, random_state=23707, n_streams=1,
                        n_estimators=n_estimators, max_leaves=-1,
                        max_depth=16)
 
@@ -733,6 +770,88 @@ def test_rf_printing(capfd, n_estimators, detailed_printing):
     assert n_estimators == tree_count
 
 
+@pytest.mark.parametrize('max_depth', [1, 2, 3, 5, 10, 15, 20])
+@pytest.mark.parametrize('n_estimators', [5, 10, 20])
+@pytest.mark.parametrize('estimator_type', ['regression', 'classification'])
+def test_dump_json(estimator_type, max_depth, n_estimators):
+    X, y = make_classification(n_samples=350, n_features=20,
+                               n_clusters_per_class=1, n_informative=10,
+                               random_state=123, n_classes=2)
+    X = X.astype(np.float32)
+    if estimator_type == 'classification':
+        cuml_model = curfc(max_features=1.0, rows_sample=1.0,
+                           n_bins=16, split_algo=0, split_criterion=0,
+                           min_rows_per_node=2, seed=23707, n_streams=1,
+                           n_estimators=n_estimators, max_leaves=-1,
+                           max_depth=max_depth)
+        y = y.astype(np.int32)
+    elif estimator_type == 'regression':
+        cuml_model = curfr(max_features=1.0, rows_sample=1.0,
+                           n_bins=16, split_algo=0,
+                           min_rows_per_node=2, seed=23707, n_streams=1,
+                           n_estimators=n_estimators, max_leaves=-1,
+                           max_depth=max_depth)
+        y = y.astype(np.float32)
+    else:
+        assert False
+
+    # Train model on the data
+    cuml_model.fit(X, y)
+
+    json_out = cuml_model.dump_as_json()
+    json_obj = json.loads(json_out)
+
+    # Test 1: Output is non-zero
+    assert '' != json_out
+
+    # Test 2: JSON object contains correct number of trees
+    assert isinstance(json_obj, list)
+    assert len(json_obj) == n_estimators
+
+    # Test 3: Traverse JSON trees and get the same predictions as cuML RF
+    def predict_with_json_tree(tree, x):
+        if 'children' not in tree:
+            assert 'leaf_value' in tree
+            return tree['leaf_value']
+        assert 'split_feature' in tree
+        assert 'split_threshold' in tree
+        assert 'yes' in tree
+        assert 'no' in tree
+        if x[tree['split_feature']] <= tree['split_threshold']:
+            return predict_with_json_tree(tree['children'][0], x)
+        return predict_with_json_tree(tree['children'][1], x)
+
+    def predict_with_json_rf_classifier(rf, x):
+        # Returns the class with the highest vote. If there is a tie, return
+        # the list of all classes with the highest vote.
+        vote = []
+        for tree in rf:
+            vote.append(predict_with_json_tree(tree, x))
+        vote = np.bincount(vote)
+        max_vote = np.max(vote)
+        majority_vote = np.nonzero(np.equal(vote, max_vote))[0]
+        return majority_vote
+
+    def predict_with_json_rf_regressor(rf, x):
+        pred = 0.
+        for tree in rf:
+            pred += predict_with_json_tree(tree, x)
+        return pred / len(rf)
+
+    if estimator_type == 'classification':
+        expected_pred = cuml_model.predict(X).astype(np.int32)
+        for idx, row in enumerate(X):
+            majority_vote = predict_with_json_rf_classifier(json_obj, row)
+            assert expected_pred[idx] in majority_vote
+    elif estimator_type == 'regression':
+        expected_pred = cuml_model.predict(X).astype(np.float32)
+        pred = []
+        for idx, row in enumerate(X):
+            pred.append(predict_with_json_rf_regressor(json_obj, row))
+        pred = np.array(pred, dtype=np.float32)
+        np.testing.assert_almost_equal(pred, expected_pred, decimal=6)
+
+
 @pytest.mark.memleak
 @pytest.mark.parametrize('estimator_type', ['classification'])
 def test_rf_host_memory_leak(large_clf, estimator_type):
@@ -752,12 +871,12 @@ def test_rf_host_memory_leak(large_clf, estimator_type):
     if estimator_type == 'classification':
         base_model = curfc(max_depth=10,
                            n_estimators=100,
-                           seed=123)
+                           random_state=123)
         y = y.astype(np.int32)
     else:
         base_model = curfr(max_depth=10,
                            n_estimators=100,
-                           seed=123)
+                           random_state=123)
         y = y.astype(np.float32)
 
     # Pre-fit once - this is our baseline and memory usage
@@ -798,12 +917,12 @@ def test_concat_memory_leak(large_clf, estimator_type):
     if estimator_type == 'classification':
         base_models = [curfc(max_depth=10,
                              n_estimators=100,
-                             seed=123) for i in range(n_models)]
+                             random_state=123) for i in range(n_models)]
         y = y.astype(np.int32)
     elif estimator_type == 'regression':
         base_models = [curfr(max_depth=10,
                              n_estimators=100,
-                             seed=123) for i in range(n_models)]
+                             random_state=123) for i in range(n_models)]
         y = y.astype(np.float32)
     else:
         assert False
diff --git a/python/cuml/test/test_serialize.py b/python/cuml/test/test_serialize.py
index 1d0a0db675..1945996474 100644
--- a/python/cuml/test/test_serialize.py
+++ b/python/cuml/test/test_serialize.py
@@ -14,6 +14,7 @@
 #
 
 import cupy as cp
+import cupyx
 import pickle
 
 from cuml.naive_bayes.naive_bayes import MultinomialNB
@@ -30,7 +31,7 @@ def test_naive_bayes_cuda():
 
     mnb = MultinomialNB()
 
-    X = cp.sparse.random(1, 5)
+    X = cupyx.scipy.sparse.random(1, 5)
     y = cp.array([0])
 
     mnb.fit(X, y)
@@ -50,7 +51,7 @@ def test_naive_bayes_cuda():
 
 def test_cupy_sparse_patch():
 
-    sp = cp.sparse.random(50, 2, format='csr')
+    sp = cupyx.scipy.sparse.random(50, 2, format='csr')
 
     pickled = pickle.dumps(sp)
 
diff --git a/python/cuml/test/test_sgd.py b/python/cuml/test/test_sgd.py
index efd451b02d..09fb9984a8 100644
--- a/python/cuml/test/test_sgd.py
+++ b/python/cuml/test/test_sgd.py
@@ -51,7 +51,8 @@ def test_sgd(dtype, lrate, penalty, loss, datatype):
 
     cu_sgd = cumlSGD(learning_rate=lrate, eta0=0.005, epochs=2000,
                      fit_intercept=True, batch_size=4096,
-                     tol=0.0, penalty=penalty, loss=loss)
+                     tol=0.0, penalty=penalty, loss=loss,
+                     power_t=0.4)
 
     cu_sgd.fit(X_train, y_train)
     cu_pred = cu_sgd.predict(X_test)
diff --git a/python/cuml/test/test_solver_attributes.py b/python/cuml/test/test_solver_attributes.py
new file mode 100644
index 0000000000..e059efef3b
--- /dev/null
+++ b/python/cuml/test/test_solver_attributes.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from cuml.linear_model import MBSGDRegressor as cumlMBSGRegressor
+from cuml.linear_model import MBSGDClassifier as cumlMBSGClassifier
+from cuml import LogisticRegression as cuLog
+from cuml.linear_model import ElasticNet as cumlElastic
+from cuml.linear_model import Lasso as cumlLasso
+
+from cuml.datasets import make_blobs
+
+
+def test_mbsgd_regressor_attributes():
+    X, y = make_blobs()
+    clf = cumlMBSGRegressor()
+    clf.fit(X, y)
+
+    attrs = ["dtype", "solver_model", "coef_", "intercept_",
+             "l1_ratio", "n_cols", "loss", "eta0", "batch_size",
+             "epochs"]
+    for attr in attrs:
+        assert hasattr(clf, attr)
+
+
+def test_logistic_regression_attributes():
+    X, y = make_blobs()
+    clf = cuLog().fit(X, y, convert_dtype=True)
+
+    attrs = ["dtype", "solver_model", "coef_", "intercept_",
+             "l1_ratio", "n_cols", "C", "penalty",
+             "fit_intercept", "solver"]
+
+    for attr in attrs:
+        assert hasattr(clf, attr)
+
+
+def test_mbsgd_classifier_attributes():
+    X, y = make_blobs()
+    clf = cumlMBSGClassifier()
+    clf.fit(X, y)
+
+    attrs = ["dtype", "solver_model", "coef_", "intercept_",
+             "l1_ratio", "n_cols", "eta0", "batch_size",
+             "fit_intercept", "penalty"]
+    for attr in attrs:
+        assert hasattr(clf, attr)
+
+
+def test_elastic_net_attributes():
+    X, y = make_blobs()
+    clf = cumlElastic(fit_intercept=False)
+    clf.fit(X, y)
+
+    attrs = ["dtype", "solver_model", "coef_", "intercept_",
+             "l1_ratio", "n_cols", "alpha", "max_iter",
+             "fit_intercept"]
+    for attr in attrs:
+        assert hasattr(clf, attr)
+
+
+def test_lasso_attributes():
+    X, y = make_blobs()
+    clf = cumlLasso()
+    clf.fit(X, y)
+
+    attrs = ["dtype", "solver_model", "coef_", "intercept_",
+             "solver_model", "l1_ratio", "n_cols"]
+    for attr in attrs:
+        assert hasattr(clf, attr)
diff --git a/python/cuml/test/test_sparsefuncs.py b/python/cuml/test/test_sparsefuncs.py
index fc7d73933b..7ae9f85d54 100644
--- a/python/cuml/test/test_sparsefuncs.py
+++ b/python/cuml/test/test_sparsefuncs.py
@@ -21,6 +21,7 @@
 import numpy as np
 import scipy.sparse as sp
 import cupy as cp
+import cupyx
 
 
 @pytest.mark.parametrize('norm, ref_norm', [
@@ -33,7 +34,7 @@
 def test_csr_norms(norm, ref_norm, dtype, seed, shape):
     X = np.random.RandomState(seed).randn(*shape).astype(dtype)
     X_csr = sp.csr_matrix(X)
-    X_csr_gpu = cp.sparse.csr_matrix(X_csr)
+    X_csr_gpu = cupyx.scipy.sparse.csr_matrix(X_csr)
 
     norm(X_csr_gpu)
     ref_norm(X_csr)
diff --git a/python/cuml/test/test_stats.py b/python/cuml/test/test_stats.py
index 4159d6c020..f15efadf23 100644
--- a/python/cuml/test/test_stats.py
+++ b/python/cuml/test/test_stats.py
@@ -15,6 +15,7 @@
 
 import pytest
 import cupy as cp
+import cupyx
 
 from cuml.prims.stats import cov
 from cuml.test.utils import array_equal
@@ -26,9 +27,10 @@
 @pytest.mark.parametrize("dtype", [cp.float32, cp.float64])
 def test_cov(nrows, ncols, sparse, dtype):
     if sparse:
-        x = cp.sparse.random(nrows, ncols, density=0.07)
+        x = cupyx.scipy.sparse.random(nrows, ncols, density=0.07,
+                                      format='csr', dtype=dtype)
     else:
-        x = cp.random.random((nrows, ncols))
+        x = cp.random.random((nrows, ncols), dtype=dtype)
 
     cov_result = cov(x, x)
 
@@ -38,4 +40,4 @@ def test_cov(nrows, ncols, sparse, dtype):
         x = x.todense()
     local_cov = cp.cov(x, rowvar=False, ddof=0)
 
-    assert array_equal(cov_result, local_cov, 1e-7, with_sign=True)
+    assert array_equal(cov_result, local_cov, 1e-6, with_sign=True)
diff --git a/python/cuml/test/test_svm.py b/python/cuml/test/test_svm.py
index a3fea3fd58..16df208912 100644
--- a/python/cuml/test/test_svm.py
+++ b/python/cuml/test/test_svm.py
@@ -25,7 +25,7 @@
 from sklearn.datasets import load_iris, make_blobs
 from sklearn.datasets import make_regression, make_friedman1
 from sklearn.datasets import make_classification, make_gaussian_quantiles
-from sklearn.metrics import mean_squared_error
+from sklearn.metrics import mean_squared_error, brier_score_loss
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import StandardScaler
 
@@ -52,54 +52,65 @@ def array_equal(a, b, tol=1e-6, relative_diff=True, report_summary=False):
     return equal
 
 
-def compare_svm(svm1, svm2, X, y, n_sv_tol=None, b_tol=None, coef_tol=None,
-                cmp_sv=False, dcoef_tol=None, accuracy_tol=None,
+def compare_svm(svm1, svm2, X, y, b_tol=None, coef_tol=None,
                 report_summary=False, cmp_decision_func=False):
     """ Compares two svm classifiers
     Parameters:
     -----------
-    svm1 : svm classifier
-    svm2 : svm classifier
-    accuracy_tol : float, default 0.1%
-        tolerance while comparing the prediction accuracy
+    svm1 : svm classifier to be tested
+    svm2 : svm classifier, the correct model
     b_tol : float
         tolerance while comparing the constant in the decision functions
     coef_tol: float
         tolerance used while comparing coef_ attribute for linear SVM
-    cmp_idx : boolean, default false
-        whether to compare SVs and their indices
-    dcoef_tol: float, default: do not compare dual coefficients
-        tolerance used to compare dual coefs
+
+    Support vector machines have a decision function:
+
+    F(x) = sum_{i=1}^{n_sv} d_i K(x_i, x) + b,
+
+    where n_sv is the number of support vectors, K is the kernel function, x_i
+    are the support vectors, d_i are the dual coefficients (more precisely
+    d = alpha_i * y_i, where alpha_i is the dual coef), and b is the intercept.
+
+    For linear svms K(x_i, x) = x_i * x, and we can simplify F by introducing
+    w = sum_{i=1}^{n_sv} d_i x_i, the normal vector of the separating
+    hyperplane:
+
+    F(x) = w * x + b.
+
+    Mathematically the solution of the optimization should be unique, which
+    means w and b should be unique.
+
+    There could be multiple set of vectors that lead to the same w, therefore
+    comparing parameters d_k, n_sv or the support vector indices can lead to
+    false positives.
+
+    We can only evaluate w for linear models, for nonlinear models we can only
+    test model accuracy and intercept.
     """
 
     n = X.shape[0]
-    svm1_y_hat = svm1.predict(X)
-    svm1_n_wrong = np.sum(np.abs(y - svm1_y_hat))
-    accuracy1 = (n-svm1_n_wrong)*100/n
-    svm2_y_hat = svm2.predict(X)
-    if type(svm2_y_hat) != np.ndarray:
-        svm2_y_hat = svm2_y_hat
-    svm2_n_wrong = np.sum(np.abs(y - svm2_y_hat))
-    accuracy2 = (n-svm2_n_wrong)*100/n
-
-    if accuracy_tol is None:
-        if n >= 250 and (accuracy1 + accuracy2)/2 <= 75:
-            # 1% accuracy tolerance for not so accurate SVM on "large" dataset
-            accuracy_tol = 1
-        else:
-            accuracy_tol = 0.1
-
-    assert abs(accuracy1 - accuracy2) <= accuracy_tol
+    accuracy1 = svm1.score(X, y)
+    accuracy2 = svm2.score(X, y)
+
+    # We use at least 0.1% tolerance for accuracy comparison
+    accuracy_tol_min = 0.001
+    if accuracy2 < 1:
+        # Set tolerance to include the 95% confidence interval of svm2's
+        # accuracy. In practice this gives 0.9% tolerance for a 90% accurate
+        # model (assuming n_test = 4000).
+        accuracy_tol = 1.96 * np.sqrt(accuracy2 * (1-accuracy2) / n)
+        if accuracy_tol < accuracy_tol_min:
+            accuracy_tol = accuracy_tol_min
+    else:
+        accuracy_tol = accuracy_tol_min
 
-    n_support1 = np.sum(svm1.n_support_)
-    n_support2 = np.sum(svm2.n_support_)
+    assert accuracy1 >= accuracy2 - accuracy_tol
 
-    if n_sv_tol is None:
-        n_sv_tol = max(2, n_support1*0.02)
     if b_tol is None:
-        b_tol = 30*svm1.tol
+        b_tol = 100*svm1.tol  # Using the deafult tol=1e-3 leads to b_tol=0.1
 
-    if accuracy1 < 50:
+    if accuracy2 < 0.5:
         # Increase error margin for classifiers that are not accurate.
         # Although analytically the classifier should always be the same,
         # we fit only until we reach a certain numerical tolerance, and
@@ -110,46 +121,31 @@ def compare_svm(svm1, svm2, X, y, n_sv_tol=None, b_tol=None, coef_tol=None,
         # the classes are concentric blobs, and we cannot separate that with a
         # straight line. When we have a large number of data points, then
         # any separating hyperplane that goes through the center would be good.
-        n_sv_tol *= 10
         b_tol *= 10
         if n >= 250:
             coef_tol = 2  # allow any direction
         else:
             coef_tol *= 10
 
-    assert abs(n_support1-n_support2) <= n_sv_tol
-
+    # Compare model parameter b (intercept). In practice some models can have
+    # same differences in the model parameters while still being within
+    # the accuracy tolerance.
     if abs(svm2.intercept_) > 1e-6:
         assert abs((svm1.intercept_-svm2.intercept_)/svm2.intercept_) <= b_tol
     else:
         assert abs((svm1.intercept_-svm2.intercept_)) <= b_tol
 
-    if coef_tol is None:
-        coef_tol = 1e-5
+    # For linear kernels we can compare the normal vector of the separating
+    # hyperplane w, which is stored in the coef_ attribute.
     if svm1.kernel == 'linear':
+        if coef_tol is None:
+            coef_tol = 1e-5
         cs = np.dot(svm1.coef_, svm2.coef_.T) / \
             (np.linalg.norm(svm1.coef_) * np.linalg.norm(svm2.coef_))
         assert cs > 1 - coef_tol
 
-    if cmp_sv or (dcoef_tol is not None):
-        sidx1 = np.argsort((svm1.support_))
-        sidx2 = np.argsort((svm2.support_))
-
-    if cmp_sv:
-        support_idx1 = ((svm1.support_))[sidx1]
-        support_idx2 = ((svm2.support_))[sidx2]
-        assert np.all(support_idx1-support_idx2) == 0
-        sv1 = ((svm1.support_vectors_))[sidx1, :]
-        sv2 = ((svm2.support_vectors_))[sidx2, :]
-        assert np.all(sv1-sv2 == 0)
-
-    if dcoef_tol is not None:
-        dcoef1 = ((svm1.dual_coef_))[0, sidx1]
-        dcoef2 = ((svm2.dual_coef_))[0, sidx2]
-        assert np.all(np.abs(dcoef1-dcoef2) <= dcoef_tol)
-
     if cmp_decision_func:
-        if accuracy2 > 90:
+        if accuracy2 > 0.9 and svm1.kernel != 'sigmoid':
             df1 = svm1.decision_function(X)
             df2 = svm2.decision_function(X)
             # For classification, the class is determined by
@@ -164,6 +160,16 @@ def compare_svm(svm1, svm2, X, y, n_sv_tol=None, b_tol=None, coef_tol=None,
 
 def make_dataset(dataset, n_rows, n_cols, n_classes=2):
     np.random.seed(137)
+    if n_rows*0.25 < 4000:
+        # Use at least 4000 test samples
+        n_test = 4000
+        if n_rows > 1000:
+            # To avoid a large increase in test time (which is between
+            # O(n_rows^2) and O(n_rows^3)).
+            n_rows = int(n_rows * 0.75)
+        n_rows += n_test
+    else:
+        n_test = n_rows * 0.25
     if dataset == 'classification1':
         X, y = make_classification(
             n_rows, n_cols, n_informative=2, n_redundant=0,
@@ -178,7 +184,7 @@ def make_dataset(dataset, n_rows, n_cols, n_classes=2):
     elif dataset == 'blobs':
         X, y = make_blobs(n_samples=n_rows, n_features=n_cols,
                           centers=n_classes)
-    X_train, X_test, y_train, y_test = train_test_split(X, y)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test)
     # correct case when not all classes made it into the training set
     if np.unique(y_train).size < n_classes:
         for i in range(n_classes):
@@ -243,18 +249,23 @@ def test_svm_skl_cmp_datasets(params, dataset, n_rows, n_cols):
         return
     X_train, X_test, y_train, y_test = make_dataset(dataset, n_rows, n_cols)
 
-    cuSVC = cu_svm.SVC(**params)
-    cuSVC.fit(X_train, y_train)
+    # Default to numpy for testing
+    with cuml.using_output_type("numpy"):
 
-    sklSVC = svm.SVC(**params)
-    sklSVC.fit(X_train, y_train)
+        cuSVC = cu_svm.SVC(**params)
+        cuSVC.fit(X_train, y_train)
+
+        sklSVC = svm.SVC(**params)
+        sklSVC.fit(X_train, y_train)
 
-    compare_svm(cuSVC, sklSVC, X_test, y_test, n_sv_tol=max(2, 0.02*n_rows),
-                coef_tol=1e-5, report_summary=True)
+        compare_svm(cuSVC, sklSVC, X_test, y_test, coef_tol=1e-5,
+                    report_summary=True)
 
 
-def test_svm_skl_cmp_decision_function(n_rows=4000, n_cols=20):
-    params = {'kernel': 'rbf', 'C': 5, 'gamma': 0.005}
+@pytest.mark.parametrize('params', [
+    {'kernel': 'rbf', 'C': 5, 'gamma': 0.005, "probability": False},
+    {'kernel': 'rbf', 'C': 5, 'gamma': 0.005, "probability": True}])
+def test_svm_skl_cmp_decision_function(params, n_rows=4000, n_cols=20):
 
     X_train, X_test, y_train, y_test = make_dataset('classification1', n_rows,
                                                     n_cols)
@@ -274,7 +285,11 @@ def test_svm_skl_cmp_decision_function(n_rows=4000, n_cols=20):
     sklSVC.fit(X_train, y_train)
     df2 = sklSVC.decision_function(X_test)
 
-    assert mean_squared_error(df1, df2) < 1e-5
+    if params["probability"]:
+        tol = 2e-2  # See comments in SVC decision_function method
+    else:
+        tol = 1e-5
+    assert mean_squared_error(df1, df2) < tol
 
 
 @pytest.mark.parametrize('params', [
@@ -300,6 +315,67 @@ def test_svm_predict(params, n_pred):
     assert accuracy > 99
 
 
+def compare_probabilistic_svm(svc1, svc2, X_test, y_test, tol=1e-3,
+                              brier_tol=1e-3):
+    """ Compare the probability output from two support vector classifiers.
+    """
+    prob1 = svc1.predict_proba(X_test)
+    brier1 = brier_score_loss(y_test, prob1[:, 1])
+
+    prob2 = svc2.predict_proba(X_test)
+    brier2 = brier_score_loss(y_test, prob2[:, 1])
+
+    assert mean_squared_error(prob1, prob2) <= tol
+    # Brier score - smaller is better
+    assert brier1 - brier2 <= brier_tol
+
+
+def test_svm_skl_cmp_predict_proba(n_rows=10000, n_cols=20):
+    params = {'kernel': 'rbf', 'C': 1, 'tol': 1e-3, 'gamma': 'scale',
+              'probability': True}
+    X, y = make_classification(n_samples=n_rows, n_features=n_cols,
+                               n_informative=2, n_redundant=10,
+                               random_state=137)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8,
+                                                        random_state=42)
+    cuSVC = cu_svm.SVC(**params)
+    cuSVC.fit(X_train, y_train)
+    sklSVC = svm.SVC(**params)
+    sklSVC.fit(X_train, y_train)
+    compare_probabilistic_svm(cuSVC, sklSVC, X_test, y_test, 1e-3, 1e-2)
+
+
+@pytest.mark.parametrize('class_weight', [None, {1: 10}, 'balanced'])
+@pytest.mark.parametrize('sample_weight', [None, True])
+def test_svc_weights(class_weight, sample_weight):
+    # We are using the following example as a test case
+    # https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
+    X, y = make_blobs(n_samples=[1000, 100],
+                      centers=[[0.0, 0.0], [2.0, 2.0]],
+                      cluster_std=[1.5, 0.5],
+                      random_state=137, shuffle=False)
+    if sample_weight:
+        # Put large weight on class 1
+        sample_weight = y * 9 + 1
+
+    params = {'kernel': 'linear', 'C': 1, 'gamma': 'scale'}
+    params['class_weight'] = class_weight
+    cuSVC = cu_svm.SVC(**params)
+    cuSVC.fit(X, y, sample_weight)
+
+    if class_weight is not None or sample_weight is not None:
+        # Standalone test: check if smaller blob is correctly classified in the
+        # presence of class weights
+        X_1 = X[y == 1, :]
+        y_1 = np.ones(X_1.shape[0])
+        cu_score = cuSVC.score(X_1, y_1)
+        assert cu_score > 0.9
+
+    sklSVC = svm.SVC(**params)
+    sklSVC.fit(X, y, sample_weight)
+    compare_svm(cuSVC, sklSVC, X, y, coef_tol=1e-5, report_summary=True)
+
+
 @pytest.mark.parametrize('params', [
     pytest.param({'kernel': 'poly', 'degree': 40, 'C': 1, 'gamma': 'auto'},
                  marks=pytest.mark.xfail(reason="fp overflow in kernel "
@@ -331,12 +407,7 @@ def test_svm_gamma(params):
     # gamma = 1/(n_cols*X.var())
     cuSVC = cu_svm.SVC(**params)
     cuSVC.fit(X, y)
-    y_pred = cuSVC.predict(X)
-    if x_arraytype == 'dataframe':
-        n_correct = np.sum(y.to_array() == y_pred)
-    else:
-        n_correct = np.sum(y == y_pred)
-    accuracy = n_correct * 100 / n_rows
+    accuracy = cuSVC.score(X, y) * 100
     assert accuracy > 70
 
 
@@ -390,9 +461,11 @@ def get_memsize(svc):
 def test_svm_memleak(params, n_rows, n_iter, n_cols,
                      use_handle, dataset='blobs'):
     """
-    Test whether there is any memory leak. Note: small n_rows, and n_cols
-    values will result in small model size, that will not be measured by
-    get_memory_info.
+    Test whether there is any memory leak.
+
+    .. note:: small `n_rows`, and `n_cols` values will result in small model
+        size, that will not be measured by get_memory_info.
+
     """
     X_train, X_test, y_train, y_test = make_dataset(dataset, n_rows, n_cols)
     stream = cuml.cuda.Stream()
@@ -533,3 +606,39 @@ def test_svr_skl_cmp(params, dataset, n_rows, n_cols):
     sklSVR.fit(X_train, y_train)
 
     compare_svr(cuSVR, sklSVR, X_test, y_test)
+
+
+def test_svr_skl_cmp_weighted():
+    """ Compare to Sklearn SVR, use sample weights"""
+    X, y = make_regression(
+        n_samples=100, n_features=5, n_informative=2, n_targets=1,
+        random_state=137, noise=10)
+    sample_weights = 10*np.sin(np.linspace(0, 2*np.pi, len(y))) + 10.1
+
+    params = {'kernel': 'linear', 'C': 10, 'gamma': 1}
+    cuSVR = cu_svm.SVR(**params)
+    cuSVR.fit(X, y, sample_weights)
+
+    sklSVR = svm.SVR(**params)
+    sklSVR.fit(X, y, sample_weights)
+
+    compare_svr(cuSVR, sklSVR, X, y)
+
+
+@pytest.mark.parametrize('classifier', [True, False])
+@pytest.mark.parametrize('train_dtype', [np.float32, np.float64])
+@pytest.mark.parametrize('test_dtype', [np.float64, np.float32])
+def test_svm_predict_convert_dtype(train_dtype, test_dtype, classifier):
+    X, y = make_classification(n_samples=50, random_state=0)
+
+    X = X.astype(train_dtype)
+    y = y.astype(train_dtype)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
+                                                        random_state=0)
+
+    if classifier:
+        clf = cu_svm.SVC()
+    else:
+        clf = cu_svm.SVR()
+    clf.fit(X_train, y_train)
+    clf.predict(X_test.astype(test_dtype))
diff --git a/python/cuml/test/test_target_encoder.py b/python/cuml/test/test_target_encoder.py
new file mode 100644
index 0000000000..87aa7abac1
--- /dev/null
+++ b/python/cuml/test/test_target_encoder.py
@@ -0,0 +1,186 @@
+# Copyright (c) 2019, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from cuml.preprocessing.TargetEncoder import TargetEncoder
+import cudf
+import pandas
+import numpy as np
+import cupy as cp
+from cuml.test.utils import array_equal
+import pytest
+
+
+def test_targetencoder_fit_transform():
+    train = cudf.DataFrame({'category': ['a', 'b', 'b', 'a'],
+                            'label': [1, 0, 1, 1]})
+    encoder = TargetEncoder()
+    train_encoded = encoder.fit_transform(train.category, train.label)
+    answer = np.array([1., 1., 0., 1.])
+    assert array_equal(train_encoded, answer)
+
+    encoder = TargetEncoder()
+    encoder.fit(train.category, train.label)
+    train_encoded = encoder.transform(train.category)
+
+    assert array_equal(train_encoded, answer)
+
+
+def test_targetencoder_transform():
+    train = cudf.DataFrame({'category': ['a', 'b', 'b', 'a'],
+                            'label': [1, 0, 1, 1]})
+    test = cudf.DataFrame({'category': ['b', 'b', 'a', 'b']})
+    encoder = TargetEncoder()
+    encoder.fit_transform(train.category, train.label)
+    test_encoded = encoder.transform(test.category)
+    answer = np.array([0.5, 0.5, 1., 0.5])
+    assert array_equal(test_encoded, answer)
+
+    encoder = TargetEncoder()
+    encoder.fit(train.category, train.label)
+    test_encoded = encoder.transform(test.category)
+    assert array_equal(test_encoded, answer)
+
+
+@pytest.mark.parametrize('n_samples', [5000, 500000])
+@pytest.mark.parametrize('dtype', [np.int32, np.int64, np.float32, np.float64])
+def test_targetencoder_random(n_samples, dtype):
+
+    x = cp.random.randint(0, 1000, n_samples).astype(dtype)
+    y = cp.random.randint(0, 2, n_samples).astype(dtype)
+    xt = cp.random.randint(0, 1000, n_samples).astype(dtype)
+
+    encoder = TargetEncoder()
+    encoder.fit_transform(x, y)
+    test_encoded = encoder.transform(xt)
+
+    df_train = cudf.DataFrame({'x': x, 'y': y})
+    dg = df_train.groupby('x', as_index=False).agg({'y': 'mean'})
+    df_test = cudf.DataFrame({'x': xt})
+    df_test['row_id'] = cp.arange(len(df_test))
+    df_test = df_test.merge(dg, on='x', how='left')
+    df_test = df_test.sort_values('row_id')
+    answer = df_test['y'].fillna(cp.mean(y).item()).values
+    assert array_equal(test_encoded, answer)
+
+
+def test_targetencoder_multi_column():
+    """
+    Test jointly encoding multiple columns
+    """
+    train = cudf.DataFrame({'cat_1': ['a', 'b', 'b', 'a', 'a', 'b'],
+                            'cat_2': [1, 1, 2, 2, 1, 2],
+                            'label': [1, 0, 1, 1, 0, 1]})
+    test = cudf.DataFrame({'cat_1': ['b', 'b', 'a', 'b'],
+                           'cat_2': [1, 2, 1, 2]})
+    encoder = TargetEncoder()
+    train_encoded = encoder.fit_transform(train[['cat_1', 'cat_2']],
+                                          train.label)
+    test_encoded = encoder.transform(test[['cat_1', 'cat_2']])
+    train_answer = np.array([2./3, 2./3, 1., 2./3, 2./3, 1.])
+    test_answer = np.array([0., 1., 0.5, 1.])
+    assert array_equal(train_encoded, train_answer)
+    assert array_equal(test_encoded, test_answer)
+
+    encoder = TargetEncoder()
+    encoder.fit(train[['cat_1', 'cat_2']], train.label)
+    train_encoded = encoder.transform(train[['cat_1', 'cat_2']])
+    test_encoded = encoder.transform(test[['cat_1', 'cat_2']])
+    assert array_equal(train_encoded, train_answer)
+    assert array_equal(test_encoded, test_answer)
+
+
+def test_targetencoder_newly_encountered():
+    """
+    Note that there are newly-encountered values in test,
+    namely, 'c' and 'd'.
+    """
+    train = cudf.DataFrame({'category': ['a', 'b', 'b', 'a'],
+                            'label': [1, 0, 1, 1]})
+    test = cudf.DataFrame({'category': ['c', 'b', 'a', 'd']})
+    encoder = TargetEncoder()
+    encoder.fit_transform(train.category, train.label)
+    test_encoded = encoder.transform(test.category)
+    answer = np.array([0.75, 0.5, 1., 0.75])
+    assert array_equal(test_encoded, answer)
+
+    encoder = TargetEncoder()
+    encoder.fit(train.category, train.label)
+    test_encoded = encoder.transform(test.category)
+    assert array_equal(test_encoded, answer)
+
+
+def test_one_category():
+    train = cudf.DataFrame({'category': ['a', 'a', 'a', 'a'],
+                            'label': [3, 0, 0, 3]})
+    test = cudf.DataFrame({'category': ['c', 'b', 'a', 'd']})
+
+    encoder = TargetEncoder()
+    train_encoded = encoder.fit_transform(train.category, train.label)
+    answer = np.array([1., 2., 2., 1.])
+    assert array_equal(train_encoded, answer)
+
+    test_encoded = encoder.transform(test.category)
+    answer = np.array([1.5, 1.5, 1.5, 1.5])
+    assert array_equal(test_encoded, answer)
+
+
+def test_targetencoder_pandas():
+    """
+    Note that there are newly-encountered values in test,
+    namely, 'c' and 'd'.
+    """
+    train = pandas.DataFrame({'category': ['a', 'b', 'b', 'a'],
+                              'label': [1, 0, 1, 1]})
+    test = pandas.DataFrame({'category': ['c', 'b', 'a', 'd']})
+    encoder = TargetEncoder()
+    encoder.fit_transform(train.category, train.label)
+    test_encoded = encoder.transform(test.category)
+    answer = np.array([0.75, 0.5, 1., 0.75])
+    assert array_equal(test_encoded, answer)
+    print(type(test_encoded))
+    assert isinstance(test_encoded, np.ndarray)
+
+
+def test_targetencoder_numpy():
+    """
+    Note that there are newly-encountered values in x_test,
+    namely, 3 and 4.
+    """
+    x_train = np.array([1, 2, 2, 1])
+    y_train = np.array([1, 0, 1, 1])
+    x_test = np.array([1, 2, 3, 4])
+    encoder = TargetEncoder()
+    encoder.fit_transform(x_train, y_train)
+    test_encoded = encoder.transform(x_test)
+    answer = np.array([1., 0.5, 0.75, 0.75])
+    assert array_equal(test_encoded, answer)
+    print(type(test_encoded))
+    assert isinstance(test_encoded, np.ndarray)
+
+
+def test_targetencoder_cupy():
+    """
+    Note that there are newly-encountered values in x_test,
+    namely, 3 and 4.
+    """
+    x_train = cp.array([1, 2, 2, 1])
+    y_train = cp.array([1, 0, 1, 1])
+    x_test = cp.array([1, 2, 3, 4])
+    encoder = TargetEncoder()
+    encoder.fit_transform(x_train, y_train)
+    test_encoded = encoder.transform(x_test)
+    answer = np.array([1., 0.5, 0.75, 0.75])
+    assert array_equal(test_encoded, answer)
+    print(type(test_encoded))
+    assert isinstance(test_encoded, cp.ndarray)
diff --git a/python/cuml/test/test_text_feature_extraction.py b/python/cuml/test/test_text_feature_extraction.py
index e58e119364..1ff751d316 100644
--- a/python/cuml/test/test_text_feature_extraction.py
+++ b/python/cuml/test/test_text_feature_extraction.py
@@ -68,9 +68,11 @@ def test_count_vectorizer():
 
 @pytest.mark.parametrize('ngram_range', NGRAM_RANGES, ids=NGRAM_IDS)
 def test_word_analyzer(ngram_range):
-    vec = CountVectorizer(ngram_range=ngram_range).fit(DOCS_GPU)
+    v = CountVectorizer(ngram_range=ngram_range).fit(DOCS_GPU)
     ref = SkCountVect(ngram_range=ngram_range).fit(DOCS)
-    assert ref.get_feature_names() == vec.get_feature_names().tolist()
+    assert (
+        ref.get_feature_names() == v.get_feature_names().to_arrow().to_pylist()
+    )
 
 
 def test_countvectorizer_custom_vocabulary():
@@ -99,10 +101,10 @@ def test_countvectorizer_stop_words_ngrams():
     stop_words_doc = Series(["and me too andy andy too"])
     expected_vocabulary = ["andy andy"]
 
-    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english')
-    vec.fit(stop_words_doc)
+    v = CountVectorizer(ngram_range=(2, 2), stop_words='english')
+    v.fit(stop_words_doc)
 
-    assert expected_vocabulary == vec.get_feature_names().tolist()
+    assert expected_vocabulary == v.get_feature_names().to_arrow().to_pylist()
 
 
 def test_countvectorizer_max_features():
@@ -113,8 +115,9 @@ def test_countvectorizer_max_features():
     # test bounded number of extracted features
     vec = CountVectorizer(max_df=0.6, max_features=4)
     vec.fit(DOCS_GPU)
-    assert set(vec.get_feature_names().tolist()) == expected_vocabulary
-    assert set(vec.stop_words_.tolist()) == expected_stop_words
+    assert set(vec.get_feature_names().to_arrow().to_pylist()
+               ) == expected_vocabulary
+    assert set(vec.stop_words_.to_arrow().to_pylist()) == expected_stop_words
 
 
 def test_countvectorizer_max_features_counts():
@@ -138,7 +141,8 @@ def test_countvectorizer_max_features_counts():
     assert 7 == counts_None.max()
 
     # The most common feature should be the same
-    def as_index(x): return x.astype(cp.int32).item()
+    def as_index(x):
+        return x.astype(cp.int32).item()
     assert "the" == features_1[as_index(cp.argmax(counts_1))]
     assert "the" == features_3[as_index(cp.argmax(counts_3))]
     assert "the" == features_None[as_index(cp.argmax(counts_None))]
@@ -148,22 +152,22 @@ def test_countvectorizer_max_df():
     test_data = Series(['abc', 'dea', 'eat'])
     vect = CountVectorizer(analyzer='char', max_df=1.0)
     vect.fit(test_data)
-    assert 'a' in vect.vocabulary_.tolist()
-    assert len(vect.vocabulary_.tolist()) == 6
+    assert 'a' in vect.vocabulary_.to_arrow().to_pylist()
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 6
     assert len(vect.stop_words_) == 0
 
     vect.max_df = 0.5  # 0.5 * 3 documents -> max_doc_count == 1.5
     vect.fit(test_data)
-    assert 'a' not in vect.vocabulary_.tolist()  # {ae} ignored
-    assert len(vect.vocabulary_.tolist()) == 4    # {bcdt} remain
-    assert 'a' in vect.stop_words_.tolist()
+    assert 'a' not in vect.vocabulary_.to_arrow().to_pylist()  # {ae} ignored
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 4    # {bcdt} remain
+    assert 'a' in vect.stop_words_.to_arrow().to_pylist()
     assert len(vect.stop_words_) == 2
 
     vect.max_df = 1
     vect.fit(test_data)
-    assert 'a' not in vect.vocabulary_.tolist()  # {ae} ignored
-    assert len(vect.vocabulary_.tolist()) == 4    # {bcdt} remain
-    assert 'a' in vect.stop_words_.tolist()
+    assert 'a' not in vect.vocabulary_.to_arrow().to_pylist()  # {ae} ignored
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 4    # {bcdt} remain
+    assert 'a' in vect.stop_words_.to_arrow().to_pylist()
     assert len(vect.stop_words_) == 2
 
 
@@ -171,22 +175,23 @@ def test_vectorizer_min_df():
     test_data = Series(['abc', 'dea', 'eat'])
     vect = CountVectorizer(analyzer='char', min_df=1)
     vect.fit(test_data)
-    assert 'a' in vect.vocabulary_.tolist()
-    assert len(vect.vocabulary_.tolist()) == 6
+    assert 'a' in vect.vocabulary_.to_arrow().to_pylist()
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 6
     assert len(vect.stop_words_) == 0
 
     vect.min_df = 2
     vect.fit(test_data)
-    assert 'c' not in vect.vocabulary_.tolist()  # {bcdt} ignored
-    assert len(vect.vocabulary_.tolist()) == 2    # {ae} remain
-    assert 'c' in vect.stop_words_.tolist()
+    assert 'c' not in vect.vocabulary_.to_arrow().to_pylist()  # {bcdt} ignored
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 2    # {ae} remain
+    assert 'c' in vect.stop_words_.to_arrow().to_pylist()
     assert len(vect.stop_words_) == 4
 
     vect.min_df = 0.8  # 0.8 * 3 documents -> min_doc_count == 2.4
     vect.fit(test_data)
-    assert 'c' not in vect.vocabulary_.tolist()  # {bcdet} ignored
-    assert len(vect.vocabulary_.tolist()) == 1    # {a} remains
-    assert 'c' in vect.stop_words_.tolist()
+    # {bcdet} ignored
+    assert 'c' not in vect.vocabulary_.to_arrow().to_pylist()
+    assert len(vect.vocabulary_.to_arrow().to_pylist()) == 1    # {a} remains
+    assert 'c' in vect.stop_words_.to_arrow().to_pylist()
     assert len(vect.stop_words_) == 5
 
 
@@ -196,7 +201,7 @@ def test_count_binary_occurrences():
     vect = CountVectorizer(analyzer='char', max_df=1.0)
     X = cp.asnumpy(vect.fit_transform(test_data).todense())
     assert_array_equal(['a', 'b', 'c', 'd', 'e'],
-                       vect.get_feature_names().tolist())
+                       vect.get_feature_names().to_arrow().to_pylist())
     assert_array_equal([[3, 1, 1, 0, 0],
                         [1, 2, 0, 1, 1]], X)
 
@@ -224,7 +229,7 @@ def test_vectorizer_inverse_transform():
     sk_inversed_data = sk_vectorizer.inverse_transform(sk_transformed_data)
 
     for doc, sk_doc in zip(inversed_data, sk_inversed_data):
-        doc = np.sort(doc.tolist())
+        doc = np.sort(doc.to_arrow().to_pylist())
         sk_doc = np.sort(sk_doc)
         if len(doc) + len(sk_doc) == 0:
             continue
@@ -237,7 +242,8 @@ def test_space_ngrams(ngram_range):
     data_gpu = Series(data)
     vec = CountVectorizer(ngram_range=ngram_range).fit(data_gpu)
     ref = SkCountVect(ngram_range=ngram_range).fit(data)
-    assert ref.get_feature_names() == vec.get_feature_names().tolist()
+    assert (ref.get_feature_names()
+            ) == vec.get_feature_names().to_arrow().to_pylist()
 
 
 def test_empty_doc_after_limit_features():
@@ -265,7 +271,18 @@ def test_non_ascii():
     res = cv.fit_transform(non_ascii_gpu)
     ref = SkCountVect().fit_transform(non_ascii)
 
-    assert 'αγγλικά' in set(cv.get_feature_names().tolist())
+    assert 'αγγλικά' in set(cv.get_feature_names().to_arrow().to_pylist())
+    cp.testing.assert_array_equal(res.todense(), ref.toarray())
+
+
+def test_sngle_len():
+    single_token_ser = ['S I N G L E T 0 K E N Example', '1 2 3 4 5 eg']
+    single_token_gpu = Series(single_token_ser)
+
+    cv = CountVectorizer()
+    res = cv.fit_transform(single_token_gpu)
+    ref = SkCountVect().fit_transform(single_token_ser)
+
     cp.testing.assert_array_equal(res.todense(), ref.toarray())
 
 
@@ -291,7 +308,8 @@ def test_character_ngrams(analyzer, ngram_range):
 
     ref = SkCountVect(analyzer=analyzer, ngram_range=ngram_range).fit(data)
 
-    assert ref.get_feature_names() == res.get_feature_names().tolist()
+    assert (ref.get_feature_names()
+            ) == res.get_feature_names().to_arrow().to_pylist()
 
 
 @pytest.mark.parametrize('query', [Series(['science aa', '', 'a aa aaa']),
diff --git a/python/cuml/test/test_tfidf.py b/python/cuml/test/test_tfidf.py
index 19508b949e..fd0364767d 100644
--- a/python/cuml/test/test_tfidf.py
+++ b/python/cuml/test/test_tfidf.py
@@ -16,6 +16,7 @@
 import pytest
 import numpy as np
 import cupy as cp
+import cupyx
 from cuml.feature_extraction.text import TfidfTransformer
 from sklearn.feature_extraction.text import TfidfTransformer as SkTfidfTransfo
 
@@ -73,9 +74,11 @@ def test_tfidf_transformer(data, norm, use_idf, smooth_idf, sublinear_tf):
 @pytest.mark.parametrize('sublinear_tf', [True, False])
 def test_tfidf_transformer_copy(norm, use_idf, smooth_idf, sublinear_tf):
     if use_idf:
-        pytest.xfail("cupy.sparse.csr does not support inplace multiply.")
+        pytest.xfail(
+            "cupyx.scipy.sparse.csr does not support inplace multiply."
+        )
 
-    data_gpu = cp.sparse.csr_matrix(cp.array([
+    data_gpu = cupyx.scipy.sparse.csr_matrix(cp.array([
         [0, 1, 1, 1],
         [0, 2, 0, 1]
     ], dtype=cp.float64, order='F'))
@@ -89,9 +92,9 @@ def test_tfidf_transformer_copy(norm, use_idf, smooth_idf, sublinear_tf):
 
 
 def test_tfidf_transformer_sparse():
-    X = cp.sparse.rand(10, 2000, dtype=np.float64, random_state=123)
-    X_csc = cp.sparse.csc_matrix(X)
-    X_csr = cp.sparse.csr_matrix(X)
+    X = cupyx.scipy.sparse.rand(10, 2000, dtype=np.float64, random_state=123)
+    X_csc = cupyx.scipy.sparse.csc_matrix(X)
+    X_csr = cupyx.scipy.sparse.csr_matrix(X)
 
     X_trans_csc = TfidfTransformer().fit_transform(X_csc).todense()
     X_trans_csr = TfidfTransformer().fit_transform(X_csr).todense()
diff --git a/python/cuml/test/test_train_test_split.py b/python/cuml/test/test_train_test_split.py
index f1690762f2..1c1dd88a87 100644
--- a/python/cuml/test/test_train_test_split.py
+++ b/python/cuml/test/test_train_test_split.py
@@ -19,6 +19,7 @@
 import pytest
 
 from cuml.preprocessing.model_selection import train_test_split
+from cuml.datasets import make_classification
 from numba import cuda
 
 test_array_input_types = [
@@ -31,17 +32,20 @@
 
 
 @pytest.mark.parametrize("train_size", [0.2, 0.6, 0.8])
-def test_split_dataframe(train_size):
+@pytest.mark.parametrize("shuffle", [True, False])
+def test_split_dataframe(train_size, shuffle):
     X = cudf.DataFrame({"x": range(100)})
     y = cudf.Series(([0] * (100 // 2)) + ([1] * (100 // 2)))
 
     X_train, X_test, y_train, y_test = train_test_split(
-        X, y, train_size=train_size
+        X, y, train_size=train_size, shuffle=shuffle
     )
     assert len(X_train) == len(y_train) == pytest.approx(train_size * len(X))
     assert (
         len(X_test) == len(y_test) == pytest.approx((1 - train_size) * len(X))
     )
+    assert (all(X_train.index.to_pandas() == y_train.index.to_pandas()))
+    assert (all(X_test.index.to_pandas() == y_test.index.to_pandas()))
 
     X_reconstructed = cudf.concat([X_train, X_test]).sort_values(
         by=["x"]
@@ -264,3 +268,106 @@ def test_split_array_single_argument(type, test_size, train_size, shuffle):
         X_rec = cp.sort(cp.concatenate(X_train, X_test))
 
         assert X_rec == X
+
+
+@pytest.mark.parametrize('type', test_array_input_types)
+@pytest.mark.parametrize('test_size', [0.2, 0.4, None])
+@pytest.mark.parametrize('train_size', [0.6, 0.8, None])
+def test_stratified_split(type, test_size, train_size):
+    # For more tolerance and reliable estimates
+    X, y = make_classification(n_samples=10000)
+
+    if type == 'cupy':
+        X = cp.asarray(X)
+        y = cp.asarray(y)
+
+    if type == 'numba':
+        X = cuda.to_device(X)
+        y = cuda.to_device(y)
+
+    def counts(y):
+        _, y_indices = cp.unique(y, return_inverse=True)
+        class_counts = cp.bincount(y_indices)
+        total = cp.sum(class_counts)
+        percent_counts = []
+        for count in (class_counts):
+            percent_counts.append(cp.around(float(count)/total.item(),
+                                            decimals=2).item())
+        return percent_counts
+
+    X_train, X_test, y_train, y_test = train_test_split(X, y,
+                                                        train_size=train_size,
+                                                        test_size=test_size,
+                                                        stratify=True)
+
+    original_counts = counts(y)
+    split_counts = counts(y_train)
+    assert cp.isclose(original_counts, split_counts,
+                      equal_nan=False, rtol=0.1).all()
+    if type == 'cupy':
+        assert isinstance(X_train, cp.ndarray)
+        assert isinstance(X_test, cp.ndarray)
+
+    if type in ['numba']:
+        assert cuda.devicearray.is_cuda_ndarray(X_train)
+        assert cuda.devicearray.is_cuda_ndarray(X_test)
+
+
+@pytest.mark.parametrize('seed_type', test_seeds)
+def test_stratified_random_seed(seed_type):
+    for i in range(10):
+        seed_n = np.random.randint(0, int(1e9))
+        if seed_type == 'int':
+            seed = seed_n
+        if seed_type == 'cupy':
+            seed = cp.random.RandomState(seed=seed_n)
+        if seed_type == 'numpy':
+            seed = np.random.RandomState(seed=seed_n)
+        X = cudf.DataFrame({"x": range(100)})
+        y = cudf.Series(([0] * (100 // 2)) + ([1] * (100 // 2)))
+    X_train, X_test, y_train, y_test = train_test_split(X, y,
+                                                        random_state=seed,
+                                                        stratify=True)
+
+    if seed_type == 'cupy':
+        seed = cp.random.RandomState(seed=seed_n)
+    if seed_type == 'numpy':
+        seed = np.random.RandomState(seed=seed_n)
+
+    X_train2, X_test2, y_train2, y_test2 = \
+        train_test_split(X, y, random_state=seed, stratify=True)
+
+    assert X_train.equals(X_train2)
+    assert X_test.equals(X_test2)
+    assert y_train.equals(y_train2)
+    assert y_test.equals(y_test2)
+
+    # Ensure that data is shuffled
+    assert not (X.head().index.values == X_train.head().index.values).all()
+
+    def monotonic_inc(x):
+        dx = cp.diff(x.values, axis=0)
+        return cp.all(dx == 1)
+
+    assert not monotonic_inc(X_train)
+
+
+@pytest.mark.parametrize('test_size', [0.2, 0.4, None])
+@pytest.mark.parametrize('train_size', [0.6, 0.8, None])
+def test_stratify_retain_index(test_size, train_size):
+    X = cudf.DataFrame({"x": range(10)})
+    y = cudf.Series(([0] * (10 // 2)) + ([1] * (10 // 2)))
+
+    X_train, X_test, y_train, y_test = train_test_split(X, y,
+                                                        train_size=train_size,
+                                                        test_size=test_size,
+                                                        shuffle=True,
+                                                        stratify=True)
+    assert (X_train["x"] == X_train.index).all()
+    assert (X_test["x"] == X_test.index).all()
+
+    if train_size is not None:
+        assert X_train.shape[0] == (int)(X.shape[0] * train_size)
+
+    elif test_size is not None:
+        assert X_test.shape[0] == (int)(X.shape[0] * test_size)
diff --git a/python/cuml/test/test_tsne.py b/python/cuml/test/test_tsne.py
index c1176f579e..111b70c3bd 100644
--- a/python/cuml/test/test_tsne.py
+++ b/python/cuml/test/test_tsne.py
@@ -107,3 +107,8 @@ def test_tsne_large(nrows, ncols):
     Y = tsne.fit_transform(X)
     nans = np.sum(np.isnan(Y))
     assert nans == 0
+
+
+def test_components_exception():
+    with pytest.raises(ValueError):
+        TSNE(n_components=3)
diff --git a/python/cuml/test/utils.py b/python/cuml/test/utils.py
index 4da2a2a5b8..e8c556947d 100644
--- a/python/cuml/test/utils.py
+++ b/python/cuml/test/utils.py
@@ -48,7 +48,7 @@ def array_equal(a, b, unit_tol=1e-4, total_tol=1e-4, with_sign=True):
 
     if not with_sign:
         a, b = np.abs(a), np.abs(b)
-    res = (np.sum(np.abs(a-b) > unit_tol)) / len(a) < total_tol
+    res = (np.sum(np.abs(a - b) > unit_tol)) / len(a) < total_tol
     return res
 
 
@@ -58,8 +58,12 @@ def get_pattern(name, n_samples):
 
     if name == 'noisy_circles':
         data = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05)
-        params = {'damping': .77, 'preference': -240,
-                  'quantile': .2, 'n_clusters': 2}
+        params = {
+            'damping': .77,
+            'preference': -240,
+            'quantile': .2,
+            'n_clusters': 2
+        }
 
     elif name == 'noisy_moons':
         data = datasets.make_moons(n_samples=n_samples, noise=.05)
@@ -135,7 +139,7 @@ def get_handle(use_handle, n_streams=0):
 
 
 def small_regression_dataset(datatype):
-    X, y = make_regression(n_samples=500, n_features=20,
+    X, y = make_regression(n_samples=1000, n_features=20,
                            n_informative=10, random_state=10)
     X = X.astype(datatype)
     y = y.astype(datatype)
@@ -182,14 +186,48 @@ class ClassEnumerator:
     custom_constructors: dictionary of {class_name: lambda}
         Custom constructors to use instead of the default one.
         ex: {'LogisticRegression': lambda: cuml.LogisticRegression(handle=1)}
+    recursive: bool, default=False
+        Instructs the class to recursively search submodules when True,
+        otherwise only classes in the specified model will be enumerated
     """
-    def __init__(self, module, exclude_classes=None, custom_constructors=None):
+    def __init__(self,
+                 module,
+                 exclude_classes=None,
+                 custom_constructors=None,
+                 recursive=False):
         self.module = module
         self.exclude_classes = exclude_classes or []
         self.custom_constructors = custom_constructors or []
+        self.recursive = recursive
 
     def _get_classes(self):
-        return inspect.getmembers(self.module, inspect.isclass)
+        def recurse_module(module):
+            classes = {}
+
+            modules = []
+
+            if (self.recursive):
+                modules = inspect.getmembers(module, inspect.ismodule)
+
+            # Enumerate child modules only if they are a submodule of the
+            # current one. i.e. `{parent_module}.{submodule}`
+            for _, m in modules:
+                if (module.__name__ + "." in m.__name__):
+                    classes.update(recurse_module(m))
+
+            # Ensure we only get classes that are part of this module
+            classes.update({
+                (".".join((klass.__module__, klass.__qualname__))): klass
+                for name,
+                klass in inspect.getmembers(module, inspect.isclass)
+                if module.__name__ + "." in ".".join((klass.__module__,
+                                                      klass.__qualname__))
+            })
+
+            return classes
+
+        return [(val.__name__, val) for key,
+                val in recurse_module(self.module).items()]
 
     def get_models(self):
         """Picks up every models classes from self.module.
@@ -203,17 +241,53 @@ def get_models(self):
             specified custom_constructor.
         """
         classes = self._get_classes()
-        models = {name: cls for name, cls in classes
-                  if cls not in self.exclude_classes and
-                  issubclass(cls, cuml.Base)}
+        models = {
+            name: cls
+            for name,
+            cls in classes
+            if cls not in self.exclude_classes and issubclass(cls, cuml.Base)
+        }
         models.update(self.custom_constructors)
         return models
 
 
-def get_classes_from_package(package):
-    modules = [m for name, m in inspect.getmembers(package, inspect.ismodule)]
-    classes = [ClassEnumerator(module).get_models() for module in modules]
-    return {k: v for dictionary in classes for k, v in dictionary.items()}
+def get_classes_from_package(package, import_sub_packages=False):
+    """
+    Gets all modules imported in the specified package and returns a dictionary
+    of any classes that derive from `cuml.Base`
+
+    Parameters
+    ----------
+    package : python module The python module to search import_sub_packages :
+        bool, default=False When set to True, will try to import sub packages
+        by searching the directory tree for __init__.py files and importing
+        them accordingly. By default this is set to False
+
+    Returns
+    -------
+    ClassEnumerator Class enumerator for the specified package
+    """
+
+    if (import_sub_packages):
+        import os
+        import importlib
+
+        # First, find all __init__.py files in subdirectories of this package
+        root_dir = os.path.dirname(package.__file__)
+
+        root_relative = os.path.dirname(root_dir)
+
+        # Now loop
+        for root, _, files in os.walk(root_dir):
+
+            if "__init__.py" in files:
+
+                module_name = os.path.relpath(root, root_relative).replace(
+                    os.sep, ".")
+
+                importlib.import_module(module_name)
+
+    return ClassEnumerator(module=package, recursive=True).get_models()
 
 
 def generate_random_labels(random_generation_lambda, seed=1234, as_cupy=False):
@@ -254,7 +328,10 @@ def generate_random_labels(random_generation_lambda, seed=1234, as_cupy=False):
         return cuda.to_device(a), cuda.to_device(b), a, b
 
 
-def score_labeling_with_handle(func, ground_truth, predictions, use_handle,
+def score_labeling_with_handle(func,
+                               ground_truth,
+                               predictions,
+                               use_handle,
                                dtype=np.int32):
     """Test helper to standardize inputs between sklearn and our prims metrics.
 
diff --git a/python/cuml/thirdparty_adapters/__init__.py b/python/cuml/thirdparty_adapters/__init__.py
new file mode 100644
index 0000000000..c257767de8
--- /dev/null
+++ b/python/cuml/thirdparty_adapters/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from .adapters import (check_array, get_input_type, to_output_type, _get_mask,
+                       _masked_column_median, _masked_column_mean,
+                       _masked_column_mode)
diff --git a/python/cuml/thirdparty_adapters/adapters.py b/python/cuml/thirdparty_adapters/adapters.py
new file mode 100644
index 0000000000..f9666eacd8
--- /dev/null
+++ b/python/cuml/thirdparty_adapters/adapters.py
@@ -0,0 +1,433 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import cupy as cp
+from cuml.common import input_to_cuml_array
+from cupy.sparse import csr_matrix as gpu_csr_matrix
+from cupy.sparse import csc_matrix as gpu_csc_matrix
+from cupy.sparse import csc_matrix as gpu_coo_matrix
+from scipy import sparse as cpu_sparse
+from cupy import sparse as gpu_sparse
+
+from numpy import ndarray as numpyArray
+from cupy import ndarray as cupyArray
+from cudf.core import Series as cuSeries
+from cudf.core import DataFrame as cuDataFrame
+from pandas import Series as pdSeries
+from pandas import DataFrame as pdDataFrame
+from numba.cuda import devicearray as numbaArray
+
+
+numeric_types = [
+    np.int8, np.int16, np.int32, np.int64,
+    np.uint8, np.uint16, np.uint32, np.uint64,
+    np.intp, np.uintp,
+    np.float32, np.float64,
+    np.complex64, np.complex128
+]
+
+
+def check_sparse(array, accept_sparse=False, accept_large_sparse=True):
+    """Checks that the sparse array is valid
+
+    Parameters
+    ----------
+    accept_sparse : string, boolean or list/tuple of strings (default=False)
+        String[s] representing allowed sparse matrix formats, such as 'csc',
+        'csr', etc. If the input is sparse but not in the allowed format,
+        it will be converted to the first listed format. True allows the input
+        to be any format. False means that a sparse matrix input will
+        raise an error.
+    accept_large_sparse : bool (default=True)
+        If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
+        accept_sparse, accept_large_sparse=False will cause it to be accepted
+        only if its indices are stored with a 32-bit dtype.
+
+    Returns
+    -------
+    None or raise error
+    """
+    if accept_sparse is True:
+        return
+
+    err_msg = "This algorithm does not support the sparse " + \
+              "input in the current configuration."
+
+    is_sparse = cpu_sparse.issparse(array) or gpu_sparse.issparse(array)
+    if is_sparse:
+        if accept_sparse is False:
+            raise ValueError(err_msg)
+
+        if not accept_large_sparse:
+            if array.indices.dtype != cp.int32 or \
+               array.indptr.dtype != cp.int32:
+                raise ValueError(err_msg)
+
+        if isinstance(accept_sparse, (tuple, list)):
+            if array.format not in accept_sparse:
+                raise ValueError(err_msg)
+        elif array.format != accept_sparse:
+            raise ValueError(err_msg)
+
+
+def check_dtype(array, dtypes='numeric'):
+    """Checks that the input dtype is part of acceptable dtypes
+
+    Parameters
+    ----------
+    array : object
+        Input object to check / convert.
+    dtype : string, type, list of types or None (default="numeric")
+        Data type of result. If None, the dtype of the input is preserved.
+        If "numeric", dtype is preserved unless array.dtype is object.
+        If dtype is a list of types, conversion on the first type is only
+        performed if the dtype of the input is not in the list.
+
+    Returns
+    -------
+    dtype or raise error
+    """
+    if dtypes is None:
+        if not isinstance(array, cuDataFrame):
+            return array.dtype
+        else:
+            return array.dtypes.tolist()[0]
+
+    if dtypes == 'numeric':
+        dtypes = numeric_types
+
+    if isinstance(dtypes, (list, tuple)):
+        # fp16 is not supported, so remove from the list of dtypes if present
+        dtypes = [d for d in dtypes if d != np.float16]
+
+        if not isinstance(array, cuDataFrame):
+            if array.dtype not in dtypes:
+                return dtypes[0]
+        elif any([dt not in dtypes for dt in array.dtypes.tolist()]):
+            return dtypes[0]
+        if not isinstance(array, cuDataFrame):
+            return array.dtype
+        else:
+            return array.dtypes.tolist()[0]
+    elif dtypes == np.float16:
+        raise NotImplementedError("Float16 not supported by cuML")
+    else:
+        # Single dtype to convert to
+        return dtypes
+
+
+def check_finite(array, force_all_finite=True):
+    """Checks that the input is finite if necessary
+
+    Parameters
+    ----------
+    array : object
+        Input object to check / convert.
+    force_all_finite : boolean or 'allow-nan', (default=True)
+        Whether to raise an error on np.inf, np.nan, pd.NA in array. The
+        possibilities are:
+        - True: Force all values of array to be finite.
+        - False: accepts np.inf, np.nan, pd.NA in array.
+        - 'allow-nan': accepts only np.nan and pd.NA values in array. Values
+          cannot be infinite.
+           ``force_all_finite`` accepts the string ``'allow-nan'``.
+
+    Returns
+    -------
+    None or raise error
+    """
+    if force_all_finite is True:
+        if not cp.all(cp.isfinite(array)):
+            raise ValueError("Non-finite value encountered in array")
+    elif force_all_finite == 'allow-nan':
+        if cp.any(cp.isinf(array)):
+            raise ValueError("Non-finite value encountered in array")
+
+
+def check_array(array, accept_sparse=False, accept_large_sparse=True,
+                dtype='numeric', order=None, copy=False,
+                force_all_finite=True, ensure_2d=True, allow_nd=False,
+                ensure_min_samples=1, ensure_min_features=1,
+                warn_on_dtype=None, estimator=None):
+    """Input validation on an array, list, sparse matrix or similar.
+    By default, the input is checked to be a non-empty 2D array containing
+    only finite values. If the dtype of the array is object, attempt
+    converting to float, raising on failure.
+
+    Parameters
+    ----------
+    array : object
+        Input object to check / convert.
+    accept_sparse : string, boolean or list/tuple of strings (default=False)
+        String[s] representing allowed sparse matrix formats, such as 'csc',
+        'csr', etc. If the input is sparse but not in the allowed format,
+        it will be converted to the first listed format. True allows the input
+        to be any format. False means that a sparse matrix input will
+        raise an error.
+    accept_large_sparse : bool (default=True)
+        If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
+        accept_sparse, accept_large_sparse=False will cause it to be accepted
+        only if its indices are stored with a 32-bit dtype.
+    dtype : string, type, list of types or None (default="numeric")
+        Data type of result. If None, the dtype of the input is preserved.
+        If "numeric", dtype is preserved unless array.dtype is object.
+        If dtype is a list of types, conversion on the first type is only
+        performed if the dtype of the input is not in the list.
+    order : 'F', 'C' or None (default=None)
+        Whether an array will be forced to be fortran or c-style.
+        When order is None (default), then if copy=False, nothing is ensured
+        about the memory layout of the output array; otherwise (copy=True)
+        the memory layout of the returned array is kept as close as possible
+        to the original array.
+    copy : boolean (default=False)
+        Whether a forced copy will be triggered. If copy=False, a copy might
+        be triggered by a conversion.
+    force_all_finite : boolean or 'allow-nan', (default=True)
+        Whether to raise an error on np.inf, np.nan, pd.NA in array. The
+        possibilities are:
+        - True: Force all values of array to be finite.
+        - False: accepts np.inf, np.nan, pd.NA in array.
+        - 'allow-nan': accepts only np.nan and pd.NA values in array. Values
+          cannot be infinite.
+           ``force_all_finite`` accepts the string ``'allow-nan'``.
+    ensure_2d : boolean (default=True)
+        Whether to raise a value error if array is not 2D.
+    allow_nd : boolean (default=False)
+        Whether to allow array.ndim > 2.
+    ensure_min_samples : int (default=1)
+        Make sure that the array has a minimum number of samples in its first
+        axis (rows for a 2D array). Setting to 0 disables this check.
+    ensure_min_features : int (default=1)
+        Make sure that the 2D array has some minimum number of features
+        (columns). The default value of 1 rejects empty datasets.
+        This check is only enforced when the input data has effectively 2
+        dimensions or is originally 1D and ``ensure_2d`` is True. Setting to 0
+        disables this check.
+    estimator : unused parameter
+
+    Returns
+    -------
+    array_converted : object
+        The converted and validated array.
+    """
+
+    if dtype == 'numeric':
+        dtype = numeric_types
+
+    correct_dtype = check_dtype(array, dtype)
+
+    if copy and not order and hasattr(array, 'flags'):
+        if array.flags['F_CONTIGUOUS']:
+            order = 'F'
+        elif array.flags['C_CONTIGUOUS']:
+            order = 'C'
+
+    if not order:
+        order = 'F'
+
+    hasshape = hasattr(array, 'shape')
+    if ensure_2d and hasshape:
+        if len(array.shape) != 2:
+            raise ValueError("Not 2D")
+
+    if not allow_nd and hasshape:
+        if len(array.shape) > 2:
+            raise ValueError("More than 2 dimensions detected")
+
+    if ensure_min_samples > 0 and hasshape:
+        if array.shape[0] < ensure_min_samples:
+            raise ValueError("Not enough samples")
+
+    if ensure_min_features > 0 and hasshape and array.ndim == 2:
+        n_features = array.shape[1]
+        if n_features < ensure_min_features:
+            raise ValueError("Found array with %d feature(s) (shape=%s) while"
+                             " a minimum of %d is required."
+                             % (n_features, array.shape, ensure_min_features))
+
+    is_sparse = cpu_sparse.issparse(array) or gpu_sparse.issparse(array)
+    if is_sparse:
+        check_sparse(array, accept_sparse, accept_large_sparse)
+        if array.format == 'csr':
+            new_array = gpu_csr_matrix(array, copy=copy)
+        elif array.format == 'csc':
+            new_array = gpu_csc_matrix(array, copy=copy)
+        elif array.format == 'coo':
+            new_array = gpu_coo_matrix(array, copy=copy)
+        else:
+            raise ValueError('Sparse matrix format not supported')
+        check_finite(new_array.data, force_all_finite)
+        if correct_dtype != new_array.dtype:
+            new_array = new_array.astype(correct_dtype)
+        return new_array
+    else:
+        X, n_rows, n_cols, dtype = input_to_cuml_array(array,
+                                                       order=order,
+                                                       deepcopy=copy)
+        X = X.to_output('cupy')
+        if correct_dtype != dtype:
+            X = X.astype(correct_dtype)
+        check_finite(X, force_all_finite)
+        return X
+
+
+_input_type_to_str = {
+    numpyArray: 'numpy',
+    cupyArray: 'cupy',
+    cuSeries: 'cudf',
+    cuDataFrame: 'cudf',
+    pdSeries: 'numpy',
+    pdDataFrame: 'numpy'
+}
+
+
+def get_input_type(input):
+    # function to access _input_to_str, while still using the correct
+    # numba check for a numba device_array
+    if type(input) in _input_type_to_str.keys():
+        return _input_type_to_str[type(input)]
+    elif numbaArray.is_cuda_ndarray(input):
+        return 'numba'
+    elif isinstance(input, cpu_sparse.csr_matrix):
+        return 'scipy_csr'
+    elif isinstance(input, cpu_sparse.csc_matrix):
+        return 'scipy_csc'
+    elif isinstance(input, cpu_sparse.coo_matrix):
+        return 'scipy_coo'
+    elif isinstance(input, gpu_sparse.csr_matrix):
+        return 'cupy_csr'
+    elif isinstance(input, gpu_sparse.csc_matrix):
+        return 'cupy_csc'
+    elif isinstance(input, gpu_sparse.coo_matrix):
+        return 'cupy_coo'
+    else:
+        return 'cupy'
+
+
+def to_output_type(array, output_type, order='F'):
+    if output_type == 'scipy_csr':
+        return cpu_sparse.csr_matrix(array.get())
+    if output_type == 'scipy_csc':
+        return cpu_sparse.csc_matrix(array.get())
+    if output_type == 'scipy_coo':
+        return cpu_sparse.coo_matrix(array.get())
+    if output_type == 'cupy_csr':
+        if array.format in ['csc', 'coo']:
+            return array.tocsr()
+        else:
+            return array
+    if output_type == 'cupy_csc':
+        if array.format in ['csr', 'coo']:
+            return array.tocsc()
+        else:
+            return array
+    if output_type == 'cupy_coo':
+        if array.format in ['csr', 'csc']:
+            return array.tocoo()
+        else:
+            return array
+
+    if cpu_sparse.issparse(array):
+        if output_type == 'numpy':
+            return array.todense()
+        elif output_type == 'cupy':
+            return cp.array(array.todense())
+        else:
+            array = array.todense()
+    elif gpu_sparse.issparse(array):
+        if output_type == 'numpy':
+            return cp.asnumpy(array.todense())
+        elif output_type == 'cupy':
+            return array.todense()
+        else:
+            array = array.todense()
+
+    cuml_array = input_to_cuml_array(array, order=order)[0]
+    return cuml_array.to_output(output_type)
+
+
+def _get_mask(X, value_to_mask):
+    """Compute the boolean mask X == missing_values."""
+    if value_to_mask == "NaN" or cp.isnan(value_to_mask):
+        return cp.isnan(X)
+    else:
+        return X == value_to_mask
+
+
+def _masked_column_median(arr, masked_value):
+    """Compute the median of each column in the 2D array arr, ignoring any
+    instances of masked_value"""
+    mask = _get_mask(arr, masked_value)
+    if arr.size == 0:
+        return cp.full(arr.shape[1], cp.nan)
+    arr_sorted = arr.copy()
+    if not cp.isnan(masked_value):
+        # If nan is not the missing value, any column with nans should
+        # have a median of nan
+        nan_cols = cp.any(cp.isnan(arr), axis=0)
+        arr_sorted[mask] = cp.nan
+    else:
+        nan_cols = cp.full(arr.shape[1], False)
+    # nans are always sorted to end of array
+    arr_sorted = cp.sort(arr_sorted, axis=0)
+
+    count_missing_values = mask.sum(axis=0)
+    # Ignore missing values in determining "halfway" index of sorted
+    # array
+    n_elems = arr.shape[0] - count_missing_values
+
+    # If no elements remain after removing missing value, median for
+    # that colum is nan
+    nan_cols = cp.logical_or(nan_cols, n_elems <= 0)
+
+    col_index = cp.arange(arr_sorted.shape[1])
+    median = (arr_sorted[cp.floor_divide(n_elems - 1, 2), col_index] +
+              arr_sorted[cp.floor_divide(n_elems, 2), col_index]) / 2
+
+    median[nan_cols] = cp.nan
+    return median
+
+
+def _masked_column_mean(arr, masked_value):
+    """Compute the mean of each column in the 2D array arr, ignoring any
+    instances of masked_value"""
+    mask = _get_mask(arr, masked_value)
+    count_missing_values = mask.sum(axis=0)
+    n_elems = arr.shape[0] - count_missing_values
+    mean = cp.nansum(arr, axis=0)
+    if not cp.isnan(masked_value):
+        mean -= (count_missing_values * masked_value)
+    mean /= n_elems
+    return mean
+
+
+def _masked_column_mode(arr, masked_value):
+    """Determine the most frequently appearing element in each column in the 2D
+    array arr, ignoring any instances of masked_value"""
+    mask = _get_mask(arr, masked_value)
+    n_features = arr.shape[1]
+    most_frequent = np.empty(n_features, dtype=arr.dtype)
+    for i in range(n_features):
+        feature_mask_idxs = cp.where(~mask[:, i])[0]
+        values, counts = cp.unique(arr[feature_mask_idxs, i],
+                                   return_counts=True)
+        count_max = counts.max()
+        if count_max > 0:
+            value = values[counts == count_max].min()
+        else:
+            value = cp.nan
+        most_frequent[i] = value
+    return cp.array(most_frequent)
diff --git a/python/cuml/thirdparty_adapters/sparsefuncs_fast.py b/python/cuml/thirdparty_adapters/sparsefuncs_fast.py
new file mode 100644
index 0000000000..59bc699a34
--- /dev/null
+++ b/python/cuml/thirdparty_adapters/sparsefuncs_fast.py
@@ -0,0 +1,361 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import cupy as cp
+from numba import cuda
+from math import ceil
+
+
+def csr_mean_variance_axis0(X):
+    """Compute mean and variance on the axis 0 of a CSR matrix
+
+    Parameters
+    ----------
+    X : sparse CSR matrix
+        Input array
+
+    Returns
+    -------
+    mean and variance
+    """
+    X = X.tocsc()
+    means, variances, _ = _csc_mean_variance_axis0(X)
+    return means, variances
+
+
+def csc_mean_variance_axis0(X):
+    """Compute mean and variance on the axis 0 of a CSC matrix
+
+    Parameters
+    ----------
+    X : sparse CSC matrix
+        Input array
+
+    Returns
+    -------
+    mean and variance
+    """
+    means, variances, _ = _csc_mean_variance_axis0(X)
+    return means, variances
+
+
+def _csc_mean_variance_axis0(X):
+    """Compute mean, variance and nans count on the axis 0 of a CSC matrix
+
+    Parameters
+    ----------
+    X : sparse CSC matrix
+        Input array
+
+    Returns
+    -------
+    mean, variance, nans count
+    """
+    n_samples, n_features = X.shape
+
+    means = cp.empty(n_features)
+    variances = cp.empty(n_features)
+    counts_nan = cp.empty(n_features)
+
+    start = X.indptr[0]
+    for i, end in enumerate(X.indptr[1:]):
+        col = X.data[start:end]
+
+        _count_zeros = n_samples - col.size
+        _count_nans = (col != col).sum()
+
+        _mean = cp.nansum(col) / (n_samples - _count_nans)
+        _variance = cp.nansum((col - _mean) ** 2)
+        _variance += _count_zeros * (_mean ** 2)
+        _variance /= (n_samples - _count_nans)
+
+        means[i] = _mean
+        variances[i] = _variance
+        counts_nan[i] = _count_nans
+
+        start = end
+    return means, variances, counts_nan
+
+
+@cuda.jit
+def norm_step2_k(indptr, data, norm):
+    """Apply normalization
+
+    Parameters
+    ----------
+    indptr : array
+        indptr of sparse matrix
+    data : array
+        data of sparse matrix
+    norm: array
+        norm by which to divide columns
+    """
+    row_i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
+    inrow_idx = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
+
+    if row_i >= indptr.shape[0] - 1:
+        return
+
+    start = indptr[row_i]
+    end = indptr[row_i + 1]
+    if inrow_idx >= (end - start):
+        return
+
+    data[start + inrow_idx] /= norm[row_i]
+
+
+@cuda.jit
+def l1_step1_k(indptr, data, norm):
+    """Compute norm for L1 normalization
+    """
+    row_i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
+    inrow_idx = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
+
+    if row_i >= indptr.shape[0] - 1:
+        return
+
+    start = indptr[row_i]
+    end = indptr[row_i + 1]
+    if inrow_idx >= (end - start):
+        return
+
+    val = abs(data[start + inrow_idx])
+    cuda.atomic.add(norm, row_i, val)
+
+
+def inplace_csr_row_normalize_l1(X):
+    """Normalize CSR matrix inplace with L1 norm
+
+    Parameters
+    ----------
+    X : sparse CSR matrix
+        Input array
+
+    Returns
+    -------
+    Normalized matrix
+    """
+    n_rows = X.indptr.shape[0]
+    max_nnz = cp.diff(X.indptr).max()
+    tpb = (32, 32)
+    bpg_x = ceil(n_rows / tpb[0])
+    bpg_y = ceil(max_nnz / tpb[1])
+    bpg = (bpg_x, bpg_y)
+
+    norm = cp.zeros(n_rows - 1, dtype=X.dtype)
+    l1_step1_k[bpg, tpb](X.indptr, X.data, norm)
+    norm_step2_k[bpg, tpb](X.indptr, X.data, norm)
+
+
+@cuda.jit
+def l2_step1_k(indptr, data, norm):
+    """Compute norm for L2 normalization
+    """
+    row_i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
+    inrow_idx = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
+
+    if row_i >= indptr.shape[0] - 1:
+        return
+
+    start = indptr[row_i]
+    end = indptr[row_i + 1]
+    if inrow_idx >= (end - start):
+        return
+
+    val = data[start + inrow_idx]
+    val *= val
+    cuda.atomic.add(norm, row_i, val)
+
+
+def inplace_csr_row_normalize_l2(X):
+    """Normalize CSR matrix inplace with L2 norm
+
+    Parameters
+    ----------
+    X : sparse CSR matrix
+        Input array
+
+    Returns
+    -------
+    Normalized matrix
+    """
+    n_rows = X.indptr.shape[0]
+    max_nnz = cp.diff(X.indptr).max()
+    tpb = (32, 32)
+    bpg_x = ceil(n_rows / tpb[0])
+    bpg_y = ceil(max_nnz / tpb[1])
+    bpg = (bpg_x, bpg_y)
+
+    norm = cp.zeros(n_rows - 1, dtype=X.dtype)
+    l2_step1_k[bpg, tpb](X.indptr, X.data, norm)
+    norm = cp.sqrt(norm)
+    norm_step2_k[bpg, tpb](X.indptr, X.data, norm)
+
+
+@cuda.jit(device=True, inline=True)
+def _deg2_column(d, i, j, interaction_only):
+    """Compute the index of the column for a degree 2 expansion
+
+    d is the dimensionality of the input data, i and j are the indices
+    for the columns involved in the expansion.
+    """
+    if interaction_only:
+        return int(d * i - (i**2 + 3 * i) / 2 - 1 + j)
+    else:
+        return int(d * i - (i**2 + i) / 2 + j)
+
+
+@cuda.jit(device=True, inline=True)
+def _deg3_column(d, i, j, k, interaction_only):
+    """Compute the index of the column for a degree 3 expansion
+
+    d is the dimensionality of the input data, i, j and k are the indices
+    for the columns involved in the expansion.
+    """
+    if interaction_only:
+        return int((3 * d**2 * i - 3 * d * i**2 + i**3
+                   + 11 * i - 3 * j**2 - 9 * j) / 6
+                   + i**2 - 2 * d * i + d * j - d + k)
+    else:
+        return int((3 * d**2 * i - 3 * d * i**2 + i ** 3 - i
+                   - 3 * j**2 - 3 * j) / 6
+                   + d * j + k)
+
+
+@cuda.jit
+def perform_expansion(indptr, indices, data, expanded_data,
+                      expanded_indices, d, interaction_only,
+                      degree, expanded_indptr):
+    """Kernel applying polynomial expansion on CSR matrix
+    """
+    row_i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
+    inrow_idx = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
+
+    if row_i >= indptr.shape[0]-1:
+        return
+
+    expanded_index = expanded_indptr[row_i] + inrow_idx
+    if expanded_index >= expanded_indptr[row_i+1]:
+        return
+
+    row_starts = indptr[row_i]
+    row_ends = indptr[row_i + 1]
+
+    i_ptr = row_starts
+    j_ptr = -1
+    k_ptr = inrow_idx
+
+    if degree == 2:
+        j_ptr = inrow_idx
+        for i in range(row_starts, row_ends):
+            diff = row_ends - i - interaction_only
+            if j_ptr >= diff:
+                j_ptr -= diff
+            else:
+                i_ptr = i
+                break
+        j_ptr += i_ptr + interaction_only
+    else:
+        # degree == 3
+        diff = 0
+        for i in range(row_starts, row_ends):
+            for j in range(i + interaction_only, row_ends):
+                diff = row_ends - j - interaction_only
+                if k_ptr >= diff:
+                    k_ptr -= diff
+                else:
+                    j_ptr = j
+                    i_ptr = i
+                    break
+            if j_ptr != -1:
+                break
+
+        k_ptr += j_ptr + interaction_only
+
+    i = indices[i_ptr]
+    j = indices[j_ptr]
+
+    if degree == 2:
+        col = _deg2_column(d, i, j, interaction_only)
+        expanded_indices[expanded_index] = col
+        expanded_data[expanded_index] = data[i_ptr] * data[j_ptr]
+    else:
+        # degree == 3
+        k = indices[k_ptr]
+        col = _deg3_column(d, i, j, k, interaction_only)
+        expanded_indices[expanded_index] = col
+        expanded_data[expanded_index] = data[i_ptr] * data[j_ptr] \
+            * data[k_ptr]
+
+
+def csr_polynomial_expansion(X, interaction_only, degree):
+    """Apply polynomial expansion on CSR matrix
+
+    Parameters
+    ----------
+    X : sparse CSR matrix
+        Input array
+
+    Returns
+    -------
+    New expansed matrix
+    """
+    assert degree in (2, 3)
+
+    interaction_only = 1 if interaction_only else 0
+
+    d = X.shape[1]
+    if degree == 2:
+        expanded_dimensionality = int((d**2 + d) / 2 - interaction_only*d)
+    else:
+        expanded_dimensionality = int((d**3 + 3*d**2 + 2*d) / 6
+                                      - interaction_only*d**2)
+    if expanded_dimensionality == 0:
+        return None
+    assert expanded_dimensionality > 0
+
+    nnz = cp.diff(X.indptr)
+    if degree == 2:
+        total_nnz = (nnz ** 2 + nnz) / 2 - interaction_only * nnz
+    else:
+        total_nnz = ((nnz ** 3 + 3 * nnz ** 2 + 2 * nnz) / 6
+                     - interaction_only * nnz ** 2)
+    del nnz
+    nnz_cumsum = total_nnz.cumsum(dtype=cp.int64)
+    total_nnz_max = int(total_nnz.max())
+    total_nnz = int(total_nnz.sum())
+
+    num_rows = X.indptr.shape[0] - 1
+
+    expanded_data = cp.empty(shape=total_nnz, dtype=X.data.dtype)
+    expanded_indices = cp.empty(shape=total_nnz, dtype=X.indices.dtype)
+    expanded_indptr = cp.empty(shape=num_rows + 1, dtype=X.indptr.dtype)
+    expanded_indptr[0] = X.indptr[0]
+    expanded_indptr[1:] = nnz_cumsum
+
+    tpb = (32, 32)
+    bpg_x = ceil(X.indptr.shape[0] / tpb[0])
+    bpg_y = ceil(total_nnz_max / tpb[1])
+    bpg = (bpg_x, bpg_y)
+    perform_expansion[bpg, tpb](X.indptr, X.indices, X.data,
+                                expanded_data, expanded_indices,
+                                d, interaction_only, degree,
+                                expanded_indptr)
+
+    return cp.sparse.csr_matrix((expanded_data, expanded_indices,
+                                 expanded_indptr),
+                                shape=(num_rows, expanded_dimensionality))
diff --git a/python/cuml/tsa/arima.pxd b/python/cuml/tsa/arima.pxd
index 1ba9a72680..12095ed20e 100644
--- a/python/cuml/tsa/arima.pxd
+++ b/python/cuml/tsa/arima.pxd
@@ -14,12 +14,6 @@
 # limitations under the License.
 #
 
-# cython: profile=False
-# distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
-
-
 cdef extern from "cuml/tsa/arima_common.h" namespace "ML":
     ctypedef struct ARIMAOrder:
         int p  # Basic order
diff --git a/python/cuml/tsa/arima.pyx b/python/cuml/tsa/arima.pyx
index 40bf468a3d..e822643e18 100644
--- a/python/cuml/tsa/arima.pyx
+++ b/python/cuml/tsa/arima.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import numpy as np
 import sys
@@ -31,7 +28,7 @@ from typing import List, Tuple, Dict, Mapping, Optional, Union
 from cuml.common.array import CumlArray as cumlArray
 from cuml.common.base import Base
 from cuml.common.cuda import nvtx_range_wrap
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.tsa.batched_lbfgs import batched_fmin_lbfgs_b
 import cuml.common.logger as logger
 from cuml.common import has_scipy
@@ -52,44 +49,46 @@ cdef extern from "cuml/tsa/arima_common.h" namespace "ML":
 cdef extern from "cuml/tsa/batched_arima.hpp" namespace "ML":
     ctypedef enum LoglikeMethod: CSS, MLE
 
-    void batched_diff(cumlHandle& handle, double* d_y_diff, const double* d_y,
+    void batched_diff(handle_t& handle, double* d_y_diff, const double* d_y,
                       int batch_size, int n_obs, const ARIMAOrder& order)
 
     void batched_loglike(
-        cumlHandle& handle, const double* y, int batch_size, int nobs,
+        handle_t& handle, const double* y, int batch_size, int nobs,
         const ARIMAOrder& order, const double* params, double* loglike,
         double* d_vs, bool trans, bool host_loglike, LoglikeMethod method,
         int truncate)
 
     void batched_loglike_grad(
-        cumlHandle& handle, const double* d_y, int batch_size, int nobs,
+        handle_t& handle, const double* d_y, int batch_size, int nobs,
         const ARIMAOrder& order, const double* d_x, double* d_grad, double h,
         bool trans, LoglikeMethod method, int truncate)
 
     void cpp_predict "predict" (
-        cumlHandle& handle, const double* d_y, int batch_size, int nobs,
+        handle_t& handle, const double* d_y, int batch_size, int nobs,
         int start, int end, const ARIMAOrder& order,
-        const ARIMAParams[double]& params, double* d_vs_ptr, double* d_y_p)
+        const ARIMAParams[double]& params, double* d_y_p, bool pre_diff,
+        double level, double* d_lower, double* d_upper)
 
     void information_criterion(
-        cumlHandle& handle, const double* d_y, int batch_size, int nobs,
+        handle_t& handle, const double* d_y, int batch_size, int nobs,
         const ARIMAOrder& order, const ARIMAParams[double]& params,
         double* ic, int ic_type)
 
     void estimate_x0(
-        cumlHandle& handle, ARIMAParams[double]& params, const double* d_y,
+        handle_t& handle, ARIMAParams[double]& params, const double* d_y,
         int batch_size, int nobs, const ARIMAOrder& order)
 
 
 cdef extern from "cuml/tsa/batched_kalman.hpp" namespace "ML":
 
     void batched_jones_transform(
-        cumlHandle& handle, const ARIMAOrder& order, int batchSize,
+        handle_t& handle, const ARIMAOrder& order, int batchSize,
         bool isInv, const double* h_params, double* h_Tparams)
 
 
 class ARIMA(Base):
-    r"""Implements a batched ARIMA model for in- and out-of-sample
+    r"""
+    Implements a batched ARIMA model for in- and out-of-sample
     time-series prediction, with support for seasonality (SARIMA)
 
     ARIMA stands for Auto-Regressive Integrated Moving Average.
@@ -100,46 +99,6 @@ class ARIMA(Base):
     The implementation is designed to give the best performance when using
     large batches of time series.
 
-    Examples
-    ---------
-    .. code-block:: python
-
-        import numpy as np
-        from cuml.tsa.arima import ARIMA
-
-        # Create seasonal data with a trend, a seasonal pattern and noise
-        n_obs = 100
-        np.random.seed(12)
-        x = np.linspace(0, 1, n_obs)
-        pattern = np.array([[0.05, 0.0], [0.07, 0.03],
-                            [-0.03, 0.05], [0.02, 0.025]])
-        noise = np.random.normal(scale=0.01, size=(n_obs, 2))
-        y = (np.column_stack((0.5*x, -0.25*x)) + noise
-             + np.tile(pattern, (25, 1)))
-
-        # Fit a seasonal ARIMA model
-        model = ARIMA(y, (0,1,1), (0,1,1,4), fit_intercept=False)
-        model.fit()
-
-        # Forecast
-        fc = model.forecast(10)
-        print(fc)
-
-    Output:
-
-    .. code-block:: python
-
-        [[ 0.55204599 -0.25681163]
-         [ 0.57430705 -0.2262438 ]
-         [ 0.48120315 -0.20583011]
-         [ 0.535594   -0.24060046]
-         [ 0.57207541 -0.26695497]
-         [ 0.59433647 -0.23638713]
-         [ 0.50123257 -0.21597344]
-         [ 0.55562342 -0.25074379]
-         [ 0.59210483 -0.27709831]
-         [ 0.61436589 -0.24653047]]
-
     Parameters
     ----------
     endog : dataframe or array-like (device or host)
@@ -150,28 +109,44 @@ class ARIMA(Base):
         The ARIMA order (p, d, q) of the model
     seasonal_order: Tuple[int, int, int, int]
         The seasonal ARIMA order (P, D, Q, s) of the model
-    fit_intercept : bool or int
-        Whether to include a constant trend mu in the model (default: True)
+    fit_intercept : bool or int (default = True)
+        Whether to include a constant trend mu in the model
+    simple_differencing: bool or int (default = True)
+        If True, the data is differenced before being passed to the Kalman
+        filter. If False, differencing is part of the state-space model.
+        In some cases this setting can be ignored: computing forecasts with
+        confidence intervals will force it to False ; fitting with the CSS
+        method will force it to True.
+        Note: that forecasts are always for the original series, whereas
+        statsmodels computes forecasts for the differenced series when
+        simple_differencing is True.
     handle : cuml.Handle
-        If it is None, a new one is created just for this instance
-    verbose : int or boolean (default = False)
-        Controls verbose level of logging.
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
-        Variable to control output type of the results and attributes.
-        If None, it'll inherit the output type set at the module level,
-        cuml.output_type. If set, it will override the global option.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     Attributes
     ----------
-    order : Tuple[int, int, int]
-        The ARIMA order (p, d, q) of the model
+    order : ARIMAOrder
+        The ARIMA order of the model (p, d, q, P, D, Q, s, k)
     seasonal_order: Tuple[int, int, int, int]
         The seasonal ARIMA order (P, D, Q, s) of the model
     intercept : bool or int
         Whether the model includes a constant trend mu
     d_y: device array
         Time series data on device
-    num_samples: int
+    n_obs: int
         Number of observations
     batch_size: int
         Number of time series in the batch
@@ -181,12 +156,13 @@ class ARIMA(Base):
         After fitting, contains the number of iterations before convergence
         for each time series.
 
-    Performance
-    -----------
-    Let `r=max(p+s*P, q+s*Q+1)`. The device memory used for most operations
-    is `O(batch_size*n_obs + batch_size*r^2)`. The execution time is a linear
-    function of `n_obs` and `batch_size` (if `batch_size` is large), but grows
-    very fast with `r`.
+    Notes
+    -----
+    *Performance:* Let :math:`r=max(p+s*P, q+s*Q+1)`. The device memory used
+    for most operations is
+    :math:`O(\mathtt{batch\_size}*\mathtt{n\_obs} + \mathtt{batch\_size}*r^2)`.
+    The execution time is a linear function of `n_obs` and `batch_size`
+    (if `batch_size` is large), but grows very fast with `r`.
 
     The performance is optimized for very large batch sizes (e.g thousands of
     series).
@@ -195,11 +171,52 @@ class ARIMA(Base):
     ----------
     This class is heavily influenced by the Python library `statsmodels`,
     particularly `statsmodels.tsa.statespace.sarimax.SARIMAX`.
-    See https://www.statsmodels.org/stable/statespace.html
+    See https://www.statsmodels.org/stable/statespace.html.
 
     Additionally the following book is a useful reference:
     "Time Series Analysis by State Space Methods",
     J. Durbin, S.J. Koopman, 2nd Edition (2012).
+
+    Examples
+    --------
+    .. code-block:: python
+
+            import numpy as np
+            from cuml.tsa.arima import ARIMA
+
+            # Create seasonal data with a trend, a seasonal pattern and noise
+            n_obs = 100
+            np.random.seed(12)
+            x = np.linspace(0, 1, n_obs)
+            pattern = np.array([[0.05, 0.0], [0.07, 0.03],
+                                [-0.03, 0.05], [0.02, 0.025]])
+            noise = np.random.normal(scale=0.01, size=(n_obs, 2))
+            y = (np.column_stack((0.5*x, -0.25*x)) + noise
+                + np.tile(pattern, (25, 1)))
+
+            # Fit a seasonal ARIMA model
+            model = ARIMA(y, (0,1,1), (0,1,1,4), fit_intercept=False)
+            model.fit()
+
+            # Forecast
+            fc = model.forecast(10)
+            print(fc)
+
+    Output:
+
+    .. code-block:: python
+
+            [[ 0.55204599 -0.25681163]
+            [ 0.57430705 -0.2262438 ]
+            [ 0.48120315 -0.20583011]
+            [ 0.535594   -0.24060046]
+            [ 0.57207541 -0.26695497]
+            [ 0.59433647 -0.23638713]
+            [ 0.50123257 -0.21597344]
+            [ 0.55562342 -0.25074379]
+            [ 0.59210483 -0.27709831]
+            [ 0.61436589 -0.24653047]]
+
     """
 
     def __init__(self,
@@ -208,6 +225,7 @@ class ARIMA(Base):
                  seasonal_order: Tuple[int, int, int, int]
                  = (0, 0, 0, 0),
                  fit_intercept=True,
+                 simple_differencing=True,
                  handle=None,
                  verbose=False,
                  output_type=None):
@@ -219,7 +237,7 @@ class ARIMA(Base):
 
         # Initialize base class
         super().__init__(handle, verbose, output_type)
-        self._set_output_type(endog)
+        self._set_base_attributes(output_type=endog)
 
         # Set the ARIMA order
         cdef ARIMAOrder cpp_order
@@ -234,9 +252,6 @@ class ARIMA(Base):
         if P + D + Q > 0 and s < 2:
             raise ValueError("ERROR: Invalid period for seasonal ARIMA: {}"
                              .format(s))
-        if P + D + Q == 0 and s > 0:
-            raise ValueError("ERROR: Period specified for non-seasonal ARIMA:"
-                             " {}".format(s))
         if d + D > 2:
             raise ValueError("ERROR: Invalid order. Required: d+D <= 2")
         if s != 0 and (p >= s or q >= s):
@@ -259,12 +274,14 @@ class ARIMA(Base):
             raise ValueError("ERROR: Number of observations too small for the"
                              " given order")
 
+        self.simple_differencing = simple_differencing
+
         # Compute the differenced series
         self._d_y_diff = cumlArray.empty(
             (self.n_obs - d - s * D, self.batch_size), self.dtype)
         cdef uintptr_t d_y_ptr = self._d_y.ptr
         cdef uintptr_t d_y_diff_ptr = self._d_y_diff.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         batched_diff(handle_[0], <double*> d_y_diff_ptr, <double*> d_y_ptr,
                      <int> self.batch_size, <int> self.n_obs, self.order)
 
@@ -290,9 +307,10 @@ class ARIMA(Base):
     def _ic(self, ic_type: str):
         """Wrapper around C++ information_criterion
         """
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
-        cdef ARIMAOrder order = self.order
+        cdef ARIMAOrder order_kf = \
+            self.order_diff if self.simple_differencing else self.order
 
         # Convert host parameters to device parameters
         cdef uintptr_t d_mu_ptr = <uintptr_t> NULL
@@ -301,19 +319,19 @@ class ARIMA(Base):
         cdef uintptr_t d_sar_ptr = <uintptr_t> NULL
         cdef uintptr_t d_sma_ptr = <uintptr_t> NULL
         cdef uintptr_t d_sigma2_ptr = <uintptr_t> NULL
-        if order.k:
+        if order_kf.k:
             d_mu, *_ = input_to_cuml_array(self.mu, check_dtype=np.float64)
             d_mu_ptr = d_mu.ptr
-        if order.p:
+        if order_kf.p:
             d_ar, *_ = input_to_cuml_array(self.ar, check_dtype=np.float64)
             d_ar_ptr = d_ar.ptr
-        if order.q:
+        if order_kf.q:
             d_ma, *_ = input_to_cuml_array(self.ma, check_dtype=np.float64)
             d_ma_ptr = d_ma.ptr
-        if order.P:
+        if order_kf.P:
             d_sar, *_ = input_to_cuml_array(self.sar, check_dtype=np.float64)
             d_sar_ptr = d_sar.ptr
-        if order.Q:
+        if order_kf.Q:
             d_sma, *_ = input_to_cuml_array(self.sma, check_dtype=np.float64)
             d_sma_ptr = d_sma.ptr
         d_sigma2, *_ = input_to_cuml_array(self.sigma2, check_dtype=np.float64)
@@ -329,19 +347,23 @@ class ARIMA(Base):
 
         ic = cumlArray.empty(self.batch_size, self.dtype)
         cdef uintptr_t d_ic_ptr = ic.ptr
-        cdef uintptr_t d_y_ptr = self._d_y.ptr
+        cdef uintptr_t d_y_kf_ptr = \
+            self._d_y_diff.ptr if self.simple_differencing else self._d_y.ptr
+
+        n_obs_kf = (self.n_obs_diff if self.simple_differencing
+                    else self.n_obs)
 
         ic_name_to_number = {"aic": 0, "aicc": 1, "bic": 2}
         cdef int ic_type_id
         try:
             ic_type_id = ic_name_to_number[ic_type.lower()]
         except KeyError as e:
-            raise NotImplementedError("IC type '{}' unknown". format(ic_type))
+            raise NotImplementedError("IC type '{}' unknown".format(ic_type))
 
-        information_criterion(handle_[0], <double*> d_y_ptr,
-                              <int> self.batch_size, <int> self.n_obs,
-                              <ARIMAOrder> order, cpp_params,
-                              <double*> d_ic_ptr, <int> ic_type_id)
+        information_criterion(handle_[0], <double*> d_y_kf_ptr,
+                              <int> self.batch_size, <int> n_obs_kf,
+                              order_kf, cpp_params, <double*> d_ic_ptr,
+                              <int> ic_type_id)
 
         return ic.to_output(self.output_type)
 
@@ -366,15 +388,16 @@ class ARIMA(Base):
         cdef ARIMAOrder order = self.order
         return order.p + order.P + order.q + order.Q + order.k + 1
 
-    def get_params(self) -> Dict[str, np.ndarray]:
-        """Get the parameters of the model
+    def get_fit_params(self) -> Dict[str, np.ndarray]:
+        """Get all the fit parameters. Not to be confused with get_params
+        Note: pack() can be used to get a compact vector of the parameters
 
-        Returns:
-        --------
+        Returns
+        -------
         params: Dict[str, np.ndarray]
             A dictionary of parameter names and associated arrays
             The key names are in {"mu", "ar", "ma", "sar", "sma", "sigma2"}
-            The shape of the arrays are (batch_size,) for mu parameters and
+            The shape of the arrays are (batch_size,) for mu and sigma2 and
             (n, batch_size) for any other type, where n is the corresponding
             number of parameters of this type.
         """
@@ -387,15 +410,17 @@ class ARIMA(Base):
                 params[names[i]] = getattr(self, names[i])
         return params
 
-    def set_params(self, params: Mapping[str, object]):
-        """Set the parameters of the model
+    def set_fit_params(self, params: Mapping[str, object]):
+        """Set all the fit parameters. Not to be confused with ``set_params``
+        Note: `unpack()` can be used to load a compact vector of the
+        parameters
 
-        Parameters:
-        --------
-        params: Mapping[str, np.ndarray]
-            A mapping (e.g dictionary) of parameter names and associated arrays
+        Parameters
+        ----------
+        params:
+            A dictionary of parameter names and associated arrays
             The key names are in {"mu", "ar", "ma", "sar", "sma", "sigma2"}
-            The shape of the arrays are (batch_size,) for mu parameters and
+            The shape of the arrays are (batch_size,) for mu and sigma2 and
             (n, batch_size) for any other type, where n is the corresponding
             number of parameters of this type.
         """
@@ -404,27 +429,65 @@ class ARIMA(Base):
                 array, _, _, _, _ = input_to_host_array(params[param_name])
                 setattr(self, param_name, array)
 
+    def get_param_names(self):
+        """
+        .. warning:: ARIMA is unable to be cloned at this time. The methods:
+            `get_param_names()`, `get_params` and `set_params` will raise
+            ``NotImplementedError``
+        """
+        raise NotImplementedError("ARIMA is unable to be cloned via "
+                                  "`get_params` and `set_params`.")
+
+    def get_params(self, deep=True):
+        """
+        .. warning:: ARIMA is unable to be cloned at this time. The methods:
+            `get_param_names()`, `get_params` and `set_params` will raise
+            ``NotImplementedError``
+        """
+        raise NotImplementedError("ARIMA is unable to be cloned via "
+                                  "`get_params` and `set_params`.")
+
+    def set_params(self, **params):
+        """
+        .. warning:: ARIMA is unable to be cloned at this time. The methods:
+            `get_param_names()`, `get_params` and `set_params` will raise
+            ``NotImplementedError``
+        """
+        raise NotImplementedError("ARIMA is unable to be cloned via "
+                                  "`get_params` and `set_params`.")
+
     @nvtx_range_wrap
-    def predict(self, start=0, end=None):
+    def predict(self, start=0, end=None, level=None):
         """Compute in-sample and/or out-of-sample prediction for each series
 
-        Parameters:
-        -----------
-        start: int
+        Parameters
+        ----------
+        start: int (default = 0)
             Index where to start the predictions (0 <= start <= num_samples)
-        end:
-            Index where to end the predictions, excluded (end > start)
-
-        Returns:
-        --------
+        end: int (default = None)
+            Index where to end the predictions, excluded (end > start), or
+            ``None`` to predict until the last observation
+        level: float or None (default = None)
+            Confidence level for prediction intervals, or None to return only
+            the point forecasts. ``0 < level < 1``
+
+        Returns
+        -------
         y_p : array-like (device)
             Predictions. Shape = (end - start, batch_size)
-
-        Example:
+        lower: array-like (device) (optional)
+            Lower limit of the prediction interval if ``level != None``
+            Shape = (end - start, batch_size)
+        upper: array-like (device) (optional)
+            Upper limit of the prediction interval if ``level != None``
+            Shape = (end - start, batch_size)
+
+        Examples
         --------
         .. code-block:: python
+
             from cuml.tsa.arima import ARIMA
-            ...
+
             model = ARIMA(ys, (1,1,1))
             model.fit()
             y_pred = model.predict()
@@ -443,10 +506,18 @@ class ARIMA(Base):
                         " undefined, will be set to NaN"
                         .format(order.d + order.D * order.s))
 
+        if level is not None:
+            if level <= 0 or level >= 1:
+                raise ValueError("ERROR: Invalid confidence level: {}"
+                                 .format(level))
+            elif level > 0 and start < self.n_obs:
+                raise ValueError("ERROR: Prediction intervals can only be"
+                                 " computed for out-of-sample predictions")
+
         if end is None:
             end = self.n_obs
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         cdef uintptr_t d_mu_ptr = <uintptr_t> NULL
         cdef uintptr_t d_ar_ptr = <uintptr_t> NULL
@@ -482,42 +553,64 @@ class ARIMA(Base):
 
         predict_size = end - start
 
-        # allocate residual (vs) and prediction (y_p) device memory and get
-        # pointers
-        cdef uintptr_t d_vs_ptr
-        cdef uintptr_t d_y_p_ptr
-        d_vs = cumlArray.empty((self.n_obs - order.d - order.D * order.s,
-                                self.batch_size), dtype=np.float64, order="F")
+        # allocate predictions and intervals device memory
+        cdef uintptr_t d_y_p_ptr = <uintptr_t> NULL
+        cdef uintptr_t d_lower_ptr = <uintptr_t> NULL
+        cdef uintptr_t d_upper_ptr = <uintptr_t> NULL
         d_y_p = cumlArray.empty((predict_size, self.batch_size),
                                 dtype=np.float64, order="F")
-        d_vs_ptr = d_vs.ptr
         d_y_p_ptr = d_y_p.ptr
+        if level is not None:
+            d_lower = cumlArray.empty((predict_size, self.batch_size),
+                                      dtype=np.float64, order="F")
+            d_upper = cumlArray.empty((predict_size, self.batch_size),
+                                      dtype=np.float64, order="F")
+            d_lower_ptr = d_lower.ptr
+            d_upper_ptr = d_upper.ptr
 
         cdef uintptr_t d_y_ptr = self._d_y.ptr
 
         cpp_predict(handle_[0], <double*>d_y_ptr, <int> self.batch_size,
                     <int> self.n_obs, <int> start, <int> end, order,
-                    cpp_params, <double*>d_vs_ptr, <double*>d_y_p_ptr)
+                    cpp_params, <double*>d_y_p_ptr,
+                    <bool> self.simple_differencing,
+                    <double> (0 if level is None else level),
+                    <double*> d_lower_ptr, <double*> d_upper_ptr)
 
-        return d_y_p.to_output(self.output_type)
+        if level is None:
+            return d_y_p.to_output(self.output_type)
+        else:
+            return (d_y_p.to_output(self.output_type),
+                    d_lower.to_output(self.output_type),
+                    d_upper.to_output(self.output_type))
 
     @nvtx_range_wrap
-    def forecast(self, nsteps: int):
+    def forecast(self, nsteps: int, level=None):
         """Forecast the given model `nsteps` into the future.
 
-        Parameters:
+        Parameters
         ----------
         nsteps : int
             The number of steps to forecast beyond end of the given series
+        level: float or None (default = None)
+            Confidence level for prediction intervals, or None to return only
+            the point forecasts. 0 < level < 1
 
-        Returns:
-        --------
+        Returns
+        -------
         y_fc : array-like
-               Forecasts. Shape = (nsteps, batch_size)
-
-        Example:
+            Forecasts. Shape = (nsteps, batch_size)
+        lower: array-like (device) (optional)
+            Lower limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
+        upper: array-like (device) (optional)
+            Upper limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
+
+        Examples
         --------
         .. code-block:: python
+
             from cuml.tsa.arima import ARIMA
             ...
             model = ARIMA(ys, (1,1,1))
@@ -525,7 +618,7 @@ class ARIMA(Base):
             y_fc = model.forecast(10)
         """
 
-        return self.predict(self.n_obs, self.n_obs + nsteps)
+        return self.predict(self.n_obs, self.n_obs + nsteps, level)
 
     @nvtx_range_wrap
     def _estimate_x0(self):
@@ -534,7 +627,7 @@ class ARIMA(Base):
         cdef ARIMAOrder order = self.order
 
         cdef uintptr_t d_y_ptr = self._d_y.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         # Create mu, ar and ma arrays
         cdef uintptr_t d_mu_ptr = <uintptr_t> NULL
@@ -589,7 +682,7 @@ class ARIMA(Base):
         if order.Q:
             params["sma"] = d_sma.to_output('numpy')
         params["sigma2"] = d_sigma2.to_output('numpy')
-        self.set_params(params)
+        self.set_fit_params(params)
 
     @nvtx_range_wrap
     def fit(self,
@@ -599,28 +692,30 @@ class ARIMA(Base):
             maxiter: int = 1000,
             method="ml",
             truncate: int = 0):
-        """Fit the ARIMA model to each time series.
+        r"""Fit the ARIMA model to each time series.
 
         Parameters
         ----------
-        start_params : Mapping[str, object] (optional)
+        start_params : Mapping[str, array-like] (optional)
             A mapping (e.g dictionary) of parameter names and associated arrays
             The key names are in {"mu", "ar", "ma", "sar", "sma", "sigma2"}
-            The shape of the arrays are (batch_size,) for mu parameters and
-            (n, batch_size) for any other type, where n is the corresponding
-            number of parameters of this type.
+            The shape of the arrays are (batch_size,) for mu and sigma2
+            parameters and (n, batch_size) for any other type, where n is the
+            corresponding number of parameters of this type.
             Pass None for automatic estimation (recommended)
+
         opt_disp : int
             Fit diagnostic level (for L-BFGS solver):
-             * `-1` for no output (default)
-             * `0<n<100` for output every `n` steps
-             * `n>100` for more detailed output
+
+            * `-1` for no output (default)
+            * `0<n<100` for output every `n` steps
+            * `n>100` for more detailed output
+
         h : float
-            Finite-differencing step size. The gradient is computed
-            using second-order differencing:
-                    f(x+h) - f(x - h)
-                g = ----------------- + O(h^2)
-                          2 * h
+            Finite-differencing step size. The gradient is computed using
+            forward finite differencing:
+            :math:`g = \frac{f(x + \mathtt{h}) - f(x)}{\mathtt{h}} + O(\mathtt{h})` # noqa
+
         maxiter : int
             Maximum number of iterations of L-BFGS-B
         method : str
@@ -668,7 +763,7 @@ class ARIMA(Base):
         if start_params is None:
             self._estimate_x0()
         else:
-            self.set_params(start_params)
+            self.set_fit_params(start_params)
 
         x0 = self._batched_transform(self.pack(), True)
 
@@ -688,7 +783,7 @@ class ARIMA(Base):
     def _loglike(self, x, trans=True, method="ml", truncate=0):
         """Compute the batched log-likelihood for the given parameters.
 
-        Parameters:
+        Parameters
         ----------
         x : array-like
             Packed parameter array, grouped by series
@@ -702,34 +797,38 @@ class ARIMA(Base):
             When using CSS, start the sum of squares after a given number of
             observations
 
-        Returns:
-        --------
+        Returns
+        -------
         loglike : numpy.ndarray
             Batched log-likelihood. Shape: (batch_size,)
         """
         cdef vector[double] vec_loglike
         vec_loglike.resize(self.batch_size)
 
-        cdef ARIMAOrder order = self.order
-        cdef ARIMAOrder order_diff = self.order_diff
+        cdef LoglikeMethod ll_method = CSS if method == "css" else MLE
+        diff = ll_method != MLE or self.simple_differencing
+
+        cdef ARIMAOrder order_kf = self.order_diff if diff else self.order
 
         d_x_array, *_ = \
             input_to_cuml_array(x, check_dtype=np.float64, order='C')
         cdef uintptr_t d_x_ptr = d_x_array.ptr
 
-        cdef uintptr_t d_y_diff_ptr = self._d_y_diff.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef uintptr_t d_y_kf_ptr = \
+            self._d_y_diff.ptr if diff else self._d_y.ptr
+
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
-        d_vs = cumlArray.empty((self.n_obs - order.d - order.D * order.s,
-                                self.batch_size), dtype=np.float64, order="F")
+        n_obs_kf = (self.n_obs_diff if diff else self.n_obs)
+        d_vs = cumlArray.empty((n_obs_kf, self.batch_size), dtype=np.float64,
+                               order="F")
         cdef uintptr_t d_vs_ptr = d_vs.ptr
 
-        cdef LoglikeMethod ll_method = CSS if method == "css" else MLE
-        batched_loglike(handle_[0], <double*> d_y_diff_ptr,
-                        <int> self.batch_size, <int> self.n_obs_diff,
-                        order_diff, <double*> d_x_ptr,
-                        <double*> vec_loglike.data(), <double*> d_vs_ptr,
-                        <bool> trans, <bool> True, ll_method, <int> truncate)
+        batched_loglike(handle_[0], <double*> d_y_kf_ptr,
+                        <int> self.batch_size, <int> n_obs_kf, order_kf,
+                        <double*> d_x_ptr, <double*> vec_loglike.data(),
+                        <double*> d_vs_ptr, <bool> trans, <bool> True,
+                        ll_method, <int> truncate)
 
         return np.array(vec_loglike, dtype=np.float64)
 
@@ -738,7 +837,7 @@ class ARIMA(Base):
         """Compute the gradient (via finite differencing) of the batched
         log-likelihood.
 
-        Parameters:
+        Parameters
         ----------
         x : array-like
             Packed parameter array, grouped by series.
@@ -755,8 +854,8 @@ class ARIMA(Base):
             When using CSS, start the sum of squares after a given number of
             observations
 
-        Returns:
-        --------
+        Returns
+        -------
         grad : numpy.ndarray
             Batched log-likelihood gradient. Shape: (n_params * batch_size,)
             where n_params is the complexity of the model
@@ -764,23 +863,27 @@ class ARIMA(Base):
         N = self.complexity
         assert len(x) == N * self.batch_size
 
+        cdef LoglikeMethod ll_method = CSS if method == "css" else MLE
+        diff = ll_method != MLE or self.simple_differencing
+
         grad = cumlArray.empty(N * self.batch_size, np.float64)
         cdef uintptr_t d_grad = <uintptr_t> grad.ptr
 
-        cdef ARIMAOrder order = self.order
-        cdef ARIMAOrder order_diff = self.order_diff
+        cdef ARIMAOrder order_kf = self.order_diff if diff else self.order
 
         d_x_array, *_ = \
             input_to_cuml_array(x, check_dtype=np.float64, order='C')
         cdef uintptr_t d_x_ptr = d_x_array.ptr
 
-        cdef uintptr_t d_y_diff_ptr = self._d_y_diff.ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef uintptr_t d_y_kf_ptr = \
+            self._d_y_diff.ptr if diff else self._d_y.ptr
 
-        cdef LoglikeMethod ll_method = CSS if method == "css" else MLE
-        batched_loglike_grad(handle_[0], <double*> d_y_diff_ptr,
-                             <int> self.batch_size, <int> self.n_obs_diff,
-                             order_diff, <double*> d_x_ptr, <double*> d_grad,
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
+
+        batched_loglike_grad(handle_[0], <double*> d_y_kf_ptr,
+                             <int> self.batch_size,
+                             <int> (self.n_obs_diff if diff else self.n_obs),
+                             order_kf, <double*> d_x_ptr, <double*> d_grad,
                              <double> h, <bool> trans, ll_method,
                              <int> truncate)
 
@@ -797,8 +900,8 @@ class ARIMA(Base):
         """Unpack linearized parameter vector `x` into the separate
         parameter arrays of the model
 
-        Parameters:
-        -----------
+        Parameters
+        ----------
         x : array-like
             Packed parameter array, grouped by series.
             Shape: (n_params * batch_size,)
@@ -827,15 +930,15 @@ class ARIMA(Base):
             params["sma"] = np.array(x_mat[k+p+q+P:k+p+q+P+Q], order='F')
         params["sigma2"] = np.array(x_mat[k+p+q+P+Q], order='F')
 
-        self.set_params(params)
+        self.set_fit_params(params)
 
     @nvtx_range_wrap
     def pack(self) -> np.ndarray:
         """Pack parameters of the model into a linearized vector `x`
 
-        Returns:
-        -----------
-        x : array-like
+        Returns
+        -------
+        x : numpy ndarray
             Packed parameter array, grouped by series.
             Shape: (n_params * batch_size,)
         """
@@ -843,7 +946,7 @@ class ARIMA(Base):
         p, q, P, Q, k = (order.p, order.q, order.P, order.Q, order.k)
         N = self.complexity
 
-        params = self.get_params()
+        params = self.get_fit_params()
 
         # 2D array for convenience
         x = np.zeros((N, self.batch_size), order='F')
@@ -866,14 +969,14 @@ class ARIMA(Base):
     def _batched_transform(self, x, isInv=False):
         """Applies Jones transform or inverse transform to a parameter vector
 
-        Parameters:
-        -----------
+        Parameters
+        ----------
         x : array-like
             Packed parameter array, grouped by series.
             Shape: (n_params * batch_size,)
 
-        Returns:
-        -----------
+        Returns
+        -------
         Tx : array-like
             Packed transformed parameter array, grouped by series.
             Shape: (n_params * batch_size,)
@@ -881,7 +984,7 @@ class ARIMA(Base):
         cdef ARIMAOrder order = self.order
         N = self.complexity
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         Tx = np.zeros(self.batch_size * N)
 
         cdef uintptr_t x_ptr = x.ctypes.data
diff --git a/python/cuml/tsa/auto_arima.pyx b/python/cuml/tsa/auto_arima.pyx
index 9a0ae5ecef..9f5612a0c5 100644
--- a/python/cuml/tsa/auto_arima.pyx
+++ b/python/cuml/tsa/auto_arima.pyx
@@ -14,10 +14,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import itertools
@@ -32,7 +29,8 @@ import cuml
 from cuml.common import logger
 from cuml.common.array import CumlArray as cumlArray
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
 from cuml.common.input_utils import input_to_cuml_array
 from cuml.tsa.arima import ARIMA
 from cuml.tsa.seasonality import seas_test
@@ -41,59 +39,58 @@ from cuml.tsa.stationarity import kpss_test
 
 # TODO:
 # - Box-Cox transformations? (parameter lambda)
-# - integrate cuML logging system
 # - Would a "one-fits-all" method be useful?
 
 
 cdef extern from "cuml/tsa/auto_arima.h" namespace "ML":
-    int divide_by_mask_build_index(const cumlHandle& handle, const bool* mask,
+    int divide_by_mask_build_index(const handle_t& handle, const bool* mask,
                                    int* index, int batch_size)
 
-    void divide_by_mask_execute(const cumlHandle& handle, const float* d_in,
+    void divide_by_mask_execute(const handle_t& handle, const float* d_in,
                                 const bool* mask, const int* index,
                                 float* d_out0, float* d_out1, int batch_size,
                                 int n_obs)
-    void divide_by_mask_execute(const cumlHandle& handle, const double* d_in,
+    void divide_by_mask_execute(const handle_t& handle, const double* d_in,
                                 const bool* mask, const int* index,
                                 double* d_out0, double* d_out1,
                                 int batch_size, int n_obs)
-    void divide_by_mask_execute(const cumlHandle& handle, const int* d_in,
+    void divide_by_mask_execute(const handle_t& handle, const int* d_in,
                                 const bool* mask, const int* index,
                                 int* d_out0, int* d_out1, int batch_size,
                                 int n_obs)
 
-    void divide_by_min_build_index(const cumlHandle& handle,
+    void divide_by_min_build_index(const handle_t& handle,
                                    const float* d_matrix, int* d_batch,
                                    int* d_index, int* h_size,
                                    int batch_size, int n_sub)
-    void divide_by_min_build_index(const cumlHandle& handle,
+    void divide_by_min_build_index(const handle_t& handle,
                                    const double* d_matrix, int* d_batch,
                                    int* d_index, int* h_size,
                                    int batch_size, int n_sub)
 
-    void divide_by_min_execute(const cumlHandle& handle, const float* d_in,
+    void divide_by_min_execute(const handle_t& handle, const float* d_in,
                                const int* d_batch, const int* d_index,
                                float** hd_out, int batch_size, int n_sub,
                                int n_obs)
-    void divide_by_min_execute(const cumlHandle& handle, const double* d_in,
+    void divide_by_min_execute(const handle_t& handle, const double* d_in,
                                const int* d_batch, const int* d_index,
                                double** hd_out, int batch_size, int n_sub,
                                int n_obs)
-    void divide_by_min_execute(const cumlHandle& handle, const int* d_in,
+    void divide_by_min_execute(const handle_t& handle, const int* d_in,
                                const int* d_batch, const int* d_index,
                                int** hd_out, int batch_size, int n_sub,
                                int n_obs)
 
     void cpp_build_division_map "ML::build_division_map" (
-        const cumlHandle& handle, const int* const* hd_id, const int* h_size,
+        const handle_t& handle, const int* const* hd_id, const int* h_size,
         int* d_id_to_pos, int* d_id_to_model, int batch_size, int n_sub)
 
     void cpp_merge_series "ML::merge_series" (
-        const cumlHandle& handle, const float* const* hd_in,
+        const handle_t& handle, const float* const* hd_in,
         const int* d_id_to_pos, const int* d_id_to_sub, float* d_out,
         int batch_size, int n_sub, int n_obs)
     void cpp_merge_series "ML::merge_series" (
-        const cumlHandle& handle, const double* const* hd_in,
+        const handle_t& handle, const double* const* hd_in,
         const int* d_id_to_pos, const int* d_id_to_sub, double* d_out,
         int batch_size, int n_sub, int n_obs)
 
@@ -104,66 +101,92 @@ tests_map = {
 
 
 class AutoARIMA(Base):
-    r"""Implements a batched auto-ARIMA model for in- and out-of-sample
+    """
+    Implements a batched auto-ARIMA model for in- and out-of-sample
     times-series prediction.
 
     This interface offers a highly customizable search, with functionality
-    similar to the `forecast` and `fable` packages in R.
-    It provides an abstraction around the underlying ARIMA models to predict
-    and forecast as if using a single model.
-
-    Example
-    -------
-    .. code-block:: python
-
-        from cuml.tsa.auto_arima import AutoARIMA
-
-        model = AutoARIMA(y)
-        model.search(s=12, d=(0, 1), D=(0, 1), p=(0, 2, 4), q=(0, 2, 4),
-                     P=range(2), Q=range(2), method="css", truncate=100)
-        model.fit(method="css-ml")
-        fc = model.forecast(20)
-
+    similar to the `forecast` and `fable` packages in R. It provides an
+    abstraction around the underlying ARIMA models to predict and forecast as
+    if using a single model.
 
     Parameters
     ----------
+
     endog : dataframe or array-like (device or host)
         The time series data, assumed to have each time series in columns.
         Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,
         Numba device ndarray, cuda array interface compliant array like CuPy.
     handle : cuml.Handle
-        If it is None, a new one is created just for this instance
-    verbose : int
-        Logging level. It must be one of `cuml.common.logger.level_*`
-    output_type : {'input', 'cudf', 'cupy', 'numpy'}, optional
-        Variable to control output type of the results and attributes.
-        If None, it'll inherit the output type set at the module level,
-        cuml.output_type. If set, it will override the global option.
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    simple_differencing: bool or int, default=True
+        If True, the data is differenced before being passed to the Kalman
+        filter. If False, differencing is part of the state-space model.
+        See additional notes in the ARIMA docs
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
+
+    Notes
+    -----
 
-    References
-    ----------
     The interface was influenced by the R `fable` package:
     See https://fable.tidyverts.org/reference/ARIMA.html
 
+    References
+    ----------
+
     A useful (though outdated) reference is the paper:
-    "Automatic Time Series Forecasting: The `forecast` Package for R",
-    Rob J. Hyndman & Yeasmin Khandakar (2008),
-    Journal of Statistical Software 27, https://doi.org/10.18637/jss.v027.i03
+
+    .. [1] Rob J. Hyndman, Yeasmin Khandakar, 2008. "Automatic Time Series
+        Forecasting: The 'forecast' Package for R", Journal of Statistical
+        Software 27
+
+    Examples
+    --------
+
+    .. code-block:: python
+
+            from cuml.tsa.auto_arima import AutoARIMA
+
+            model = AutoARIMA(y)
+            model.search(s=12, d=(0, 1), D=(0, 1), p=(0, 2, 4), q=(0, 2, 4),
+                         P=range(2), Q=range(2), method="css", truncate=100)
+            model.fit(method="css-ml")
+            fc = model.forecast(20)
+
+
     """
 
     def __init__(self,
                  endog,
                  handle=None,
-                 verbose=logger.level_info,
+                 simple_differencing=True,
+                 verbose=False,
                  output_type=None):
         # Initialize base class
-        super().__init__(handle, output_type=output_type, verbose=verbose)
-        self._set_output_type(endog)
+        super().__init__(
+            handle=handle,
+            output_type=output_type,
+            verbose=verbose)
+        self._set_base_attributes(output_type=endog)
 
         # Get device array. Float64 only for now.
         self._d_y, self.n_obs, self.batch_size, self.dtype \
             = input_to_cuml_array(endog, check_dtype=np.float64)
 
+        self.simple_differencing = simple_differencing
+
     def search(self,
                s=None,
                d=range(3),
@@ -238,8 +261,8 @@ class AutoARIMA(Base):
         ic = ic.lower()
         test = test.lower()
         seasonal_test = seasonal_test.lower()
-        if s == 1:  # R users might use s=1 for a non-seasonal dataset
-            s = None
+        if s is None or s == 1:  # R users might use s=1 for non-seasonal data
+            s = 0
         if method == "auto":
             method = "css" if self.n_obs >= 100 and s >= 4 else "ml"
 
@@ -306,7 +329,7 @@ class AutoARIMA(Base):
                         break
                 else:  # (when the for loop reaches its end naturally)
                     # The remaining series are assigned the max possible d
-                    data_dD[d_options[-1]] = (data_temp, id_temp)
+                    data_dD[(d_options[-1], D_)] = (data_temp, id_temp)
                 del data_temp, id_temp, mask, out0, index0, out1, index1
         del data_D
 
@@ -337,7 +360,8 @@ class AutoARIMA(Base):
                     continue
                 s_ = s if (P_ + D_ + Q_) else 0
                 model = ARIMA(data_temp.to_output("cupy"), (p_, d_, q_),
-                              (P_, D_, Q_, s_), k_, self.handle,
+                              (P_, D_, Q_, s_), k_, handle=self.handle,
+                              simple_differencing=self.simple_differencing,
                               output_type="cupy")
                 logger.debug("Fitting {} ({})".format(model, method))
                 model.fit(h=h, maxiter=maxiter, method=method,
@@ -358,12 +382,11 @@ class AutoARIMA(Base):
                 if sub_batches[i] is None:
                     continue
                 p_, q_, P_, Q_, s_, k_ = all_orders[i]
-                self.models.append(ARIMA(sub_batches[i].to_output("cupy"),
-                                         order=(p_, d_, q_),
-                                         seasonal_order=(P_, D_, Q_, s_),
-                                         fit_intercept=k_,
-                                         handle=self.handle,
-                                         output_type="cupy"))
+                self.models.append(
+                    ARIMA(sub_batches[i].to_output("cupy"), order=(p_, d_, q_),
+                          seasonal_order=(P_, D_, Q_, s_), fit_intercept=k_,
+                          handle=self.handle, output_type="cupy",
+                          simple_differencing=self.simple_differencing))
                 id_tracker.append(sub_id[i])
 
             del all_ic, all_orders, ic_matrix, sub_batches, sub_id
@@ -400,45 +423,82 @@ class AutoARIMA(Base):
             logger.debug("Fitting {} ({})".format(model, method))
             model.fit(h=h, maxiter=maxiter, method=method, truncate=truncate)
 
-    def predict(self, start=0, end=None):
+    def predict(self, start=0, end=None, level=None):
         """Compute in-sample and/or out-of-sample prediction for each series
 
-        Parameters:
-        -----------
+        Parameters
+        ----------
         start: int
             Index where to start the predictions (0 <= start <= num_samples)
         end:
             Index where to end the predictions, excluded (end > start)
+        level: float or None (default = None)
+            Confidence level for prediction intervals, or None to return only
+            the point forecasts. 0 < level < 1
 
-        Returns:
-        --------
+        Returns
+        -------
         y_p : array-like (device)
             Predictions. Shape = (end - start, batch_size)
+        lower: array-like (device) (optional)
+            Lower limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
+        upper: array-like (device) (optional)
+            Upper limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
         """
         # Compute predictions for each model
-        predictions = []
+        pred_list = []
+        lower_list = []
+        upper_list = []
         for model in self.models:
-            pred, *_ = input_to_cuml_array(model.predict(start, end))
-            predictions.append(pred)
+            if level is None:
+                pred, *_ = input_to_cuml_array(model.predict(start, end))
+                pred_list.append(pred)
+            else:
+                pred, low, upp = model.predict(start, end, level=level)
+                pred_list.append(input_to_cuml_array(pred)[0])
+                lower_list.append(input_to_cuml_array(low)[0])
+                upper_list.append(input_to_cuml_array(upp)[0])
 
         # Put all the predictions together
-        return _merge_series(predictions, self.id_to_model, self.id_to_pos,
-                             self.batch_size).to_output(self.output_type)
+        y_p = _merge_series(pred_list, self.id_to_model, self.id_to_pos,
+                            self.batch_size).to_output(self.output_type)
+        if level is not None:
+            lower = _merge_series(lower_list, self.id_to_model, self.id_to_pos,
+                                  self.batch_size).to_output(self.output_type)
+            upper = _merge_series(upper_list, self.id_to_model, self.id_to_pos,
+                                  self.batch_size).to_output(self.output_type)
+
+        # Return the results
+        if level is None:
+            return y_p
+        else:
+            return y_p, lower, upper
 
-    def forecast(self, nsteps):
+    def forecast(self, nsteps: int, level=None):
         """Forecast `nsteps` into the future.
 
-        Parameters:
+        Parameters
         ----------
         nsteps : int
             The number of steps to forecast beyond end of the given series
+        level: float or None (default = None)
+            Confidence level for prediction intervals, or None to return only
+            the point forecasts. 0 < level < 1
 
-        Returns:
-        --------
+        Returns
+        -------
         y_fc : array-like
                Forecasts. Shape = (nsteps, batch_size)
+        lower: array-like (device) (optional)
+            Lower limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
+        upper: array-like (device) (optional)
+            Upper limit of the prediction interval if level != None
+            Shape = (end - start, batch_size)
         """
-        return self.predict(self.n_obs, self.n_obs + nsteps)
+        return self.predict(self.n_obs, self.n_obs + nsteps, level)
 
     def summary(self):
         """Display a quick summary of the models selected by `search`
@@ -467,10 +527,11 @@ def _parse_sequence(name, seq_in, min_accepted, max_accepted):
 
 def _divide_by_mask(original, mask, batch_id, handle=None):
     """Divide a given batch into two sub-batches according to a boolean mask
-    Note: in case the mask contains only False or only True, one sub-batch
-    will be the original batch (not a copy!) and the other None
 
-    Parameters:
+    .. note:: in case the mask contains only False or only True, one sub-batch
+        will be the original batch (not a copy!) and the other None
+
+    Parameters
     ----------
     original : cumlArray (float32 or float64)
         Original batch
@@ -479,10 +540,15 @@ def _divide_by_mask(original, mask, batch_id, handle=None):
     batch_id : cumlArray (int)
         Integer array to track the id of each member in the initial batch
     handle : cuml.Handle
-        If it is None, a new one is created just for this call
-
-    Returns:
-    --------
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+
+    Returns
+    -------
     out0 : cumlArray (float32 or float64)
         Sub-batch 0, or None if empty
     batch0_id : cumlArray (int)
@@ -501,8 +567,8 @@ def _divide_by_mask(original, mask, batch_id, handle=None):
     batch_size = original.shape[1] if len(original.shape) > 1 else 1
 
     if handle is None:
-        handle = cuml.common.handle.Handle()
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+        handle = Handle()
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     index = cumlArray.empty(batch_size, np.int32)
     cdef uintptr_t d_index = index.ptr
@@ -589,7 +655,7 @@ def _divide_by_min(original, metrics, batch_id, handle=None):
     """Divide a given batch into multiple sub-batches according to the values
     of the given metrics, by selecting the minimum value for each member
 
-    Parameters:
+    Parameters
     ----------
     original : cumlArray (float32 or float64)
         Original batch
@@ -598,10 +664,15 @@ def _divide_by_min(original, metrics, batch_id, handle=None):
     batch_id : cumlArray (int)
         Integer array to track the id of each member in the initial batch
     handle : cuml.Handle
-        If it is None, a new one is created just for this call
-
-    Returns:
-    --------
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+
+    Returns
+    -------
     sub_batches : List[cumlArray] (float32 or float64)
         List of arrays containing each sub-batch, or None if empty
     sub_id : List[cumlArray] (int)
@@ -616,8 +687,8 @@ def _divide_by_min(original, metrics, batch_id, handle=None):
     batch_size = original.shape[1] if len(original.shape) > 1 else 1
 
     if handle is None:
-        handle = cuml.common.handle.Handle()
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+        handle = Handle()
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     batch_buffer = cumlArray.empty(batch_size, np.int32)
     index_buffer = cumlArray.empty(batch_size, np.int32)
@@ -708,23 +779,23 @@ def _build_division_map(id_tracker, batch_size, handle=None):
     """Build a map to associate each batch member with a model and index in
     the associated sub-batch
 
-    Parameters:
+    Parameters
     ----------
     id_tracker : List[cumlArray] (int)
         List of the index arrays of each sub-batch
     batch_size : int
         Size of the initial batch
 
-    Returns:
-    --------
+    Returns
+    -------
     id_to_model : cumlArray (int)
         Associates each batch member with a model
     id_to_pos : cumlArray (int)
         Position of each member in the respective sub-batch
     """
     if handle is None:
-        handle = cuml.common.handle.Handle()
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+        handle = Handle()
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     n_sub = len(id_tracker)
 
@@ -760,7 +831,7 @@ def _merge_series(data_in, id_to_sub, id_to_pos, batch_size, handle=None):
     associate each id in the unique batch to a sub-batch and a position in
     this sub-batch.
 
-    Parameters:
+    Parameters
     ----------
     data_in : List[cumlArray] (float32 or float64)
         List of sub-batches to merge
@@ -771,8 +842,8 @@ def _merge_series(data_in, id_to_sub, id_to_pos, batch_size, handle=None):
     batch_size : int
         Size of the initial batch
 
-    Returns:
-    --------
+    Returns
+    -------
     data_out : cumlArray (float32 or float64)
         Merged batch
     """
@@ -781,8 +852,8 @@ def _merge_series(data_in, id_to_sub, id_to_pos, batch_size, handle=None):
     n_sub = len(data_in)
 
     if handle is None:
-        handle = cuml.common.handle.Handle()
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+        handle = Handle()
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     cdef vector[uintptr_t] in_ptr
     in_ptr.resize(n_sub)
diff --git a/python/cuml/tsa/batched_lbfgs.py b/python/cuml/tsa/batched_lbfgs.py
index a7093b943b..f12bacc629 100644
--- a/python/cuml/tsa/batched_lbfgs.py
+++ b/python/cuml/tsa/batched_lbfgs.py
@@ -165,8 +165,8 @@ def fprime_f(x):
         for ib in range(num_batches):
             if converged[ib]:
                 continue
-            task_str = task[ib].tostring()
-            task_str_strip = task[ib].tostring().strip(b'\x00').strip()
+            task_str = task[ib].tobytes()
+            task_str_strip = task[ib].tobytes().strip(b'\x00').strip()
             if task_str.startswith(b'FG'):
                 # needs function evalation
                 f[ib] = fk[ib]
@@ -196,7 +196,7 @@ def fprime_f(x):
             for ib in range(num_batches):
                 if warn_flag[ib] > 0:
                     logger.info("WARNING: id={} convergence issue: {}".format(
-                        ib, task[ib].tostring()))
+                        ib, task[ib].tobytes()))
 
     nvtx_range_pop()
     return xk, n_iterations, warn_flag
diff --git a/python/cuml/tsa/holtwinters.pyx b/python/cuml/tsa/holtwinters.pyx
index d626093ed7..4972b92a30 100644
--- a/python/cuml/tsa/holtwinters.pyx
+++ b/python/cuml/tsa/holtwinters.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import cudf
 import cupy as cp
@@ -27,7 +24,7 @@ from cuml.common import input_to_dev_array
 from cuml.common import get_dev_array_ptr
 from cuml.common import numba_utils
 from cuml.common.base import Base
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 
 cdef extern from "cuml/tsa/holtwinters_params.h" namespace "ML":
     enum SeasonalType:
@@ -42,24 +39,24 @@ cdef extern from "cuml/tsa/holtwinters.h" namespace "ML::HoltWinters":
         int *leveltrend_coef_shift, int *season_coef_shift) except +
 
     cdef void fit(
-        cumlHandle &handle, int n, int batch_size,
+        handle_t &handle, int n, int batch_size,
         int frequency, int start_periods, SeasonalType seasonal,
         float epsilon,
         float *data, float *level_ptr, float *trend_ptr,
         float *season_ptr, float *SSE_error_ptr) except +
     cdef void fit(
-        cumlHandle &handle, int n, int batch_size,
+        handle_t &handle, int n, int batch_size,
         int frequency, int start_periods, SeasonalType seasonal,
         double epsilon,
         double *data, double *level_ptr, double *trend_ptr,
         double *season_ptr, double *SSE_error_ptr) except +
 
     cdef void forecast(
-        cumlHandle &handle, int n, int batch_size, int frequency,
+        handle_t &handle, int n, int batch_size, int frequency,
         int h, SeasonalType seasonal, float *level_ptr,
         float *trend_ptr, float *season_ptr, float *forecast_ptr) except +
     cdef void forecast(
-        cumlHandle &handle, int n, int batch_size, int frequency,
+        handle_t &handle, int n, int batch_size, int frequency,
         int h, SeasonalType seasonal, double *level_ptr,
         double *trend_ptr, double *season_ptr, double *forecast_ptr) except +
 
@@ -72,68 +69,67 @@ class ExponentialSmoothing(Base):
     data with exponentially decreasing impact. This is done by analyzing
     three components of the data: level, trend, and seasonality.
 
-    Known Limitations
-    -----------------
-    This version of ExponentialSmoothing currently provides only a limited
-    number of features when compared to the
-    statsmodels.holtwinters.ExponentialSmoothing model. Noticeably, it lacks:
+    Notes
+    -----
+    *Known Limitations:* This version of ExponentialSmoothing currently
+    provides only a limited number of features when compared to the
+    `statsmodels.holtwinters.ExponentialSmoothing` model. Noticeably, it lacks:
 
-        * predict : no support for in-sample prediction.
-                       https://github.com/rapidsai/cuml/issues/875
+    * predict : no support for in-sample prediction.
+        * https://github.com/rapidsai/cuml/issues/875
 
-        * hessian : no support for returning Hessian matrix.
-                       https://github.com/rapidsai/cuml/issues/880
+    * hessian : no support for returning Hessian matrix.
+        * https://github.com/rapidsai/cuml/issues/880
 
-        * information : no support for returning Fisher matrix.
-                           https://github.com/rapidsai/cuml/issues/880
+    * information : no support for returning Fisher matrix.
+        * https://github.com/rapidsai/cuml/issues/880
 
-        * loglike : no support for returning Log-likelihood.
-                       https://github.com/rapidsai/cuml/issues/880
+    * loglike : no support for returning Log-likelihood.
+        * https://github.com/rapidsai/cuml/issues/880
 
     Additionally, be warned that there may exist floating point instability
     issues in this model. Small values in endog may lead to faulty results.
     See https://github.com/rapidsai/cuml/issues/888 for more information.
 
-    Known Differences
-    -----------------
-    This version of ExponentialSmoothing differs from statsmodels in some
-    other minor ways:
+    *Known Differences:* This version of ExponentialSmoothing differs from
+    statsmodels in some other minor ways:
 
     * Cannot pass trend component or damped trend component
     * this version can take additional parameters `eps`,
-                    `start_periods`, `ts_num`, and `handle`
+      `start_periods`, `ts_num`, and `handle`
     * Score returns SSE rather than gradient logL
-                 https://github.com/rapidsai/cuml/issues/876
+      https://github.com/rapidsai/cuml/issues/876
     * This version provides get_level(), get_trend(), get_season()
 
     Examples
     --------
-    .. code-block:: python
 
-            from cuml import ExponentialSmoothing
-            import cudf
-            import numpy as np
-            data = cudf.Series([1, 2, 3, 4, 5, 6,
-                               7, 8, 9, 10, 11, 12,
-                               2, 3, 4, 5, 6, 7,
-                               8, 9, 10, 11, 12, 13,
-                               3, 4, 5, 6, 7, 8, 9,
-                               10, 11, 12, 13, 14],
-                               dtype=np.float64)
-            cu_hw = ExponentialSmoothing(data, seasonal_periods=12)
-            cu_hw.fit()
-            cu_pred = cu_hw.forecast(4)
-            print('Forecasted points:', cu_pred)
-    Output
+    .. code-block:: python
 
+        from cuml import ExponentialSmoothing
+        import cudf
+        import numpy as np
+        data = cudf.Series([1, 2, 3, 4, 5, 6,
+                            7, 8, 9, 10, 11, 12,
+                            2, 3, 4, 5, 6, 7,
+                            8, 9, 10, 11, 12, 13,
+                            3, 4, 5, 6, 7, 8, 9,
+                            10, 11, 12, 13, 14],
+                            dtype=np.float64)
+        cu_hw = ExponentialSmoothing(data, seasonal_periods=12)
+        cu_hw.fit()
+        cu_pred = cu_hw.forecast(4)
+        print('Forecasted points:', cu_pred)
+
+    Output:
 
     .. code-block:: python
 
-            Forecasted points :
-            0    4.000143766093652
-            1    5.000000163513641
-            2    6.000000000174092
-            3    7.000000000000178
+        Forecasted points :
+        0    4.000143766093652
+        1    5.000000163513641
+        2    6.000000000174092
+        3    7.000000000000178
 
     Parameters
     ----------
@@ -144,7 +140,8 @@ class ExponentialSmoothing(Base):
         Note: cuDF.DataFrame types assumes data is in columns,
         while all other datatypes assume data is in rows.
         The endogenous dataset to be operated on.
-    seasonal : 'additive', 'add', 'multiplicative', 'mul' (default = 'additive')  # noqa
+    seasonal : 'additive', 'add', 'multiplicative', 'mul' \
+        (default = 'additive')
         Whether the seasonal trend should be calculated
         additively or multiplicatively.
     seasonal_periods : int (default=2)
@@ -159,15 +156,30 @@ class ExponentialSmoothing(Base):
     eps : np.number > 0 (default=2.24e-3)
         The accuracy to which gradient descent should achieve.
         Note that changing this value may affect the forecasted results.
-    handle : cuml.Handle (default=None)
-        If it is None, a new one is created just for this class.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
 
     """
     def __init__(self, endog, seasonal="additive",
                  seasonal_periods=2, start_periods=2,
-                 ts_num=1, eps=2.24e-3, handle=None):
+                 ts_num=1, eps=2.24e-3, handle=None,
+                 verbose=False, output_type=None):
 
-        super(ExponentialSmoothing, self).__init__(handle)
+        super(ExponentialSmoothing, self).__init__(
+            handle=handle, verbose=verbose, output_type=output_type)
 
         # Total number of Time Series for forecasting
         if type(ts_num) != int:
@@ -302,7 +314,7 @@ class ExponentialSmoothing(Base):
                     <int*> &season_coef_offset,
                     <int*> &error_len)
 
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
         cdef uintptr_t level_ptr, trend_ptr, season_ptr, SSE_ptr
 
         self.level = numba_utils.zeros(components_len, dtype=self.dtype)
@@ -370,7 +382,7 @@ class ExponentialSmoothing(Base):
 
         """
         cdef uintptr_t forecast_ptr, level_ptr, trend_ptr, season_ptr
-        cdef cumlHandle* handle_ = <cumlHandle*><size_t>self.handle.getHandle()
+        cdef handle_t* handle_ = <handle_t*><size_t>self.handle.getHandle()
 
         if type(h) != int or (type(index) != int and index is not None):
             raise TypeError("Input arguments must be of type int."
@@ -435,8 +447,8 @@ class ExponentialSmoothing(Base):
         """
         Returns the score of the model.
 
-        **Note: Currently returns the SSE, rather than the gradient of the
-                LogLikelihood. https://github.com/rapidsai/cuml/issues/876
+        .. note:: Currently returns the SSE, rather than the gradient of the
+            LogLikelihood. https://github.com/rapidsai/cuml/issues/876
 
         Parameters
         ----------
@@ -554,3 +566,13 @@ class ExponentialSmoothing(Base):
                     return cudf.Series(cp.asarray(self.season[index]))
         else:
             raise ValueError("Fit() the model to get season values")
+
+    def get_param_names(self):
+        return super().get_param_names() + [
+            "endog",
+            "seasonal",
+            "seasonal_periods",
+            "start_periods",
+            "ts_num",
+            "eps",
+        ]
diff --git a/python/cuml/tsa/seasonality.pyx b/python/cuml/tsa/seasonality.pyx
index c89f3d418f..f6bc98e0d5 100644
--- a/python/cuml/tsa/seasonality.pyx
+++ b/python/cuml/tsa/seasonality.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import numpy as np
@@ -26,7 +23,7 @@ from libcpp cimport bool
 import cuml
 from cuml.common.array import CumlArray as cumlArray
 from cuml.common.base import _input_to_type
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
 from cuml.common.input_utils import input_to_host_array, input_to_cuml_array
 
 # TODO: #2234 and #2235
@@ -63,8 +60,13 @@ def seas_test(y, s, output_type="input", handle=None):
         Numba device ndarray, cuda array interface compliant array like CuPy.
     s: integer
         Seasonal period (s > 1)
-    handle : cuml.Handle (default=None)
-        If it is None, a new one is created just for this function call.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
 
     Returns
     -------
diff --git a/python/cuml/tsa/stationarity.pyx b/python/cuml/tsa/stationarity.pyx
index ac022b9544..e6f1146eab 100644
--- a/python/cuml/tsa/stationarity.pyx
+++ b/python/cuml/tsa/stationarity.pyx
@@ -13,10 +13,7 @@
 # limitations under the License.
 #
 
-# cython: profile=False
 # distutils: language = c++
-# cython: embedsignature = True
-# cython: language_level = 3
 
 import ctypes
 import numpy as np
@@ -26,13 +23,15 @@ from libcpp cimport bool
 import cuml
 from cuml.common.array import CumlArray as cumlArray
 from cuml.common.base import _input_to_type
-from cuml.common.handle cimport cumlHandle
+from cuml.raft.common.handle cimport handle_t
+from cuml.raft.common.handle import Handle
+
 from cuml.common.input_utils import input_to_cuml_array
 
 
 cdef extern from "cuml/tsa/stationarity.h" namespace "ML":
     int cpp_kpss "ML::Stationarity::kpss_test" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         const float* d_y,
         bool* results,
         int batch_size,
@@ -41,7 +40,7 @@ cdef extern from "cuml/tsa/stationarity.h" namespace "ML":
         float pval_threshold)
 
     int cpp_kpss "ML::Stationarity::kpss_test" (
-        const cumlHandle& handle,
+        const handle_t& handle,
         const double* d_y,
         bool* results,
         int batch_size,
@@ -70,8 +69,13 @@ def kpss_test(y, d=0, D=0, s=0, pval_threshold=0.05, output_type="input",
         Seasonal period if D > 0
     pval_threshold : float
         The p-value threshold above which a series is considered stationary.
-    handle : cuml.Handle (default=None)
-        If it is None, a new one is created just for this function call.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
 
     Returns
     -------
@@ -86,8 +90,8 @@ def kpss_test(y, d=0, D=0, s=0, pval_threshold=0.05, output_type="input",
         output_type = _input_to_type(y)
 
     if handle is None:
-        handle = cuml.common.handle.Handle()
-    cdef cumlHandle* handle_ = <cumlHandle*><size_t>handle.getHandle()
+        handle = Handle()
+    cdef handle_t* handle_ = <handle_t*><size_t>handle.getHandle()
 
     results = cumlArray.empty(batch_size, dtype=np.bool)
     cdef uintptr_t d_results = results.ptr
diff --git a/python/cython_build_ext.py b/python/cython_build_ext.py
new file mode 100644
index 0000000000..dc5f502d4b
--- /dev/null
+++ b/python/cython_build_ext.py
@@ -0,0 +1,190 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+
+# TODO: It should be possible to support Cython-less distribution following
+# this guide and removing the direct import of Cython:
+# https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#distributing-cython-modules
+
+# Must import in this order:
+#   setuptools -> Cython.Distutils.build_ext -> setuptools.command.build_ext
+# Otherwise, setuptools.command.build_ext ends up inheriting from
+# Cython.Distutils.old_build_ext which we do not want
+import setuptools
+
+try:
+    from Cython.Distutils.build_ext import new_build_ext as _build_ext
+except ImportError:
+    from setuptools.command.build_ext import build_ext as _build_ext
+
+import setuptools.command.build_ext
+
+
+class cython_build_ext(_build_ext, object):
+    """
+    This class follows the design of `Cython.Distutils.build_ext.new_build_ext`
+    to allow for parallel `cythonize()` but adds options for the various
+    arguments that can be passed to `cythonize()` including separate options
+    for `compiler_directives`. This build extension can be directly used in
+    place of `new_build_ext` for any Cython project that needs to set global
+    parameters in the build phase. See the documentation for more information
+    on the `cythonize()` arguments.
+
+    Parameters
+    ----------
+    language_level : {"2", "3", "3str"}, default="2"
+        Globally set the Python language level to be used for module
+        compilation. Default is compatibility with Python 2. To enable Python 3
+        source code semantics, set this to 3 (or 3str)
+    binding : bool, default=True
+        Controls whether free functions behave more like Python’s CFunctions
+        (e.g. len()) or, when set to True, more like Python’s functions. When
+        enabled, functions will bind to an instance when looked up as a class
+        attribute (hence the name) and will emulate the attributes of Python
+        functions, including introspections like argument names and
+        annotations.
+
+        Changed in version 3.0.0: Default changed from False to True
+    profile : bool, default=False
+        Write hooks for Python profilers into the compiled C code.
+    embedsignature : bool, default=False
+        If set to True, Cython will embed a textual copy of the call signature
+        in the docstring of all Python visible functions and classes. Tools
+        like IPython and epydoc can thus display the signature, which cannot
+        otherwise be retrieved after compilation.
+    cython_exclude : list of str
+        When passing glob patterns as module_list, you can exclude certain
+        module names explicitly by passing them into the exclude option.
+    gdb_debug : bool, default=False
+        Passes the `gdb_debug` argument to `cythonize()`. Setting up debugging
+        for Cython can be difficult. See the debugging docs here
+        https://cython.readthedocs.io/en/latest/src/userguide/debugging.html
+    """
+    user_options = [
+        ('language-level=', None,
+         'Sets the python language syntax to use "2", "3", "3str".'),
+        ("binding", None,
+         "Sets the binding Cython compiler directive. See the Cython docs for "
+         "more info."),
+        ("profile", None,
+         "Sets the profile Cython compiler directive. See the Cython docs for "
+         "more info."),
+        ("embedsignature", None,
+         "Sets the `embedsignature` Cython compiler directive. See the Cython "
+         "docs for more info."),
+        ("cython-exclude=", None,
+         "Sets the exclude argument for `cythonize()`. See the Cython docs for"
+         " more info."),
+        ("gdb-debug=", None,
+         "Passes the `gdb_debug` argument to `cythonize()`. See the Cython "
+         "docs for more info.")
+    ] + _build_ext.user_options
+
+    boolean_options = [
+        "binding",
+        "profile",
+        "embedsignature",
+        "gdb-debug",
+    ] + _build_ext.boolean_options
+
+    def initialize_options(self):
+        """
+        Set the default values for the `user_options` to None to allow us to
+        detect if they were set by the user
+        """
+
+        self.language_level = None
+        self.binding = None
+        self.profile = None
+        self.embedsignature = None
+        self.cython_exclude = None
+        self.gdb_debug = None
+        super().initialize_options()
+
+    def finalize_options(self):
+        """
+        Determines any user defined options and finalizes the Cython
+        configuration before compilation
+        """
+
+        # Ensure the base build class options get set so we can use parallel
+        self.set_undefined_options(
+            'build',
+            ('build_lib', 'build_lib'),
+            ('build_temp', 'build_temp'),
+            ('compiler', 'compiler'),
+            ('debug', 'debug'),
+            ('force', 'force'),
+            ('parallel', 'parallel'),
+            ('plat_name', 'plat_name'),
+        )
+
+        # If ext_modules is set, then build the cythonize argument list
+        if self.distribution.ext_modules:
+            if self.language_level is None:
+                self.language_level = str(sys.version_info[0])
+
+            assert self.language_level in (
+                '2', '3',
+                '3str'), 'Incorrect Cython language level ("{0}")'.format(
+                    self.language_level)
+
+            compiler_directives = dict(language_level=self.language_level)
+
+            if (self.binding is not None):
+                self.binding = bool(self.binding)
+                compiler_directives.update({"binding": self.binding})
+
+            if (self.profile is not None):
+                self.profile = bool(self.profile)
+                compiler_directives.update({"profile": self.profile})
+
+            if (self.embedsignature is not None):
+                self.embedsignature = bool(self.embedsignature)
+                compiler_directives.update(
+                    {"embedsignature": self.embedsignature})
+
+            cythonize_kwargs = {}
+
+            if (self.cython_exclude is not None):
+
+                if (isinstance(self.cython_exclude, str)):
+                    self.cython_exclude = list(self.cython_exclude)
+
+                cythonize_kwargs.update({"exclude": self.cython_exclude})
+
+            if (self.gdb_debug is not None):
+
+                cythonize_kwargs.update({"gdb_debug": self.gdb_debug})
+
+            # Handle nthreads separately to mimic what Cython does
+            nthreads = getattr(self, 'parallel', None)  # -j option in Py3.5+
+            nthreads = int(nthreads) if nthreads else None
+
+            # Delay import this to allow for Cython-less installs
+            from Cython.Build.Dependencies import cythonize
+
+            # Finally, cythonize the arguments
+            self.distribution.ext_modules = cythonize(
+                self.distribution.ext_modules,
+                nthreads=nthreads,
+                force=self.force,
+                compiler_directives=compiler_directives,
+                **cythonize_kwargs)
+
+        # Skip calling super() and jump straight to setuptools
+        setuptools.command.build_ext.build_ext.finalize_options(self)
diff --git a/python/setup.cfg b/python/setup.cfg
index bc65780383..a4c9a02e46 100644
--- a/python/setup.cfg
+++ b/python/setup.cfg
@@ -13,3 +13,14 @@ versionfile_source = cuml/_version.py
 versionfile_build = cuml/_version.py
 tag_prefix = v
 parentdir_prefix = cuml-
+
+[tool:pytest]
+testpaths = cuml/test
+
+# Project wide, Cython settings
+[build_ext]
+inplace = True
+binding = True
+language_level = 3
+profile = False
+embedsignature = True
\ No newline at end of file
diff --git a/python/setup.py b/python/setup.py
index e462234b51..c4a34f3478 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -14,40 +14,32 @@
 # limitations under the License.
 #
 
-from distutils.sysconfig import get_python_lib
-from pathlib import Path
+import glob
+import os
+import shutil
+import sys
+import sysconfig
+import warnings
 from pprint import pprint
+from pathlib import Path
+
 from setuptools import find_packages
 from setuptools import setup
 from setuptools.extension import Extension
+from distutils.sysconfig import get_python_lib
+from distutils.command.build import build as _build
+
+import numpy
+
 from setuputils import clean_folder
 from setuputils import get_environment_option
 from setuputils import get_cli_option
 from setuputils import use_raft_package
 
-import glob
-import numpy
-import os
-import shutil
-import sys
-import sysconfig
 import versioneer
-import warnings
+from cython_build_ext import cython_build_ext
 
-
-if "--singlegpu" in sys.argv:
-    from Cython.Build import cythonize
-    from setuptools.command.build_ext import build_ext
-else:
-    try:
-        from Cython.Distutils.build_ext import new_build_ext as build_ext
-    except ImportError:
-        from setuptools.command.build_ext import build_ext
-
-install_requires = [
-    'numba',
-    'cython'
-]
+install_requires = ['numba', 'cython']
 
 ##############################################################################
 # - Print of build options used by setup.py  --------------------------------
@@ -57,7 +49,6 @@
 raft_path = get_environment_option('RAFT_PATH')
 
 clean_artifacts = get_cli_option('clean')
-single_gpu_build = get_cli_option('--singlegpu')
 
 ##############################################################################
 # - Dependencies include and lib folder setup --------------------------------
@@ -69,6 +60,7 @@
 
     cuda_home = str(Path(nvcc_path).parent.parent)
     print("-- Using nvcc to detect CUDA, found at " + str(cuda_home))
+
 cuda_include_dir = os.path.join(cuda_home, "include")
 cuda_lib_dir = os.path.join(cuda_home, "lib64")
 
@@ -78,8 +70,9 @@
 if clean_artifacts:
     print("-- Cleaning all Python and Cython build artifacts...")
 
-    treelite_path = ""
-    libcuml_path = ""
+    # Reset these paths since they may be deleted below
+    treelite_path = False
+    libcuml_path = False
 
     try:
         setup_file_path = str(Path(__file__).parent.absolute())
@@ -114,87 +107,142 @@
 
 raft_include_dir = use_raft_package(raft_path, libcuml_path)
 
-##############################################################################
-# - Cython extensions build and parameters -----------------------------------
-
-# cumlcomms and nccl are still needed for multigpu algos not based
-# on libcumlprims
-libs = ['cuda',
-        'cuml++',
-        'rmm']
-
-include_dirs = ['../cpp/src',
-                '../cpp/include',
-                '../cpp/src_prims',
-                raft_include_dir,
-                '../cpp/comms/std/src',
-                '../cpp/comms/std/include',
-                cuda_include_dir,
-                numpy.get_include(),
-                os.path.dirname(sysconfig.get_path("include"))]
-
-# Exclude multigpu components that use libcumlprims if --singlegpu is used
-cython_exc_list = []
-python_exc_list = []
-
 if "--multigpu" in sys.argv:
     warnings.warn("Flag --multigpu is deprecated. By default cuML is"
                   "built with multi GPU support. To disable it use the flag"
                   "--singlegpu")
     sys.argv.remove('--multigpu')
 
-if "--singlegpu" in sys.argv:
-    cython_exc_list = glob.glob('cuml/*/*_mg.pyx')
-    cython_exc_list = cython_exc_list + glob.glob('cuml/*/*_mg.pxd')
-    cython_exc_list.append('cuml/nccl/nccl.pyx')
-    cython_exc_list.append('cuml/dask/common/comms_utils.pyx')
+if not libcuml_path:
+    libcuml_path = '../cpp/build/'
 
-    print('--singlegpu: excluding the following Cython components:')
-    pprint(cython_exc_list)
+##############################################################################
+# - Cython extensions build and parameters -----------------------------------
+#
+# We create custom build steps for both `build` and `build_ext` for several
+#   reasons:
+# 1) Custom `build_ext` is needed to set `cython_build_ext.cython_exclude` when
+#    `--singlegpu=True`
+# 2) Custom `build` is needed to exclude pacakges and directories when
+#    `--singlegpu=True`
+# 3) These cannot be combined because `build` is used by both `build_ext` and
+#    `install` commands and it would be difficult to set
+#    `cython_build_ext.cython_exclude` from `cuml_build` since the property
+#    exists on a different command.
+#
+# Using custom commands also allows combining commands at the command line. For
+# example, the following will all work as expected:
+# `python setup.py clean --all build --singlegpu build_ext --inplace`
+# `python setup.py clean --all build --singlegpu install --record=record.txt`
+# `python setup.py build_ext --debug --singlegpu`
 
-    python_exc_list = ["*.dask", "*.dask.*"]
-else:
-    libs.append('cumlprims')
-    libs.append('cumlcomms')
-    libs.append('nccl')
 
-    sys_include = os.path.dirname(sysconfig.get_path("include"))
-    include_dirs.append("%s/cumlprims" % sys_include)
+class cuml_build(_build):
 
-cmdclass = dict()
-cmdclass.update(versioneer.get_cmdclass())
-cmdclass["build_ext"] = build_ext
+    def initialize_options(self):
 
-if not libcuml_path:
-    libcuml_path = '../cpp/build/'
+        self.singlegpu = False
+        super().initialize_options()
+
+    def finalize_options(self):
+
+        # distutils plain build command override cannot be done just setting
+        # user_options and boolean options like build_ext below. Distribution
+        # object has all the args used by the user, we can check that.
+        self.singlegpu = '--singlegpu' in self.distribution.script_args
+
+        libs = ['cuda', 'cuml++']
+
+        include_dirs = [
+            '../cpp/src',
+            '../cpp/include',
+            '../cpp/src_prims',
+            raft_include_dir,
+            cuda_include_dir,
+            numpy.get_include(),
+            os.path.dirname(sysconfig.get_path("include"))
+        ]
+
+        python_exc_list = []
+
+        if (self.singlegpu):
+            python_exc_list = ["*.dask", "*.dask.*"]
+        else:
+            libs.append('cumlprims')
+            libs.append('nccl')
+
+            sys_include = os.path.dirname(sysconfig.get_path("include"))
+            include_dirs.append("%s/cumlprims" % sys_include)
+
+        # Find packages now that --singlegpu has been determined
+        self.distribution.packages = find_packages(include=['cuml', 'cuml.*'],
+                                                   exclude=python_exc_list)
+
+        # Build the extensions list
+        extensions = [
+            Extension("*",
+                      sources=["cuml/**/*.pyx"],
+                      include_dirs=include_dirs,
+                      library_dirs=[get_python_lib(), libcuml_path],
+                      runtime_library_dirs=[
+                          cuda_lib_dir, os.path.join(os.sys.prefix, "lib")
+                      ],
+                      libraries=libs,
+                      language='c++',
+                      extra_compile_args=['-std=c++14'])
+        ]
+
+        self.distribution.ext_modules = extensions
 
-extensions = [
-    Extension("*",
-              sources=["cuml/**/**/*.pyx"],
-              include_dirs=include_dirs,
-              library_dirs=[get_python_lib(), libcuml_path],
-              runtime_library_dirs=[cuda_lib_dir,
-                                    os.path.join(os.sys.prefix, "lib")],
-              libraries=libs,
-              language='c++',
-              extra_compile_args=['-std=c++11'])
-]
-
-for e in extensions:
-    # TODO: this exclude is not working, need to research way to properly
-    # exclude files for parallel build. See issue
-    # https://github.com/rapidsai/cuml/issues/2037
-    # e.exclude = cython_exc_list
-    e.cython_directives = dict(
-        profile=False, language_level=3, embedsignature=True
-    )
-
-if "--singlegpu" in sys.argv:
-    print("Full cythonization in parallel is not supported for singlegpu " +
-          "target for now.")
-    extensions = cythonize(extensions,
-                           exclude=cython_exc_list)
-    sys.argv.remove('--singlegpu')
+        super().finalize_options()
+
+
+# This custom build_ext is only responsible for setting cython_exclude when
+# --singlegpu is specified
+class cuml_build_ext(cython_build_ext, object):
+    user_options = [
+        ("singlegpu", None, "Specifies whether to include multi-gpu or not"),
+    ] + cython_build_ext.user_options
+
+    boolean_options = ["singlegpu"] + cython_build_ext.boolean_options
+
+    def initialize_options(self):
+
+        self.singlegpu = None
+
+        super().initialize_options()
+
+    def finalize_options(self):
+
+        # Ensure the base build class options get set so we can use singlegpu
+        self.set_undefined_options(
+            'build',
+            ('singlegpu', 'singlegpu'),
+        )
+
+        # Exclude multigpu components that use libcumlprims if
+        # --singlegpu is used
+        if (self.singlegpu):
+            cython_exc_list = glob.glob('cuml/*/*_mg.pyx')
+            cython_exc_list = cython_exc_list + glob.glob('cuml/*/*_mg.pxd')
+            cython_exc_list.append('cuml/raft/dask/common/nccl.pyx')
+            cython_exc_list.append('cuml/raft/dask/common/comms_utils.pyx')
+
+            print('--singlegpu: excluding the following Cython components:')
+            pprint(cython_exc_list)
+
+            # Append to base excludes
+            self.cython_exclude = cython_exc_list + \
+                (self.cython_exclude or [])
+
+        super().finalize_options()
+
+
+# Specify the custom build class
+cmdclass = dict()
+cmdclass.update(versioneer.get_cmdclass())
+cmdclass["build"] = cuml_build
+cmdclass["build_ext"] = cuml_build_ext
 
 ##############################################################################
 # - Python package generation ------------------------------------------------
@@ -203,18 +251,14 @@
       description="cuML - RAPIDS ML Algorithms",
       version=versioneer.get_version(),
       classifiers=[
-        "Intended Audience :: Developers",
-        "Programming Language :: Python",
-        "Programming Language :: Python :: 3.6",
-        "Programming Language :: Python :: 3.7"
+          "Intended Audience :: Developers",
+          "Programming Language :: Python",
+          "Programming Language :: Python :: 3.7",
+          "Programming Language :: Python :: 3.8"
       ],
       author="NVIDIA Corporation",
       setup_requires=['cython'],
-      ext_modules=extensions,
-      packages=find_packages(include=['cuml', 'cuml.*'],
-                             exclude=python_exc_list),
       install_requires=install_requires,
       license="Apache",
       cmdclass=cmdclass,
-      zip_safe=False
-      )
+      zip_safe=False)
diff --git a/python/setuputils.py b/python/setuputils.py
index a4d2094e43..64e14dd9d2 100644
--- a/python/setuputils.py
+++ b/python/setuputils.py
@@ -186,7 +186,7 @@ def get_submodule_dependency(repo,
         print("-- Third party repositories have not been found so they " +
               "will be cloned. To avoid this set the environment " +
               "variable CUML_BUILD_PATH, containing the absolute " +
-              "path to the build folder where libcuml++ was built.")
+              "path to the build folder where libcuml++ was built. ")
 
         for repo in repos:
             clone_repo(repo, repo_info[repo][0], repo_info[repo][1])
diff --git a/wiki/cpp/DEVELOPER_GUIDE.md b/wiki/cpp/DEVELOPER_GUIDE.md
index 1607943be9..76da17ce05 100644
--- a/wiki/cpp/DEVELOPER_GUIDE.md
+++ b/wiki/cpp/DEVELOPER_GUIDE.md
@@ -6,13 +6,13 @@ Please start by reading [CONTRIBUTING.md](../../CONTRIBUTING.md).
 
 ## Performance
 1. In performance critical sections of the code, favor `cudaDeviceGetAttribute` over `cudaDeviceGetProperties`. See corresponding CUDA devblog [here](https://devblogs.nvidia.com/cuda-pro-tip-the-fast-way-to-query-device-properties/) to know more.
-2. If an algo requires you to launch GPU work in multiple cuda streams, do not create multiple `cumlHandle` objects, one for each such work stream. Instead, expose a `n_streams` parameter in that algo's cuML C++ interface and then rely on `cumlHandle_impl::getInternalStream()` to pick up the right cuda stream. Refer to the section on [CUDA Resources](#cuda-resources) and the section on [Threading](#TBD) for more details. TIP: use `cumlHandle_impl::getNumInternalStreams()` to know how many such streams are available at your disposal.
+2. If an algo requires you to launch GPU work in multiple cuda streams, do not create multiple `raft::handle_t` objects, one for each such work stream. Instead, expose a `n_streams` parameter in that algo's cuML C++ interface and then rely on `raft::handle_t::get_internal_stream()` to pick up the right cuda stream. Refer to the section on [CUDA Resources](#cuda-resources) and the section on [Threading](#TBD) for more details. TIP: use `raft::handle_t::get_num_internal_streams` to know how many such streams are available at your disposal.
 
 ## Threading Model
 
-With the exception of the cumlHandle, cuML algorithms should maintain thread-safety and are, in general, 
+With the exception of the raft::handle_t, cuML algorithms should maintain thread-safety and are, in general, 
 assumed to be single threaded. This means they should be able to be called from multiple host threads so 
-long as different instances of `cumlHandle` are used.
+long as different instances of `raft::handle_t` are used.
 
 Exceptions are made for algorithms that can take advantage of multiple CUDA streams within multiple host threads
 in order to oversubscribe or increase occupancy on a single GPU. In these cases, the use of multiple host 
@@ -23,9 +23,9 @@ computations.
 A good example of an acceptable use of host threads within a cuML algorithm might look like the following
 
 ```
-cudaStreamSynchronize(handle.getStream());
+cudaStreamSynchronize(handle.get_stream());
 
-int n_streams = handle.getNumInternalStreams();
+int n_streams = handle.get_num_internal_streams();
 
 #pragma omp parallel for num_threads(n_threads)
 for(int i = 0; i < n; i++) {
@@ -45,9 +45,9 @@ for(int i = 0; i < n; i++) {
 In the example above, if there is no CPU pre-processing at the beginning of the for-loop, an event can be registered in
 each of the streams within the for-loop to make them wait on the stream from the handle. 
 
-This can be done easily by replacing `cudaStreamSynchronize(handle.getStream())` with `handle.waitOnUserStream()` 
+This can be done easily by replacing `cudaStreamSynchronize(handle.get_stream())` with `handle.wait_on_user_stream()` 
 for a lighter-weight synchronization. If there is no CPU post-processing at the end of each for-loop iteration, 
-`cudaStreamSynchronize(s)` can be replaced with a single `handle.waitOnInternalStreams()` after the for-loop. 
+`cudaStreamSynchronize(s)` can be replaced with a single `handle.wait_on_internal_streams()` after the for-loop. 
 
 To avoid compatibility issues between different threading models, the only threading programming allowed in cuML is OpenMP.
 Though cuML's build enables OpenMP by default, cuML algorithms should still function properly even when OpenMP has been
@@ -72,7 +72,7 @@ Thus, this section lays out guidelines for managing state along the API of cuML.
 ### General guideline
 As mentioned before, functions exposed via the C++ API must be stateless. Things that are OK to be exposed on the interface:
 1. Any [POD](https://en.wikipedia.org/wiki/Passive_data_structure) - see [std::is_pod](https://en.cppreference.com/w/cpp/types/is_pod) as a reference for C++11  POD types.
-2. `cumlHandle` - since it stores GPU-related state which has nothing to do with the model/algo state. If you're working on a C-binding, use `cumlHandle_t`([reference](../../cpp/src/cuML_api.h)), instead.
+2. `raft::handle_t` - since it stores GPU-related state which has nothing to do with the model/algo state. If you're working on a C-binding, use `cumlHandle_t`([reference](../../cpp/src/cuML_api.h)), instead.
 3. Pointers to POD types (explicitly putting it out, even though it can be considered as a POD).
 Internal to the C++ API, these stateless functions are free to use their own temporary classes, as long as they are not exposed on the interface.
 
@@ -83,17 +83,17 @@ template <typename T>
 class DecisionTreeClassifier {
   TreeNode<T>* root;
   DTParams params;
-  const cumlHandle &handle;
+  const raft::handle_t &handle;
 public:
-  DecisionTreeClassifier(const cumlHandle &handle, DTParams& params, bool verbose=false);
+  DecisionTreeClassifier(const raft::handle_t &handle, DTParams& params, bool verbose=false);
   void fit(const T *input, int n_rows, int n_cols, const int *labels);
   void predict(const T *input, int n_rows, int n_cols, int *predictions);
 };
 
-void decisionTreeClassifierFit(const cumlHandle &handle, const float *input, int n_rows, int n_cols,
+void decisionTreeClassifierFit(const raft::handle_t &handle, const float *input, int n_rows, int n_cols,
                                const int *labels, DecisionTreeClassifier<float> *model, DTParams params,
                                bool verbose=false);
-void decisionTreeClassifierPredict(const cumlHandle &handle, const float* input,
+void decisionTreeClassifierPredict(const raft::handle_t &handle, const float* input,
                                    DecisionTreeClassifier<float> *model, int n_rows,
                                    int n_cols, int* predictions, bool verbose=false);
 ```
@@ -107,10 +107,10 @@ struct DTParams { /* hyper-params for building DT */ };
 typedef TreeNode<float> TreeNodeF;
 typedef TreeNode<double> TreeNodeD;
 
-void decisionTreeClassifierFit(const cumlHandle &handle, const float *input, int n_rows, int n_cols,
+void decisionTreeClassifierFit(const raft::handle_t &handle, const float *input, int n_rows, int n_cols,
                                const int *labels, TreeNodeF *&root, DTParams params,
                                bool verbose=false);
-void decisionTreeClassifierPredict(const cumlHandle &handle, const double* input, int n_rows,
+void decisionTreeClassifierPredict(const raft::handle_t &handle, const double* input, int n_rows,
                                    int n_cols, const TreeNodeD *root, int* predictions,
                                    bool verbose=false);
 ```
@@ -267,32 +267,32 @@ TODO: Add this
 To enable `libcuml.so` users to control how memory for temporary data is allocated, allocate device memory using the allocator provided:
 ```cpp
 template<typename T>
-void foo(const ML::cumlHandle_impl& h, cudaStream_t stream, ... )
+void foo(const raft::handle_t& h, cudaStream_t stream, ... )
 {
-    T* temp_h = h.getDeviceAllocator()->allocate(n*sizeof(T), stream);
+    T* temp_h = h.get_device_allocator()->allocate(n*sizeof(T), stream);
     ...
-    h.getDeviceAllocator()->deallocate(temp_h, n*sizeof(T), stream);
+    h.get_device_allocator()->deallocate(temp_h, n*sizeof(T), stream);
 }
 ```
 The same rule applies to larger amounts of host heap memory:
 ```cpp
 template<typename T>
-void foo(const ML::cumlHandle_impl& h, cudaStream_t stream, ... )
+void foo(const raft::handle_t& h, cudaStream_t stream, ... )
 {
-    T* temp_h = h.getHostAllocator()->allocate(n*sizeof(T), stream);
+    T* temp_h = h.get_host_allocator()->allocate(n*sizeof(T), stream);
     ...
-    h.getHostAllocator()->deallocate(temp_h, n*sizeof(T), stream);
+    h.get_host_allocator()->deallocate(temp_h, n*sizeof(T), stream);
 }
 ```
 Small host memory heap allocations, e.g. as internally done by STL containers, are fine, e.g. an `std::vector` managing only a handful of integers.
 Both the Host and the Device Allocators might allow asynchronous stream ordered allocation and deallocation. This can provide significant performance benefits so a stream always needs to be specified when allocating or deallocating (see [Asynchronous operations and stream ordering](#asynchronous-operations-and-stream-ordering)). `ML::deviceAllocator` returns pinned device memory on the current device, while `ML::hostAllocator` returns host memory. A user of cuML can write customized allocators and pass them into cuML. If a cuML user does not provide custom allocators default allocators will be used. For `ML::deviceAllocator` the default is to use `cudaMalloc`/`cudaFree`. For `ML::hostAllocator` the default is to use `cudaMallocHost`/`cudaFreeHost`.
-There are two simple container classes compatible with the allocator interface `MLCommon::device_buffer` available in `ml-prims/src/common/device_buffer.hpp` and `MLCommon::host_buffer` available in `ml-prims/src/common/host_buffer.hpp`. These allow to follow the [RAII idiom](https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization) to avoid resources leaks and enable exception safe code. These containers also allow asynchronous allocation and deallocation using the `resize` and `release` member functions:
+There are two simple container classes compatible with the allocator interface `MLCommon::device_buffer` available in `src_prims/common/device_buffer.hpp` and `MLCommon::host_buffer` available in `src_prims/common/host_buffer.hpp`. These allow to follow the [RAII idiom](https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization) to avoid resources leaks and enable exception safe code. These containers also allow asynchronous allocation and deallocation using the `resize` and `release` member functions:
 ```cpp
 template<typename T>
-void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
+void foo(const raft::handle_t& h, ..., cudaStream_t stream )
 {
     ...
-    MLCommon::device_buffer<T> temp( h.getDeviceAllocator(), stream, 0 )
+    MLCommon::device_buffer<T> temp( h.get_device_allocator(), stream, 0 )
     
     temp.resize(n, stream);
     kernelA<<<grid, block, 0, stream>>>(..., temp.data(), ...);
@@ -304,10 +304,10 @@ The motivation for `MLCommon::host_buffer` and `MLCommon::device_buffer` over us
 To use `ML::hostAllocator` with a STL container the header `src/common/allocatorAdapter.hpp` provides `ML::stdAllocatorAdapter`:
 ```cpp
 template<typename T>
-void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
+void foo(const raft::handle_t& h, ..., cudaStream_t stream )
 {
     ...
-    std::vector<T,ML::stdAllocatorAdapter<T> > temp( n, val, ML::stdAllocatorAdapter<T>(h.getHostAllocator(), stream) )
+    std::vector<T,ML::stdAllocatorAdapter<T> > temp( n, val, ML::stdAllocatorAdapter<T>(h.get_host_allocator(), stream) )
     ...
 }
 ```
@@ -316,45 +316,45 @@ If thrust 1.9.4 or later is avaiable for use in cuML a similar allocator can be
 ### <a name="allocationsthrust"></a>Using Thrust
 To ensure that thrust algorithms allocate temporary memory via the provided device memory allocator, use the `ML::thrustAllocatorAdapter` available in `src/common/allocatorAdapter.hpp` with the `thrust::cuda::par` execution policy:
 ```cpp
-void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
+void foo(const raft::handle_t& h, ..., cudaStream_t stream )
 {
-    ML::thrustAllocatorAdapter alloc( h.getDeviceAllocator(), stream );
+    ML::thrustAllocatorAdapter alloc( h.get_device_allocator(), stream );
     auto execution_policy = thrust::cuda::par(alloc).on(stream);
     thrust::for_each(execution_policy, ... );
 }
 ```
 The header `src/common/allocatorAdapter.hpp` also provides a helper function to create an execution policy:
 ```cpp
-void foo(const ML::cumlHandle_impl& h, ... , cudaStream_t stream )
+void foo(const raft::handle_t& h, ... , cudaStream_t stream )
 {
-    auto execution_policy = ML::thrust_exec_policy(h.getDeviceAllocator(),stream);
+    auto execution_policy = ML::thrust_exec_policy(h.get_device_allocator(),stream);
     thrust::for_each(execution_policy->on(stream), ... );
 }
 ```
 
 ## Asynchronous operations and stream ordering
-All ML algorithms should be as asynchronous as possible avoiding the use of the default stream (aka as NULL or `0` stream). Implementations that require only one CUDA Stream should use the stream from `ML::cumlHandle_impl`:
+All ML algorithms should be as asynchronous as possible avoiding the use of the default stream (aka as NULL or `0` stream). Implementations that require only one CUDA Stream should use the stream from `raft::handle_t`:
 ```cpp
-void foo(const ML::cumlHandle_impl& h, ...)
+void foo(const raft::handle_t& h, ...)
 {
-    cudaStream_t stream = h.getStream();
+    cudaStream_t stream = h.get_stream();
 }
 ```
-When multiple streams are needed, e.g. to manage a pipeline, use the internal streams available in `ML::cumlHandle_impl` (see [CUDA Resources](#cuda-resources)). If multiple streams are used all operations still must be ordered according to `ML::cumlHandle::getStream()`. Before any operation in any of the internal CUDA streams is started, all previous work in `ML::cumlHandle::getStream()` must have completed. Any work enqueued in `ML::cumlHandle::getStream()` after a cuML function returns should not start before all work enqueued in the internal streams has completed. E.g. if a cuML algorithm is called like this: 
+When multiple streams are needed, e.g. to manage a pipeline, use the internal streams available in `raft::handle_t` (see [CUDA Resources](#cuda-resources)). If multiple streams are used all operations still must be ordered according to `raft::handle_t::get_stream()`. Before any operation in any of the internal CUDA streams is started, all previous work in `raft::handle_t::get_stream()` must have completed. Any work enqueued in `raft::handle_t::get_stream()` after a cuML function returns should not start before all work enqueued in the internal streams has completed. E.g. if a cuML algorithm is called like this: 
 ```cpp
 void foo(const double* const srcdata, double* const result)
 {
-    ML::cumlHandle cumlHandle;
+    raft::handle_t raftHandle;
 
     cudaStream_t stream;
     CUDA_RT_CALL( cudaStreamCreate( &stream ) );
-    cumlHandle.setStream( stream );
+    raftHandle.set_stream( stream );
 
     ...
 
     CUDA_CHECK( cudaMemcpyAsync( srcdata, h_srcdata.data(), n*sizeof(double), cudaMemcpyHostToDevice, stream ) );
 
-    ML::algo(cumlHandle, dopredict, srcdata, result, ... );
+    ML::algo(raft::handle_t, dopredict, srcdata, result, ... );
 
     CUDA_CHECK( cudaMemcpyAsync( h_result.data(), result, m*sizeof(int), cudaMemcpyDeviceToHost, stream ) );
 
@@ -363,12 +363,12 @@ void foo(const double* const srcdata, double* const result)
 ```
 No work in any stream should start in `ML::algo` before the `cudaMemcpyAsync` in `stream` launched before the call to `ML::algo` is done. And all work in all streams used in `ML::algo` should be done before the `cudaMemcpyAsync` in `stream` launched after the call to `ML::algo` starts.
 
-This can be ensured by introducing interstream dependencies with CUDA events and `cudaStreamWaitEvent`. For convenience, the header `cumlHandle.hpp` provides the class `ML::detail::streamSyncer` which lets all `ML::cumlHandle_impl` internal CUDA streams wait on `ML::cumlHandle::getStream()` in its constructor and in its destructor and lets `ML::cumlHandle::getStream()` wait on all work enqueued in the `ML::cumlHandle_impl` internal CUDA streams. The intended use would be to create a `ML::detail::streamSyncer` object as the first thing in a entry function of the public cuML API: 
+This can be ensured by introducing interstream dependencies with CUDA events and `cudaStreamWaitEvent`. For convenience, the header `raft/handle.hpp` provides the class `raft::stream_syncer` which lets all `raft::handle_t` internal CUDA streams wait on `raft::handle_t:get_stream()` in its constructor and in its destructor and lets `raft::handle_t::get_stream()` wait on all work enqueued in the `raft::handle_t` internal CUDA streams. The intended use would be to create a `raft::stream_syncer` object as the first thing in a entry function of the public cuML API: 
 
 ```cpp
-void cumlAlgo(const ML::cumlHandle& handle, ...)
+void cumlAlgo(const raft::handle_t& handle, ...)
 {
-    ML::detail::streamSyncer _(handle.getImpl());
+    raft::streamSyncer _(handle);
 }
 ```
 This ensures the stream ordering behavior described above.
@@ -378,56 +378,39 @@ To ensure that thrust algorithms are executed in the intended stream the `thrust
 
 ## CUDA Resources
 
-Do not create reusable CUDA resources directly in implementations of ML algorithms. Instead, use the existing resources in `ML::cumlHandle_impl` to avoid constant creation and deletion of reusable resources such as CUDA streams, CUDA events or library handles. Please file a feature request if a resource handle is missing in `ML::cumlHandle_impl `.
+Do not create reusable CUDA resources directly in implementations of ML algorithms. Instead, use the existing resources in `raft::handle_t` to avoid constant creation and deletion of reusable resources such as CUDA streams, CUDA events or library handles. Please file a feature request if a resource handle is missing in `raft::handle_t`.
 The resources can be obtained like this
 ```cpp
-void foo(const ML::cumlHandle_impl& h, ...)
+void foo(const raft::handle_t& h, ...)
 {
-    cublasHandle_t cublasHandle = h.getCublasHandle();
-    const int num_streams       = h.getNumInternalStreams();
+    cublasHandle_t cublasHandle = h.get_cublas_handle();
+    const int num_streams       = h.get_num_internal_streams();
     const int stream_idx        = ...
-    cudaStream_t stream         = h.getInternalStream(stream_idx);
+    cudaStream_t stream         = h.get_internal_stream(stream_idx);
     ...
 }
 ```
 
-The example below shows one way to create `nStreams` number of internal cuda streams which can later be used by the algos inside cuML. For a full working example of how to use internal streams to schedule work on a single GPU, the reader is further referred to [this PR](https://github.com/rapidsai/cuml/pull/1015). In this PR, the internal streams inside `cumlHandle_impl` are used to schedule more work onto a GPU for Random Forest building.
+The example below shows one way to create `nStreams` number of internal cuda streams which can later be used by the algos inside cuML. For a full working example of how to use internal streams to schedule work on a single GPU, the reader is further referred to [this PR](https://github.com/rapidsai/cuml/pull/1015). In this PR, the internal streams inside `raft::handle_t` are used to schedule more work onto a GPU for Random Forest building.
 ```cpp
 int main(int argc, char** argv)
 {
     int nStreams = argc > 1 ? atoi(argv[1]) : 0;
-    ML::cumlHandle handle(nStreams);
-    foo(handle.getImpl(), ...);
-}
-```
-
-### `ML::cumlHandle` and `ML::cumlHandle_impl`
-
-The purpose of `ML::cumlHandle` is to be the public interface of cuML, i.e. it is meant to be used by developers using cuML in their application. This is differentiated from `ML::cumlHandle_impl` to avoid that the public interface of cuML depends on cuML internals, such as CUDA library handles, e.g. for cuBLAS. This is implemented via the "Pointer to implementation" or [pImpl](https://en.cppreference.com/w/cpp/language/pimpl) idiom. From a `ML::cumlHandle` the implementation `ML::cumlHandle_impl` can be obtained by calling `ML::cumlHandle::getImpl()`. The implementation of cuML should use `ML::cumlHandle_impl` and not `ML::cumlHandle`. E.g. for the function `ml_algo` from the public cuML interface an implementation calling the internal functions `foo` and `bar` could look like this:
-
-```cpp
-void ml_algo(const ML::cumlHandle& handle, ...)
-{
-    const ML::cumlHandle_impl& h = handle.getImpl();
-    ML::detail::streamSyncer _(h);
-    ...
-    foo(h, ...);
-    ...
-    bar(h, ...);
-    ...
+    raft::handle_t handle(nStreams);
+    foo(handle, ...);
 }
 ```
 
 ## Multi-GPU
 
-The multi GPU paradigm of cuML is **O**ne **P**rocess per **G**PU (OPG). Each algorithm should be implemented in a way that it can run with a single GPU without any specific dependencies to a particular communication library. A multi-GPU implementation should use the methods offered by the class `MLCommon::cumlCommunicator` from [cuml_comms_int.hpp](src_prims/src/common/cuml_comms_int.hpp) for inter-rank/GPU communication. It is the responsibility of the user of cuML to create an initialized instance of `MLCommon::cumlCommunicator`.
+The multi GPU paradigm of cuML is **O**ne **P**rocess per **G**PU (OPG). Each algorithm should be implemented in a way that it can run with a single GPU without any specific dependencies to a particular communication library. A multi-GPU implementation should use the methods offered by the class `raft::comms::comms_t` from [raft/comms/comms.hpp] for inter-rank/GPU communication. It is the responsibility of the user of cuML to create an initialized instance of `raft::comms::comms_t`.
 
-E.g. with a CUDA-aware MPI, a cuML user could use code like this to inject an initialized instance of `MLCommon:cumlCommunicator` into a `cumlHandle`:
+E.g. with a CUDA-aware MPI, a cuML user could use code like this to inject an initialized instance of `raft::comms::mpi_comms` into a `raft::handle_t`:
 
 ```cpp
 #include <mpi.h>
-#include <cuML.hpp>
-#include <cuML_comms.hpp>
+#include <raft/handle.hpp>
+#include <raft/comms/mpi_comms.hpp>
 #include <mlalgo/mlalgo.hpp>
 ...
 int main(int argc, char * argv[])
@@ -448,19 +431,19 @@ int main(int argc, char * argv[])
 
     cudaSetDevice(local_rank);
 
-    MPI_Comm cuml_mpi_comm;
-    MPI_Comm_dup(MPI_COMM_WORLD, &cuml_mpi_comm);
+    mpi_comms raft_mpi_comms;
+    MPI_Comm_dup(MPI_COMM_WORLD, &raft_mpi_comms);
 
     {
-        ML::cumlHandle cumlHandle;
-        inject_comms(cumlHandle, cuml_mpi_comm);
+        raft::handle_t raftHandle;
+        initialize_mpi_comms(raftHandle, raft_mpi_comms);
         
         ...
         
-        ML::mlalgo(cumlHandle, ... );
+        ML::mlalgo(raftHandle, ... );
     }
 
-    MPI_Comm_free(&cuml_mpi_comm);
+    MPI_Comm_free(&raft_mpi_comms);
 
     MPI_Finalize();
     return 0;
@@ -468,17 +451,17 @@ int main(int argc, char * argv[])
 ```
 
 A cuML developer can assume the following:
- * A instance of `MLCommon::cumlCommunicator` was correctly initialized.
- * All processes that are part of `MLCommon::cumlCommunicator` call into the ML algorithm cooperatively.
+ * A instance of `raft::comms::comms_t` was correctly initialized.
+ * All processes that are part of `raft::comms::comms_t` call into the ML algorithm cooperatively.
 
-The initialized instance of `MLCommon::cumlCommunicator` can be accessed from the `ML::cumlHandle_impl` instance:
+The initialized instance of `raft::comms::comms_t` can be accessed from the `raft::handle_t` instance:
 
 ```cpp
-void foo(const ML::cumlHandle_impl& h, ...)
+void foo(const raft::handle_t& h, ...)
 {
-    const MLCommon::cumlCommunicator& communicator = h.getCommunicator();
-    const int rank = communicator.getRank();
-    const int size = communicator.getSize();
+    const MLCommon::cumlCommunicator& communicator = h.get_comms();
+    const int rank = communicator.get_rank();
+    const int size = communicator.get_size();
     ...
 }
 ```
diff --git a/wiki/mnmg/Using_Infiniband_for_MNMG.md b/wiki/mnmg/Using_Infiniband_for_MNMG.md
index 741144e2d1..bebf4be4ea 100644
--- a/wiki/mnmg/Using_Infiniband_for_MNMG.md
+++ b/wiki/mnmg/Using_Infiniband_for_MNMG.md
@@ -292,10 +292,10 @@ dask-cuda-worker ucx://10.0.0.50:8786
 
 ## 7. Run cumlCommunicator test:
 
-### First, create a Dask `Client` and cuML `CommsContext`:
+### First, create a Dask `Client` and cuML `Comms`:
 ```python
 from dask.distributed import Client, wait
-from cuml.dask.common.comms import CommsContext
+from cuml.raft.dask.common.comms import Comms
 from cuml.dask.common import worker_state
 from cuml.dask.common import perform_test_comms_send_recv
 from cuml.dask.common import perform_test_comms_allreduce
@@ -303,7 +303,7 @@ from cuml.dask.common import perform_test_comms_allreduce
 import random
 
 c = Client("ucx://10.0.0.50:8786")
-cb = CommsContext(comms_p2p=True)
+cb = Comms(comms_p2p=True)
 cb.init()
 ```
 
diff --git a/wiki/python/DEVELOPER_GUIDE.md b/wiki/python/DEVELOPER_GUIDE.md
index a498a9dc95..10c0e0882b 100644
--- a/wiki/python/DEVELOPER_GUIDE.md
+++ b/wiki/python/DEVELOPER_GUIDE.md
@@ -67,6 +67,5 @@ algo2.fit(X2, y2)
 To know more underlying details about stream ordering refer to the corresponding section of [C++ DEVELOPER_GUIDE.md](../../cpp/DEVELOPER_GUIDE.md#asynchronous-operations-and-stream-ordering)
 
 ## Multi GPU
-We currently have **S**ingle **P**rocess **M**ultiple **G**PU (SPMG) versions of KNN, OLS and tSVD. Our upcoming versions will concentrate on **O**ne **P**rocess per **G**PU (OPG) paradigm.
 
 TODO: Add more details.