rapidsai · cjnolet · Nov 30, 2020 · Nov 18, 2020 · Nov 18, 2020 · Nov 18, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -36,6 +36,7 @@
 - PR #3135: Add QuasiNewton tests
 - PR #3040: Improved Array Conversion with CumlArrayDescriptor and Decorators
 - PR #3134: Improving the Deprecation Message Formatting in Documentation
+- PR #3154: Adding estimator pickling demo notebooks (and docs)
 - PR #3151: MNMG Logistic Regression via dask-glm
 - PR #3113: Add tags and prefered memory order tags to estimators
 - PR #3137: Reorganize Pytest Config and Add Quick Run Option

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -26,6 +26,7 @@ Support for Windows is possible in the near future.
    cuml_intro.rst
    cuml_blogs.rst
    estimator_intro.ipynb
+   pickling_cuml_models.ipynb
 
 
 Indices and tables

diff --git a/docs/source/pickling_cuml_models.ipynb b/docs/source/pickling_cuml_models.ipynb
@@ -0,0 +1,209 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pickling cuML Models for Persistence\n",
+    "\n",
+    "This notebook demonstrates simple pickling of both single-GPU and multi-GPU cuML models for persistence"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\", category=FutureWarning)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single GPU Model Pickling\n",
+    "\n",
+    "All single-GPU estimators are pickleable. The following example demonstrates the creation of a synthetic dataset, training, and pickling of the resulting model for storage. Trained single-GPU models can also be used to distribute the inference on a Dask cluster, which the `Distributed Model Pickling` section below demonstrates."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cuml.datasets import make_blobs\n",
+    "\n",
+    "X, y = make_blobs(n_samples=50,\n",
+    "                  n_features=10,\n",
+    "                  centers=5,\n",
+    "                  cluster_std=0.4,\n",
+    "                  random_state=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cuml.cluster import KMeans\n",
+    "\n",
+    "model = KMeans(n_clusters=5)\n",
+    "\n",
+    "model.fit(X)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "\n",
+    "pickle.dump(model, open(\"kmeans_model.pkl\", \"wb\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = pickle.load(open(\"kmeans_model.pkl\", \"rb\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.cluster_centers_"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Distributed Model Pickling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The distributed estimator wrappers inside of the `cuml.dask` are not intended to be pickled directly. The Dask cuML estimators provide a function `get_combined_model()`, which returns the trained single-GPU model for pickling. The combined model can be used for inference on a single-GPU, and the `ParallelPostFit` wrapper from the [Dask-ML](https://ml.dask.org/meta-estimators.html) library can be used to perform distributed inference on a Dask cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "from dask_cuda import LocalCUDACluster\n",
+    "\n",
+    "cluster = LocalCUDACluster()\n",
+    "client = Client(cluster)\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cuml.dask.datasets import make_blobs\n",
+    "\n",
+    "n_workers = len(client.scheduler_info()[\"workers\"].keys())\n",
+    "\n",
+    "X, y = make_blobs(n_samples=5000, \n",
+    "                  n_features=30,\n",
+    "                  centers=5, \n",
+    "                  cluster_std=0.4, \n",
+    "                  random_state=0,\n",
+    "                  n_parts=n_workers*5)\n",
+    "\n",
+    "X = X.persist()\n",
+    "y = y.persist()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cuml.dask.cluster import KMeans\n",
+    "\n",
+    "dist_model = KMeans(n_clusters=5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dist_model.fit(X)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "\n",
+    "single_gpu_model = dist_model.get_combined_model()\n",
+    "pickle.dump(single_gpu_model, open(\"kmeans_model.pkl\", \"wb\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "single_gpu_model = pickle.load(open(\"kmeans_model.pkl\", \"rb\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "single_gpu_model.cluster_centers_"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}