dmlc · RAMitchell · Nov 16, 2019 · Nov 14, 2019 · Nov 14, 2019
diff --git a/demo/gpu_acceleration/README.md b/demo/gpu_acceleration/README.md
@@ -1,7 +1,5 @@
 # GPU Acceleration Demo
 
-This demo shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms.
+`cover_type.py` shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms.
 
-This demo requires the [GPU plug-in](https://xgboost.readthedocs.io/en/latest/gpu/index.html) to be built and installed.
-
-The dataset is automatically loaded via the sklearn script. 
+`memory.py` shows how to repeatedly train xgboost models while freeing memory between iterations.
diff --git a/demo/gpu_acceleration/memory.py b/demo/gpu_acceleration/memory.py
@@ -0,0 +1,51 @@
+import xgboost as xgb
+import numpy as np
+import time
+import pickle
+import GPUtil
+
+n = 10000
+m = 1000
+X = np.random.random((n, m))
+y = np.random.random(n)
+
+param = {'objective': 'binary:logistic',
+         'tree_method': 'gpu_hist'
+         }
+iterations = 5
+dtrain = xgb.DMatrix(X, label=y)
+
+# High memory usage
+# active bst objects with device memory persist across iterations
+boosters = []
+for i in range(iterations):
+    bst = xgb.train(param, dtrain)
+    boosters.append(bst)
+
+print("Example 1")
+GPUtil.showUtilization()
+del boosters
+
+# Better memory usage
+# The bst object can be destroyed by the python gc, freeing device memory
+# The gc may not immediately free the object, so more than one booster can be allocated at a time
+boosters = []
+for i in range(iterations):
+    bst = xgb.train(param, dtrain)
+    boosters.append(pickle.dumps(bst))
+
+print("Example 2")
+GPUtil.showUtilization()
+del boosters
+
+# Best memory usage
+# The gc explicitly frees the booster before starting the next iteration
+boosters = []
+for i in range(iterations):
+    bst = xgb.train(param, dtrain)
+    boosters.append(pickle.dumps(bst))
+    del bst
+
+print("Example 3")
+GPUtil.showUtilization()
+del boosters
diff --git a/doc/gpu/index.rst b/doc/gpu/index.rst
@@ -204,6 +204,23 @@ Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations a
 
 See `GPU Accelerated XGBoost <https://xgboost.ai/2016/12/14/GPU-accelerated-xgboost.html>`_ and `Updates to the XGBoost GPU algorithms <https://xgboost.ai/2018/07/04/gpu-xgboost-update.html>`_ for additional performance benchmarks of the ``gpu_hist`` tree method.
 
+Memory usage
+============
+The following are some guidelines on the device memory usage of the `gpu_hist` updater.
+
+If you train xgboost in a loop you may notice xgboost is not freeing device memory after each training iteration. This is because memory is allocated over the lifetime of the booster object and does not get freed until the booster is freed. A workaround is to serialise the booster object after training. See `demo/gpu_acceleration/memory.py` for a simple example.
+
+Memory inside xgboost training is generally allocated for two reasons - storing the dataset and working memory.
+
+The dataset itself is stored on device in a compressed ELLPACK format. The ELLPACK format is a type of sparse matrix that stores elements with a constant row stride. This format is convenient for parallel computation when compared to CSR because the row index of each element is known directly from its address in memory. The disadvantage of the ELLPACK format is that it becomes less memory efficient if the maximum row length is significantly more than the average row length. Elements are quantised and stored as integers. These integers are compressed to a minimum bit length. Depending on the number of features, we usually don't need the full range of a 32 bit integer to store elements and so compress this down. The compressed, quantised ELLPACK format will commonly use 1/4 the space of a CSR matrix stored in floating point.
+
+In some cases the full CSR matrix stored in floating point needs to be allocated on the device. This currently occurs for prediction in multiclass classification. If this is a problem consider setting `'predictor'='cpu_predictor'`. This also occurs when the external data itself comes from a source on device e.g. a cudf DataFrame. These are known issues we hope to resolve.
+
+Working memory is allocated inside the algorithm proportional to the number of rows to keep track of gradients, tree positions and other per row statistics. Memory is allocated for histogram bins proportional to the number of bins, number of features and nodes in the tree. For performance reasons we keep histograms in memory from previous nodes in the tree, when a certain threshold of memory usage is passed we stop doing this to conserve memory at some performance loss.
+
+The quantile finding algorithm also uses some amount of working device memory. It is able to operate in batches, but is not currently well optimised for sparse data.
+
+
 Developer notes
 ===============
 The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the 'Monitor' class in cuda code will automatically appear in the nsight profiler.