Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some guidelines on device memory usage #5038

Merged
merged 2 commits into from
Nov 16, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions demo/gpu_acceleration/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# GPU Acceleration Demo

This demo shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms.
`cover_type.py` shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms.

This demo requires the [GPU plug-in](https://xgboost.readthedocs.io/en/latest/gpu/index.html) to be built and installed.

The dataset is automatically loaded via the sklearn script.
`memory.py` shows how to repeatedly train xgboost models while freeing memory between iterations.
51 changes: 51 additions & 0 deletions demo/gpu_acceleration/memory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import xgboost as xgb
import numpy as np
import time
import pickle
import GPUtil

n = 10000
m = 1000
X = np.random.random((n, m))
y = np.random.random(n)

param = {'objective': 'binary:logistic',
'tree_method': 'gpu_hist'
}
iterations = 5
dtrain = xgb.DMatrix(X, label=y)

# High memory usage
# active bst objects with device memory persist across iterations
boosters = []
for i in range(iterations):
bst = xgb.train(param, dtrain)
boosters.append(bst)

print("Example 1")
GPUtil.showUtilization()
del boosters

# Better memory usage
# The bst object can be destroyed by the python gc, freeing device memory
# The gc may not immediately free the object, so more than one booster can be allocated at a time
boosters = []
for i in range(iterations):
bst = xgb.train(param, dtrain)
boosters.append(pickle.dumps(bst))

print("Example 2")
GPUtil.showUtilization()
del boosters

# Best memory usage
# The gc explicitly frees the booster before starting the next iteration
boosters = []
for i in range(iterations):
bst = xgb.train(param, dtrain)
boosters.append(pickle.dumps(bst))
del bst

print("Example 3")
GPUtil.showUtilization()
del boosters
17 changes: 17 additions & 0 deletions doc/gpu/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,23 @@ Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations a

See `GPU Accelerated XGBoost <https://xgboost.ai/2016/12/14/GPU-accelerated-xgboost.html>`_ and `Updates to the XGBoost GPU algorithms <https://xgboost.ai/2018/07/04/gpu-xgboost-update.html>`_ for additional performance benchmarks of the ``gpu_hist`` tree method.

Memory usage
============
The following are some guidelines on the device memory usage of the `gpu_hist` updater.

If you train xgboost in a loop you may notice xgboost is not freeing device memory after each training iteration. This is because memory is allocated over the lifetime of the booster object and does not get freed until the booster is freed. A workaround is to serialise the booster object after training. See `demo/gpu_acceleration/memory.py` for a simple example.

Memory inside xgboost training is generally allocated for two reasons - storing the dataset and working memory.

The dataset itself is stored on device in a compressed ELLPACK format. The ELLPACK format is a type of sparse matrix that stores elements with a constant row stride. This format is convenient for parallel computation when compared to CSR because the row index of each element is known directly from its address in memory. The disadvantage of the ELLPACK format is that it becomes less memory efficient if the maximum row length is significantly more than the average row length. Elements are quantised and stored as integers. These integers are compressed to a minimum bit length. Depending on the number of features, we usually don't need the full range of a 32 bit integer to store elements and so compress this down. The compressed, quantised ELLPACK format will commonly use 1/4 the space of a CSR matrix stored in floating point.

In some cases the full CSR matrix stored in floating point needs to be allocated on the device. This currently occurs for prediction in multiclass classification. If this is a problem consider setting `'predictor'='cpu_predictor'`. This also occurs when the external data itself comes from a source on device e.g. a cudf DataFrame. These are known issues we hope to resolve.

Working memory is allocated inside the algorithm proportional to the number of rows to keep track of gradients, tree positions and other per row statistics. Memory is allocated for histogram bins proportional to the number of bins, number of features and nodes in the tree. For performance reasons we keep histograms in memory from previous nodes in the tree, when a certain threshold of memory usage is passed we stop doing this to conserve memory at some performance loss.

The quantile finding algorithm also uses some amount of working device memory. It is able to operate in batches, but is not currently well optimised for sparse data.


Developer notes
===============
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the 'Monitor' class in cuda code will automatically appear in the nsight profiler.
Expand Down