Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu terminate called after throwing an instance of 'thrust::system::system_error' parallel_for failed: out of memory #3756

Closed
pablete opened this issue Oct 5, 2018 · 4 comments

Comments

@pablete
Copy link

pablete commented Oct 5, 2018

Building xgboost for multi-gpu support from c6b5df6
With CUDA 9.0
With NCCL 2.2.13-1+cuda9.0
on amazon p2.8xlarge 488GB Ram , 32 vcpus, 8 gpus K80

CODE

print(X.count())
print(X.info(memory_usage='deep'))

data = Data(X, y)
dtrain = xgb.DMatrix(data.X_train, data.y_train)
dval = xgb.DMatrix(data.X_val, data.y_val)
p = {'min_child_weight': 50,
    'subsample': 1.0,
    'max_depth': 11,
    'learning_rate': 0.03,
    'colsample_bytree': 0.4,
    'sketch_eps': 0.01,
    'n_gpus': 8,
    'alpha': 0,
    'lambda': 1,
    'max_bin': 64,
    'tree_method': 'gpu_hist',
    'objective': 'gpu:reg:logistic'}
bst = xgb.train(p, dtrain, 10, [(dtrain, "train"), (dval, "val")])

DATAFRAME has 120M entries 108 features (doubles)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124646054 entries, 0 to 139271
Columns: 108 entries, xxxxx to yyyyyyy
dtypes: float32(108)
memory usage: 51.1 GB

Traceback (most recent call last):
File "train_now.py", line 120, in
main()
File "train_now.py", line 87, in main
bst = xgb.train(params, dtrain, args.num_rounds, [(dtrain, "train"), (dval, "val")])
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 216, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 1035, in update
dtrain.handle))
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 165, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: [19:26:13] /mnt/dmlc/xgboost/include/xgboost/../../src/common/common.h:41: /mnt/dmlc/xgboost/src/predictor/../common/device_helpers.cuh: 409: out of memory

Stack trace returned 10 entries:
[bt] (0) /mnt/xgboost/venv/xgboost/libxgboost.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7fe98b5dba6b]
[bt] (1) /mnt/xgboost/venv/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fe98b5dc2e8]
[bt] (2) /mnt/xgboost/venv/xgboost/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x22f) [0x7fe98b814e4f]
[bt] (3) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::DeviceMatrix::DeviceMatrix(xgboost::DMatrix*, int, bool)+0x16f) [0x7fe98b84ed4f]
[bt] (4) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::GPUPredictor::DevicePredictInternal(xgboost::DMatrix*, xgboost::HostDeviceVector, xgboost::gbm::GBTreeModel const&, unsigned long, unsigned long)+0xbc7) [0x7fe98b850d47]
[bt] (5) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::GPUPredictor::UpdatePredictionCache(xgboost::gbm::GBTreeModel const&, std::vector<std::unique_ptr<xgboost::TreeUpdater, std::default_deletexgboost::TreeUpdater >, std::allocator<std::unique_ptr<xgboost::TreeUpdater, std::default_deletexgboost::TreeUpdater > > >
, int)+0x75) [0x7fe98b851475]
[bt] (6) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x6d5) [0x7fe98b661ef5]
[bt] (7) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x361) [0x7fe98b66f351]
[bt] (8) /mnt/xgboost/venv/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x48) [0x7fe98b5cfb58]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe9fd287e40]

terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
Aborted

nvidia-smi snapshot before the crash

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:17.0 Off |                    0 |
| N/A   57C    P0    69W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:18.0 Off |                    0 |
| N/A   53C    P0    82W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:19.0 Off |                    0 |
| N/A   57C    P0    65W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   53C    P0    81W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   59C    P0    70W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   53C    P0    82W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   59C    P0    73W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   54C    P0    87W / 149W |   2095MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     35738      C   -                                           2082MiB |
|    1     35738      C   -                                           2082MiB |
|    2     35738      C   -                                           2082MiB |
|    3     35738      C   -                                           2082MiB |
|    4     35738      C   -                                           2082MiB |
|    5     35738      C   -                                           2082MiB |
|    6     35738      C   -                                           2082MiB |
|    7     35738      C   -                                           2082MiB |
+-----------------------------------------------------------------------------+

I can not publish the dataset, but creating a similar one should be no problem.

Thanks

@trivialfis
Copy link
Member

Currently the predictor doesn't support multi-gpu yet, but should be there soon. :) see #3738.

So the there is only one device being used for prediction, which means the memory capacity is limited to a single device.

@pablete
Copy link
Author

pablete commented Oct 6, 2018

Thanks, I have combined it with predictor: 'cpu_predictor' and it works. Will wait until the multi-gpu predictor is available. I will test the PR early on.

Thanks for the pointer!

@ch11y
Copy link

ch11y commented Nov 19, 2018

I am also running XGBoost with GPU. My data is also large with 900 features with 30M samples.
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory

But when I checked from the GPU utility. It only takes 14G out of 16G of the GPU. I have also tried to set predictor:'cpu_predictor' or not. Both ending up with the same error. Any idea about why @hcho3

@hcho3
Copy link
Collaborator

hcho3 commented Nov 19, 2018

@ch11y Make sure to compile from the latest master. This issue has been resolved by #3738. If you keep seeing out-of-memory error, open a new issue.

@dmlc dmlc locked as resolved and limited conversation to collaborators Nov 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants