You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<class 'pandas.core.frame.DataFrame'>
Int64Index: 124646054 entries, 0 to 139271
Columns: 108 entries, xxxxx to yyyyyyy
dtypes: float32(108)
memory usage: 51.1 GB
Traceback (most recent call last):
File "train_now.py", line 120, in
main()
File "train_now.py", line 87, in main
bst = xgb.train(params, dtrain, args.num_rounds, [(dtrain, "train"), (dval, "val")])
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 216, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 1035, in update
dtrain.handle))
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 165, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: [19:26:13] /mnt/dmlc/xgboost/include/xgboost/../../src/common/common.h:41: /mnt/dmlc/xgboost/src/predictor/../common/device_helpers.cuh: 409: out of memory
Thanks, I have combined it with predictor: 'cpu_predictor' and it works. Will wait until the multi-gpu predictor is available. I will test the PR early on.
I am also running XGBoost with GPU. My data is also large with 900 features with 30M samples.
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
But when I checked from the GPU utility. It only takes 14G out of 16G of the GPU. I have also tried to set predictor:'cpu_predictor' or not. Both ending up with the same error. Any idea about why @hcho3
Building xgboost for multi-gpu support from c6b5df6
With CUDA 9.0
With NCCL 2.2.13-1+cuda9.0
on amazon p2.8xlarge 488GB Ram , 32 vcpus, 8 gpus K80
CODE
DATAFRAME has 120M entries 108 features (doubles)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 124646054 entries, 0 to 139271
Columns: 108 entries, xxxxx to yyyyyyy
dtypes: float32(108)
memory usage: 51.1 GB
Traceback (most recent call last):
File "train_now.py", line 120, in
main()
File "train_now.py", line 87, in main
bst = xgb.train(params, dtrain, args.num_rounds, [(dtrain, "train"), (dval, "val")])
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 216, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 1035, in update
dtrain.handle))
File "/mnt/xgboost/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 165, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: [19:26:13] /mnt/dmlc/xgboost/include/xgboost/../../src/common/common.h:41: /mnt/dmlc/xgboost/src/predictor/../common/device_helpers.cuh: 409: out of memory
Stack trace returned 10 entries:
[bt] (0) /mnt/xgboost/venv/xgboost/libxgboost.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7fe98b5dba6b]
[bt] (1) /mnt/xgboost/venv/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fe98b5dc2e8]
[bt] (2) /mnt/xgboost/venv/xgboost/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x22f) [0x7fe98b814e4f]
[bt] (3) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::DeviceMatrix::DeviceMatrix(xgboost::DMatrix*, int, bool)+0x16f) [0x7fe98b84ed4f]
[bt] (4) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::GPUPredictor::DevicePredictInternal(xgboost::DMatrix*, xgboost::HostDeviceVector, xgboost::gbm::GBTreeModel const&, unsigned long, unsigned long)+0xbc7) [0x7fe98b850d47]
[bt] (5) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::predictor::GPUPredictor::UpdatePredictionCache(xgboost::gbm::GBTreeModel const&, std::vector<std::unique_ptr<xgboost::TreeUpdater, std::default_deletexgboost::TreeUpdater >, std::allocator<std::unique_ptr<xgboost::TreeUpdater, std::default_deletexgboost::TreeUpdater > > >, int)+0x75) [0x7fe98b851475]
[bt] (6) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x6d5) [0x7fe98b661ef5]
[bt] (7) /mnt/xgboost/venv/xgboost/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x361) [0x7fe98b66f351]
[bt] (8) /mnt/xgboost/venv/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x48) [0x7fe98b5cfb58]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe9fd287e40]
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
Aborted
nvidia-smi snapshot before the crash
I can not publish the dataset, but creating a similar one should be no problem.
Thanks
The text was updated successfully, but these errors were encountered: