Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

nils-fl · 2020-10-12T07:32:12Z

Hi, I am having an issue using XGBClassifier on GPU running OOM and tried to implement a workaround by saving the model, deleting the model and loading it back in.

pickle.dump(self.model, open(f'tmp/model_{uid}.pkl', 'wb'))
del self.model
self.model = pickle.load(open(f'tmp/model_{uid}.pkl', 'rb'))
os.remove(f'tmp/model_{uid}.pkl')

I am on xgb 1.3.0 and the models are very small. I am running a HO with Optuna with a 1000x Bootstrapping CV in each iteration. After 50 - 120 Optuna iteration, it throws the error:

xgboost.core.XGBoostError: [16:11:48] ../src/tree/updater_gpu_hist.cu:731: Exception in gpu_hist: NCCL failure :unhandled cuda error ../src/common/device_helpers.cu(71)

and

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

Looking at nvidia-smi it only takes a constant ~210 MB… (RTX TITAN)

My parameter space looks like this:

params = {
            'booster': 'gbtree',
            'objective': 'binary:logistic',
            'tree_method': 'gpu_hist',
            'random_state': self.random_state,
            'predictor': 'cpu_predictor',
            'n_estimators' : 100,
            'reg_alpha': 0,
            'reg_lambda': 1,
            'min_child_weight': 1,
            'max_depth': trial.suggest_int('max_depth', 2, 6),
            'gamma': trial.suggest_discrete_uniform('gamma', 0, 10, 0.1),
            'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
            'subsample': trial.suggest_discrete_uniform('subsample', 0.3, 1.0, 0.05),
            'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.1, 1.0, 0.1)
        }

I thought this is related to issue #4668, but I am not sure about that anymore.

BTW, everything works fine running the same code on CPU. Other libraries like RAPIDS cuML are working fine on GPU.

The text was updated successfully, but these errors were encountered:

trivialfis · 2020-10-12T08:35:58Z

Let me dive into Optuna. At the same time could you please share the shape of your data?

nils-fl · 2020-10-12T10:12:50Z

Thanks for the effort.
The shape varies in number of features, but all are failing with the same error. A typical shape would be (133, 20) though.
A detail that I forgot to mention is that I am using the Optuna build in optuna.integration.XGBoostPruningCallback to skip runs but it also fails without the callback.

nils-fl · 2020-10-26T10:53:17Z

Any update on this issue?

trivialfis · 2020-10-28T09:52:06Z

Could you please provide a more complete script that I can run? I can't guess your configuration.

nils-fl · 2020-10-28T20:46:08Z

This is my code, which stops after 28 rounds with the errors stated above.

import numpy as np
import pandas as pd
from sklearn.utils import resample
from xgboost import XGBClassifier
import optuna
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

X = pd.DataFrame(np.random.randint(0,200, size=(100, 4000)))
y = pd.Series(np.random.randint(0,2, size=(100)))

class StratifiedBootstrapping():
    """
    """
    def __init__(self, n_iter, n_size, random_state=101):
        self.n_iter = n_iter
        self.n_size = n_size
        self.random_state = random_state
    
    def get_splits(self):
        return self.n_iter

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_iter
    
    def split(self, X, y, group=None):
        X = X.reset_index(drop=True)    
        for i in range(self.n_iter):
            train = resample(
                X.index, 
                n_samples=self.n_size, 
                stratify=list(y),
                random_state=self.random_state+i
            )
            test = np.array([x for x in X.index if x not in train])

            yield train, test

def get_auc(model, X, y, cv, callback=None, tqdm_disable=False):
    """
    """
    aucs = []
    for i, (train, test) in enumerate(cv.split(X, y)):
        fitted_model = model.fit(
            X.iloc[train, :],
            y.iloc[train],
            eval_set = [
                (X.iloc[train, :], y.iloc[train]),
                (X.iloc[test, :], y.iloc[test])
                ],
            eval_metric = ['logloss'],
            callbacks = callback,
            early_stopping_rounds=10,
            verbose=False,
        )
        y_test = np.array(y.iloc[test])
        X_test = X.iloc[test, :]
        y_pred_proba = fitted_model.predict_proba(X_test)
        auc = roc_auc_score(
            y_true=y_test, 
            y_score=y_pred_proba[:, 1],
            )
        aucs.append(auc)

    auc_mean = np.mean(aucs)
    auc_std = np.std(aucs)

    return auc_mean, auc_std

def objective(trial):
    """
    """
    model = XGBClassifier()
    params = {
        'booster': 'gbtree',
        'objective': 'binary:logistic',
        'tree_method': 'gpu_hist',
        'random_state': 101,
        'n_estimators' : 500,
        'scale_pos_weight' : 1,
        'min_child_weight': trial.suggest_discrete_uniform('min_child_weight', 0, 10, 0.1),
        'reg_alpha': trial.suggest_discrete_uniform('reg_alpha', 0, 1, 0.05),
        'reg_lambda': trial.suggest_discrete_uniform('reg_lambda', 0, 1, 0.05),
        'max_depth': trial.suggest_int('max_depth', 2, 6),
        'gamma': trial.suggest_discrete_uniform('gamma', 0, 10, 0.1),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
        'subsample': trial.suggest_discrete_uniform('subsample', 0.3, 1.0, 0.05),
        'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.1, 1.0, 0.05),
        'colsample_bylevel': trial.suggest_discrete_uniform('colsample_bylevel', 0.1, 1.0, 0.05),
        'colsample_bynode': trial.suggest_discrete_uniform('colsample_bynode', 0.1, 1.0, 0.05),
    }
    cv = StratifiedBootstrapping(1000, 95, 101)
    model.set_params(**params)
    callback = 'validation_1-logloss'
    pruning_callback = [optuna.integration.XGBoostPruningCallback(trial, callback)]
    auc = get_auc(model, X, y, cv, pruning_callback)
    return auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000, show_progress_bar=True)

trivialfis · 2020-10-29T18:00:33Z

Running the script. See what's gonna happen later.

trivialfis · 2020-10-29T19:32:50Z

Are you sure that your script is correct? I'm not familiar with optuna, but this message seems unnnormal:

[W 2020-10-30 03:24:54,521] Trial 0 failed, because the returned value from the objective function cannot be cast to float. Returned value is: (0.5540239229861059, 0.07789749679989079

nils-fl · 2020-10-30T14:03:44Z

Hm.. Never saw that before. I will check and let you know.

nils-fl · 2020-11-10T18:43:25Z

So, the code above is working for me - until the stated error.
Anyway, I switched to the learning API and tested everything without Optuna and with a much simpler code:

import numpy as np
import pandas as pd
from sklearn.utils import resample
import xgboost as xgb
from tqdm import tqdm

X_train = pd.DataFrame(np.random.randint(0,200, size=(100, 4000)))
y_train = pd.Series(np.random.randint(0,2, size=(100)))

class StratifiedBootstrapping():
    """
    """
    def __init__(self, n_iter, n_size, random_state=101):
        self.n_iter = n_iter
        self.n_size = n_size
        self.random_state = random_state
    
    def get_splits(self):
        return self.n_iter

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_iter
    
    def split(self, X, y, group=None):
        X = X.reset_index(drop=True)    
        for i in range(self.n_iter):
            train = resample(
                X.index, 
                n_samples=self.n_size, 
                stratify=list(y),
                random_state=self.random_state+i
            )
            test = np.array([x for x in X.index if x not in train])

            yield train, test
            
cv = StratifiedBootstrapping(n_iter=1000, n_size=95)
folds = [(train,test) for train, test in cv.split(X_train, y_train)]

for i in tqdm(range(1000), total=1000):
    for train, test in folds:
        dtrain = xgb.DMatrix(X_train.iloc[train,:], label=y_train.iloc[train])
        dval = xgb.DMatrix(X_train.iloc[test,:], label=y_train.iloc[test])
        params = {
            'verbosity': 0,
            'seed': 101+i,
            'tree_method': 'gpu_hist',
            'objective': 'binary:logistic',
            'booster': 'gbtree',
            'eval_metric': 'auc'
        }
        xgb.train(
            dtrain=dtrain,
            params=params,
            evals=[(dtrain, 'train'), (dval, 'val')],
            num_boost_round=1000,
            verbose_eval=False,
            early_stopping_rounds=50
        )

Still giving me the same error (in round i=31):

raceback (most recent call last):
  File "xgb_oom.py", line 58, in <module>
    early_stopping_rounds=50
  File "/home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/training.py", line 212, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/training.py", line 75, in _train_internal
    bst.update(dtrain, i, obj)
  File "/home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/core.py", line 1161, in update
    dtrain.handle))
  File "/home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/core.py", line 188, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:55:26] ../src/tree/updater_gpu_hist.cu:723: Exception in gpu_hist: parallel_for failed: out of memory

Stack trace:
  [bt] (0) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x8b514) [0x7f8039677514]
  [bt] (1) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x43a55a) [0x7f8039a2655a]
  [bt] (2) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x16329b) [0x7f803974f29b]
  [bt] (3) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1656c7) [0x7f80397516c7]
  [bt] (4) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x190799) [0x7f803977c799]
  [bt] (5) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x55) [0x7f8039669685]
  [bt] (6) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f8218a6f9dd]
  [bt] (7) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f8218a6f067]
  [bt] (8) /home/nilsflaschel/miniconda3/envs/rad-pro/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2e7) [0x7f81db150517]


terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory
[1]    117335 abort (core dumped)  python xgb_oom.py

trivialfis · 2020-11-11T04:27:53Z

I was able to get a different error, after 8 hours ....

nils-fl · 2020-11-11T07:20:37Z

Do you think it could be hardware related?

trivialfis · 2020-11-11T08:34:17Z

I can't make any guess at the moment.

akshayb7 · 2020-12-21T14:51:57Z

I have been facing similar issues when running xgb.cv with optuna. On closer inspection I saw that this was because of GPU going out of memory. This was confirmed because when I lowered down the cv to 2 (which fits 2 instances of my data on my GPU) and removed the flag n_jobs=-1 (no parallelization), it ran without issues. So most probably it's because optuna is parallely trying to train multiple models and the GPU is running out of memory in that case.

WatsonCao · 2023-10-25T13:47:19Z

I have been facing similar issues when running xgb.cv with optuna. On closer inspection I saw that this was because of GPU going out of memory. This was confirmed because when I lowered down the cv to 2 (which fits 2 instances of my data on my GPU) and removed the flag n_jobs=-1 (no parallelization), it ran without issues. So most probably it's because optuna is parallely trying to train multiple models and the GPU is running out of memory in that case.

I meet with this issue too.

zhongshuai-cao · 2024-03-06T17:31:38Z

I have similar issue using sklearn rfe and hyperopt when trying to run multiple iterations of GPU model training. Is there any control for sklearn API to gc the GPU memory?

trivialfis added the status: need update label Oct 28, 2020

trivialfis self-assigned this Oct 29, 2020

trivialfis added the type: bug label Nov 11, 2020

trivialfis removed the status: need update label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

nils-fl commented Oct 12, 2020

trivialfis commented Oct 12, 2020

nils-fl commented Oct 12, 2020

nils-fl commented Oct 26, 2020

trivialfis commented Oct 28, 2020

nils-fl commented Oct 28, 2020

trivialfis commented Oct 29, 2020

trivialfis commented Oct 29, 2020

nils-fl commented Oct 30, 2020

nils-fl commented Nov 10, 2020

trivialfis commented Nov 11, 2020

nils-fl commented Nov 11, 2020 •

edited

Loading

trivialfis commented Nov 11, 2020

akshayb7 commented Dec 21, 2020

WatsonCao commented Oct 25, 2023

zhongshuai-cao commented Mar 6, 2024

Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

Comments

nils-fl commented Oct 12, 2020

trivialfis commented Oct 12, 2020

nils-fl commented Oct 12, 2020

nils-fl commented Oct 26, 2020

trivialfis commented Oct 28, 2020

nils-fl commented Oct 28, 2020

trivialfis commented Oct 29, 2020

trivialfis commented Oct 29, 2020

nils-fl commented Oct 30, 2020

nils-fl commented Nov 10, 2020

trivialfis commented Nov 11, 2020

nils-fl commented Nov 11, 2020 • edited Loading

trivialfis commented Nov 11, 2020

akshayb7 commented Dec 21, 2020

WatsonCao commented Oct 25, 2023

zhongshuai-cao commented Mar 6, 2024

nils-fl commented Nov 11, 2020 •

edited

Loading