[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

mlisovyi · 2019-02-13T16:26:53Z

I've faced an error, when i use a lightgbm model (sklearn API) in a sklearn Pipeline. This happens, only when:

there is dimentionality reduction in the transforms, e.g. PCA, thus number of fetures fed into the model is not the same as of the input into pipeline fit;
one uses pandas DataFrames (works well with numpy arrays);
one trains the model with early stopping and evaluation metric (works well without evaluation).

The last two restrictions are illustrated in the example below.

Environment info

Operating System: Ubuntu 18.04

C++/Python/R version: 3.5.5

LightGBM version or commit hash: 2.2.0

Sklearn version: 0.19.1

Error message

~/xxx/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    189             booster.set_train_data_name(train_data_name)
    190         for valid_set, name_valid_set in zip(reduced_valid_sets, name_valid_sets):
--> 191             booster.add_valid(valid_set, name_valid_set)
    192     finally:
    193         train_set._reverse_update_params()

~/xxx/lightgbm/basic.py in add_valid(self, data, name)
   1646         _safe_call(_LIB.LGBM_BoosterAddValidData(
   1647             self.handle,
-> 1648             data.construct().handle))
   1649         self.valid_sets.append(data)
   1650         self.name_valid_sets.append(name)

~/xxx/lightgbm/basic.py in construct(self)
    932                     self._lazy_init(self.data, label=self.label, reference=self.reference,
    933                                     weight=self.weight, group=self.group, init_score=self.init_score, predictor=self._predictor,
--> 934                                     silent=self.silent, feature_name=self.feature_name, params=self.params)
    935                 else:
    936                     # construct subset

~/xxx/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
    791             raise TypeError('wrong predictor type {}'.format(type(self.predictor).__name__))
    792         # set feature names
--> 793         return self.set_feature_name(feature_name)
    794 
    795     def __init_from_np2d(self, mat, params_str, ref_dataset):

~/xxx/lightgbm/basic.py in set_feature_name(self, feature_name)
   1218         if self.handle is not None and feature_name is not None and feature_name != 'auto':
   1219             if len(feature_name) != self.num_feature():
-> 1220                 raise ValueError("Length of feature_name({}) and num_feature({}) don't match".format(len(feature_name), self.num_feature()))
   1221             c_feature_name = [c_str(name) for name in feature_name]
   1222             _safe_call(_LIB.LGBM_DatasetSetFeatureNames(
ValueError: Length of feature_name(100) and num_feature(20) don't match

Reproducible examples

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

# generate random data 

np.random.seed(312)
# training
n_trn = 1000
n_prm = 100
train_X = np.random.random((n_trn,n_prm))
train_y = np.random.randn(n_trn)
# early-stopping
n_val = 1000
test_X = np.random.random((n_val,n_prm))
test_y = np.random.randn(n_val)

# pipeline to be trained
pipe = Pipeline([('ss', StandardScaler()),
            ('pca', PCA(20)),
            ('lgbm', lgb.LGBMRegressor(max_depth=-1, random_state=314, silent=True, metric='None', 
                                       num_leaves=20, n_estimators=50, learning_rate=0.15, importance_type='gain'))
                                   ])
# fit parameters
params_fit = {"lgbm__early_stopping_rounds":5, 
              "lgbm__eval_metric" : 'rmse',
              'lgbm__eval_names': ['train', 'early_stop'],
              'lgbm__verbose': 5,
              'lgbm__eval_set': [(train_X,train_y), (test_X,test_y)]
              #'lgbm__eval_set': [(pd.DataFrame(train_X),pd.DataFrame(train_y)), (pd.DataFrame(test_X),pd.DataFrame(test_y))]
             }

# THIS WORKS
pipe = pipe.fit(train_X, train_y, **params_fit)
# this also works, which indicated that the issue is with the training, but rather with evaluation
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y))
# THIS DOES NOT WORK
params_fit['lgbm__eval_set'] = [(pd.DataFrame(train_X),pd.DataFrame(train_y)), (pd.DataFrame(test_X),pd.DataFrame(test_y))]
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)

StrikerRUS · 2019-02-14T18:44:18Z

Hi @mlisovyi !

It's quite expected because scikit-learn's Pipeline is not aware of any validation data you are providing to LightGBM. So, you train on 20 features and validate on 100.

It's quite strange that everything is OK with numpy case...

I think you can fix this by something like the following:

# THIS DOES NOT WORK
params_fit['lgbm__eval_set'] = [(pipe.steps[1][1].fit_transform(pipe.steps[0][1].fit_transform(pd.DataFrame(train_X))),pd.DataFrame(train_y)), (pipe.steps[1][1].fit_transform(pipe.steps[0][1].fit_transform(pd.DataFrame(test_X))),pd.DataFrame(test_y))]
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)

StrikerRUS · 2019-02-20T13:00:42Z

@mlisovyi did valid data preprocessing help in your case?

I think that due to not crash is case of numpy this issue can be transferred as a subissue to #812.

mlisovyi · 2019-02-20T14:03:27Z

I will have time only on the upcoming weekend to look into it :(

StrikerRUS · 2019-02-20T14:52:15Z

Sure, no problem!

mlisovyi · 2019-02-28T08:49:40Z

Your proposal works, but it has the downside that the transformation for the validation and train samples are different (and also it does not extend to an arbitrary length of the pipeline). I think, the best way to overcome it is to use

pipe_trf = Pipeline(pipe.steps[:-1])
pipe_trf = pipe_trf.fit(pd.DataFrame(train_X))
params_fit['lgbm__eval_set'] = [(pipe_trf.transform(pd.DataFrame(train_X)),
                                 pd.DataFrame(train_y)),
                                (pipe_trf.transform(pd.DataFrame(test_X)),
                                 pd.DataFrame(test_y))]

pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)

This yields the same results if I use numpy objects with the same code, as expected. But the result is different from the suspicious single-pipeline approach on arrays, so that calculayion does something wrong (and does not alert user about it 😞 ).

The downside is that the transforms have to be fitted twice, which can be inefficient for CPU-intense methods like PCA. I think, there is way to freeze transformers in sklearn pipeline and this way to avoid duplication of CPU time. Or I confuse sklearn with some other toolkit. Otherwise one should just either use lightgbm outside of a pipeline, if one wants to use early stopping, or compromise on CPU time.

Note: to be able to get reproducible results, one should fix the random_state in PCA() initialisation, which i forgot to do in the original example

mlisovyi closed this as completed Feb 28, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

mlisovyi commented Feb 13, 2019

StrikerRUS commented Feb 14, 2019

StrikerRUS commented Feb 20, 2019

mlisovyi commented Feb 20, 2019

StrikerRUS commented Feb 20, 2019

mlisovyi commented Feb 28, 2019 •

edited

Loading

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

Comments

mlisovyi commented Feb 13, 2019

Environment info

Error message

Reproducible examples

StrikerRUS commented Feb 14, 2019

StrikerRUS commented Feb 20, 2019

mlisovyi commented Feb 20, 2019

StrikerRUS commented Feb 20, 2019

mlisovyi commented Feb 28, 2019 • edited Loading

mlisovyi commented Feb 28, 2019 •

edited

Loading