Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

Closed
mlisovyi opened this issue Feb 13, 2019 · 5 comments

Comments

@mlisovyi
Copy link
Contributor

I've faced an error, when i use a lightgbm model (sklearn API) in a sklearn Pipeline. This happens, only when:

  • there is dimentionality reduction in the transforms, e.g. PCA, thus number of fetures fed into the model is not the same as of the input into pipeline fit;
  • one uses pandas DataFrames (works well with numpy arrays);
  • one trains the model with early stopping and evaluation metric (works well without evaluation).

The last two restrictions are illustrated in the example below.

Environment info

Operating System: Ubuntu 18.04

C++/Python/R version: 3.5.5

LightGBM version or commit hash: 2.2.0

Sklearn version: 0.19.1

Error message

~/xxx/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    189             booster.set_train_data_name(train_data_name)
    190         for valid_set, name_valid_set in zip(reduced_valid_sets, name_valid_sets):
--> 191             booster.add_valid(valid_set, name_valid_set)
    192     finally:
    193         train_set._reverse_update_params()

~/xxx/lightgbm/basic.py in add_valid(self, data, name)
   1646         _safe_call(_LIB.LGBM_BoosterAddValidData(
   1647             self.handle,
-> 1648             data.construct().handle))
   1649         self.valid_sets.append(data)
   1650         self.name_valid_sets.append(name)

~/xxx/lightgbm/basic.py in construct(self)
    932                     self._lazy_init(self.data, label=self.label, reference=self.reference,
    933                                     weight=self.weight, group=self.group, init_score=self.init_score, predictor=self._predictor,
--> 934                                     silent=self.silent, feature_name=self.feature_name, params=self.params)
    935                 else:
    936                     # construct subset

~/xxx/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
    791             raise TypeError('wrong predictor type {}'.format(type(self.predictor).__name__))
    792         # set feature names
--> 793         return self.set_feature_name(feature_name)
    794 
    795     def __init_from_np2d(self, mat, params_str, ref_dataset):

~/xxx/lightgbm/basic.py in set_feature_name(self, feature_name)
   1218         if self.handle is not None and feature_name is not None and feature_name != 'auto':
   1219             if len(feature_name) != self.num_feature():
-> 1220                 raise ValueError("Length of feature_name({}) and num_feature({}) don't match".format(len(feature_name), self.num_feature()))
   1221             c_feature_name = [c_str(name) for name in feature_name]
   1222             _safe_call(_LIB.LGBM_DatasetSetFeatureNames(
ValueError: Length of feature_name(100) and num_feature(20) don't match

Reproducible examples

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

# generate random data 

np.random.seed(312)
# training
n_trn = 1000
n_prm = 100
train_X = np.random.random((n_trn,n_prm))
train_y = np.random.randn(n_trn)
# early-stopping
n_val = 1000
test_X = np.random.random((n_val,n_prm))
test_y = np.random.randn(n_val)

# pipeline to be trained
pipe = Pipeline([('ss', StandardScaler()),
            ('pca', PCA(20)),
            ('lgbm', lgb.LGBMRegressor(max_depth=-1, random_state=314, silent=True, metric='None', 
                                       num_leaves=20, n_estimators=50, learning_rate=0.15, importance_type='gain'))
                                   ])
# fit parameters
params_fit = {"lgbm__early_stopping_rounds":5, 
              "lgbm__eval_metric" : 'rmse',
              'lgbm__eval_names': ['train', 'early_stop'],
              'lgbm__verbose': 5,
              'lgbm__eval_set': [(train_X,train_y), (test_X,test_y)]
              #'lgbm__eval_set': [(pd.DataFrame(train_X),pd.DataFrame(train_y)), (pd.DataFrame(test_X),pd.DataFrame(test_y))]
             }

# THIS WORKS
pipe = pipe.fit(train_X, train_y, **params_fit)
# this also works, which indicated that the issue is with the training, but rather with evaluation
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y))
# THIS DOES NOT WORK
params_fit['lgbm__eval_set'] = [(pd.DataFrame(train_X),pd.DataFrame(train_y)), (pd.DataFrame(test_X),pd.DataFrame(test_y))]
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)
@StrikerRUS
Copy link
Collaborator

Hi @mlisovyi !

It's quite expected because scikit-learn's Pipeline is not aware of any validation data you are providing to LightGBM. So, you train on 20 features and validate on 100.

It's quite strange that everything is OK with numpy case...

I think you can fix this by something like the following:

# THIS DOES NOT WORK
params_fit['lgbm__eval_set'] = [(pipe.steps[1][1].fit_transform(pipe.steps[0][1].fit_transform(pd.DataFrame(train_X))),pd.DataFrame(train_y)), (pipe.steps[1][1].fit_transform(pipe.steps[0][1].fit_transform(pd.DataFrame(test_X))),pd.DataFrame(test_y))]
pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)

@StrikerRUS
Copy link
Collaborator

@mlisovyi did valid data preprocessing help in your case?

I think that due to not crash is case of numpy this issue can be transferred as a subissue to #812.

@mlisovyi
Copy link
Contributor Author

I will have time only on the upcoming weekend to look into it :(

@StrikerRUS
Copy link
Collaborator

Sure, no problem!

@mlisovyi
Copy link
Contributor Author

mlisovyi commented Feb 28, 2019

Your proposal works, but it has the downside that the transformation for the validation and train samples are different (and also it does not extend to an arbitrary length of the pipeline). I think, the best way to overcome it is to use

pipe_trf = Pipeline(pipe.steps[:-1])
pipe_trf = pipe_trf.fit(pd.DataFrame(train_X))
params_fit['lgbm__eval_set'] = [(pipe_trf.transform(pd.DataFrame(train_X)),
                                 pd.DataFrame(train_y)),
                                (pipe_trf.transform(pd.DataFrame(test_X)),
                                 pd.DataFrame(test_y))]

pipe = pipe.fit(pd.DataFrame(train_X), pd.DataFrame(train_y), **params_fit)

This yields the same results if I use numpy objects with the same code, as expected. But the result is different from the suspicious single-pipeline approach on arrays, so that calculayion does something wrong (and does not alert user about it 😞 ).

The downside is that the transforms have to be fitted twice, which can be inefficient for CPU-intense methods like PCA. I think, there is way to freeze transformers in sklearn pipeline and this way to avoid duplication of CPU time. Or I confuse sklearn with some other toolkit. Otherwise one should just either use lightgbm outside of a pipeline, if one wants to use early stopping, or compromise on CPU time.

Note: to be able to get reproducible results, one should fix the random_state in PCA() initialisation, which i forgot to do in the original example

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants