Enable random seed averaging #40

Y-oHr-N · 2020-01-30T00:57:14Z

No description provided.

flamby · 2020-02-04T10:01:51Z

Is it that you want to make your mllib.ensemble.RandomSeedAveragingRegressor and mllib.ensemble.RandomSeedAveragingClassifier compatible somehow w/ OptGBM?

I'd thought that since OptGBM follows the sklearn API, it would be compatible by default.

Or am I missing something?

Anyway, random seed averaging is something I'll need to test as stdev of differently seeded models is sometimes very high, and averaging could generate a more robust model.

Do you have datasets in mind on which it's improving?

Thanks

Y-oHr-N · 2020-02-05T17:26:16Z

Hi @flamby,

As you said, random seed averaging is possible by default.

The first way is to pass OGBMModel to RandomSeedAveragingModel. This takes a very long time because tuning is performed for the number of seeds.

The second way is to train OGBMModel and then pass LGBMModel with best hyperparameters to RandomSeedAveragingModel. This requires only one tuning, but does not guarantee that the best model can be trained for each seed.

This is a simple example using OptGBM 0.5.0.

import lightgbm as lgb

from mllib.ensemble import RandomSeedAveragingClassifier
from optgbm.sklearn import OGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 1. LightGBM
model = lgb.LGBMClassifier(random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 2. LightGBM + random seed averaging
model = lgb.LGBMClassifier()
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 3. OptGBM (fold averaging)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.977...

# 4. OptGBM (single model)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)
model.refit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.980...

# 5. OptGBM (fold averaging) + random seed averaging (tune `n_estimators` times)
model = OGBMClassifier(n_trials=20)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.984...

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(**model.study_.best_params)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.968...

By the way, mllib is not currently being maintained and most of the code has been ported to pretools. I am trying to implement random seed averaging in pretools or OGBMModel.refit.

flamby · 2020-02-06T11:47:22Z

Hi @Y-oHr-N,

Thanks for the clarification.
I ran all your examples with my dataset (OptGBM 0.5.0 and mllib of current git master branch) and had small improvements indeed. I must test it more extensively.

Except on example 5 for which I got the below error :

lib/python3.7/site-packages/lightgbm/sklearn.py in set_params(self, **params)
    366             setattr(self, key, value)
    367             if hasattr(self, '_' + key):
--> 368                 setattr(self, '_' + key, value)
    369             self._other_params[key] = value
    370         return self

AttributeError: can't set attribute

I'm used to rely a lot on decision threshold to improve my classification precision/recall, thanks to predict_proba method. If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging{Classifier,Regressor}?

Thanks and keep the good work!

Y-oHr-N · 2020-02-06T15:59:01Z

Except on example 5 for which I got the below error :

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging {Classifier,Regressor}?

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

Thank you for your feedback.

flamby · 2020-02-06T17:09:03Z

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

You're right. It appears it was a jupyter cache issue. Silly me. Restarting the kernel again fixed it.

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

I've already monkey patch it, taking inspiration from the sklearn's VotingClassifier way to do it

Here it is. I hope I did not make mistakes.

model = lgb.LGBMClassifier()

def predict_proba(self, X):
    self._check_is_fitted()
    probas = np.asarray([e.predict_proba(X) for e in self.estimators_])
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', category=RuntimeWarning)
    avg = np.average(probas, axis=0)
    return avg

# monkey patching
RandomSeedAveragingClassifier.predict_proba = predict_proba

model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)

Y-oHr-N · 2020-02-10T04:24:58Z

Thank you for sharing your code.
I implemented it in pretools and released the package to PyPI.
Please try it.

flamby · 2020-02-10T09:55:58Z

Thank you very much @Y-oHr-N
I'll test it in the coming days.

Y-oHr-N · 2020-02-14T18:21:11Z

Hi @flamby,

I noticed that example 6 had a mistake. The modified code is as follows.

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(n_estimators=model.best_iteration_, **model.best_params_)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.982...

flamby · 2020-02-15T08:33:01Z

Hi @Y-oHr-N

Thanks. I finally had time to test it, and it works like a charm.

Y-oHr-N added the help wanted Extra attention is needed label Feb 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable random seed averaging #40

Enable random seed averaging #40

Y-oHr-N commented Jan 30, 2020

flamby commented Feb 4, 2020

Y-oHr-N commented Feb 5, 2020 •

edited

Loading

flamby commented Feb 6, 2020

Y-oHr-N commented Feb 6, 2020 •

edited

Loading

flamby commented Feb 6, 2020 •

edited

Loading

Y-oHr-N commented Feb 10, 2020

flamby commented Feb 10, 2020

Y-oHr-N commented Feb 14, 2020

flamby commented Feb 15, 2020

Enable random seed averaging #40

Enable random seed averaging #40

Comments

Y-oHr-N commented Jan 30, 2020

flamby commented Feb 4, 2020

Y-oHr-N commented Feb 5, 2020 • edited Loading

flamby commented Feb 6, 2020

Y-oHr-N commented Feb 6, 2020 • edited Loading

flamby commented Feb 6, 2020 • edited Loading

Y-oHr-N commented Feb 10, 2020

flamby commented Feb 10, 2020

Y-oHr-N commented Feb 14, 2020

flamby commented Feb 15, 2020

Y-oHr-N commented Feb 5, 2020 •

edited

Loading

Y-oHr-N commented Feb 6, 2020 •

edited

Loading

flamby commented Feb 6, 2020 •

edited

Loading