Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable random seed averaging #40

Open
Y-oHr-N opened this issue Jan 30, 2020 · 9 comments
Open

Enable random seed averaging #40

Y-oHr-N opened this issue Jan 30, 2020 · 9 comments
Labels
help wanted Extra attention is needed

Comments

@Y-oHr-N
Copy link
Owner

Y-oHr-N commented Jan 30, 2020

No description provided.

@Y-oHr-N Y-oHr-N added the help wanted Extra attention is needed label Feb 1, 2020
@flamby
Copy link

flamby commented Feb 4, 2020

Hi @Y-oHr-N

Is it that you want to make your mllib.ensemble.RandomSeedAveragingRegressor and mllib.ensemble.RandomSeedAveragingClassifier compatible somehow w/ OptGBM?

I'd thought that since OptGBM follows the sklearn API, it would be compatible by default.

Or am I missing something?

Anyway, random seed averaging is something I'll need to test as stdev of differently seeded models is sometimes very high, and averaging could generate a more robust model.

Do you have datasets in mind on which it's improving?

Thanks

@Y-oHr-N
Copy link
Owner Author

Y-oHr-N commented Feb 5, 2020

Hi @flamby,

As you said, random seed averaging is possible by default.

The first way is to pass OGBMModel to RandomSeedAveragingModel. This takes a very long time because tuning is performed for the number of seeds.

The second way is to train OGBMModel and then pass LGBMModel with best hyperparameters to RandomSeedAveragingModel. This requires only one tuning, but does not guarantee that the best model can be trained for each seed.

This is a simple example using OptGBM 0.5.0.

import lightgbm as lgb

from mllib.ensemble import RandomSeedAveragingClassifier
from optgbm.sklearn import OGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 1. LightGBM
model = lgb.LGBMClassifier(random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 2. LightGBM + random seed averaging
model = lgb.LGBMClassifier()
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 3. OptGBM (fold averaging)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.977...

# 4. OptGBM (single model)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)
model.refit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.980...

# 5. OptGBM (fold averaging) + random seed averaging (tune `n_estimators` times)
model = OGBMClassifier(n_trials=20)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.984...

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(**model.study_.best_params)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.968...

By the way, mllib is not currently being maintained and most of the code has been ported to pretools. I am trying to implement random seed averaging in pretools or OGBMModel.refit.

@flamby
Copy link

flamby commented Feb 6, 2020

Hi @Y-oHr-N,

Thanks for the clarification.
I ran all your examples with my dataset (OptGBM 0.5.0 and mllib of current git master branch) and had small improvements indeed. I must test it more extensively.

Except on example 5 for which I got the below error :

lib/python3.7/site-packages/lightgbm/sklearn.py in set_params(self, **params)
    366             setattr(self, key, value)
    367             if hasattr(self, '_' + key):
--> 368                 setattr(self, '_' + key, value)
    369             self._other_params[key] = value
    370         return self

AttributeError: can't set attribute

I'm used to rely a lot on decision threshold to improve my classification precision/recall, thanks to predict_proba method. If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging{Classifier,Regressor}?

Thanks and keep the good work!

@Y-oHr-N
Copy link
Owner Author

Y-oHr-N commented Feb 6, 2020

Except on example 5 for which I got the below error :

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging {Classifier,Regressor}?

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

Thank you for your feedback.

@flamby
Copy link

flamby commented Feb 6, 2020

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

You're right. It appears it was a jupyter cache issue. Silly me. Restarting the kernel again fixed it.

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

I've already monkey patch it, taking inspiration from the sklearn's VotingClassifier way to do it

Here it is. I hope I did not make mistakes.

model = lgb.LGBMClassifier()

def predict_proba(self, X):
    self._check_is_fitted()
    probas = np.asarray([e.predict_proba(X) for e in self.estimators_])
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', category=RuntimeWarning)
    avg = np.average(probas, axis=0)
    return avg

# monkey patching
RandomSeedAveragingClassifier.predict_proba = predict_proba

model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)

@Y-oHr-N
Copy link
Owner Author

Y-oHr-N commented Feb 10, 2020

Thank you for sharing your code.
I implemented it in pretools and released the package to PyPI.
Please try it.

@flamby
Copy link

flamby commented Feb 10, 2020

Thank you very much @Y-oHr-N
I'll test it in the coming days.

@Y-oHr-N
Copy link
Owner Author

Y-oHr-N commented Feb 14, 2020

Hi @flamby,

I noticed that example 6 had a mistake. The modified code is as follows.

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(n_estimators=model.best_iteration_, **model.best_params_)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.982...

@flamby
Copy link

flamby commented Feb 15, 2020

Hi @Y-oHr-N

Thanks. I finally had time to test it, and it works like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants