-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Variable number of n_components per class in GMMClassifier #608
Comments
Thanks for raising the issue. |
I guess a few things to consider before implementing this.
|
In fairness, our docs could do a better job of explaining the bayesian variants of the GMM methods. It feels like they are mainly mentioned here. |
Long story short, I think having variable numbers of Gaussians will outperform if one of the classes is associated with multiple clusters in feature space. Consider a simple implementation below using some artificial data. Using gridsearch, I end up with better scores when using different numbers of Gaussians for each GMM. import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, ClassifierMixin
class GMMClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
self.n_components_class0 = n_components_class0
self.n_components_class1 = n_components_class1
self.covariance_type = covariance_type
def fit(self, X, y):
X_class0 = X[y == 0]
X_class1 = X[y == 1]
self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0, covariance_type=self.covariance_type)
self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1, covariance_type=self.covariance_type)
self.gmm_class0 = self.gmm_class0.fit(X_class0)
self.gmm_class1 = self.gmm_class1.fit(X_class1)
def predict(self, X):
prob_class0 = self.gmm_class0.score_samples(X)
prob_class1 = self.gmm_class1.score_samples(X)
return (prob_class1 > prob_class0).astype(int)
# Generate some example data
X, y = make_blobs(n_samples=5000, cluster_std=[0.8, 2, 1], random_state=0)
# Convert to binary classification by combinining class 1 and 2
y = np.where(y == 2, 1, y)
# Define the parameter grid for grid search
param_grid = {
'n_components_class0': range(1, 11),
'n_components_class1': range(1, 11),
}
# Create the GMM classifier
gmm_classifier = GMMClassifier()
# Create the pipeline for grid search
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=5)
# Fit the model to the data
grid_search.fit(X, y)
# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_) I end up with the best performance for 4 gaussians for one class and 3 for the other.
Consider the following: from sklego.mixture import GMMClassifier, BayesianGMMClassifier
X, y = make_blobs(n_samples=50000, cluster_std=[0.8, 2, 1], random_state=0)
clf = GMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start)
clf = BayesianGMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start) The GMMClassifier takes less than a second on my PC, while the BayesianGMMClassifier takes around 40 seconds. |
I ran the exact same script you shared and got these results.
This is likely due to randomness, since the data generation tools in sklearn all use some form of pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_n_components_class1', 'param_n_components_class0']].sort_values("mean_test_score") This yields:
I'm certainly still open to the idea, I would prefer to have a more convincing benchmark also for the docs. Is there a dataset that you can share that would give a clear difference in score? I guess a final comment on the time it takes to train. The grid search takes a while:
Compared to that grid, the bayesian method is a fair bit faster. from sklego.mixture import GMMClassifier, BayesianGMMClassifier
clf = BayesianGMMClassifier(n_components=10, max_iter=1000) This yields:
|
Given the radio silence, @FBruzzesi would you be OK if we just update the ticket to reflect that the docs should be updated? |
@koaning do you mean to make explicit in the docs that |
Hi @koaning, I'm still looking for a good benchmark dataset. We have some internal medical data that would fit this, but I imagine something that can be generated using sklearn datasets would be preferable. I know that the Bayesian approach is faster than grid-searching, but we often reach a point where we don't have the compute to fit an extremely large Bayesian model on the data that we have (>1,000,000 points, 8+ features). For those datasets, E-M based approaches are the only feasible option. |
@FBruzzesi I made a separate issue for the docs #614 to keep this thread on topic. @timmocking instead of finding a dataset could you share a benchmark on simulated data? I'm really just interested in justifying the addition and also having a story to share in the docs. As long as the default behavior doesn't change it'll be easy to add/support but I do want to be able to teach people when to consider any new setting. |
I hope this example works better than the last one. I adapted the "swiss roll" dataset to create a binary classification example where each class is better approximated with different numbers of Gaussian components. import numpy as np
from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt
X, univar_pos = make_swiss_roll(n_samples=1500, random_state=0)
# Create a class seperation in the swiss roll
y = np.where((univar_pos < 6) | (univar_pos > 13), 1, 0) It looks as follows: fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")
fig.add_axes(ax)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=50, alpha=0.8)
ax.set_title("Binary swiss roll")
ax.view_init(azim=-66, elev=12)
_ = ax.text2D(0.8, 0.05, s="n_samples=1500", transform=ax.transAxes) Adopting the code I shared earlier: from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.base import BaseEstimator, ClassifierMixin
class GMMClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
self.n_components_class0 = n_components_class0
self.n_components_class1 = n_components_class1
self.covariance_type = covariance_type
def fit(self, X, y):
X_class0 = X[y == 0]
X_class1 = X[y == 1]
self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0,
covariance_type=self.covariance_type,
random_state=0)
self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1,
covariance_type=self.covariance_type,
random_state=0)
self.gmm_class0 = self.gmm_class0.fit(X_class0)
self.gmm_class1 = self.gmm_class1.fit(X_class1)
def predict(self, X):
prob_class0 = self.gmm_class0.score_samples(X)
prob_class1 = self.gmm_class1.score_samples(X)
return (prob_class1 > prob_class0).astype(int)
# Define the parameter grid for grid search
param_grid = {
'n_components_class0': range(1, 11),
'n_components_class1': range(1, 11),
}
# Create the GMM classifier
gmm_classifier = GMMClassifier()
# Create the pipeline for grid search
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=cv)
# Fit the model to the data
grid_search.fit(X, y) |
Currently, GMMClassifier always uses the same number of components to fit a model on each class.
The number of components providing the best "fit" for the data is rarely the same across different classes. In a binary classification task, I would be interested in tuning n_components using gridsearch for each class independently.
I understand that this would be difficult to implement, as n_components is set when initializing GMMClassifier. Is this something that would be interesting to pursue?
The text was updated successfully, but these errors were encountered: