-
-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unavoidable "y_true and y_pred contain different number of classes" error inside a CV loop #11777
Comments
What about |
@NicolasHug that only fixes the problem half-way. Setting import numpy as np
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.naive_bayes import GaussianNB
rs = np.random.RandomState(99984)
y = [
'cow',
'hedgehog',
'fox',
'fox',
'hedgehog',
'fox',
'hedgehog',
'fox'
]
x = rs.normal([0, 0], [1, 1], size=(len(y), 2))
model = GaussianNB()
cv = StratifiedKFold(4, shuffle=True, random_state=rs)
param_dist = {
'alpha': np.logspace(np.log(0.1), np.log(1), 20)
}
search = RandomizedSearchCV(model, param_dist, 5,
scoring=make_scorer(log_loss, needs_proba=True, labels=y), cv=cv)
search.fit(x, y) |
The warning I'm getting is:
which makes sense. You cannot have a stratified CV if the number of folds is greater than the number of instances from a given class: that would mean one or more of the folds would have no instance of that class. Setting |
@NicolasHug did you run the 2nd example? It should throw an error.
|
Sorry I didn't double check the 2nd one, I get the same error indeed. I'm not exactly sure how this could be elegantly fixed :/ |
This is also an issue for binarized multi-class problems as well, so it's worth coming up with even some kind of solution. One very general fix would allow Another fix would be to add some kind of additional logic that can be fed into the scorer, and then have the scorer handle the index bookkeeping. That might be a lower-impact solution since it would be internal and would be isolated to just |
Here's a proof-of-concept for label alignment in You can see that the logic for actually manipulating the output labels is almost totally agnostic of the nature of the classifier. You could probably have each Edit: note that I didn't touch the class priors. You would probably have to handle that more carefully on a class-by-class basis, e.g. check that they align with the given classes. @jnothman reading over those issues, it seems like that whole class of related problems could be made easier by enforcing more structure on target variable handling. For example, labelbin = LabelBinarizer()
Y = labelbin.fit_transform(y)
self.classes_ = labelbin.classes_
if Y.shape[1] == 1:
Y = np.concatenate((1 - Y, Y), axis=1) To me, this is violates the "separation of concerns" principle. Scikit-learn already has a well-defined taxonomy of target variables. Right now, logic related to target types is scattered throughout the codebase, like in the example above. Each class is burdened with implementing its own logic to handle various target types, and whoever writes the documentation is burdened in turn with documenting that logic. Imagine if you could do something like this: class BernoulliNB(BaseEstimator, ClassifierMixin):
_TARGET_TYPES_ = ('binary', 'multiclass', 'multilabel-indicator') and have logic inherited from ClassifierMixin just "do the right thing" with respect to converting between target types. You could then use this attribute to help enforce consistent output-alignment handling. |
I'm willing to put in the work to implement and test a generic wrapper for label alignment for the multiclass, multiclass-multioutput, and multilabel-indicator cases, but I'm not familiar enough with Scikit-learn to know when and where to inject them. Is there interest in pursuing this further? Should I follow up on the mailing list instead of here? |
I think you need to wait a few weeks. Core Devs are either vacationing or
exhausted from trying to release version 0.20. Then working towards a
tangible proof of concept would be interesting. But I promise a long road
before we achieve consistency and consensus. Very often multiclass cannot
just be handled with LabelBinarizer. And yes, you might get different
people responding if you advertise on the mailing list.
|
Thanks @jnothman . What would be an example where multiclass can't be handled with LabelBinarizer, or LabelEncoder -> LabelBinarizer? Another quick-fix solution would be to add an option like le = LabelEncoder(fill_missing=-1)
lb = LabelBinarizer(missing_indicator=-1)
le.fit(['a', 'b', 'c'])
lb.fit(le.transform(['a','b','c'])
tmp1 = le.transform(['a','b','c','d'])
tmp2 = lb.transform(tmp1)
print(tmp1)
# np.ndarray([0, 1, 2, -1])
print(tmp1)
# np.ndarray([[1, 0, 0],
# [0, 1, 0],
# [0, 0, 1],
# [0, 0, 0]]) |
Trees, for instance.
|
@jnothman in which case you can still use LabelEncoder, right? |
Use model.classes_ attribute y_probs = model.predict_proba(X_test) |
I have this error too |
Description
During cross-validation on a multi-class problem, it's technically possible to have classes present in the test data that don't appear in the training data.
Steps/Code to Reproduce
Expected Results
Either:
predict_proba
are aligned with classes in the full training data, not just the in-fold subset.Actual Results
Predicted classes from
predict_proba
are aligned with classes in the in-fold subset only, but classes not in the training data are still used in the test data, causing the error.I understand that this is normatively "correct" behavior, but it makes it hard/impossible to use in cross-validation with the existing APIs.
From my perspective, the best solution would be to have
RandomizedSearchCV
pass alabels=self.classes_
argument to its scorer. I'm not sure how well that generalizes.Versions
The text was updated successfully, but these errors were encountered: