Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unavoidable "y_true and y_pred contain different number of classes" error inside a CV loop #11777

Open
gwerbin opened this issue Aug 7, 2018 · 15 comments

Comments

@gwerbin
Copy link

gwerbin commented Aug 7, 2018

Description

During cross-validation on a multi-class problem, it's technically possible to have classes present in the test data that don't appear in the training data.

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB

rs = np.random.RandomState(1389057)

y = [
    'cow',
    'hedgehog',
    'fox',
    'fox',
    'hedgehog',
    'fox',
    'hedgehog',
    'cow',
    'cow',
    'fox'
]

x = rs.normal([0, 0], [1, 1], size=(len(y), 2))

model = BernoulliNB()

cv = StratifiedKFold(4, shuffle=True, random_state=rs)

param_dist = {
    'alpha': np.logspace(np.log(0.1), np.log(1), 20)
}

search = RandomizedSearchCV(model, param_dist, 5,
                            scoring=make_scorer(log_loss, needs_proba=True), cv=cv)

search.fit(x, y)

Expected Results

Either:

  1. Predicted classes from predict_proba are aligned with classes in the full training data, not just the in-fold subset.
  2. Classes not in the training data are ignored in the test data.

Actual Results

Predicted classes from predict_proba are aligned with classes in the in-fold subset only, but classes not in the training data are still used in the test data, causing the error.

I understand that this is normatively "correct" behavior, but it makes it hard/impossible to use in cross-validation with the existing APIs.

From my perspective, the best solution would be to have RandomizedSearchCV pass a labels=self.classes_ argument to its scorer. I'm not sure how well that generalizes.

Versions

Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1
@NicolasHug
Copy link
Member

What about scoring=make_scorer(log_loss, needs_proba=True, labels=y)?

@gwerbin
Copy link
Author

gwerbin commented Aug 7, 2018

@NicolasHug that only fixes the problem half-way.

Setting labels= resolves the "not enough classes out-of-fold" case. But the "not enough classes in-fold" case is still a problem.

import numpy as np
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.naive_bayes import GaussianNB

rs = np.random.RandomState(99984)

y = [
    'cow',
    'hedgehog',
    'fox',
    'fox',
    'hedgehog',
    'fox',
    'hedgehog',
    'fox'
]

x = rs.normal([0, 0], [1, 1], size=(len(y), 2))

model = GaussianNB()

cv = StratifiedKFold(4, shuffle=True, random_state=rs)

param_dist = {
    'alpha': np.logspace(np.log(0.1), np.log(1), 20)
}

search = RandomizedSearchCV(model, param_dist, 5,
                            scoring=make_scorer(log_loss, needs_proba=True, labels=y), cv=cv)

search.fit(x, y)

@NicolasHug
Copy link
Member

NicolasHug commented Aug 7, 2018

The warning I'm getting is:

sklearn/model_selection/_split.py:605: Warning: The least populated class in y has only 3 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4

which makes sense. You cannot have a stratified CV if the number of folds is greater than the number of instances from a given class: that would mean one or more of the folds would have no instance of that class.

Setting n_folds=3 removes the warning

@gwerbin
Copy link
Author

gwerbin commented Aug 7, 2018

@NicolasHug did you run the 2nd example? It should throw an error.

ValueError: The number of classes in labels is different from that in y_pred. Classes found in labels: ['cow' 'fox' 'hedgehog']

@NicolasHug
Copy link
Member

Sorry I didn't double check the 2nd one, I get the same error indeed. I'm not exactly sure how this could be elegantly fixed :/

@gwerbin
Copy link
Author

gwerbin commented Aug 7, 2018

This is also an issue for binarized multi-class problems as well, so it's worth coming up with even some kind of solution.

One very general fix would allow .fit methods of classifiers generally to accept a labels= argument, like many loss functions (e.g. log_loss) already do. Then the estimator class can handle the index bookkeeping and correctly align the output. As for what to do with the ghost columns created by alignment, for most combos of loss function and estimator, filling them with 0 is reasonable.

Another fix would be to add some kind of additional logic that can be fed into the scorer, and then have the scorer handle the index bookkeeping. That might be a lower-impact solution since it would be internal and would be isolated to just RandomizedSearchCV, GridSearchCV, and make_scorer.

@jnothman
Copy link
Member

jnothman commented Aug 8, 2018

See #6231, #8100, #9585. It's a problem whose solution we are stalled on.... :(

@gwerbin
Copy link
Author

gwerbin commented Aug 8, 2018

Here's a proof-of-concept for label alignment in BernoulliNB: https://gist.github.com/gwerbin/8a8f777db6775c7d0c3e585c39c550f8 (including a variation of the 2nd example I posted above).

You can see that the logic for actually manipulating the output labels is almost totally agnostic of the nature of the classifier. You could probably have each predict_* method call some kind of generic _align_labels method on the prediction output.

Edit: note that I didn't touch the class priors. You would probably have to handle that more carefully on a class-by-class basis, e.g. check that they align with the given classes.

@jnothman reading over those issues, it seems like that whole class of related problems could be made easier by enforcing more structure on target variable handling.

For example, BernoulliNB.fit calls LabelBinarizer internally:

labelbin = LabelBinarizer()
Y = labelbin.fit_transform(y)
self.classes_ = labelbin.classes_
if Y.shape[1] == 1:
    Y = np.concatenate((1 - Y, Y), axis=1)

https://github.com/scikit-learn/scikit-learn/blob/0.19.1/sklearn/naive_bayes.py#L582-L586

To me, this is violates the "separation of concerns" principle. Scikit-learn already has a well-defined taxonomy of target variables. Right now, logic related to target types is scattered throughout the codebase, like in the example above. Each class is burdened with implementing its own logic to handle various target types, and whoever writes the documentation is burdened in turn with documenting that logic.

Imagine if you could do something like this:

class BernoulliNB(BaseEstimator, ClassifierMixin):
    _TARGET_TYPES_ = ('binary', 'multiclass', 'multilabel-indicator')

and have logic inherited from ClassifierMixin just "do the right thing" with respect to converting between target types. You could then use this attribute to help enforce consistent output-alignment handling.

@gwerbin
Copy link
Author

gwerbin commented Aug 25, 2018

I'm willing to put in the work to implement and test a generic wrapper for label alignment for the multiclass, multiclass-multioutput, and multilabel-indicator cases, but I'm not familiar enough with Scikit-learn to know when and where to inject them.

Is there interest in pursuing this further? Should I follow up on the mailing list instead of here?

@jnothman
Copy link
Member

jnothman commented Aug 30, 2018 via email

@gwerbin
Copy link
Author

gwerbin commented Aug 30, 2018

Thanks @jnothman . What would be an example where multiclass can't be handled with LabelBinarizer, or LabelEncoder -> LabelBinarizer?

Another quick-fix solution would be to add an option like LabelEncoder(..., fill_missing=-1) and LabelBinarizer(..., missing_indicator=-1), which would result in something like this:

le = LabelEncoder(fill_missing=-1)
lb = LabelBinarizer(missing_indicator=-1)

le.fit(['a', 'b', 'c'])
lb.fit(le.transform(['a','b','c'])

tmp1 = le.transform(['a','b','c','d'])
tmp2 = lb.transform(tmp1)

print(tmp1)
# np.ndarray([0, 1, 2, -1])

print(tmp1)
# np.ndarray([[1, 0, 0],
#             [0, 1, 0],
#             [0, 0, 1],
#             [0, 0, 0]])

@jnothman
Copy link
Member

jnothman commented Aug 31, 2018 via email

@gwerbin
Copy link
Author

gwerbin commented Aug 31, 2018

@jnothman in which case you can still use LabelEncoder, right?

@qinj
Copy link

qinj commented Nov 8, 2019

Use model.classes_ attribute

y_probs = model.predict_proba(X_test)
log_loss(y_test, y_probs, label=model.classes_)

@turian
Copy link

turian commented Jul 28, 2021

I have this error too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants