- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 26.4k
Open
Description
Description
During cross-validation on a multi-class problem, it's technically possible to have classes present in the test data that don't appear in the training data.
Steps/Code to Reproduce
import numpy as np
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
rs = np.random.RandomState(1389057)
y = [
    'cow',
    'hedgehog',
    'fox',
    'fox',
    'hedgehog',
    'fox',
    'hedgehog',
    'cow',
    'cow',
    'fox'
]
x = rs.normal([0, 0], [1, 1], size=(len(y), 2))
model = BernoulliNB()
cv = StratifiedKFold(4, shuffle=True, random_state=rs)
param_dist = {
    'alpha': np.logspace(np.log(0.1), np.log(1), 20)
}
search = RandomizedSearchCV(model, param_dist, 5,
                            scoring=make_scorer(log_loss, needs_proba=True), cv=cv)
search.fit(x, y)Expected Results
Either:
- Predicted classes from predict_probaare aligned with classes in the full training data, not just the in-fold subset.
- Classes not in the training data are ignored in the test data.
Actual Results
Predicted classes from predict_proba are aligned with classes in the in-fold subset only, but classes not in the training data are still used in the test data, causing the error.
I understand that this is normatively "correct" behavior, but it makes it hard/impossible to use in cross-validation with the existing APIs.
From my perspective, the best solution would be to have RandomizedSearchCV pass a labels=self.classes_ argument to its scorer. I'm not sure how well that generalizes.
Versions
Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1
PGijsbers