-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: y_true and y_pred contain different number of classes 2, 3. #1220
Comments
@PGijsbers Thank you for submitting this issue, for the detail, and for the minimally reproducible example. It seems that this is an issue when a class is unobserved in any of the cross-validation folds that TPOT generated (by default, it uses You can reduce the number of folds performed by TPOT so that it is less than the number of instances of the smallest class or create your own cross-fold generator that ensures at least one of each class exists in the data passed when fitting the pipeline for scoring (both of these use the This is an issue that occurs in native sklearn and is due to how In theory, we could eliminate/ignore sparsely-populated classes either in preprocessing or when evaluating pipelines, but as TPOT can otherwise handle cases like this and properly construct and mutate pipelines with most other metrics (for example, if you use the basic accuracy metric), this doesn't seem like the best approach to take without user input and may be something better left to the user to do, as the approach to removing outliers or handling classes with few instances will likely differ significantly based on the meta-features of the input dataset. It is possible to handle this and use a larger number of folds without modifying the functionality of TPOT or sklearn and maintain the use of the from tpot import TPOTClassifier
import numpy as np
from sklearn.metrics import log_loss, make_scorer
x, y = np.random.random((151, 4)), np.asarray([0] * 75 + [1] * 75 + [2])
labels = np.unique(y)
def mod_log_loss(y_true, y_pred, labels):
class_diff = len(labels) - len(y_pred[0])
if(class_diff > 0):
y_pred_pad = np.array([np.pad(x, pad_width=(0,class_diff)) for x in y_pred])
else:
y_pred_pad = y_pred
return(log_loss(y_true, y_pred_pad, labels=labels))
mod_neg_log_loss = make_scorer(mod_log_loss, greater_is_better=False, labels=labels, needs_proba=True)
t = TPOTClassifier(max_time_mins=1, scoring=mod_neg_log_loss)
t.fit(x,y)
t.predict(x) Note that this demo assumes that the missing classes are the last classes (as it pads at the end of the probability vectors). In theory, you could instead determine which classes are present in Let us know if you have any thoughts or questions! |
Thank you very much for the elaborate response. I was aware of the underlying issue, but I wasn't aware it was a design decision not to address it within TPOT. I understand the decision, feel free to close the issue if desired. |
Admittedly, I'm not sure if we should or shouldn't handle this case as opposed to relying on the user to know the drawbacks of imbalanced data and/or the limits of the metrics they choose. My logic is that processing the data in any way that isn't fully transparent to the user and/or consistent across all cases will be problematic and that it's better to leave it up to the user as to how they want to handle the situation (since there are many options and the best one will likely depend on what the user knows about their data and the importance of the outlier class - for example, in biomedical data, imbalances are common but usually highly important, like in cases where you have extraordinarily rare diseases with few cases against a large number of "control" cases). That being said, I'll have to talk with the rest of the lab that supports TPOT to see what the best choice might be. Thank you for raising the issue! We'll keep it open for now while we think about the best way to handle this - we may need to be clearer about this in the documentation or keep it in mind for future TPOT extensions/modificiations. |
Yes I think it depends entirely on how hands-off you want the automl experience to be and what the expected data science experience of the user is. |
A
TPOT.fit
call may fail when there are outlier minority classes (with certain metrics).Context of the issue
When running the benchmark we encountered this issue sometimes, for instance with evaluations on
wine-quality-white
:python runbenchmark.py TPOT openml/t/359974 1h8c -f 6
. Because of TPOT internals, the small minority classes may cause an error when optimizing towards log loss. I reduced the issue to a minimal example:Expected result
I expect a pipeline to be fit regardless, and be able to produce predictions for every class (even if that means with a probability of zero and receiving a warning about it).
Current result
Running the MWE:
Possible fix
Depends on the level you want to fix it on, options include:
scikit-learn
warnings, and also lead to the error)The text was updated successfully, but these errors were encountered: