-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Hellinger distance split criterion for classification trees #16478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Hellinger distance split criterion for classification trees #16478
Conversation
@EvgeniDubov Thanks for porting it in scikit-learn. I think that you should merge master into your branch as well (regarding some CI which are not used anymore). To give a bit of context with the PR, we discuss in IRL with @ogrisel regarding the integration of this feature in the trees. It would only be supported for the binary case most probably. The thing that we will need is to check that it helps in practise. I think one way would be to build trees using this criterion in the example that we are building in the |
@glemaitre thanks for the fast response, I'm on it |
Can anyone add the |
Hi everyone, I was just curious when do you think this option will be finalised and added? |
Are there any updates on when Hellinger split criterion will be merged? |
Thanks @EvgeniDubov for adding this feature! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution, @EvgeniDubov.
This is a notable feature.
Some comments:
test_importances
parametrisation must be extended to include'hellinger'
in.- the example and the implementation of feature importance can come as a follow up.
Let us know if you need more feedback and if you have time pursuing this work.
@EvgeniDubov: merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @EvgeniDubov!
I left a couple of comments regarding the documentation.
I think it would be nice to add a reference to this new distance for unbalanced datasets in doc/modules/tree.rst
.
For example, we can mention the Hellinger distance where it says:
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Likewise here:
Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.
Also, we should mention in doc/modules/tree.rst
that the Hellinger distance was only implemented for the binary case.
|
def check_imbalanced_criterion(name, criterion): | ||
ForestClassifier = FOREST_CLASSIFIERS[name] | ||
|
||
clf = ForestClassifier(n_estimators=10, criterion=criterion, random_state=1) | ||
clf.fit(X_large_imbl, y_large_imbl) | ||
|
||
# score is a mean of minority class predict_proba | ||
score = clf.predict_proba(X_large_imbl)[:, 1].mean() | ||
|
||
assert ( | ||
score > imbl_minority_class_ratio | ||
), "Failed with imbalanced criterion %s, score = %f, minority class ratio = %f" % ( | ||
criterion, | ||
score, | ||
imbl_minority_class_ratio, | ||
) | ||
|
||
|
||
@pytest.mark.parametrize("name", FOREST_CLASSIFIERS) | ||
@pytest.mark.parametrize("criterion", ["hellinger"]) | ||
def test_imbalanced_criterions(name, criterion): | ||
check_imbalanced_criterion(name, criterion) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjerphan, @glemaitre Do you know why we use this test pattern (one function with the logic and another parametrized that calls it)?
sklearn/ensemble/tests/test_forest.py
has some tests like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know and I would merge them together if it's used only once.
Edit: I guess it makes sense if the check is used in various tests, which is not the case here.
@glemaitre hellinger score is per split (three populations) and not per one population as in gini and entropy, thus adding it to the existing tree split design was a bit tricky |
@EvgeniDubov Thanks for this PR.
|
@lorentzenchr Do you have any evidence that this would not be helpful for binary classification of imbalanced data? If you look through the history of this ticket you will see that this was originally proposed as an addition to imblearn about 6 years ago. However, imblearn was dependant on sklearn in a way that this needed to be added to sklearn. From what I can see there is a lot of interest in this functionality and there is no evidence that it will not perform as expected. |
I‘m afraid the logic is reversed. One has to proof the usefulness of an additional feature. IMHO, this was not much discussed in #9947.
It takes a lot for me to sacrifice calibration. And I have not seen or do not understand the great advantage of the Hellinger distance as a split criterion. I see that you invested a lot in this feature and that I come at the last minute. Please bear with me as someone who never understood what the problem of „the imbalanced class problem“ is. |
@lorentzenchr I did not intend to claim that it should be added unless there was evidence against it without there being evidence for it. #9947 shows several papers that give evidence of the theoretical viability. @EvgeniDubov confirmed the usefulness here https://github.com/EvgeniDubov/hellinger-distance-criterion. I used it for an insurance client and got better results. I would have deployed it in production if it was in a major library like SKlearn. As a DS consultant, one of the most common problems is imbalanced binary classification. I can go into the details of why it is needed but the imbalance is an additional issue to the binary classification issue itself. In any case, I would have used it several times over the years since my original request. So my question was; given what I would consider ample evidence that it would be a useful addition to sklearn, is there evidence which pushes back against that? If there is only evidence for it and it is a smallish addition, then I do not see why it would not be wanted. |
@lorentzenchr https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/
|
From your example, I get
It depends quite a bit on the random seeds. But the point is that the hellinger criterion does not seem to have a positive effect on the AUC. # Import the necessary modules and libraries
import numpy as np
from sklearn import datasets
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
# Create imbalanced dataset
minority_class_ratio = 0.001
n_classes = 2
X, y = datasets.make_classification(
n_samples=1000 + 1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=n_classes,
n_clusters_per_class=1,
weights=[1 - minority_class_ratio],
shuffle=False,
random_state=0,
)
X_train, y_train = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]
# Criteria to compare
criterions = ["gini", "entropy", "hellinger"]
for criterion in criterions:
clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
clf.fit(X_train, y_train)
print(f"{criterion=}")
print(f"training set:")
y_prob = clf.predict_proba(X_train)
print(f" log_loss={log_loss(y_train, y_prob)} auc={roc_auc_score(y_train, y_prob[:, 1])}")
print(f"test set:")
y_prob = clf.predict_proba(X_test)
print(f" log_loss={log_loss(y_test, y_prob)} auc={roc_auc_score(y_test, y_prob[:, 1])}")
print("") |
I also made the effort to analyze one of the datasets from the cited papers. I decided for the binary version of the covertype dataset, https://www.openml.org/search?type=data&status=active&id=293. The big problem with the default params of
Summary I do not see the advantage of using the Hellinger distance as a split criterion. Using better hyperparams ( import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# covertype.binary, see https://www.openml.org/search?type=data&status=active&id=293
X, y = fetch_openml(data_id=293, as_frame=False, parser="auto", return_X_y=True)
y = y.astype(int) # y is of string type / object
criterions = ["gini", "entropy", "hellinger"]
def mean_depth(trees):
return np.mean([x.get_depth() for x in trees])
for criterion in criterions:
clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
cv_res = cross_validate(
clf, X, y, scoring=["neg_log_loss", "roc_auc"], cv=5, n_jobs=-1, return_estimator=True
)
print(f"{criterion=}")
print(f" Average AUC = {np.mean(cv_res['test_roc_auc'])}")
print(f" Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")
print(f" Average tree depth = {mean_depth(cv_res['estimator'])}")
The average tree depth shows that gini and entropy have MUCH toooooo deep trees and are very likely overfitting a lot. To come up with a fairer comparison, we do a grid search. for criterion in criterions:
clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
grid = GridSearchCV(
clf,
scoring="roc_auc",
param_grid={
"min_samples_leaf": [1, 2, 4, 6, 8, 10],
"max_depth": [2, 3, 4, 5, 6],
},
cv=3,
n_jobs=-1,
).fit(X, y)
print(f"{criterion=}")
print(f" best AUC = {grid.best_score_}")
print(f" best params = {grid.best_params_}")
We run the comparison again, now with the optimized parameters params = {
"gini": {"max_depth": 3, "min_samples_leaf": 1},
"entropy": {"max_depth": 3, "min_samples_leaf": 1},
"hellinger": {"max_depth": 3, "min_samples_leaf": 1},
}
for criterion in criterions:
clf = DecisionTreeClassifier(criterion=criterion, random_state=33, **params[criterion])
cv_res = cross_validate(clf, X, y, scoring=["neg_log_loss", "neg_brier_score", "roc_auc"], cv=5, n_jobs=-1)
print(f"{criterion=} params={params[criterion]}")
print(f" Average AUC = {np.mean(cv_res['test_roc_auc'])}")
print(f" Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")
|
Given my above analysis (better someone checks it!!!), my temporary answer is yes. Therefore, my temporary vote is -1 |
Thanks @lorentzenchr for the experiment. Since the start of the discussion long ago, I got new intuitions that I did not necessarily have at the time. Overall, I am convinced by the statistical arguments of @lorentzenchr regarding the importance of getting calibrated models and thus the use of proper scoring rules. I am more and more convinced that there is actually no imbalanced classification problem. With an imbalanced problem, it seems that the issue boils down to getting the "expected" hard predictions. scikit-learn does a bad job with a cut-off point fixed at 0.5 (when considering probabilities). Thus, I think that the missing piece is having access to a meta-estimator that can tweak this cut-off point to a given application (for a specific utility function). This is the purpose of #26120. I see that @richardbatesMcK add a link to the What I would still be interested in regarding the Hellinger criterion (mainly due to my limited knowledge of statistics) is to gain intuitions on the implication regarding the recursive partitioning within the trees. Basically, do we get the same results using the current tree + cut-off optimization? |
What I observed is that the Hellinger distance prohibits trees to split any further, very early. The details section in #16478 (comment) is really instructive in that regard as in prints the tree depth without setting restrictions. Then gini/entropy are around 40 splits while Hellinger is around 2. That is an extreme difference! |
I did not follow but did the bug that I showed in #16478 (comment) got solved? Because the tree might probably stop splitting even before it actually should (at least with the default tree parameter). |
This is a strange thing for the maintainer of imblearn to say. Are you implying it is an unuseful package? I have used imblearn.ensemble.BalancedBaggingClassifier a fair bit and have seen it be better than the sklearn.ensemble.RandomForestClassifier. I would like to see a rigorous comparison between the two as you suggest. However, I have also had good success with sklearn.ensemble.GradientBoostingClassifier vs both. Sometimes, using sample_weight in fit() helps. I have not been able to find an optimal method which consistently beats the other options so my strategy has been to try several methods. The plan was to add Hellinger distance splitting as another option to try. A benchmarking study comparing the options would be very interesting. The two major components to vary would be the amount of imbalance, data volume and noise. It is also important to note that the evaluation metric is not always the same in real world problems. Sometimes you want calibrated probabilities, sometimes you want to select the top most likely, sometimes you want high ROC AUC, sometimes minimizing false positives or false negatives is important. I would expect that it is unlikely that a single method is best in all data scenarios for all evaluations metrics.
I had a similar thought. Are we sure that the bug has been solved? Also, if we are going to hyperparameter tune that should be applied to all methods equally to make it fair. |
I strive for solving problems the best way I can. If it means that
This is exactly what I am currently interested in investigating. I think that you can get the best of the two worlds. You can start by optimizing a model (also with hyperparameter tuning) using a proper scoring rule (e.g. log-loss) to get a properly calibrated estimator and then use something as the |
Agreed. I have never found things like SMOTE useful. But I have used BalancedBaggingClassifier successfully.
In many cases (eg lead gen) you really only care about the sequence of the predictions when sorted by probability not the value of the probability. ROC AUC is a good proxy for this. This means altering the cut off point gets you nothing. What you want is to well isolate the rare positive cases somewhat like anomaly detection. So what I want is the method which does that optimally. As I said above, methods which do not attempt to handle the imbalance have underperformed in my experience. I am very interested in the deep study looking into all this but I doubt anything comprehensive and definitive will come before a decision needs to be made about this ticket. What is the bar which needs to be passed for this to get into the next release? Until this week I though it was only blocker was getting the code written. |
There was spherical scoring rule, it is proper and somewhat similar to Matusita/Hellinger distance. It is also way too easier to implement (basically change couple of lines of Gini criterion) |
It is quite unlikely that this feature will be included in the upcoming 1.3 release. Therefore, I close. But please continue the discussion with new insights. |
@glemaitre @EvgeniDubov Is it technically possible to get this into imbalanced-learn? As I recall, changing the split criteria was something which needed to be done in Sklearn but the plan was to make it possible to have user defined splits. Was that the implementation which was done in the end. Is it technically possible to have user defined splits as an option then put the specific implementation of the Hellinger Split in imbalanced-learn? Surely the option to have user defined splits is something we would want in sklearn. It opens the door to the testing we described above. |
|
@lorentzenchr That seems to contradict what is stated here #10251 This was actually proposed as the original solutions if you look at this comment. #9947 (comment) What am I missing? |
@KeithEdmondsMcK You're right. It's not a build time restriction. |
If we are going to get blocked on direct implementation in sklearn then I propose we follow this path. I have two questions
|
Adding the criterion means that we need to move from a pure Python package to an infrastructure where we need to build a wheel for all platforms in the world :) |
@EvgeniDubov Do you still plan to debug this? It would be useful to have a branch with the complete feature to use even if it is a fork. After a while the needed empirical evidence will accumulate. |
@KeithEdmondsMcK yep, I don't want to give up on hellinger :) planning to do the following
Does the second bullet make sense? |
Reference Issue
[scikit-learn] Feature Request: Hellinger split criterion for classificaiton trees #9947
[scikit-learn-contrib] [WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437
What does this implement/fix? Explain your changes.
Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models
TODO