ENH Hellinger distance split criterion for classification trees #16478

EvgeniDubov · 2020-02-19T09:02:35Z

Reference Issue
[scikit-learn] Feature Request: Hellinger split criterion for classificaiton trees #9947
[scikit-learn-contrib] [WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

What does this implement/fix? Explain your changes.
Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

TODO

glemaitre · 2020-02-19T09:47:07Z

@EvgeniDubov Thanks for porting it in scikit-learn. I think that you should merge master into your branch as well (regarding some CI which are not used anymore).

To give a bit of context with the PR, we discuss in IRL with @ogrisel regarding the integration of this feature in the trees. It would only be supported for the binary case most probably. The thing that we will need is to check that it helps in practise. I think one way would be to build trees using this criterion in the example that we are building in the BalancedRandomForest: #13227

EvgeniDubov · 2020-02-20T06:45:58Z

@glemaitre thanks for the fast response, I'm on it

chkoar · 2020-07-29T09:20:49Z

Can anyone add the Waiting for Reviewer label to this PR?

ndrmahmoudi · 2021-04-07T02:50:47Z

Hi everyone, I was just curious when do you think this option will be finalised and added?

giladwa1 · 2022-01-09T07:42:20Z

Are there any updates on when Hellinger split criterion will be merged?
IMO this is a very useful feature in cases of imbalanced labels, and would be great to have as part of scikit-learn.

kahanad · 2022-01-09T13:04:32Z

Thanks @EvgeniDubov for adding this feature!
Any idea when it'll be merged?
Thanks!

jjerphan

Thank you for this contribution, @EvgeniDubov.
This is a notable feature.

Some comments:

test_importances parametrisation must be extended to include 'hellinger' in.
the example and the implementation of feature importance can come as a follow up.

Let us know if you need more feedback and if you have time pursuing this work.

sklearn/tree/_criterion.pyx

doc/modules/tree.rst

doc/whats_new/v0.23.rst

doc/modules/tree.rst

jjerphan · 2022-01-28T16:45:33Z

@EvgeniDubov: merging main again will solve the problem with the CI.

sklearn/tree/_criterion.pyx

sklearn/tree/tests/test_tree.py

jmloyola

Thanks for the PR @EvgeniDubov!

I left a couple of comments regarding the documentation.

I think it would be nice to add a reference to this new distance for unbalanced datasets in doc/modules/tree.rst.
For example, we can mention the Hellinger distance where it says:

Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Likewise here:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

Also, we should mention in doc/modules/tree.rst that the Hellinger distance was only implemented for the binary case.

doc/modules/tree.rst

sklearn/ensemble/_forest.py

sklearn/tree/_classes.py

EvgeniDubov · 2022-02-06T09:59:11Z

Thanks for the PR @EvgeniDubov!

I left a couple of comments regarding the documentation.

I think it would be nice to add a reference to this new distance for unbalanced datasets in doc/modules/tree.rst. For example, we can mention the Hellinger distance where it says:

Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Likewise here:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

Also, we should mention in doc/modules/tree.rst that the Hellinger distance was only implemented for the binary case.

sklearn/ensemble/tests/test_forest.py

jmloyola · 2022-02-06T23:10:33Z

sklearn/ensemble/tests/test_forest.py

+def check_imbalanced_criterion(name, criterion):
+    ForestClassifier = FOREST_CLASSIFIERS[name]
+
+    clf = ForestClassifier(n_estimators=10, criterion=criterion, random_state=1)
+    clf.fit(X_large_imbl, y_large_imbl)
+
+    # score is a mean of minority class predict_proba
+    score = clf.predict_proba(X_large_imbl)[:, 1].mean()
+
+    assert (
+        score > imbl_minority_class_ratio
+    ), "Failed with imbalanced criterion %s, score = %f, minority class ratio = %f" % (
+        criterion,
+        score,
+        imbl_minority_class_ratio,
+    )
+
+
+@pytest.mark.parametrize("name", FOREST_CLASSIFIERS)
+@pytest.mark.parametrize("criterion", ["hellinger"])
+def test_imbalanced_criterions(name, criterion):
+    check_imbalanced_criterion(name, criterion)
+


@jjerphan, @glemaitre Do you know why we use this test pattern (one function with the logic and another parametrized that calls it)?
sklearn/ensemble/tests/test_forest.py has some tests like this.

I do not know and I would merge them together if it's used only once.

Edit: I guess it makes sense if the check is used in various tests, which is not the case here.

sklearn/ensemble/tests/test_forest.py

sklearn/tree/tests/test_tree.py

EvgeniDubov · 2023-05-03T21:07:48Z

One would need to investigate the cause of the bug shown in #16478 (comment)

@glemaitre hellinger score is per split (three populations) and not per one population as in gini and entropy, thus adding it to the existing tree split design was a bit tricky
I tried to adapt it by breaking the hellinger formula into two parts, one for each child node and making left half of hellinger formula to be the left child "impurity"
looks like this solution works with current split mechanism but doesn't always get to pure leaves
I'm trying to get deeper understanding of _splitter.pyx code to figure out why hellinger doesn't split till pure leaves
any help with this final point is very welcomed

lorentzenchr · 2023-05-04T12:17:15Z

@EvgeniDubov Thanks for this PR.
From a first fast read over the example and the paper on the Hellinger distance, I'm not so convinced to include it for several reasons:

While the Hellinger distance is an f-divergence, it does not stem from a proper scoring rule:

Friedman (1983) and Nau (1985) studied a looser type of relationship between proper scoring rules and distance measures on classes of probability distributions. They restricted attention to metrics (i.e., distance measures that are symmetric and satisfy the triangle inequality) and called a scoring rule $S$ effective with respect to a metric $d$ if
$S(P_1, Q) \geq S(P_2, Q) \Leftrightarrow d(P_1, Q) \leq d(P_2, Q)$.
Nau (1985) called a metric co-effective if there is a proper scoring rule that is effective with respect to it. His proposition 1 implies that the $l_1$, $l_\infty$, and Hellinger distances on spaces of absolutely continuous probability measures are not co-effective.

Gneiting, Raftery "Strictly Proper Scoring Rules, Prediction, and Estimation" free PDF
This means, it will not revover the true probability in the end. (Maybe I'm missing a step to discrete distributions.)
I do not understand why the figure in the new example would show that the Hellinger criterion is superior in this setting. I see that Gini and entropy better identify the one outlier at the top, which might also be problematic with the default min_samples_leaf=1. For the cluster of the minority class in the middle, given only the two features, no model will ever predict those with high probability.
I would like to see the numbers np.sum(model.predict_proba(X_train), axis=0) and the frequencies of the classes in y_train.

KeithEdmondsMcK · 2023-05-04T16:22:56Z

@lorentzenchr
I think you are missing the point of this functionality. The utility is to increase binary classification metrics like ROC AUC not to give a calibrated probability. This is specifically for use in the case of imbalanced binary classification. Resampling and reweighting techniques to handle imbalance typically give highly uncalibrated probabilities as well. Tools to handle calibration have been around for a while so its a bit of a non-issue. https://scikit-learn.org/stable/modules/calibration.html

Do you have any evidence that this would not be helpful for binary classification of imbalanced data? If you look through the history of this ticket you will see that this was originally proposed as an addition to imblearn about 6 years ago. However, imblearn was dependant on sklearn in a way that this needed to be added to sklearn. From what I can see there is a lot of interest in this functionality and there is no evidence that it will not perform as expected.

lorentzenchr · 2023-05-04T16:53:34Z

Do you have any evidence that this would not be helpful for binary classification of imbalanced data?

I‘m afraid the logic is reversed. One has to proof the usefulness of an additional feature. IMHO, this was not much discussed in #9947.

The utility is to increase binary classification metrics like ROC AUC not to give a calibrated probability.

It takes a lot for me to sacrifice calibration. And I have not seen or do not understand the great advantage of the Hellinger distance as a split criterion.

I see that you invested a lot in this feature and that I come at the last minute. Please bear with me as someone who never understood what the problem of „the imbalanced class problem“ is.

KeithEdmondsMcK · 2023-05-04T17:17:38Z

@lorentzenchr I did not intend to claim that it should be added unless there was evidence against it without there being evidence for it. #9947 shows several papers that give evidence of the theoretical viability. @EvgeniDubov confirmed the usefulness here https://github.com/EvgeniDubov/hellinger-distance-criterion. I used it for an insurance client and got better results. I would have deployed it in production if it was in a major library like SKlearn. As a DS consultant, one of the most common problems is imbalanced binary classification. I can go into the details of why it is needed but the imbalance is an additional issue to the binary classification issue itself. In any case, I would have used it several times over the years since my original request. So my question was; given what I would consider ample evidence that it would be a useful addition to sklearn, is there evidence which pushes back against that? If there is only evidence for it and it is a smallish addition, then I do not see why it would not be wanted.

richardbatesMcK · 2023-05-04T17:59:19Z

@lorentzenchr
Decision tree models are known to be impacted by imbalanced data sets.

https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

However, the most commonly used classification algorithms do not work well for such problems because they aim to minimize the overall error rate, rather than paying special attention to the positive class.

https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/

The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets.
The split points of the tree are chosen to best separate examples into two groups with minimum mixing. When both groups are dominated by examples from one class, the criterion used to select a split point will see good separation, when in fact, the examples from the minority class are being ignored.

lorentzenchr · 2023-05-05T06:28:33Z

From your example, I get

criterion='gini'
- training set:
  log_loss=2.220446049250313e-16 auc=1.0
- test set:
  log_loss=0.648785761004109 auc=0.4949596774193548
criterion='entropy'
- training set:
  log_loss=2.220446049250313e-16 auc=1.0
- test set:
  log_loss=0.7208730677823433 auc=0.4939516129032258
criterion='hellinger'
- training set:
  log_loss=0.009106686791084227 auc=0.9980696755216472
- test set:
  log_loss=0.6549246984162754 auc=0.48538306451612906

It depends quite a bit on the random seeds. But the point is that the hellinger criterion does not seem to have a positive effect on the AUC.

# Import the necessary modules and libraries
import numpy as np
from sklearn import datasets
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss, confusion_matrix
from sklearn.tree import DecisionTreeClassifier


# Create imbalanced dataset
minority_class_ratio = 0.001
n_classes = 2
X, y = datasets.make_classification(
    n_samples=1000 + 1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=n_classes,
    n_clusters_per_class=1,
    weights=[1 - minority_class_ratio],
    shuffle=False,
    random_state=0,
)
X_train, y_train = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]


# Criteria to compare
criterions = ["gini", "entropy", "hellinger"]

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    clf.fit(X_train, y_train)
    print(f"{criterion=}")
    print(f"training set:")
    y_prob = clf.predict_proba(X_train)
    print(f"  log_loss={log_loss(y_train, y_prob)} auc={roc_auc_score(y_train, y_prob[:, 1])}")
    print(f"test set:")
    y_prob = clf.predict_proba(X_test)
    print(f"  log_loss={log_loss(y_test, y_prob)} auc={roc_auc_score(y_test, y_prob[:, 1])}")
    print("")

lorentzenchr · 2023-05-05T12:18:33Z

I also made the effort to analyze one of the datasets from the cited papers. I decided for the binary version of the covertype dataset, https://www.openml.org/search?type=data&status=active&id=293.

The big problem with the default params of DecisionTreeClassifier is that is uses min_samples_leaf=1 and max_depth=None (meaning no limit). Therefore, I used a grid search to make the comparison fairer.
Using the best value gives.

criterion='gini' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7427456478411264
  Average log_loss = 0.6149432406120839
criterion='entropy' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7466694513462981
  Average log_loss = 0.610347261462469
criterion='hellinger' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929

Summary I do not see the advantage of using the Hellinger distance as a split criterion. Using better hyperparams (max_depth and min_samples_leaf) than the default ones have been sufficient in the cases I analyzed.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# covertype.binary, see https://www.openml.org/search?type=data&status=active&id=293
X, y = fetch_openml(data_id=293, as_frame=False, parser="auto", return_X_y=True)
y = y.astype(int)  # y is of string type / object

criterions = ["gini", "entropy", "hellinger"]
def mean_depth(trees):
    return np.mean([x.get_depth() for x in trees])

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    cv_res = cross_validate(
        clf, X, y, scoring=["neg_log_loss", "roc_auc"], cv=5, n_jobs=-1, return_estimator=True
    )
    print(f"{criterion=}")
    print(f"  Average AUC = {np.mean(cv_res['test_roc_auc'])}")
    print(f"  Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")
    print(f"  Average tree depth = {mean_depth(cv_res['estimator'])}")

criterion='gini'
  Average AUC = 0.6261283565436794
  Average log_loss = 13.428064691489425
  Average tree depth = 41.0
criterion='entropy'
  Average AUC = 0.6271224404131799
  Average log_loss = 13.3942540173142
  Average tree depth = 46.0
criterion='hellinger'
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929
  Average tree depth = 1.8

The average tree depth shows that gini and entropy have MUCH toooooo deep trees and are very likely overfitting a lot. To come up with a fairer comparison, we do a grid search.

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    grid = GridSearchCV(
        clf,
        scoring="roc_auc",
        param_grid={
            "min_samples_leaf": [1, 2, 4, 6, 8, 10],
            "max_depth": [2, 3, 4, 5, 6],
        },
        cv=3,
        n_jobs=-1,
    ).fit(X, y)
    print(f"{criterion=}")
    print(f"  best AUC = {grid.best_score_}")
    print(f"  best params = {grid.best_params_}")

criterion='gini'
  best AUC = 0.7275554647833573
  best params = {'max_depth': 3, 'min_samples_leaf': 1}
criterion='entropy'
  best AUC = 0.7210744088919395
  best params = {'max_depth': 3, 'min_samples_leaf': 1}
criterion='hellinger'
  best AUC = 0.6806826150720867
  best params = {'max_depth': 3, 'min_samples_leaf': 1}

We run the comparison again, now with the optimized parameters

params = {
    "gini": {"max_depth": 3, "min_samples_leaf": 1},
    "entropy": {"max_depth": 3, "min_samples_leaf": 1},
    "hellinger": {"max_depth": 3, "min_samples_leaf": 1},
}
for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33, **params[criterion])
    cv_res = cross_validate(clf, X, y, scoring=["neg_log_loss", "neg_brier_score", "roc_auc"], cv=5, n_jobs=-1)
    print(f"{criterion=} params={params[criterion]}")
    print(f"  Average AUC = {np.mean(cv_res['test_roc_auc'])}")
    print(f"  Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")

criterion='gini' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7427456478411264
  Average log_loss = 0.6149432406120839
criterion='entropy' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7466694513462981
  Average log_loss = 0.610347261462469
criterion='hellinger' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929

lorentzenchr · 2023-05-05T12:22:56Z

Do you have any evidence that this would not be helpful for binary classification of imbalanced data?

Given my above analysis (better someone checks it!!!), my temporary answer is yes. Therefore, my temporary vote is

-1

glemaitre · 2023-05-05T12:56:51Z

Thanks @lorentzenchr for the experiment.

Since the start of the discussion long ago, I got new intuitions that I did not necessarily have at the time.

Overall, I am convinced by the statistical arguments of @lorentzenchr regarding the importance of getting calibrated models and thus the use of proper scoring rules. I am more and more convinced that there is actually no imbalanced classification problem.

With an imbalanced problem, it seems that the issue boils down to getting the "expected" hard predictions. scikit-learn does a bad job with a cut-off point fixed at 0.5 (when considering probabilities). Thus, I think that the missing piece is having access to a meta-estimator that can tweak this cut-off point to a given application (for a specific utility function). This is the purpose of #26120. I see that @richardbatesMcK add a link to the BalanacedRandomForest: actually, I plan to run an experiment to compare such an approach with a RandomForest + cut-off optimization.

What I would still be interested in regarding the Hellinger criterion (mainly due to my limited knowledge of statistics) is to gain intuitions on the implication regarding the recursive partitioning within the trees. Basically, do we get the same results using the current tree + cut-off optimization?

lorentzenchr · 2023-05-05T13:36:22Z

What I would still be interested in regarding the Hellinger criterion is to gain intuitions on the implication regarding the recursive partitioning within the trees. Basically, do we get the same results using the current tree + cut-off optimization?

What I observed is that the Hellinger distance prohibits trees to split any further, very early. The details section in #16478 (comment) is really instructive in that regard as in prints the tree depth without setting restrictions. Then gini/entropy are around 40 splits while Hellinger is around 2. That is an extreme difference!

glemaitre · 2023-05-05T13:40:39Z

Then gini/entropy are around 40 splits while Hellinger is around 2. That is an extreme difference!

I did not follow but did the bug that I showed in #16478 (comment) got solved? Because the tree might probably stop splitting even before it actually should (at least with the default tree parameter).

KeithEdmondsMcK · 2023-05-05T16:53:14Z

@glemaitre

I am more and more convinced that there is actually no imbalanced classification problem.

This is a strange thing for the maintainer of imblearn to say. Are you implying it is an unuseful package? I have used imblearn.ensemble.BalancedBaggingClassifier a fair bit and have seen it be better than the sklearn.ensemble.RandomForestClassifier. I would like to see a rigorous comparison between the two as you suggest. However, I have also had good success with sklearn.ensemble.GradientBoostingClassifier vs both. Sometimes, using sample_weight in fit() helps. I have not been able to find an optimal method which consistently beats the other options so my strategy has been to try several methods. The plan was to add Hellinger distance splitting as another option to try. A benchmarking study comparing the options would be very interesting. The two major components to vary would be the amount of imbalance, data volume and noise. It is also important to note that the evaluation metric is not always the same in real world problems. Sometimes you want calibrated probabilities, sometimes you want to select the top most likely, sometimes you want high ROC AUC, sometimes minimizing false positives or false negatives is important. I would expect that it is unlikely that a single method is best in all data scenarios for all evaluations metrics.

I did not follow but did the bug that I showed in #16478 (comment) got solved? Because the tree might probably stop splitting even before it actually should (at least with the default tree parameter).

I had a similar thought. Are we sure that the bug has been solved? Also, if we are going to hyperparameter tune that should be applied to all methods equally to make it fair.

glemaitre · 2023-05-05T17:34:10Z

This is a strange thing for the maintainer of imblearn to say. Are you implying it is an unuseful package?

I strive for solving problems the best way I can. If it means that imbalanced-learn becomes irrelevant because it is not the way to go, I am completely fine with it. I implemented these methods because the scientific literature was pointed in this direction. Nowadays, I am not convinced anymore but I need tangible experiments to show where to go from there.

Sometimes you want calibrated probabilities, sometimes you want to select the top most likely, sometimes you want high ROC AUC, sometimes minimizing false positives or false negatives is important. I would expect that it is unlikely that a single method is best in all data scenarios for all evaluations metrics.

This is exactly what I am currently interested in investigating. I think that you can get the best of the two worlds. You can start by optimizing a model (also with hyperparameter tuning) using a proper scoring rule (e.g. log-loss) to get a properly calibrated estimator and then use something as the CutOffClassifier (i.e. #26120) to get "hard" predictions that are optimum to your specific business metric (could be in $ or just a control point TPR/FPR).

KeithEdmondsMcK · 2023-05-05T19:12:00Z

If it means that imbalanced-learn becomes irrelevant because it is not the way to go, I am completely fine with it.

Agreed. I have never found things like SMOTE useful. But I have used BalancedBaggingClassifier successfully.

I think that you can get the best of the two worlds. You can start by optimizing a model (also with hyperparameter tuning) using a proper scoring rule (e.g. log-loss) to get a properly calibrated estimator and then use something as the CutOffClassifier (i.e. #26120) to get "hard" predictions that are optimum to your specific business metric (could be in $ or just a control point TPR/FPR).

In many cases (eg lead gen) you really only care about the sequence of the predictions when sorted by probability not the value of the probability. ROC AUC is a good proxy for this. This means altering the cut off point gets you nothing. What you want is to well isolate the rare positive cases somewhat like anomaly detection. So what I want is the method which does that optimally. As I said above, methods which do not attempt to handle the imbalance have underperformed in my experience.

I am very interested in the deep study looking into all this but I doubt anything comprehensive and definitive will come before a decision needs to be made about this ticket. What is the bar which needs to be passed for this to get into the next release? Until this week I though it was only blocker was getting the code written.

glevv · 2023-05-07T08:40:19Z

There was spherical scoring rule, it is proper and somewhat similar to Matusita/Hellinger distance. It is also way too easier to implement (basically change couple of lines of Gini criterion)

lorentzenchr · 2023-05-07T16:52:51Z

What is the bar which needs to be passed for this to get into the next release?

It is quite unlikely that this feature will be included in the upcoming 1.3 release.
As it stands, I‘m currently against inclusion (@glemaitre maybe too). My main concerns are summarized in #16478 (comment). Additionally, my experiments in #16478 (comment) (please also note the details section) indicate no improvement.
If there is new evidence (e.g. I did a mistake, the current implementation of this PR has a flaw, another striking example that shows the benefit etc), I‘m happy to reconsider.

Therefore, I close. But please continue the discussion with new insights.

KeithEdmondsMcK · 2023-05-07T17:22:44Z

@glemaitre @EvgeniDubov Is it technically possible to get this into imbalanced-learn? As I recall, changing the split criteria was something which needed to be done in Sklearn but the plan was to make it possible to have user defined splits. Was that the implementation which was done in the end. Is it technically possible to have user defined splits as an option then put the specific implementation of the Hellinger Split in imbalanced-learn? Surely the option to have user defined splits is something we would want in sklearn. It opens the door to the testing we described above.

lorentzenchr · 2023-05-07T17:46:36Z

In scikit-learn, splitting and splitting criteria are applied in performance critical Cython code. Therefore, it is not possible to provide user defined splitting criteria (they must be there at compile time).

KeithEdmondsMcK · 2023-05-07T21:07:01Z

@lorentzenchr That seems to contradict what is stated here #10251

This was actually proposed as the original solutions if you look at this comment. #9947 (comment)

What am I missing?

lorentzenchr · 2023-05-08T06:31:15Z

@KeithEdmondsMcK You're right. It's not a build time restriction.

KeithEdmondsMcK · 2023-05-08T13:25:22Z

If we are going to get blocked on direct implementation in sklearn then I propose we follow this path. I have two questions

Is anything needed on the sklearn side to make this possible? Extending Criterion #10251 seems to imply that some work is still needed but it is deprioritized.
@glemaitre Are you opposed to the addition of the code for Hellinger Distance to the imbalanced-learn library?

glemaitre · 2023-05-15T15:04:19Z

@glemaitre Are you opposed to the addition of the code for Hellinger Distance to the imbalanced-learn library?

Adding the criterion means that we need to move from a pure Python package to an infrastructure where we need to build a wheel for all platforms in the world :)
Being the one that is doing the releases, I don't have the bandwidth to take care of maintaining the wheel's infrastructure.

KeithEdmondsMcK · 2023-06-15T03:27:01Z

@EvgeniDubov Do you still plan to debug this? It would be useful to have a branch with the complete feature to use even if it is a fork. After a while the needed empirical evidence will accumulate.

EvgeniDubov · 2023-06-20T06:05:59Z

@KeithEdmondsMcK yep, I don't want to give up on hellinger :) planning to do the following

Address this stops splitting comment
Compare hellinger to gini and entropy with other metric than auc, because its purpose is different, maybe recall at k, when k is the minority class relative size

Does the second bullet make sense?

giladwa1 mentioned this pull request Feb 20, 2020

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification scikit-learn-contrib/imbalanced-learn#437

Open

github-actions bot added the module:tree label Mar 2, 2020

thomasjpfan added the Waiting for Reviewer label Aug 6, 2020

Base automatically changed from master to main January 22, 2021 10:52

glemaitre self-requested a review April 9, 2021 14:53

thomasjpfan added the cython label Apr 13, 2021

This was referenced Oct 23, 2021

Error with Cythonzing EvgeniDubov/hellinger-distance-criterion#9

Closed

Value Error with import statement EvgeniDubov/hellinger-distance-criterion#3

Open

jjerphan reviewed Jan 10, 2022

View reviewed changes

jjerphan removed the Waiting for Reviewer label Jan 10, 2022

jjerphan reviewed Jan 28, 2022

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

jjerphan reviewed Feb 4, 2022

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved