Skip to content

ENH Hellinger distance split criterion for classification trees #16478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

EvgeniDubov
Copy link

@EvgeniDubov EvgeniDubov commented Feb 19, 2020

Reference Issue
[scikit-learn] Feature Request: Hellinger split criterion for classificaiton trees #9947
[scikit-learn-contrib] [WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

What does this implement/fix? Explain your changes.
Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

TODO

  • release notes
  • documentation
  • examples
  • implement feature importance
  • add documentation in cython code

@glemaitre
Copy link
Member

@EvgeniDubov Thanks for porting it in scikit-learn. I think that you should merge master into your branch as well (regarding some CI which are not used anymore).

To give a bit of context with the PR, we discuss in IRL with @ogrisel regarding the integration of this feature in the trees. It would only be supported for the binary case most probably. The thing that we will need is to check that it helps in practise. I think one way would be to build trees using this criterion in the example that we are building in the BalancedRandomForest: #13227

@EvgeniDubov
Copy link
Author

@glemaitre thanks for the fast response, I'm on it

@chkoar
Copy link
Contributor

chkoar commented Jul 29, 2020

Can anyone add the Waiting for Reviewer label to this PR?

Base automatically changed from master to main January 22, 2021 10:52
@ndrmahmoudi
Copy link

Hi everyone, I was just curious when do you think this option will be finalised and added?

@giladwa1
Copy link

giladwa1 commented Jan 9, 2022

Are there any updates on when Hellinger split criterion will be merged?
IMO this is a very useful feature in cases of imbalanced labels, and would be great to have as part of scikit-learn.

@kahanad
Copy link

kahanad commented Jan 9, 2022

Thanks @EvgeniDubov for adding this feature!
Any idea when it'll be merged?
Thanks!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution, @EvgeniDubov.
This is a notable feature.

Some comments:

  • test_importances parametrisation must be extended to include 'hellinger' in.
  • the example and the implementation of feature importance can come as a follow up.

Let us know if you need more feedback and if you have time pursuing this work.

@jjerphan
Copy link
Member

@EvgeniDubov: merging main again will solve the problem with the CI.

Copy link
Member

@jmloyola jmloyola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @EvgeniDubov!

I left a couple of comments regarding the documentation.

I think it would be nice to add a reference to this new distance for unbalanced datasets in doc/modules/tree.rst.
For example, we can mention the Hellinger distance where it says:

Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Likewise here:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

Also, we should mention in doc/modules/tree.rst that the Hellinger distance was only implemented for the binary case.

@EvgeniDubov
Copy link
Author

Thanks for the PR @EvgeniDubov!

I left a couple of comments regarding the documentation.

I think it would be nice to add a reference to this new distance for unbalanced datasets in doc/modules/tree.rst. For example, we can mention the Hellinger distance where it says:

Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Likewise here:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

Also, we should mention in doc/modules/tree.rst that the Hellinger distance was only implemented for the binary case.

@EvgeniDubov EvgeniDubov closed this Feb 6, 2022
@EvgeniDubov EvgeniDubov deleted the hellinger_distance_criterion branch February 6, 2022 09:59
@EvgeniDubov EvgeniDubov restored the hellinger_distance_criterion branch February 6, 2022 09:59
@EvgeniDubov EvgeniDubov reopened this Feb 6, 2022
Comment on lines 179 to 201
def check_imbalanced_criterion(name, criterion):
ForestClassifier = FOREST_CLASSIFIERS[name]

clf = ForestClassifier(n_estimators=10, criterion=criterion, random_state=1)
clf.fit(X_large_imbl, y_large_imbl)

# score is a mean of minority class predict_proba
score = clf.predict_proba(X_large_imbl)[:, 1].mean()

assert (
score > imbl_minority_class_ratio
), "Failed with imbalanced criterion %s, score = %f, minority class ratio = %f" % (
criterion,
score,
imbl_minority_class_ratio,
)


@pytest.mark.parametrize("name", FOREST_CLASSIFIERS)
@pytest.mark.parametrize("criterion", ["hellinger"])
def test_imbalanced_criterions(name, criterion):
check_imbalanced_criterion(name, criterion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjerphan, @glemaitre Do you know why we use this test pattern (one function with the logic and another parametrized that calls it)?
sklearn/ensemble/tests/test_forest.py has some tests like this.

Copy link
Member

@jjerphan jjerphan Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know and I would merge them together if it's used only once.

Edit: I guess it makes sense if the check is used in various tests, which is not the case here.

@EvgeniDubov EvgeniDubov closed this May 3, 2023
@EvgeniDubov EvgeniDubov reopened this May 3, 2023
@EvgeniDubov
Copy link
Author

One would need to investigate the cause of the bug shown in #16478 (comment)

@glemaitre hellinger score is per split (three populations) and not per one population as in gini and entropy, thus adding it to the existing tree split design was a bit tricky
I tried to adapt it by breaking the hellinger formula into two parts, one for each child node and making left half of hellinger formula to be the left child "impurity"
looks like this solution works with current split mechanism but doesn't always get to pure leaves
I'm trying to get deeper understanding of _splitter.pyx code to figure out why hellinger doesn't split till pure leaves
any help with this final point is very welcomed

@lorentzenchr
Copy link
Member

lorentzenchr commented May 4, 2023

@EvgeniDubov Thanks for this PR.
From a first fast read over the example and the paper on the Hellinger distance, I'm not so convinced to include it for several reasons:

  1. While the Hellinger distance is an f-divergence, it does not stem from a proper scoring rule:

    Friedman (1983) and Nau (1985) studied a looser type of relationship between proper scoring rules and distance measures on classes of probability distributions. They restricted attention to metrics (i.e., distance measures that are symmetric and satisfy the triangle inequality) and called a scoring rule $S$ effective with respect to a metric $d$ if
    $S(P_1, Q) \geq S(P_2, Q) \Leftrightarrow d(P_1, Q) \leq d(P_2, Q)$.
    Nau (1985) called a metric co-effective if there is a proper scoring rule that is effective with respect to it. His proposition 1 implies that the $l_1$, $l_\infty$, and Hellinger distances on spaces of absolutely continuous probability measures are not co-effective.

    Gneiting, Raftery "Strictly Proper Scoring Rules, Prediction, and Estimation" free PDF
    This means, it will not revover the true probability in the end. (Maybe I'm missing a step to discrete distributions.)

  2. I do not understand why the figure in the new example would show that the Hellinger criterion is superior in this setting. I see that Gini and entropy better identify the one outlier at the top, which might also be problematic with the default min_samples_leaf=1. For the cluster of the minority class in the middle, given only the two features, no model will ever predict those with high probability.
    I would like to see the numbers np.sum(model.predict_proba(X_train), axis=0) and the frequencies of the classes in y_train.

@KeithEdmondsMcK
Copy link

@lorentzenchr
I think you are missing the point of this functionality. The utility is to increase binary classification metrics like ROC AUC not to give a calibrated probability. This is specifically for use in the case of imbalanced binary classification. Resampling and reweighting techniques to handle imbalance typically give highly uncalibrated probabilities as well. Tools to handle calibration have been around for a while so its a bit of a non-issue. https://scikit-learn.org/stable/modules/calibration.html

Do you have any evidence that this would not be helpful for binary classification of imbalanced data? If you look through the history of this ticket you will see that this was originally proposed as an addition to imblearn about 6 years ago. However, imblearn was dependant on sklearn in a way that this needed to be added to sklearn. From what I can see there is a lot of interest in this functionality and there is no evidence that it will not perform as expected.

@lorentzenchr
Copy link
Member

Do you have any evidence that this would not be helpful for binary classification of imbalanced data?

I‘m afraid the logic is reversed. One has to proof the usefulness of an additional feature. IMHO, this was not much discussed in #9947.

The utility is to increase binary classification metrics like ROC AUC not to give a calibrated probability.

It takes a lot for me to sacrifice calibration. And I have not seen or do not understand the great advantage of the Hellinger distance as a split criterion.

I see that you invested a lot in this feature and that I come at the last minute. Please bear with me as someone who never understood what the problem of „the imbalanced class problem“ is.

@KeithEdmondsMcK
Copy link

KeithEdmondsMcK commented May 4, 2023

@lorentzenchr I did not intend to claim that it should be added unless there was evidence against it without there being evidence for it. #9947 shows several papers that give evidence of the theoretical viability. @EvgeniDubov confirmed the usefulness here https://github.com/EvgeniDubov/hellinger-distance-criterion. I used it for an insurance client and got better results. I would have deployed it in production if it was in a major library like SKlearn. As a DS consultant, one of the most common problems is imbalanced binary classification. I can go into the details of why it is needed but the imbalance is an additional issue to the binary classification issue itself. In any case, I would have used it several times over the years since my original request. So my question was; given what I would consider ample evidence that it would be a useful addition to sklearn, is there evidence which pushes back against that? If there is only evidence for it and it is a smallish addition, then I do not see why it would not be wanted.

@richardbatesMcK
Copy link

@lorentzenchr
Decision tree models are known to be impacted by imbalanced data sets.

https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

  • However, the most commonly used classification algorithms do not work well for such problems because they aim to minimize the overall error rate, rather than paying special attention to the positive class.

https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/

  • The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets.
  • The split points of the tree are chosen to best separate examples into two groups with minimum mixing. When both groups are dominated by examples from one class, the criterion used to select a split point will see good separation, when in fact, the examples from the minority class are being ignored.

@lorentzenchr
Copy link
Member

From your example, I get

  • criterion='gini'
    • training set:
      log_loss=2.220446049250313e-16 auc=1.0
    • test set:
      log_loss=0.648785761004109 auc=0.4949596774193548
  • criterion='entropy'
    • training set:
      log_loss=2.220446049250313e-16 auc=1.0
    • test set:
      log_loss=0.7208730677823433 auc=0.4939516129032258
  • criterion='hellinger'
    • training set:
      log_loss=0.009106686791084227 auc=0.9980696755216472
    • test set:
      log_loss=0.6549246984162754 auc=0.48538306451612906

It depends quite a bit on the random seeds. But the point is that the hellinger criterion does not seem to have a positive effect on the AUC.

# Import the necessary modules and libraries
import numpy as np
from sklearn import datasets
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss, confusion_matrix
from sklearn.tree import DecisionTreeClassifier


# Create imbalanced dataset
minority_class_ratio = 0.001
n_classes = 2
X, y = datasets.make_classification(
    n_samples=1000 + 1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=n_classes,
    n_clusters_per_class=1,
    weights=[1 - minority_class_ratio],
    shuffle=False,
    random_state=0,
)
X_train, y_train = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]


# Criteria to compare
criterions = ["gini", "entropy", "hellinger"]

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    clf.fit(X_train, y_train)
    print(f"{criterion=}")
    print(f"training set:")
    y_prob = clf.predict_proba(X_train)
    print(f"  log_loss={log_loss(y_train, y_prob)} auc={roc_auc_score(y_train, y_prob[:, 1])}")
    print(f"test set:")
    y_prob = clf.predict_proba(X_test)
    print(f"  log_loss={log_loss(y_test, y_prob)} auc={roc_auc_score(y_test, y_prob[:, 1])}")
    print("")

@lorentzenchr
Copy link
Member

I also made the effort to analyze one of the datasets from the cited papers. I decided for the binary version of the covertype dataset, https://www.openml.org/search?type=data&status=active&id=293.

The big problem with the default params of DecisionTreeClassifier is that is uses min_samples_leaf=1 and max_depth=None (meaning no limit). Therefore, I used a grid search to make the comparison fairer.
Using the best value gives.

criterion='gini' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7427456478411264
  Average log_loss = 0.6149432406120839
criterion='entropy' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7466694513462981
  Average log_loss = 0.610347261462469
criterion='hellinger' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929

Summary I do not see the advantage of using the Hellinger distance as a split criterion. Using better hyperparams (max_depth and min_samples_leaf) than the default ones have been sufficient in the cases I analyzed.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# covertype.binary, see https://www.openml.org/search?type=data&status=active&id=293
X, y = fetch_openml(data_id=293, as_frame=False, parser="auto", return_X_y=True)
y = y.astype(int)  # y is of string type / object

criterions = ["gini", "entropy", "hellinger"]
def mean_depth(trees):
    return np.mean([x.get_depth() for x in trees])

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    cv_res = cross_validate(
        clf, X, y, scoring=["neg_log_loss", "roc_auc"], cv=5, n_jobs=-1, return_estimator=True
    )
    print(f"{criterion=}")
    print(f"  Average AUC = {np.mean(cv_res['test_roc_auc'])}")
    print(f"  Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")
    print(f"  Average tree depth = {mean_depth(cv_res['estimator'])}")
criterion='gini'
  Average AUC = 0.6261283565436794
  Average log_loss = 13.428064691489425
  Average tree depth = 41.0
criterion='entropy'
  Average AUC = 0.6271224404131799
  Average log_loss = 13.3942540173142
  Average tree depth = 46.0
criterion='hellinger'
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929
  Average tree depth = 1.8

The average tree depth shows that gini and entropy have MUCH toooooo deep trees and are very likely overfitting a lot. To come up with a fairer comparison, we do a grid search.

for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33)
    grid = GridSearchCV(
        clf,
        scoring="roc_auc",
        param_grid={
            "min_samples_leaf": [1, 2, 4, 6, 8, 10],
            "max_depth": [2, 3, 4, 5, 6],
        },
        cv=3,
        n_jobs=-1,
    ).fit(X, y)
    print(f"{criterion=}")
    print(f"  best AUC = {grid.best_score_}")
    print(f"  best params = {grid.best_params_}")
criterion='gini'
  best AUC = 0.7275554647833573
  best params = {'max_depth': 3, 'min_samples_leaf': 1}
criterion='entropy'
  best AUC = 0.7210744088919395
  best params = {'max_depth': 3, 'min_samples_leaf': 1}
criterion='hellinger'
  best AUC = 0.6806826150720867
  best params = {'max_depth': 3, 'min_samples_leaf': 1}

We run the comparison again, now with the optimized parameters

params = {
    "gini": {"max_depth": 3, "min_samples_leaf": 1},
    "entropy": {"max_depth": 3, "min_samples_leaf": 1},
    "hellinger": {"max_depth": 3, "min_samples_leaf": 1},
}
for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion, random_state=33, **params[criterion])
    cv_res = cross_validate(clf, X, y, scoring=["neg_log_loss", "neg_brier_score", "roc_auc"], cv=5, n_jobs=-1)
    print(f"{criterion=} params={params[criterion]}")
    print(f"  Average AUC = {np.mean(cv_res['test_roc_auc'])}")
    print(f"  Average log_loss = {-np.mean(cv_res['test_neg_log_loss'])}")
criterion='gini' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7427456478411264
  Average log_loss = 0.6149432406120839
criterion='entropy' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.7466694513462981
  Average log_loss = 0.610347261462469
criterion='hellinger' params={'max_depth': 3, 'min_samples_leaf': 1}
  Average AUC = 0.6625673560369815
  Average log_loss = 0.6497047640247929

@lorentzenchr
Copy link
Member

Do you have any evidence that this would not be helpful for binary classification of imbalanced data?

Given my above analysis (better someone checks it!!!), my temporary answer is yes. Therefore, my temporary vote is

-1

@glemaitre
Copy link
Member

Thanks @lorentzenchr for the experiment.

Since the start of the discussion long ago, I got new intuitions that I did not necessarily have at the time.

Overall, I am convinced by the statistical arguments of @lorentzenchr regarding the importance of getting calibrated models and thus the use of proper scoring rules. I am more and more convinced that there is actually no imbalanced classification problem.

With an imbalanced problem, it seems that the issue boils down to getting the "expected" hard predictions. scikit-learn does a bad job with a cut-off point fixed at 0.5 (when considering probabilities). Thus, I think that the missing piece is having access to a meta-estimator that can tweak this cut-off point to a given application (for a specific utility function). This is the purpose of #26120. I see that @richardbatesMcK add a link to the BalanacedRandomForest: actually, I plan to run an experiment to compare such an approach with a RandomForest + cut-off optimization.

What I would still be interested in regarding the Hellinger criterion (mainly due to my limited knowledge of statistics) is to gain intuitions on the implication regarding the recursive partitioning within the trees. Basically, do we get the same results using the current tree + cut-off optimization?

@lorentzenchr
Copy link
Member

What I would still be interested in regarding the Hellinger criterion is to gain intuitions on the implication regarding the recursive partitioning within the trees. Basically, do we get the same results using the current tree + cut-off optimization?

What I observed is that the Hellinger distance prohibits trees to split any further, very early. The details section in #16478 (comment) is really instructive in that regard as in prints the tree depth without setting restrictions. Then gini/entropy are around 40 splits while Hellinger is around 2. That is an extreme difference!

@glemaitre
Copy link
Member

Then gini/entropy are around 40 splits while Hellinger is around 2. That is an extreme difference!

I did not follow but did the bug that I showed in #16478 (comment) got solved? Because the tree might probably stop splitting even before it actually should (at least with the default tree parameter).

@KeithEdmondsMcK
Copy link

@glemaitre

I am more and more convinced that there is actually no imbalanced classification problem.

This is a strange thing for the maintainer of imblearn to say. Are you implying it is an unuseful package? I have used imblearn.ensemble.BalancedBaggingClassifier a fair bit and have seen it be better than the sklearn.ensemble.RandomForestClassifier. I would like to see a rigorous comparison between the two as you suggest. However, I have also had good success with sklearn.ensemble.GradientBoostingClassifier vs both. Sometimes, using sample_weight in fit() helps. I have not been able to find an optimal method which consistently beats the other options so my strategy has been to try several methods. The plan was to add Hellinger distance splitting as another option to try. A benchmarking study comparing the options would be very interesting. The two major components to vary would be the amount of imbalance, data volume and noise. It is also important to note that the evaluation metric is not always the same in real world problems. Sometimes you want calibrated probabilities, sometimes you want to select the top most likely, sometimes you want high ROC AUC, sometimes minimizing false positives or false negatives is important. I would expect that it is unlikely that a single method is best in all data scenarios for all evaluations metrics.

I did not follow but did the bug that I showed in #16478 (comment) got solved? Because the tree might probably stop splitting even before it actually should (at least with the default tree parameter).

I had a similar thought. Are we sure that the bug has been solved? Also, if we are going to hyperparameter tune that should be applied to all methods equally to make it fair.

@glemaitre
Copy link
Member

This is a strange thing for the maintainer of imblearn to say. Are you implying it is an unuseful package?

I strive for solving problems the best way I can. If it means that imbalanced-learn becomes irrelevant because it is not the way to go, I am completely fine with it. I implemented these methods because the scientific literature was pointed in this direction. Nowadays, I am not convinced anymore but I need tangible experiments to show where to go from there.

Sometimes you want calibrated probabilities, sometimes you want to select the top most likely, sometimes you want high ROC AUC, sometimes minimizing false positives or false negatives is important. I would expect that it is unlikely that a single method is best in all data scenarios for all evaluations metrics.

This is exactly what I am currently interested in investigating. I think that you can get the best of the two worlds. You can start by optimizing a model (also with hyperparameter tuning) using a proper scoring rule (e.g. log-loss) to get a properly calibrated estimator and then use something as the CutOffClassifier (i.e. #26120) to get "hard" predictions that are optimum to your specific business metric (could be in $ or just a control point TPR/FPR).

@KeithEdmondsMcK
Copy link

If it means that imbalanced-learn becomes irrelevant because it is not the way to go, I am completely fine with it.

Agreed. I have never found things like SMOTE useful. But I have used BalancedBaggingClassifier successfully.

I think that you can get the best of the two worlds. You can start by optimizing a model (also with hyperparameter tuning) using a proper scoring rule (e.g. log-loss) to get a properly calibrated estimator and then use something as the CutOffClassifier (i.e. #26120) to get "hard" predictions that are optimum to your specific business metric (could be in $ or just a control point TPR/FPR).

In many cases (eg lead gen) you really only care about the sequence of the predictions when sorted by probability not the value of the probability. ROC AUC is a good proxy for this. This means altering the cut off point gets you nothing. What you want is to well isolate the rare positive cases somewhat like anomaly detection. So what I want is the method which does that optimally. As I said above, methods which do not attempt to handle the imbalance have underperformed in my experience.

I am very interested in the deep study looking into all this but I doubt anything comprehensive and definitive will come before a decision needs to be made about this ticket. What is the bar which needs to be passed for this to get into the next release? Until this week I though it was only blocker was getting the code written.

@glevv
Copy link
Contributor

glevv commented May 7, 2023

There was spherical scoring rule, it is proper and somewhat similar to Matusita/Hellinger distance. It is also way too easier to implement (basically change couple of lines of Gini criterion)

@lorentzenchr
Copy link
Member

What is the bar which needs to be passed for this to get into the next release?

It is quite unlikely that this feature will be included in the upcoming 1.3 release.
As it stands, I‘m currently against inclusion (@glemaitre maybe too). My main concerns are summarized in #16478 (comment). Additionally, my experiments in #16478 (comment) (please also note the details section) indicate no improvement.
If there is new evidence (e.g. I did a mistake, the current implementation of this PR has a flaw, another striking example that shows the benefit etc), I‘m happy to reconsider.

Therefore, I close. But please continue the discussion with new insights.

@KeithEdmondsMcK
Copy link

@glemaitre @EvgeniDubov Is it technically possible to get this into imbalanced-learn? As I recall, changing the split criteria was something which needed to be done in Sklearn but the plan was to make it possible to have user defined splits. Was that the implementation which was done in the end. Is it technically possible to have user defined splits as an option then put the specific implementation of the Hellinger Split in imbalanced-learn? Surely the option to have user defined splits is something we would want in sklearn. It opens the door to the testing we described above.

@lorentzenchr
Copy link
Member

lorentzenchr commented May 7, 2023

In scikit-learn, splitting and splitting criteria are applied in performance critical Cython code. Therefore, it is not possible to provide user defined splitting criteria (they must be there at compile time).

@KeithEdmondsMcK
Copy link

@lorentzenchr That seems to contradict what is stated here #10251

This was actually proposed as the original solutions if you look at this comment. #9947 (comment)

What am I missing?

@lorentzenchr
Copy link
Member

@KeithEdmondsMcK You're right. It's not a build time restriction.

@KeithEdmondsMcK
Copy link

If we are going to get blocked on direct implementation in sklearn then I propose we follow this path. I have two questions

  1. Is anything needed on the sklearn side to make this possible? Extending Criterion #10251 seems to imply that some work is still needed but it is deprioritized.
  2. @glemaitre Are you opposed to the addition of the code for Hellinger Distance to the imbalanced-learn library?

@glemaitre
Copy link
Member

@glemaitre Are you opposed to the addition of the code for Hellinger Distance to the imbalanced-learn library?

Adding the criterion means that we need to move from a pure Python package to an infrastructure where we need to build a wheel for all platforms in the world :)
Being the one that is doing the releases, I don't have the bandwidth to take care of maintaining the wheel's infrastructure.

@KeithEdmondsMcK
Copy link

@EvgeniDubov Do you still plan to debug this? It would be useful to have a branch with the complete feature to use even if it is a fork. After a while the needed empirical evidence will accumulate.

@EvgeniDubov
Copy link
Author

EvgeniDubov commented Jun 20, 2023

@KeithEdmondsMcK yep, I don't want to give up on hellinger :) planning to do the following

  1. Address this stops splitting comment
  2. Compare hellinger to gini and entropy with other metric than auc, because its purpose is different, maybe recall at k, when k is the minority class relative size

Does the second bullet make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.