Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]The sampling method of the BRFClassifier is different from the paper #838

Closed
Chengwen-98 opened this issue May 18, 2021 · 6 comments · Fixed by #1006 or #1010
Closed

[BUG]The sampling method of the BRFClassifier is different from the paper #838

Chengwen-98 opened this issue May 18, 2021 · 6 comments · Fixed by #1006 or #1010

Comments

@Chengwen-98
Copy link

Describe the bug

Hi.The following is the sampling method of BRF in the paper Using Random Forest to Learn Imbalanced Data

For each iteration in random forest, draw a bootstrap sample from the minority class. 
Randomly draw the same number of cases, with replacement, from the majority class.

My interpretation is that the minority samples in each sub-training set are selected by bootstrap and each sub-training set is balanced.Then these sub-training sets are given to traditional random forest's trees.
But in the code of imblearn\ensemble\_forest.py\_local_parallel_build_trees ,I found that all minority samples in the training set are used in each sub-training set,the minority samples in the sub-training set of each tree are the same.

@chkoar
Copy link
Member

chkoar commented May 18, 2021

That is correct. I had the impression that we had fixed this. The reference should be removed and the docstring should be updated.
There is an open MR in the scikit-learn repo with an actual implementation from the paper you posted. Although, as you can see, the implemention that we offer though imbalanced-learn is comparable with the original algorithm.

@Chengwen-98
Copy link
Author

That is correct. I had the impression that we had fixed this. The reference should be removed and the docstring should be updated.
There is an open MR in the scikit-learn repo with an actual implementation from the paper you posted. Although, as you can see, the implemention that we offer though imbalanced-learn is comparable with the original algorithm.

Thanks for your answer.The reference is still on the website https://pypi.org/project/imbalanced-learn/#id17

@chkoar
Copy link
Member

chkoar commented May 18, 2021

Basically, the reference is everywhere. We should remove the citation from docs and docstrings saying that we have implemented a variation of random forests in order to be adapted in imbalanced data sets.

@Chengwen-98
Copy link
Author

Basically, the reference is everywhere. We should remove the citation from docs and docstrings saying that we have implemented a variation of random forests in order to be adapted in imbalanced data sets.

a sudden announcement 😀,thank you for remembering my first issue.

@glemaitre
Copy link
Member

a sudden announcement 😀,thank you for remembering my first issue.

Indeed, we can succeed to get the right behaviour only by changing the default. So let's do that.

@glemaitre
Copy link
Member

glemaitre commented Jul 9, 2023

Looking more in detail, I think that I forgot to change the default of bootstrap because it is handled by the sampler.

Note that beforehand, we were passing the minority class entirely to _parallel_bootstrap that later will take a bootstrap as well. So we were indeed taking a bootstrap sample. What change then, is the way that the majority class is handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants