-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature/linfa ensemble random forest #43
[WIP] Feature/linfa ensemble random forest #43
Conversation
will look into this tomorrow 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly added some suggestions for places where you can add iterators instead of for
loops. You should also make the algorithm generic over Float and labels.
In the end it would make sense to have a general Bagging
implementation here, and add a more specialized version in linfa-trees
, called RandomForest
.
Thanks for the work 👍
Wonderful comments. Will start fixing tomorrow ;) |
…from decision trees
…daleta/linfa into feature/linfa-ensemble-random-forest
…from decision trees
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some remarks on the feature importance stuff :)
@bytesnake can we merge and close this? |
mh for me a
|
I am implemeting VotingClassifier in the same crate
All the family of XGB and extraRF could be under the same crate.
I think putting all that under linfa-trees can be misleading.
…On Sep 16 2020, at 10:51 am, Lorenz ***@***.***> wrote:
mh for me a linfa-ensemble with just random forest is a bit too empty, I think there are two ways forward:
implement ensemble learning for generic fittable algorithms, this depends on #45 ***@***.***/0?redirect=https%3A%2F%2Fgithub.com%2Frust-ml%2Flinfa%2Fpull%2F45&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D) and will take some time
or move the RandomForest implementation to linfa-trees and accept this PR
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub ***@***.***/1?redirect=https%3A%2F%2Fgithub.com%2Frust-ml%2Flinfa%2Fpull%2F43%23issuecomment-693268442&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D), or unsubscribe ***@***.***/2?redirect=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACYTR3ZWOITD5534YL3PEUTSGB37VANCNFSM4QVHX2IA&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D).
|
well if they all are using decision trees, then I would put them under okay, if it's important for you I will accept, but at least improve the documentation a bit. First it is required for every subcrate to add a |
By looking at the list of ensemble models indeed most of them are tree based. Hence it makes sense to refactor them there. I only found metaclassifier and votingclassifier not to be necessarily tree based (but just ensemble of generic weak learners)
Let me add the votingclassifier and documentation to the same PR and we move from there. Does this sound reasonable? :)
…On Sep 16 2020, at 11:19 am, Lorenz ***@***.***> wrote:
well if they all are using decision trees, then I would put them under linfa-trees 😅
okay, if it's important for you I will accept, but at least improve the documentation a bit. First it is required for every subcrate to add a README.md describing its purpose and goals, then the link in the super crate is missing in the README.md too. Also add some comments to the RandomForest struct, you can follow these https://doc.rust-lang.org/stable/rust-by-example/meta/doc.html
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR36OEIXRRBXG4PS2X33SGB7I3ANCNFSM4QVHX2IA).
|
yep sounds good 👍 I have zero experience in the field of ensemble learning, are you saying that most of them are tree based, because of some properties inherent to decision trees necessary for the ensemble learning process. Or that they normally, e.g. the most popular, are tree based? |
Because they use trees as the single model to ensemble on.
With this said, there are exceptions. For instance a voting classifiers can be composed of heterogeneous models (not necessarily trees). Same goes for a Stacking classifier (can also be based on logistic regression) , bagging classifier (can be based on K-Nearest neighbors).
These models are still considered ensemble methods (just not tree-based).
…On Sep 16 2020, at 11:43 am, Lorenz ***@***.***> wrote:
yep sounds good 👍 I have zero experience in the field of ensemble learning, are you saying that most of them are tree based, because of properties of decision trees necessary for the ensemble learning process. Or that they normally, e.g. the most popular, are tree based?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR3YUTBKAE3RPFTWDF2LSGCCFVANCNFSM4QVHX2IA).
|
Feature/voting classifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments, I haven't taken a very thorough look at this yet.
Unfortunately I think this PR brings up some design decisions that we need to make (e.g. Predictor
/ Classifier
trait), which IMO need some discussion and there is some design work to do. I also think we need some more thorough test cases. Things like unimplemented!
should not be merged.
Another thought: a few things seem to be WIP in this PR (TODO comments, unimplemented!
macros, etc.), maybe it makes sense to split this into multiple PRs? (e.g. one for random forest, one for voting classifier, etc.) and then polish each one?
@@ -0,0 +1,9 @@ | |||
[package] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a generic place in the library where we have cross-trait level traits such as this one (Float, and probably a few others come to mind as well). I don't think it makes sense to have different sub-crates for each of these traits. What do you think @bytesnake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep working on it here #45 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented a LinfaError
struct that is now returned from functions returning Result
Specifically, decision trees should not implement predict_probabilities()
. In the next commit I gently raise the error instead of the unimplemented!()
Predictor
trait is currently in its own crate. I personally do not think that makes sense (it should be part of linfa, regardless). I have placed it as importable crate for convenience now, but needs to be moved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, decision trees should not implement
predict_probabilities()
Do you mean that
- This is actually genuinely not implemented and we should handle the case more gracefully
- Making a prediction of class probabilities is not a sensible operation for decision trees?
use ndarray::{Array1, ArrayBase, Data, Ix2}; | ||
|
||
/// Trait every predictor should implement | ||
pub trait Predictor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Predictor
is a good name, it should probably be Classifier
since this is predicting classes (as opposed to a Regressor
which could predict continuous values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Predictor
is an accurate name, but should have a Target type parameter, specifying whether it predicts labels or continuous values. https://github.com/rust-ml/linfa/pull/45/files#diff-44e8606db87b39b048179b1f7af8d244R19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the implementation can then be specialized at any point like
impl<L: Label> Predict<Array2<f32>, Vec<L>> ... {
}
this accepts only labels, e.g. integers, strings and booleans and not more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Both a Classifier
and Regressor
are Predictors
:)
How about an enum
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I agree if the predictor is generic in its output then the name Predictor
makes sense.
But I think predicting probabilities doesn't make sense for all predictors, so should be another trait.
linfa-predictor/src/lib.rs
Outdated
/// Trait every predictor should implement | ||
pub trait Predictor { | ||
/// predict class for each sample | ||
fn predict(&self, x: &ArrayBase<impl Data<Elem = f64>, Ix2>) -> Array1<u64>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logistic regression uses predict_classes
to distinguish from predict_probabilities
. I'm also OK with predict
and predict_probabilities
but we should choose a consistent approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, but I'm now thinking whether you can implement a trait twice each with different return values? 🤔 Would be nice to have a
impl<F: Float, D: Data> Predictor<D, Array1<F>> for .. {
}
impl<L: Label, D: Data> Predictor<D, Array1<L>> for .. {
}
for the same type and then decide on the required return type, which implementation is used. But lets move that to the PR :D
One simply cannot predict probabilities with a decision tree (and should not) .
A decision tree is very unstable and can change dramatically for small perturbations of the input data.
If you want something that looks like probabilities you need to reduce the depth of the tree. And still all observations in the same node will have the same probability.
This is due to the fact that a decision tree generates probabilities from the number of samples of a class k on a leaf node, over the number of samples in that leaf.
When the tree changes so dramatically from small input data perturbations, all those "probabilities" change completely.
Only in ensemble methods (and random forest in particular) it makes sense to predict probabilities.
…On Oct 8 2020, at 7:46 am, mossbanay ***@***.***> wrote:
@mossbanay commented on this pull request.
In linfa-predictor/Cargo.toml (#43 (comment)):
> @@ -0,0 +1,9 @@
+[package]
> Specifically, decision trees should not implement predict_probabilities()
Do you mean that
This is actually genuinely not implemented and we should handle the case more gracefully
Making a prediction of class probabilities is not a sensible operation for decision trees?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR3Y74ICUCDTUAD2ROELSJVG2RANCNFSM4QVHX2IA).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't weigh in on linfa-predictor
further than I think it's a good idea in principle and might take a few iterations to get to a point where we are happy with it.
The voting classifier and random forest look close but just need a bit of polish in some places. The biggest omission I can see is it doesn't look like we subset the features, which is a pretty core part of random forests.
RandomForestParamsBuilder { | ||
tree_hyperparameters, | ||
n_estimators, | ||
max_features: Some(MaxFeatures::Auto), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we initialise these to a (default) value then I think we don't need the Option
. At the moment we sort of have two default values, one here and one in build
when it does the unwrap_or
.
I'm not saying using a decision tree is the best option to predict probabilities but it's certainly valid. You could make the same argument about decision trees being unstable for regression right? In some cases you have dense enough datasets where these aren't as big of an issue. Alternatively you can construct a shallow tree that tends to have more samples in the leaf nodes. Think about models used for inference rather than just prediction, knowing the distribution of the classes in each leaf node can be very useful. |
…adaleta/linfa into feature/linfa-ensemble-random-forest
yes
…On Oct 10 2020, at 1:51 am, mossbanay ***@***.***> wrote:
@mossbanay commented on this pull request.
In linfa-ensemble/README.md (#43 (comment)):
> @@ -20,6 +16,9 @@ There is an example in the `examples/` directory how to use random forest models
$ cargo run --release --example random_forest
```
+The current benchmark performs random forest with 10, 100, 500, 1000 independent trees to predict labels and compare such results to ground truth.
+
+
Is it just cargo bench to run them?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#43 (review)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR35JAIGHOEJDN3764U3SJ6OWLANCNFSM4QVHX2IA).
|
Hey, how's this going? Any luck with the sub-sampling of features? Let me know if you need a hand I'd be happy to help. This is pretty close to getting over the line and into the crate 😄 |
this PR seems to stale, continue work in #60 |
Implemented random forest baseline with tests and bench