Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/linfa ensemble random forest #43

Closed

Conversation

fgadaleta
Copy link
Contributor

Implemented random forest baseline with tests and bench

@bytesnake
Copy link
Member

will look into this tomorrow 👍

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly added some suggestions for places where you can add iterators instead of for loops. You should also make the algorithm generic over Float and labels.
In the end it would make sense to have a general Bagging implementation here, and add a more specialized version in linfa-trees, called RandomForest.
Thanks for the work 👍

linfa-ensemble/benches/random_forest.rs Show resolved Hide resolved
linfa-ensemble/benches/random_forest.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
@fgadaleta
Copy link
Contributor Author

I mostly added some suggestions for places where you can add iterators instead of for loops. You should also make the algorithm generic over Float and labels.
In the end it would make sense to have a general Bagging implementation here, and add a more specialized version in linfa-trees, called RandomForest.
Thanks for the work +1

Wonderful comments. Will start fixing tomorrow ;)

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some remarks on the feature importance stuff :)

linfa-trees/src/decision_trees/algorithm.rs Show resolved Hide resolved
linfa-trees/src/decision_trees/algorithm.rs Show resolved Hide resolved
linfa-trees/src/decision_trees/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
@fgadaleta
Copy link
Contributor Author

fgadaleta commented Sep 16, 2020

@bytesnake can we merge and close this?

@bytesnake
Copy link
Member

mh for me a linfa-ensemble with just random forest is a bit too empty, I think there are two ways forward:

@fgadaleta
Copy link
Contributor Author

fgadaleta commented Sep 16, 2020 via email

@bytesnake
Copy link
Member

well if they all are using decision trees, then I would put them under linfa-trees 😅

okay, if it's important for you I will accept, but at least improve the documentation a bit. First it is required for every subcrate to add a README.md describing its purpose and goals, then the link in the super crate is missing in the README.md too. Also add some comments to the RandomForest struct, you can follow these https://doc.rust-lang.org/stable/rust-by-example/meta/doc.html

@fgadaleta
Copy link
Contributor Author

fgadaleta commented Sep 16, 2020 via email

@bytesnake
Copy link
Member

bytesnake commented Sep 16, 2020

yep sounds good 👍 I have zero experience in the field of ensemble learning, are you saying that most of them are tree based, because of some properties inherent to decision trees necessary for the ensemble learning process. Or that they normally, e.g. the most popular, are tree based?

@fgadaleta
Copy link
Contributor Author

fgadaleta commented Sep 16, 2020 via email

Copy link
Collaborator

@paulkoerbitz paulkoerbitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments, I haven't taken a very thorough look at this yet.

Unfortunately I think this PR brings up some design decisions that we need to make (e.g. Predictor / Classifier trait), which IMO need some discussion and there is some design work to do. I also think we need some more thorough test cases. Things like unimplemented! should not be merged.

Another thought: a few things seem to be WIP in this PR (TODO comments, unimplemented! macros, etc.), maybe it makes sense to split this into multiple PRs? (e.g. one for random forest, one for voting classifier, etc.) and then polish each one?

@@ -0,0 +1,9 @@
[package]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a generic place in the library where we have cross-trait level traits such as this one (Float, and probably a few others come to mind as well). I don't think it makes sense to have different sub-crates for each of these traits. What do you think @bytesnake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep working on it here #45 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented a LinfaError struct that is now returned from functions returning Result
Specifically, decision trees should not implement predict_probabilities(). In the next commit I gently raise the error instead of the unimplemented!()

Predictor trait is currently in its own crate. I personally do not think that makes sense (it should be part of linfa, regardless). I have placed it as importable crate for convenience now, but needs to be moved

Copy link
Contributor

@mossbanay mossbanay Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, decision trees should not implement predict_probabilities()

Do you mean that

  1. This is actually genuinely not implemented and we should handle the case more gracefully
  2. Making a prediction of class probabilities is not a sensible operation for decision trees?

use ndarray::{Array1, ArrayBase, Data, Ix2};

/// Trait every predictor should implement
pub trait Predictor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Predictor is a good name, it should probably be Classifier since this is predicting classes (as opposed to a Regressor which could predict continuous values).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Predictor is an accurate name, but should have a Target type parameter, specifying whether it predicts labels or continuous values. https://github.com/rust-ml/linfa/pull/45/files#diff-44e8606db87b39b048179b1f7af8d244R19

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation can then be specialized at any point like

impl<L: Label> Predict<Array2<f32>, Vec<L>> ... {
}

this accepts only labels, e.g. integers, strings and booleans and not more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Both a Classifier and Regressor are Predictors :)
How about an enum ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree if the predictor is generic in its output then the name Predictor makes sense.

But I think predicting probabilities doesn't make sense for all predictors, so should be another trait.

linfa-predictor/src/lib.rs Show resolved Hide resolved
/// Trait every predictor should implement
pub trait Predictor {
/// predict class for each sample
fn predict(&self, x: &ArrayBase<impl Data<Elem = f64>, Ix2>) -> Array1<u64>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logistic regression uses predict_classes to distinguish from predict_probabilities. I'm also OK with predict and predict_probabilities but we should choose a consistent approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, but I'm now thinking whether you can implement a trait twice each with different return values? 🤔 Would be nice to have a

impl<F: Float, D: Data> Predictor<D, Array1<F>> for .. {
}
impl<L: Label, D: Data> Predictor<D, Array1<L>> for .. {
}

for the same type and then decide on the required return type, which implementation is used. But lets move that to the PR :D

linfa-predictor/src/lib.rs Outdated Show resolved Hide resolved
linfa-trees/src/decision_trees/algorithm.rs Show resolved Hide resolved
@paulkoerbitz paulkoerbitz changed the title Feature/linfa ensemble random forest [WIP] Feature/linfa ensemble random forest Sep 20, 2020
@fgadaleta
Copy link
Contributor Author

fgadaleta commented Oct 8, 2020 via email

Copy link
Contributor

@mossbanay mossbanay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't weigh in on linfa-predictor further than I think it's a good idea in principle and might take a few iterations to get to a point where we are happy with it.
The voting classifier and random forest look close but just need a bit of polish in some places. The biggest omission I can see is it doesn't look like we subset the features, which is a pretty core part of random forests.

linfa-ensemble/examples/random_forest.rs Outdated Show resolved Hide resolved
linfa-ensemble/README.md Outdated Show resolved Hide resolved
linfa-ensemble/benches/random_forest.rs Show resolved Hide resolved
linfa-ensemble/README.md Show resolved Hide resolved
linfa-ensemble/src/random_forest/algorithm.rs Outdated Show resolved Hide resolved
RandomForestParamsBuilder {
tree_hyperparameters,
n_estimators,
max_features: Some(MaxFeatures::Auto),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we initialise these to a (default) value then I think we don't need the Option. At the moment we sort of have two default values, one here and one in build when it does the unwrap_or.

linfa-ensemble/src/voting_classifier/algorithm.rs Outdated Show resolved Hide resolved
linfa-ensemble/src/voting_classifier/hyperparameters.rs Outdated Show resolved Hide resolved
linfa-trees/src/decision_trees/algorithm.rs Outdated Show resolved Hide resolved
@mossbanay
Copy link
Contributor

One simply cannot predict probabilities with a decision tree (and should not) . A decision tree is very unstable and can change dramatically for small perturbations of the input data. If you want something that looks like probabilities you need to reduce the depth of the tree. And still all observations in the same node will have the same probability. This is due to the fact that a decision tree generates probabilities from the number of samples of a class k on a leaf node, over the number of samples in that leaf. When the tree changes so dramatically from small input data perturbations, all those "probabilities" change completely. Only in ensemble methods (and random forest in particular) it makes sense to predict probabilities.

I'm not saying using a decision tree is the best option to predict probabilities but it's certainly valid. You could make the same argument about decision trees being unstable for regression right?

In some cases you have dense enough datasets where these aren't as big of an issue. Alternatively you can construct a shallow tree that tends to have more samples in the leaf nodes. Think about models used for inference rather than just prediction, knowing the distribution of the classes in each leaf node can be very useful.

@fgadaleta
Copy link
Contributor Author

fgadaleta commented Oct 10, 2020 via email

@mossbanay
Copy link
Contributor

Hey, how's this going? Any luck with the sub-sampling of features? Let me know if you need a hand I'd be happy to help. This is pretty close to getting over the line and into the crate 😄

@bytesnake bytesnake mentioned this pull request Nov 20, 2020
4 tasks
@bytesnake
Copy link
Member

this PR seems to stale, continue work in #60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants