[WIP] Feature/linfa ensemble random forest #43

fgadaleta · 2020-09-03T17:52:43Z

Implemented random forest baseline with tests and bench

bytesnake · 2020-09-08T17:29:41Z

will look into this tomorrow 👍

bytesnake

I mostly added some suggestions for places where you can add iterators instead of for loops. You should also make the algorithm generic over Float and labels.
In the end it would make sense to have a general Bagging implementation here, and add a more specialized version in linfa-trees, called RandomForest.
Thanks for the work 👍

linfa-ensemble/benches/random_forest.rs

linfa-ensemble/src/random_forest/algorithm.rs

fgadaleta · 2020-09-09T16:18:49Z

I mostly added some suggestions for places where you can add iterators instead of for loops. You should also make the algorithm generic over Float and labels.
In the end it would make sense to have a general Bagging implementation here, and add a more specialized version in linfa-trees, called RandomForest.
Thanks for the work +1

Wonderful comments. Will start fixing tomorrow ;)

…from decision trees

…daleta/linfa into feature/linfa-ensemble-random-forest

…from decision trees

bytesnake

added some remarks on the feature importance stuff :)

linfa-trees/src/decision_trees/algorithm.rs

linfa-ensemble/src/random_forest/algorithm.rs

fgadaleta · 2020-09-16T06:09:11Z

@bytesnake can we merge and close this?

bytesnake · 2020-09-16T08:50:50Z

mh for me a linfa-ensemble with just random forest is a bit too empty, I think there are two ways forward:

implement ensemble learning for generic fittable algorithms, this depends on [WIP] Tidy-up and improve ergonomics with new interface and dataset #45 and will take some time
or move the RandomForest implementation to linfa-trees and accept this PR

fgadaleta · 2020-09-16T08:53:47Z

I am implemeting VotingClassifier in the same crate All the family of XGB and extraRF could be under the same crate. I think putting all that under linfa-trees can be misleading.

…

On Sep 16 2020, at 10:51 am, Lorenz ***@***.***> wrote: mh for me a linfa-ensemble with just random forest is a bit too empty, I think there are two ways forward: implement ensemble learning for generic fittable algorithms, this depends on #45 ***@***.***/0?redirect=https%3A%2F%2Fgithub.com%2Frust-ml%2Flinfa%2Fpull%2F45&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D) and will take some time or move the RandomForest implementation to linfa-trees and accept this PR — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub ***@***.***/1?redirect=https%3A%2F%2Fgithub.com%2Frust-ml%2Flinfa%2Fpull%2F43%23issuecomment-693268442&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D), or unsubscribe ***@***.***/2?redirect=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACYTR3ZWOITD5534YL3PEUTSGB37VANCNFSM4QVHX2IA&recipient=cmVwbHkrQUNZVFIzWUE3WjVPVEVEVFRBUkhaRVY1TldaUFZFVkJOSEhDU1FNVEFNQHJlcGx5LmdpdGh1Yi5jb20%3D).

bytesnake · 2020-09-16T09:18:53Z

well if they all are using decision trees, then I would put them under linfa-trees 😅

okay, if it's important for you I will accept, but at least improve the documentation a bit. First it is required for every subcrate to add a README.md describing its purpose and goals, then the link in the super crate is missing in the README.md too. Also add some comments to the RandomForest struct, you can follow these https://doc.rust-lang.org/stable/rust-by-example/meta/doc.html

fgadaleta · 2020-09-16T09:35:42Z

By looking at the list of ensemble models indeed most of them are tree based. Hence it makes sense to refactor them there. I only found metaclassifier and votingclassifier not to be necessarily tree based (but just ensemble of generic weak learners) Let me add the votingclassifier and documentation to the same PR and we move from there. Does this sound reasonable? :)

…

On Sep 16 2020, at 11:19 am, Lorenz ***@***.***> wrote: well if they all are using decision trees, then I would put them under linfa-trees 😅 okay, if it's important for you I will accept, but at least improve the documentation a bit. First it is required for every subcrate to add a README.md describing its purpose and goals, then the link in the super crate is missing in the README.md too. Also add some comments to the RandomForest struct, you can follow these https://doc.rust-lang.org/stable/rust-by-example/meta/doc.html — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR36OEIXRRBXG4PS2X33SGB7I3ANCNFSM4QVHX2IA).

bytesnake · 2020-09-16T09:43:39Z

yep sounds good 👍 I have zero experience in the field of ensemble learning, are you saying that most of them are tree based, because of some properties inherent to decision trees necessary for the ensemble learning process. Or that they normally, e.g. the most popular, are tree based?

fgadaleta · 2020-09-16T09:52:52Z

Because they use trees as the single model to ensemble on. With this said, there are exceptions. For instance a voting classifiers can be composed of heterogeneous models (not necessarily trees). Same goes for a Stacking classifier (can also be based on logistic regression) , bagging classifier (can be based on K-Nearest neighbors). These models are still considered ensemble methods (just not tree-based).

…

On Sep 16 2020, at 11:43 am, Lorenz ***@***.***> wrote: yep sounds good 👍 I have zero experience in the field of ensemble learning, are you saying that most of them are tree based, because of properties of decision trees necessary for the ensemble learning process. Or that they normally, e.g. the most popular, are tree based? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR3YUTBKAE3RPFTWDF2LSGCCFVANCNFSM4QVHX2IA).

Feature/voting classifier

paulkoerbitz

I've left a few comments, I haven't taken a very thorough look at this yet.

Unfortunately I think this PR brings up some design decisions that we need to make (e.g. Predictor / Classifier trait), which IMO need some discussion and there is some design work to do. I also think we need some more thorough test cases. Things like unimplemented! should not be merged.

Another thought: a few things seem to be WIP in this PR (TODO comments, unimplemented! macros, etc.), maybe it makes sense to split this into multiple PRs? (e.g. one for random forest, one for voting classifier, etc.) and then polish each one?

paulkoerbitz · 2020-09-20T20:34:27Z

linfa-predictor/Cargo.toml

@@ -0,0 +1,9 @@
+[package]


I think we should have a generic place in the library where we have cross-trait level traits such as this one (Float, and probably a few others come to mind as well). I don't think it makes sense to have different sub-crates for each of these traits. What do you think @bytesnake?

Yep working on it here #45 👍

I have implemented a LinfaError struct that is now returned from functions returning Result
Specifically, decision trees should not implement predict_probabilities(). In the next commit I gently raise the error instead of the unimplemented!()

Predictor trait is currently in its own crate. I personally do not think that makes sense (it should be part of linfa, regardless). I have placed it as importable crate for convenience now, but needs to be moved

Specifically, decision trees should not implement predict_probabilities()

Do you mean that

This is actually genuinely not implemented and we should handle the case more gracefully

Making a prediction of class probabilities is not a sensible operation for decision trees?

paulkoerbitz · 2020-09-20T20:38:21Z

linfa-predictor/src/lib.rs

+use ndarray::{Array1, ArrayBase, Data, Ix2};
+
+/// Trait every predictor should implement
+pub trait Predictor {


I don't think Predictor is a good name, it should probably be Classifier since this is predicting classes (as opposed to a Regressor which could predict continuous values).

I think that Predictor is an accurate name, but should have a Target type parameter, specifying whether it predicts labels or continuous values. https://github.com/rust-ml/linfa/pull/45/files#diff-44e8606db87b39b048179b1f7af8d244R19

the implementation can then be specialized at any point like

impl<L: Label> Predict<Array2<f32>, Vec<L>> ... { }

this accepts only labels, e.g. integers, strings and booleans and not more

This is interesting. Both a Classifier and Regressor are Predictors :)
How about an enum ?

OK, I agree if the predictor is generic in its output then the name Predictor makes sense.

But I think predicting probabilities doesn't make sense for all predictors, so should be another trait.

linfa-predictor/src/lib.rs

paulkoerbitz · 2020-09-20T20:54:24Z

linfa-predictor/src/lib.rs

+/// Trait every predictor should implement
+pub trait Predictor {
+    /// predict class for each sample
+    fn predict(&self, x: &ArrayBase<impl Data<Elem = f64>, Ix2>) -> Array1<u64>;


logistic regression uses predict_classes to distinguish from predict_probabilities. I'm also OK with predict and predict_probabilities but we should choose a consistent approach.

same as above, but I'm now thinking whether you can implement a trait twice each with different return values? 🤔 Would be nice to have a

impl<F: Float, D: Data> Predictor<D, Array1<F>> for .. { } impl<L: Label, D: Data> Predictor<D, Array1<L>> for .. { }

for the same type and then decide on the required return type, which implementation is used. But lets move that to the PR :D

linfa-predictor/src/lib.rs

linfa-trees/src/decision_trees/algorithm.rs

fgadaleta · 2020-10-08T06:18:35Z

One simply cannot predict probabilities with a decision tree (and should not) . A decision tree is very unstable and can change dramatically for small perturbations of the input data. If you want something that looks like probabilities you need to reduce the depth of the tree. And still all observations in the same node will have the same probability. This is due to the fact that a decision tree generates probabilities from the number of samples of a class k on a leaf node, over the number of samples in that leaf. When the tree changes so dramatically from small input data perturbations, all those "probabilities" change completely. Only in ensemble methods (and random forest in particular) it makes sense to predict probabilities.

…

On Oct 8 2020, at 7:46 am, mossbanay ***@***.***> wrote: @mossbanay commented on this pull request. In linfa-predictor/Cargo.toml (#43 (comment)): > @@ -0,0 +1,9 @@ +[package] > Specifically, decision trees should not implement predict_probabilities() Do you mean that This is actually genuinely not implemented and we should handle the case more gracefully Making a prediction of class probabilities is not a sensible operation for decision trees? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub (#43 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR3Y74ICUCDTUAD2ROELSJVG2RANCNFSM4QVHX2IA).

mossbanay

I won't weigh in on linfa-predictor further than I think it's a good idea in principle and might take a few iterations to get to a point where we are happy with it.
The voting classifier and random forest look close but just need a bit of polish in some places. The biggest omission I can see is it doesn't look like we subset the features, which is a pretty core part of random forests.

linfa-ensemble/examples/random_forest.rs

linfa-ensemble/README.md

linfa-ensemble/benches/random_forest.rs

linfa-ensemble/README.md

linfa-ensemble/src/random_forest/algorithm.rs

mossbanay · 2020-10-08T06:45:37Z

linfa-ensemble/src/random_forest/hyperparameters.rs

+        RandomForestParamsBuilder {
+            tree_hyperparameters,
+            n_estimators,
+            max_features: Some(MaxFeatures::Auto),


If we initialise these to a (default) value then I think we don't need the Option. At the moment we sort of have two default values, one here and one in build when it does the unwrap_or.

linfa-ensemble/src/random_forest/hyperparameters.rs

linfa-ensemble/src/voting_classifier/algorithm.rs

linfa-ensemble/src/voting_classifier/hyperparameters.rs

linfa-trees/src/decision_trees/algorithm.rs

mossbanay · 2020-10-08T07:00:43Z

One simply cannot predict probabilities with a decision tree (and should not) . A decision tree is very unstable and can change dramatically for small perturbations of the input data. If you want something that looks like probabilities you need to reduce the depth of the tree. And still all observations in the same node will have the same probability. This is due to the fact that a decision tree generates probabilities from the number of samples of a class k on a leaf node, over the number of samples in that leaf. When the tree changes so dramatically from small input data perturbations, all those "probabilities" change completely. Only in ensemble methods (and random forest in particular) it makes sense to predict probabilities.

I'm not saying using a decision tree is the best option to predict probabilities but it's certainly valid. You could make the same argument about decision trees being unstable for regression right?

In some cases you have dense enough datasets where these aren't as big of an issue. Alternatively you can construct a shallow tree that tends to have more samples in the leaf nodes. Think about models used for inference rather than just prediction, knowing the distribution of the classes in each leaf node can be very useful.

…adaleta/linfa into feature/linfa-ensemble-random-forest

linfa-ensemble/README.md

fgadaleta · 2020-10-10T05:37:40Z

yes

…

On Oct 10 2020, at 1:51 am, mossbanay ***@***.***> wrote: @mossbanay commented on this pull request. In linfa-ensemble/README.md (#43 (comment)): > @@ -20,6 +16,9 @@ There is an example in the `examples/` directory how to use random forest models $ cargo run --release --example random_forest ``` +The current benchmark performs random forest with 10, 100, 500, 1000 independent trees to predict labels and compare such results to ground truth. + + Is it just cargo bench to run them? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub (#43 (review)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACYTR35JAIGHOEJDN3764U3SJ6OWLANCNFSM4QVHX2IA).

mossbanay · 2020-10-31T10:33:02Z

Hey, how's this going? Any luck with the sub-sampling of features? Let me know if you need a hand I'd be happy to help. This is pretty close to getting over the line and into the crate 😄

bytesnake · 2020-11-20T15:32:37Z

this PR seems to stale, continue work in #60

fgadaleta and others added 7 commits September 1, 2020 10:45

add debug trait to structs

ee95910

initial commit for random forest implementation

828963c

WIP first iter of random forest

ffa9f85

add benches

40fd48e

add benches

bc72ace

setup bench and cleanup

3ad0ee2

cleanup

d705d21

bytesnake reviewed Sep 9, 2020

View reviewed changes

fgadaleta and others added 11 commits September 10, 2020 16:30

fixed comments as in PR#43

cbda63d

add max_n_rows for single decision tree fitting

ba047af

Merge branch 'master' into feature/linfa-ensemble-random-forest

09b1314

implement random forest feature importance as collection of features …

d24d2b0

…from decision trees

Merge branch 'feature/linfa-ensemble-random-forest' of github.com:fga…

2534a98

…daleta/linfa into feature/linfa-ensemble-random-forest

implement random forest feature importance as collection of features …

e6fda07

…from decision trees

remove unused var

7a44978

remove unused var

e88efaf

run clippy

a442b11

assert test success for feature importance

8ff236c

clippy and fmt

eb45a2d

bytesnake reviewed Sep 15, 2020

View reviewed changes

store references of nodes to queue

9b08215

fgadaleta added 7 commits September 16, 2020 12:06

WIP voting classifier and predictor trait

7d75c98

WIP voting classifier and predictor trait

cf3e99c

Merge pull request #1 from fgadaleta/feature/voting-classifier

5273ad7

Feature/voting classifier

implement and test VotingClassifier hard voting

516949e

implement predict_proba for random forest and tested

cf6aaf3

documentation, examples, cleanup

c51b3f6

cleanup

6bb44ba

paulkoerbitz requested changes Sep 20, 2020

View reviewed changes

paulkoerbitz changed the title ~~Feature/linfa ensemble random forest~~ [WIP] Feature/linfa ensemble random forest Sep 20, 2020

fgadaleta added 6 commits September 22, 2020 21:04

implement LinfaError for Predictor trait

6d4b9ce

fixed tests and CI/CD pipeline

8b260a8

renamed predict_classes to predict in logreg for consistency

c0d1c22

implement ProbabilisticPredictor whenever needed

a60cd8c

votingclassifier implements predictor trait

2dbc669

Merge branch 'master' into feature/linfa-ensemble-random-forest

914e32d

mossbanay reviewed Oct 8, 2020

View reviewed changes

fgadaleta added 2 commits October 9, 2020 17:42

PR-43 Moss comments addressed

8c42230

:Merge branch 'feature/linfa-ensemble-random-forest' of github.com:fg…

a2843c8

…adaleta/linfa into feature/linfa-ensemble-random-forest

mossbanay reviewed Oct 9, 2020

View reviewed changes

linfa-ensemble/README.md Show resolved Hide resolved

fgadaleta added 2 commits October 15, 2020 09:07

nits from PR-43 and feature importance as a vec

814674f

nits from PR-43 and feature importance as a vec

f5f4897

bytesnake mentioned this pull request Nov 20, 2020

Linear decision trees improvements #60

Merged

4 tasks

bytesnake closed this Nov 20, 2020

EricTulowetzke mentioned this pull request Aug 3, 2022

Random Forest and Ensemble Learning #199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature/linfa ensemble random forest #43

[WIP] Feature/linfa ensemble random forest #43

fgadaleta commented Sep 3, 2020

bytesnake commented Sep 8, 2020

bytesnake left a comment

fgadaleta commented Sep 9, 2020

bytesnake left a comment

fgadaleta commented Sep 16, 2020 •

edited

Loading

bytesnake commented Sep 16, 2020

fgadaleta commented Sep 16, 2020 via email

bytesnake commented Sep 16, 2020

fgadaleta commented Sep 16, 2020 via email

bytesnake commented Sep 16, 2020 •

edited

Loading

fgadaleta commented Sep 16, 2020 via email

paulkoerbitz left a comment •

edited

Loading

paulkoerbitz Sep 20, 2020

bytesnake Sep 21, 2020

fgadaleta Sep 22, 2020

mossbanay Oct 8, 2020 •

edited

Loading

paulkoerbitz Sep 20, 2020

bytesnake Sep 21, 2020

bytesnake Sep 21, 2020

fgadaleta Sep 22, 2020

paulkoerbitz Sep 27, 2020

paulkoerbitz Sep 20, 2020

bytesnake Sep 21, 2020

fgadaleta commented Oct 8, 2020 via email

mossbanay left a comment

mossbanay Oct 8, 2020

mossbanay commented Oct 8, 2020

fgadaleta commented Oct 10, 2020 via email

mossbanay commented Oct 31, 2020

bytesnake commented Nov 20, 2020

[WIP] Feature/linfa ensemble random forest #43

[WIP] Feature/linfa ensemble random forest #43

Conversation

fgadaleta commented Sep 3, 2020

bytesnake commented Sep 8, 2020

bytesnake left a comment

Choose a reason for hiding this comment

fgadaleta commented Sep 9, 2020

bytesnake left a comment

Choose a reason for hiding this comment

fgadaleta commented Sep 16, 2020 • edited Loading

bytesnake commented Sep 16, 2020

fgadaleta commented Sep 16, 2020 via email

bytesnake commented Sep 16, 2020

fgadaleta commented Sep 16, 2020 via email

bytesnake commented Sep 16, 2020 • edited Loading

fgadaleta commented Sep 16, 2020 via email

paulkoerbitz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossbanay Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgadaleta commented Oct 8, 2020 via email

mossbanay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossbanay commented Oct 8, 2020

fgadaleta commented Oct 10, 2020 via email

mossbanay commented Oct 31, 2020

bytesnake commented Nov 20, 2020

fgadaleta commented Sep 16, 2020 •

edited

Loading

bytesnake commented Sep 16, 2020 •

edited

Loading

paulkoerbitz left a comment •

edited

Loading

mossbanay Oct 8, 2020 •

edited

Loading