Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/linfa ensemble random forest #43

Closed
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ee95910
add debug trait to structs
fgadaleta Sep 1, 2020
828963c
initial commit for random forest implementation
fgadaleta Sep 1, 2020
ffa9f85
WIP first iter of random forest
fgadaleta Sep 2, 2020
40fd48e
add benches
fgadaleta Sep 3, 2020
bc72ace
add benches
Sep 3, 2020
3ad0ee2
setup bench and cleanup
Sep 3, 2020
d705d21
cleanup
Sep 3, 2020
cbda63d
fixed comments as in PR#43
fgadaleta Sep 10, 2020
ba047af
add max_n_rows for single decision tree fitting
Sep 10, 2020
09b1314
Merge branch 'master' into feature/linfa-ensemble-random-forest
fgadaleta Sep 12, 2020
d24d2b0
implement random forest feature importance as collection of features …
fgadaleta Sep 12, 2020
2534a98
Merge branch 'feature/linfa-ensemble-random-forest' of github.com:fga…
fgadaleta Sep 12, 2020
e6fda07
implement random forest feature importance as collection of features …
fgadaleta Sep 12, 2020
7a44978
remove unused var
fgadaleta Sep 12, 2020
e88efaf
remove unused var
fgadaleta Sep 12, 2020
a442b11
run clippy
fgadaleta Sep 15, 2020
8ff236c
assert test success for feature importance
fgadaleta Sep 15, 2020
eb45a2d
clippy and fmt
fgadaleta Sep 15, 2020
9b08215
store references of nodes to queue
fgadaleta Sep 15, 2020
7d75c98
WIP voting classifier and predictor trait
fgadaleta Sep 16, 2020
cf3e99c
WIP voting classifier and predictor trait
fgadaleta Sep 16, 2020
5273ad7
Merge pull request #1 from fgadaleta/feature/voting-classifier
fgadaleta Sep 16, 2020
516949e
implement and test VotingClassifier hard voting
fgadaleta Sep 16, 2020
cf6aaf3
implement predict_proba for random forest and tested
fgadaleta Sep 17, 2020
c51b3f6
documentation, examples, cleanup
fgadaleta Sep 17, 2020
6bb44ba
cleanup
fgadaleta Sep 17, 2020
6d4b9ce
implement LinfaError for Predictor trait
fgadaleta Sep 22, 2020
8b260a8
fixed tests and CI/CD pipeline
fgadaleta Sep 22, 2020
c0d1c22
renamed predict_classes to predict in logreg for consistency
fgadaleta Sep 22, 2020
a60cd8c
implement ProbabilisticPredictor whenever needed
fgadaleta Sep 30, 2020
2dbc669
votingclassifier implements predictor trait
fgadaleta Sep 30, 2020
914e32d
Merge branch 'master' into feature/linfa-ensemble-random-forest
fgadaleta Oct 3, 2020
8c42230
PR-43 Moss comments addressed
fgadaleta Oct 9, 2020
a2843c8
:Merge branch 'feature/linfa-ensemble-random-forest' of github.com:fg…
fgadaleta Oct 9, 2020
814674f
nits from PR-43 and feature importance as a vec
fgadaleta Oct 15, 2020
f5f4897
nits from PR-43 and feature importance as a vec
fgadaleta Oct 15, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ categories = ["algorithms", "mathematics", "science"]
rand = "0.7"
ndarray = { version = "0.13", features = ["rayon", "serde", "approx"]}
num-traits = "0.1.32"

linfa-clustering = { path = "linfa-clustering", version = "0.1" }
linfa-trees = { path = "linfa-trees", version = "0.1" }
linfa-ensemble = { path = "linfa-ensemble", version = "0.1" }
linfa-reduction = { path = "linfa-reduction", version = "0.1" }
linfa-linear = { path = "linfa-linear", version = "0.1" }
linfa-logistic = { path = "linfa-logistic", version = "0.1" }
Expand All @@ -38,6 +38,7 @@ members = [
"linfa-linear",
"linfa-logistic",
"linfa-trees",
"linfa-ensemble",
"linfa-svm"
]

Expand Down
29 changes: 29 additions & 0 deletions linfa-ensemble/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[package]
name = "linfa-ensemble"
version = "0.1.0"
edition = "2018"
authors = ["Francesco Gadaleta <francesco@amethix.com>"]
description = "A collection of ensemble methods based on decision trees"
license = "MIT/Apache-2.0"

repository = "https://github.com/rust-ml/linfa"
readme = "README.md"

keywords = ["machine-learning", "linfa", "ensemble", "random-forest", "supervised"]
categories = ["algorithms", "mathematics", "science"]

[dependencies]
ndarray = { version = "0.13" , features = ["rayon", "approx"]}
ndarray-rand = "0.11"
linfa-trees = {path = "../linfa-trees", version = "0.1"}

[dev-dependencies]
rand_isaac = "0.2.0"
ndarray-npy = { version = "0.5", default-features = false }
criterion = "0.3"
serde_json = "1"
approx = "0.3"

[[bench]]
name = "random_forest"
harness = false
Empty file added linfa-ensemble/README.md
Empty file.
50 changes: 50 additions & 0 deletions linfa-ensemble/benches/random_forest.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use linfa_trees::DecisionTreeParams;
use ndarray::{Array, Array1};
use linfa_ensemble::{RandomForest, RandomForestParamsBuilder, MaxFeatures};


fn random_forest_bench(c: &mut Criterion) {
// Load data
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
let data = vec![0.54439407, 0.26408166, 0.97446289, 0.81338034, 0.08248497,
0.30045893, 0.35535142, 0.26975284, 0.46910295, 0.72357513,
0.77458868, 0.09104661, 0.17291617, 0.50215056, 0.26381918,
0.06778572, 0.92139866, 0.30618514, 0.36123106, 0.90650849,
0.88988489, 0.44992222, 0.95507872, 0.52735043, 0.42282919,
0.98382015, 0.68076762, 0.4890352 , 0.88607302, 0.24732972,
0.98936691, 0.73508201, 0.16745694, 0.25099697, 0.32681078,
0.37070237, 0.87316842, 0.85858922, 0.55702507, 0.06624119,
0.3272859 , 0.46670468, 0.87466706, 0.51465624, 0.69996642,
0.04334688, 0.6785262 , 0.80599445, 0.6690343 , 0.29780375];

// Define parameters of single tree
let tree_params = DecisionTreeParams::new(2)
.max_depth(Some(3))
.min_samples_leaf(2 as u64)
.build();
// Define parameters of random forest
let trees_set_sizes = vec![10, 100, 500, 1000];
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
// Benchmark training time 10 times for each training sample size
let mut group = c.benchmark_group("random_forest");
group.sample_size(10);

for ntrees in trees_set_sizes.iter() {
let xtrain = Array::from(data.clone()).into_shape((10, 5)).unwrap();
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
let ytrain = Array1::from(vec![0, 1, 0, 1, 1, 0, 1, 0, 1, 1]);

let rf_params = RandomForestParamsBuilder::new(tree_params, *ntrees as usize)
.max_features(Some(MaxFeatures::Auto))
.build();
group.bench_with_input(
BenchmarkId::from_parameter(ntrees),
&(xtrain, ytrain),
|b, (x, y)| b.iter(|| RandomForest::fit(rf_params, &x, &y)),
);
}

group.finish();
}

criterion_group!(benches, random_forest_bench);
criterion_main!(benches);

3 changes: 3 additions & 0 deletions linfa-ensemble/src/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
mod random_forest;

pub use random_forest::*;
122 changes: 122 additions & 0 deletions linfa-ensemble/src/random_forest/algorithm.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
use linfa_trees::DecisionTree;
use crate::random_forest::hyperparameters::RandomForestParams;
use ndarray::{Array1, Array2, ArrayBase, Data, Ix1, Ix2};
use ndarray_rand::rand_distr::Uniform;
use ndarray::Axis;
use ndarray::Array;
use ndarray_rand::RandomExt;
use std::collections::HashMap;


pub struct RandomForest {
pub hyperparameters: RandomForestParams,
pub trees: Vec<DecisionTree>,
}

impl RandomForest {
pub fn fit(
hyperparameters: RandomForestParams,
x: &ArrayBase<impl Data<Elem = f64>, Ix2>,
y: &ArrayBase<impl Data<Elem = u64>, Ix1>,
) -> Self {

let n_estimators = hyperparameters.n_estimators;
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
let mut trees: Vec<DecisionTree> = Vec::with_capacity(n_estimators);
let single_tree_params = hyperparameters.tree_hyperparameters;

//TODO check bootstrap
let _bootstrap = hyperparameters.bootstrap;
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved

for _ in 0..n_estimators {
// Bagging here
let rnd_idx = Array::random((1, x.nrows()), Uniform::new(0, x.nrows())).into_raw_vec();
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
let xsample = x.select(Axis(0), &rnd_idx);
let ysample = y.select(Axis(0), &rnd_idx);

let tree = DecisionTree::fit(single_tree_params, &xsample, &ysample);
trees.push(tree);
}

Self {
hyperparameters: hyperparameters,
trees: trees
}
}

pub fn predict(&self, x: &ArrayBase<impl Data<Elem = f64>, Ix2>) -> Array1<u64> {
let ntrees = self.hyperparameters.n_estimators;
assert!(ntrees > 0, "Run .fit() method first");

let mut predictions: Array2<u64> = Array2::zeros((ntrees, x.nrows()));

for i in 0..ntrees {
let single_pred = self.trees[i].predict(&x);
dbg!("single pred: ", &single_pred);

// TODO can we make this more idiomatic rust
for j in 0..single_pred.len() {
predictions[[i, j]] = single_pred[j];
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
}
}

let mut result: Vec<u64> = Vec::with_capacity(x.nrows());
for j in 0..predictions.ncols() {
let mut counter_stats: HashMap<u64, u64> = HashMap::new();
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
for i in 0..ntrees {
*counter_stats.entry(predictions[[i,j]]).or_insert(0) += 1;
}

let final_pred = counter_stats
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
.iter()
.max_by(|a,b| a.1.cmp(&b.1))
.map(|(k, _v)| k)
.unwrap();
result.push(*final_pred);
}
Array1::from(result)
}
}


#[cfg(test)]
mod tests {
use super::*;
use linfa_trees::DecisionTreeParams;
use crate::random_forest::hyperparameters::{RandomForestParamsBuilder,
MaxFeatures};

#[test]
fn test_random_forest_fit() {
// Load data
let data = vec![0.54439407, 0.26408166, 0.97446289, 0.81338034, 0.08248497,
0.30045893, 0.35535142, 0.26975284, 0.46910295, 0.72357513,
0.77458868, 0.09104661, 0.17291617, 0.50215056, 0.26381918,
0.06778572, 0.92139866, 0.30618514, 0.36123106, 0.90650849,
0.88988489, 0.44992222, 0.95507872, 0.52735043, 0.42282919,
0.98382015, 0.68076762, 0.4890352 , 0.88607302, 0.24732972,
0.98936691, 0.73508201, 0.16745694, 0.25099697, 0.32681078,
0.37070237, 0.87316842, 0.85858922, 0.55702507, 0.06624119,
0.3272859 , 0.46670468, 0.87466706, 0.51465624, 0.69996642,
0.04334688, 0.6785262 , 0.80599445, 0.6690343 , 0.29780375];

let xtrain = Array::from(data).into_shape((10, 5)).unwrap();
let ytrain = Array1::from(vec![0, 1, 0, 1, 1, 0, 1, 0, 1, 1]);

// Define parameters of single tree
let tree_params = DecisionTreeParams::new(2)
.max_depth(Some(3))
.min_samples_leaf(2 as u64)
.build();
// Define parameters of random forest
let ntrees = 100;
let rf_params = RandomForestParamsBuilder::new(tree_params, ntrees)
.max_features(Some(MaxFeatures::Auto))
.build();
let rf = RandomForest::fit(rf_params, &xtrain, &ytrain);
assert_eq!(rf.trees.len(), ntrees);

let preds = rf.predict(&xtrain);
dbg!("Predictions: {}", preds);
}

}
95 changes: 95 additions & 0 deletions linfa-ensemble/src/random_forest/hyperparameters.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
use linfa_trees::DecisionTreeParams;

#[derive(Clone, Copy)]
pub struct RandomForestParams {
pub n_estimators: usize,
pub tree_hyperparameters: DecisionTreeParams,
pub max_features: u64,
pub bootstrap: bool
}

#[derive(Clone, Copy)]
pub enum MaxFeatures {
Auto,
Sqrt,
Log2
}

pub struct RandomForestParamsBuilder {
tree_hyperparameters: DecisionTreeParams,
n_estimators: usize,
max_features: Option<MaxFeatures>,
bootstrap: Option<bool>
}

impl RandomForestParamsBuilder {
pub fn new(
tree_hyperparameters: DecisionTreeParams,
n_estimators: usize) -> RandomForestParamsBuilder {
RandomForestParamsBuilder {
tree_hyperparameters,
n_estimators,
max_features: Some(MaxFeatures::Auto),
bootstrap: Some(false)
}
}

pub fn set_tree_hyperparameters(
mut self,
tree_hyperparameters: DecisionTreeParams) -> Self {
self.tree_hyperparameters = tree_hyperparameters;
self
}

pub fn n_estimators(mut self, n_estimators: usize) -> Self {
self.n_estimators = n_estimators;
self
}

pub fn max_features(mut self, n_feats: Option<MaxFeatures>) -> Self {
self.max_features = n_feats;
self
}

pub fn bootstrap(mut self, bootstrap: Option<bool>) -> Self {
self.bootstrap = bootstrap;
self
}

pub fn build(&self) -> RandomForestParams {
// TODO
let max_features = self.max_features.unwrap_or(MaxFeatures::Auto);

let max_features = match max_features {
fgadaleta marked this conversation as resolved.
Show resolved Hide resolved
MaxFeatures::Auto => 42,
MaxFeatures::Log2 => 42,
MaxFeatures::Sqrt => 42
};
let bootstrap = self.bootstrap.unwrap_or(false);

RandomForestParams {
tree_hyperparameters: self.tree_hyperparameters,
n_estimators: self.n_estimators,
max_features: max_features,
bootstrap: bootstrap
}

}
}

















5 changes: 5 additions & 0 deletions linfa-ensemble/src/random_forest/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
mod algorithm;
mod hyperparameters;

pub use algorithm::*;
pub use hyperparameters::*;
2 changes: 2 additions & 0 deletions linfa-trees/src/decision_trees/algorithm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ impl SortedIndex {
}
}

#[derive(Debug)]
struct TreeNode {
feature_idx: usize,
split_value: f64,
Expand Down Expand Up @@ -208,6 +209,7 @@ impl TreeNode {
}

/// A fitted decision tree model.
#[derive(Debug)]
pub struct DecisionTree {
hyperparameters: DecisionTreeParams,
root_node: TreeNode,
Expand Down
6 changes: 3 additions & 3 deletions linfa-trees/src/decision_trees/hyperparameters.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
/// The possible impurity measures for training.
#[derive(Clone, Copy)]
#[derive(Clone, Copy, Debug)]
pub enum SplitQuality {
Gini,
Entropy,
}

/// The set of hyperparameters that can be specified for fitting a
/// [decision tree](struct.DecisionTree.html).
#[derive(Clone, Copy)]
#[derive(Clone, Copy, Debug)]
pub struct DecisionTreeParams {
pub n_classes: u64,
pub split_quality: SplitQuality,
Expand Down Expand Up @@ -90,7 +90,7 @@ impl DecisionTreeParams {
}
}

fn build(
pub fn build(
n_classes: u64,
split_quality: SplitQuality,
max_depth: Option<u64>,
Expand Down