Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear decision trees improvements #60

Merged
merged 63 commits into from
Dec 6, 2020
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
05e1f02
add debug trait to structs
fgadaleta Sep 1, 2020
7c3b1c0
initial commit for random forest implementation
fgadaleta Sep 1, 2020
e5c77ca
WIP first iter of random forest
fgadaleta Sep 2, 2020
7d2d512
add benches
fgadaleta Sep 3, 2020
da77696
setup bench and cleanup
Sep 3, 2020
701182d
cleanup
Sep 3, 2020
75bebb4
add max_n_rows for single decision tree fitting
Sep 10, 2020
d045e5e
implement random forest feature importance as collection of features …
fgadaleta Sep 12, 2020
1b754e9
implement random forest feature importance as collection of features …
fgadaleta Sep 12, 2020
7fbfa01
remove unused var
fgadaleta Sep 12, 2020
da30535
remove unused var
fgadaleta Sep 12, 2020
b808415
run clippy
fgadaleta Sep 15, 2020
f75d334
assert test success for feature importance
fgadaleta Sep 15, 2020
5461488
clippy and fmt
fgadaleta Sep 15, 2020
d84a8b5
store references of nodes to queue
fgadaleta Sep 15, 2020
4cfd3d4
WIP voting classifier and predictor trait
fgadaleta Sep 16, 2020
2846ef1
WIP voting classifier and predictor trait
fgadaleta Sep 16, 2020
b571894
implement and test VotingClassifier hard voting
fgadaleta Sep 16, 2020
37c2bec
implement predict_proba for random forest and tested
fgadaleta Sep 17, 2020
46d8491
documentation, examples, cleanup
fgadaleta Sep 17, 2020
0efe3ab
cleanup
fgadaleta Sep 17, 2020
34f3c40
implement LinfaError for Predictor trait
fgadaleta Sep 22, 2020
79abd05
fixed tests and CI/CD pipeline
fgadaleta Sep 22, 2020
139ba82
renamed predict_classes to predict in logreg for consistency
fgadaleta Sep 22, 2020
7b4103e
implement ProbabilisticPredictor whenever needed
fgadaleta Sep 30, 2020
dbf9184
votingclassifier implements predictor trait
fgadaleta Sep 30, 2020
92f20ab
PR-43 Moss comments addressed
fgadaleta Oct 9, 2020
90a5a2a
Switch `linfa-tree` to new infrastructure
bytesnake Nov 20, 2020
cb069c9
Experiment with interface
bytesnake Nov 26, 2020
0fcac7b
Add argmax ensemble classifier
bytesnake Nov 29, 2020
f7fe3f7
Run fmt
bytesnake Nov 29, 2020
74944e9
Add test with random noise
bytesnake Nov 30, 2020
0347f3a
Customize decision trees with weights
bytesnake Nov 30, 2020
bf821df
Remove unnecessary casting
bytesnake Dec 2, 2020
057b34e
Compare weight in splits with hyperparams
mossbanay Dec 3, 2020
3a35e72
Rename _samples hyperparams
mossbanay Dec 3, 2020
ced8e89
Fix cargo fmt lint?
mossbanay Dec 4, 2020
914ef45
Shush random forest example for time being
mossbanay Dec 4, 2020
0ec1eb8
Added new test for perfectly separable data
mossbanay Dec 4, 2020
048aed4
Appease clippy
mossbanay Dec 4, 2020
df9e7cb
Merge pull request #1 from mossbanay/trees
bytesnake Dec 4, 2020
11a2e72
Merge branch 'trees' of github.com:bytesnake/linfa into trees
bytesnake Dec 4, 2020
44bc93d
Fix error in test
bytesnake Dec 4, 2020
fe10d24
Run cargo fmt
bytesnake Dec 4, 2020
8787599
Add max_depth function for decision trees
bytesnake Dec 4, 2020
3d160a0
Add impurity decrease function
bytesnake Dec 4, 2020
c8e6fcc
Add mean impurity decrease
bytesnake Dec 4, 2020
5864e76
Add more tests to linear decision trees
bytesnake Dec 4, 2020
6d0cc1d
Remove number of classes hyper-parameter
bytesnake Dec 4, 2020
6f5f78f
Remove ensemble algorithm
bytesnake Dec 5, 2020
b2b41a0
Address issue with toy test
bytesnake Dec 5, 2020
82b62c4
Merge branch 'master' of github.com:rust-ml/linfa into trees
bytesnake Dec 6, 2020
70402ab
Simplify tree inspection methods
bytesnake Dec 6, 2020
1f1849f
Add max depth testing and hyperparameter validation
bytesnake Dec 6, 2020
09348a0
Fix parameter syntax in benchmarks
bytesnake Dec 6, 2020
7df68bc
Run cargo fmt
bytesnake Dec 6, 2020
f0bdc1c
Add tikz export builder
bytesnake Dec 6, 2020
311fc94
Improve decision tree formatting
bytesnake Dec 6, 2020
fa410f7
Add pruning
bytesnake Dec 6, 2020
635f98c
Adjust syntax of tikz snippet
bytesnake Dec 6, 2020
c658bd9
Merge branch 'master' of github.com:rust-ml/linfa into trees
bytesnake Dec 6, 2020
b268de2
Run cargo fmt
bytesnake Dec 6, 2020
5fd321a
Run cargo fmt
bytesnake Dec 6, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ exclude = [".github/"]

[dependencies]
num-traits = "0.2"
rand = "0.7"
ndarray = { version = "0.13", default-features = false }

[dev-dependencies]
rand = "0.7"
ndarray-rand = "0.12"
approx = "0.3"

Expand All @@ -37,7 +37,6 @@ members = [
"linfa-trees",
"linfa-svm",
"linfa-hierarchical",
"linfa-ica",
]

[profile.release]
Expand Down
12 changes: 6 additions & 6 deletions linfa-logistic/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ impl<F: Float, C: PartialOrd + Clone> FittedLogisticRegression<F, C> {

/// Given a feature matrix, predict the classes learned when the model was
/// fitted.
pub fn predict_classes<A: Data<Elem = F>>(&self, x: &ArrayBase<A, Ix2>) -> Vec<C> {
pub fn predict<A: Data<Elem = F>>(&self, x: &ArrayBase<A, Ix2>) -> Vec<C> {
let pos_class = class_from_label(&self.labels, F::POSITIVE_LABEL);
let neg_class = class_from_label(&self.labels, F::NEGATIVE_LABEL);
self.predict_probabilities(x)
Expand Down Expand Up @@ -647,7 +647,7 @@ mod test {
let res = log_reg.fit(&x, &y).unwrap();
assert_eq!(res.intercept(), 0.0);
assert!(res.params().abs_diff_eq(&array![0.681], 1e-3));
assert_eq!(res.predict_classes(&x), y.to_vec());
assert_eq!(res.predict(&x), y.to_vec());
}

#[test]
Expand All @@ -661,7 +661,7 @@ mod test {
assert!(res
.predict_probabilities(&x)
.abs_diff_eq(&array![0.501, 0.664, 0.335, 0.498], 1e-3));
assert_eq!(res.predict_classes(&x), y);
assert_eq!(res.predict(&x), y);
}

#[test]
Expand All @@ -683,7 +683,7 @@ mod test {
let res = log_reg.fit(&x, &y).unwrap();
assert!(res.intercept().abs_diff_eq(&-4.124, 1e-3));
assert!(res.params().abs_diff_eq(&array![1.181], 1e-3));
assert_eq!(res.predict_classes(&x), y.to_vec());
assert_eq!(res.predict(&x), y.to_vec());
}

#[test]
Expand Down Expand Up @@ -776,7 +776,7 @@ mod test {
let res = log_reg.fit(&x, &y).unwrap();
assert!(res.intercept().abs_diff_eq(&-4.124, 1e-3));
assert!(res.params().abs_diff_eq(&array![1.181], 1e-3));
assert_eq!(res.predict_classes(&x), y.to_vec());
assert_eq!(res.predict(&x), y.to_vec());
}

#[test]
Expand All @@ -787,6 +787,6 @@ mod test {
let res = log_reg.fit(&x, &y).unwrap();
assert_eq!(res.intercept(), 0.0 as f32);
assert!(res.params().abs_diff_eq(&array![0.682 as f32], 1e-3));
assert_eq!(res.predict_classes(&x), y.to_vec());
assert_eq!(res.predict(&x), y.to_vec());
}
}
5 changes: 2 additions & 3 deletions linfa-trees/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,12 @@ categories = ["algorithms", "mathematics", "science"]
ndarray = { version = "0.13" , features = ["rayon", "approx"]}
ndarray-rand = "0.11"

linfa = { path = ".." }

[dev-dependencies]
rand_isaac = "0.2.0"
ndarray-npy = { version = "0.5", default-features = false }
criterion = "0.3"
serde_json = "1"
approx = "0.3"
#linfa-clustering = { version = "0.2.1", path = "../linfa-clustering" }

[[bench]]
name = "decision_tree"
Expand Down
42 changes: 24 additions & 18 deletions linfa-trees/benches/decision_tree.rs
Original file line number Diff line number Diff line change
@@ -1,48 +1,54 @@
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use linfa_clustering::generate_blobs;
use linfa_trees::{DecisionTree, DecisionTreeParams};
use ndarray::{Array, Array2};
use linfa::prelude::*;
use linfa_trees::DecisionTree;
use ndarray::{stack, Array, Array2, Axis};
use ndarray_rand::rand::SeedableRng;
use ndarray_rand::rand_distr::Uniform;
use ndarray_rand::rand_distr::{StandardNormal, Uniform};
use ndarray_rand::RandomExt;
use rand_isaac::Isaac64Rng;
use std::iter::FromIterator;

fn generate_blobs(means: &Array2<f64>, samples: usize, mut rng: &mut Isaac64Rng) -> Array2<f64> {
let out = means
.axis_iter(Axis(0))
.map(|mean| Array::random_using((samples, 4), StandardNormal, &mut rng) + mean)
.collect::<Vec<_>>();
let out2 = out.iter().map(|x| x.view()).collect::<Vec<_>>();

stack(Axis(0), &out2).unwrap()
}

fn decision_tree_bench(c: &mut Criterion) {
let mut rng = Isaac64Rng::seed_from_u64(42);

// Controls how many samples for each class are generated
let training_set_sizes = vec![100, 1000, 10000, 100000];

let n_classes: u64 = 4;
let n_classes = 4;
let n_features = 4;

// Use the default configuration
let hyperparams = DecisionTreeParams::new(n_classes as u64);
let hyperparams = DecisionTree::params(n_classes);

// Benchmark training time 10 times for each training sample size
let mut group = c.benchmark_group("decision_tree");
group.sample_size(10);

for n in training_set_sizes.iter() {
let centroids = Array2::random_using(
(n_classes as usize, n_features),
Uniform::new(-30., 30.),
&mut rng,
);
let centroids =
Array2::random_using((n_classes, n_features), Uniform::new(-30., 30.), &mut rng);

let train_x = generate_blobs(*n, &centroids, &mut rng);
let train_x = generate_blobs(&centroids, *n, &mut rng);
let train_y = Array::from_iter(
(0..n_classes)
.map(|x| std::iter::repeat(x).take(*n).collect::<Vec<u64>>())
.map(|x| std::iter::repeat(x).take(*n).collect::<Vec<usize>>())
.flatten(),
);
let dataset = Dataset::new(train_x, train_y);

group.bench_with_input(
BenchmarkId::from_parameter(n),
&(train_x, train_y),
|b, (x, y)| b.iter(|| DecisionTree::fit(hyperparams.build(), &x, &y)),
);
group.bench_with_input(BenchmarkId::from_parameter(n), &dataset, |b, d| {
b.iter(|| hyperparams.fit(&d))
});
}

group.finish();
Expand Down
129 changes: 43 additions & 86 deletions linfa-trees/examples/decision_tree.rs
Original file line number Diff line number Diff line change
@@ -1,122 +1,79 @@
use linfa_trees::{DecisionTree, DecisionTreeParams, SplitQuality};
use ndarray::{array, s, Array, Array2, ArrayBase, Data, Ix1, Ix2};
use ndarray_rand::rand::Rng;
use ndarray::{array, stack, Array, Array1, Array2, Axis};
use ndarray_rand::rand::SeedableRng;
use ndarray_rand::rand_distr::StandardNormal;
use ndarray_rand::RandomExt;
use rand_isaac::Isaac64Rng;
use std::iter::FromIterator;

/// Given an input matrix `blob_centroids`, with shape `(n_blobs, n_features)`,
/// generate `blob_size` data points (a "blob") around each of the blob centroids.
///
/// More specifically, each blob is formed by `blob_size` points sampled from a normal
/// distribution centered in the blob centroid with unit variance.
///
/// `generate_blobs` can be used to quickly assemble a synthetic dataset to test or
/// benchmark various clustering algorithms on a best-case scenario input.
pub fn generate_blobs(
blob_size: usize,
blob_centroids: &ArrayBase<impl Data<Elem = f64>, Ix2>,
rng: &mut impl Rng,
) -> Array2<f64> {
let (n_centroids, n_features) = blob_centroids.dim();
let mut blobs: Array2<f64> = Array2::zeros((n_centroids * blob_size, n_features));

for (blob_index, blob_centroid) in blob_centroids.genrows().into_iter().enumerate() {
let blob = generate_blob(blob_size, &blob_centroid, rng);

let indexes = s![blob_index * blob_size..(blob_index + 1) * blob_size, ..];
blobs.slice_mut(indexes).assign(&blob);
}
blobs
}

/// Generate `blob_size` data points (a "blob") around `blob_centroid`.
///
/// More specifically, the blob is formed by `blob_size` points sampled from a normal
/// distribution centered in `blob_centroid` with unit variance.
///
/// `generate_blob` can be used to quickly assemble a synthetic stereotypical cluster.
pub fn generate_blob(
blob_size: usize,
blob_centroid: &ArrayBase<impl Data<Elem = f64>, Ix1>,
rng: &mut impl Rng,
) -> Array2<f64> {
let shape = (blob_size, blob_centroid.len());
let origin_blob: Array2<f64> = Array::random_using(shape, StandardNormal, rng);
origin_blob + blob_centroid
}
use linfa::prelude::*;
use linfa_trees::{DecisionTree, SplitQuality};

fn generate_blobs(means: &[(f64, f64)], samples: usize, mut rng: &mut Isaac64Rng) -> Array2<f64> {
let out = means
.into_iter()
.map(|mean| {
Array::random_using((samples, 2), StandardNormal, &mut rng) + array![mean.0, mean.1]
})
.collect::<Vec<_>>();
let out2 = out.iter().map(|x| x.view()).collect::<Vec<_>>();

fn accuracy(
labels: &ArrayBase<impl Data<Elem = u64>, Ix1>,
pred: &ArrayBase<impl Data<Elem = u64>, Ix1>,
) -> f64 {
let true_positive: f64 = labels
.iter()
.zip(pred.iter())
.filter(|(x, y)| x == y)
.map(|_| 1.0)
.sum();
true_positive / labels.len() as f64
stack(Axis(0), &out2).unwrap()
}

fn main() {
// Our random number generator, seeded for reproducibility
let mut rng = Isaac64Rng::seed_from_u64(42);

// For each our expected centroids, generate `n` data points around it (a "blob")
let n_classes: u64 = 4;
let expected_centroids = array![[0., 0.], [1., 4.], [-5., 0.], [4., 4.]];
let n = 100;
let n_classes: usize = 4;
let n = 300;

println!("Generating training data");

let train_x = generate_blobs(n, &expected_centroids, &mut rng);
let train_y = Array::from_iter(
(0..n_classes)
.map(|x| std::iter::repeat(x).take(n).collect::<Vec<u64>>())
.flatten(),
);
let train_x = generate_blobs(&[(0., 0.), (1., 4.), (-5., 0.), (4., 4.)], n, &mut rng);
let train_y = (0..n_classes)
.map(|x| std::iter::repeat(x).take(n).collect::<Vec<_>>())
.flatten()
.collect::<Array1<_>>();

let test_x = generate_blobs(n, &expected_centroids, &mut rng);
let test_y = Array::from_iter(
(0..n_classes)
.map(|x| std::iter::repeat(x).take(n).collect::<Vec<u64>>())
.flatten(),
);

println!("Generated training data");
let dataset = Dataset::new(train_x, train_y).shuffle(&mut rng);
let (train, test) = dataset.split_with_ratio(0.9);

println!("Training model with Gini criterion ...");
let gini_hyperparams = DecisionTreeParams::new(n_classes)
let gini_model = DecisionTree::params()
.split_quality(SplitQuality::Gini)
.max_depth(Some(100))
.min_samples_split(10)
.min_samples_leaf(10)
.build();
.min_weight_split(10.0)
.min_weight_leaf(10.0)
.fit(&train);

let gini_pred_y = gini_model.predict(test.records().view());
let cm = gini_pred_y.confusion_matrix(&test);

let gini_model = DecisionTree::fit(gini_hyperparams, &train_x, &train_y);
println!("{:?}", cm);

let gini_pred_y = gini_model.predict(&test_x);
println!(
"Test accuracy with Gini criterion: {:.2}%",
100.0 * accuracy(&test_y, &gini_pred_y)
100.0 * cm.accuracy()
);

println!("Training model with entropy criterion ...");
let entropy_hyperparams = DecisionTreeParams::new(n_classes)
let entropy_model = DecisionTree::params()
.split_quality(SplitQuality::Entropy)
.max_depth(Some(100))
.min_samples_split(10)
.min_samples_leaf(10)
.build();
.min_weight_split(10.0)
.min_weight_leaf(10.0)
.fit(&train);

let entropy_model = DecisionTree::fit(entropy_hyperparams, &train_x, &train_y);
let entropy_pred_y = gini_model.predict(test.records().view());
let cm = entropy_pred_y.confusion_matrix(&test);

println!("{:?}", cm);

let entropy_pred_y = entropy_model.predict(&test_x);
println!(
"Test accuracy with Entropy criterion: {:.2}%",
100.0 * accuracy(&test_y, &entropy_pred_y)
100.0 * cm.accuracy()
);

let feats = entropy_model.features();
println!("Features trained in this tree {:?}", feats);
}
Loading