Skip to content

Introduction to SharpLearning

Mads Dabros edited this page Feb 28, 2017 · 13 revisions

Introduction

This guide is an introduction on how to use SharpLearning for learning, evaluating and improving a machine learning model. The examples are made to illustrate how to use SharpLearning, but the overall concepts should also be applicable to other machine learning libraries.

This guide will use the wine quality data set, which is also included in SharpLearning.Examples. The dataset can be used for both classification and regression. In this case, we will use the data to create a regression model for scoring the quality of white wine. The full code examples from this guide can be found in SharpLearning.Examples.

The guide will cover the following topics:

  • Importing/reading data from csv.
  • Splitting data into training/test for evaluating models.
  • Learning a simple model.
  • Improving the simple model by tuning hyperparameters.
  • Improving the model by using more advanced machine learning a algorithms.
  • Using variable importance to gain insights about the model and data.

Notation

  • Learner - Machine learning algorithm.
  • Model - Machine learning model.
  • Hyperparameters - The parameters used to regulate the complexity of a machine learning model.
  • Overfitting - The model is too complex, this is also known as high variance.
  • Underfitting - The model is too simple, this is also known as high bias.
  • Target(s) - The value(s) we are trying to model, also known as the dependent variable. In some libraries this is called (y).
  • Observation(s) - Feature matrix, also known as the independent variables, contains all the information we have to describe the targets. In some libraries this is called (x).

Importing/reading data from csv.

In SharpLearning, csv data can be read using the CsvParser located in the namespace SharpLearning.InputOutput.Csv. Below, the CsvParser is created using a StreamReader to read from the filesystem.

// Setup the CsvParser
var parser = new CsvParser(() => new StreamReader("winequality-white.csv", separator: ';'));

// the column name in the wine quality data set we want to model.
var targetName = "quality";

// read the "quality" column, this is the targets for our learner. 
var targets = parser.EnumerateRows(targetName)
    .ToF64Vector();

// read the feature matrix, all columns except "quality",
// this is the observations for our learner.
var observations = parser.EnumerateRows(c => c != targetName)
    .ToF64Matrix();

The methods ToF64Vector and ToF64Matrix, converts from CsvRows to double format. ToF64Vector returns a double[] and ToF64Matrix returns a F64Matrix. There are corresponding methods to convert to string[] and StringMatrix in case further transforms has to be done before converting to double format.

Splitting data into training/test set.

In SharpLearning, splitting data into training/test is done using the TrainingTestIndexSplitters. There are various versions of these, corresponding to how the data should be distributed between the training and test set:

  • NoShuffleTrainingTestIndexSplitter - Keeps the data in the original order before splitting.
  • RandomTrainingTestIndexSplitter - Randomly shuffles the data before splitting. Usually used for regression.
  • StratifiedTrainingTestIndexSplitter - Ensures that the distribution of unique target values are similar between training and test set. Usually used for classification.

Since we want to learn a regression model from the wine quality data set, we will be using the RandomTrainingTestIndexSplitter. Here we specify that we are going to use 70% of the data for the training set, which leaves 30% of the data for the test set.

// 30 % of the data is used for the test set. 
var splitter = new RandomTrainingTestIndexSplitter<double>(trainingPercentage: 0.7, seed: 24);

var trainingTestSplit = splitter.SplitSet(observations, targets);
var trainSet = trainingTestSplit.TrainingSet;
var testSet = trainingTestSplit.TestSet;

Learning a simple model (DecisionTreeModel).

Now that we have read the data and have a training and test set available, we can start by applying a simple machine learning algorithm and measure how well it performs on the test set. The test set error is our estimate of how well the model generalizes to new data, so our goal is to get this error as low as possible.

Before we can evaluate the model, we have to decide how we want to measure the performance of the model. In SharpLearning, there are several different metrics available in the SharpLearning.Metrics project. A standard metric for evaluating a regression model is the mean square error. Since we are creating a regression model, this is the metric we are going to use.

The first machine learning algorithm we are going to try will be a RegressionDecisionTreeLaerner with default parameters:

// Create the learner and learn the model.
var learner = new RegressionDecisionTreeLearner();
var model = learner.Learn(trainSet.Observations, trainSet.Targets);

// predict the training and test set.
var trainPredictions = model.Predict(trainSet.Observations);
var testPredictions = model.Predict(testSet.Observations);

// create the metric
var metric = new MeanSquaredErrorRegressionMetric();
            
// measure the error on training and test set.
var trainError = metric.Error(trainSet.Targets, trainPredictions);
var testError = metric.Error(testSet.Targets, testPredictions);

We measure the error on both the training and test set:

Algorithm Train Error Test Error
RegressionDecisionTreeLearner(default) 0.0000 0.8415

As can bee seen, the training error is zero while the test error is relatively high. This is a sign that our RegressionDecisionTreeLearner might be overfitting the data.

The RegressionDecisionTreeLearner has several hyperparameters for controlling how complex the model can be. One of them is the maximum tree depth. This parameter controls the maximum allowed depth of the decision tree model, and thereby how many interactions between the input features the algorithm is allowed make.

Lets try to reduce the maximum tree depth to 10:

learner = new RegressionDecisionTreeLearner(maximumTreeDepth: 10);

We meaure the error again with the new hyperparameters:

Algorithm Train Error Test Error
RegressionDecisionTreeLearner(manually tuned) 0.2681 0.6914

This made the training error higher, but reduced the test error from 0.8415 to 0.6914. Since our goal is to create a model that generalizes well to new data, this is a step in the right direction.

We could continue to manually tweak the maximum tree depth and see if we could get an even lower test error but a better alternative is to let an optimizer do the work for us.

Tuning the hyperparameters of a DecisionTreeLearner

Manually tuning the hyperparameters of a machine learning algorithm can be a very time consuming task, especially when the number of hyperparameters starts to get large. Therefore it is preferable to use an optimizer and let the computer do the work.

In SharpLearning, there are several different optimizers in the SharpLearning.Optimizer project. For this problem we are going to the the RandomSearchOptimizer.

For evalauting each set of hyperparameters, using the optimizer, we are going to further split the training data into a training/validation set and leave our current test set out of the optimization. If we optimize directly on the error of our test set, we risk getting a positive bias on our final error estimate:

// Further split the training data to have a validation set to measure
// how well the model generalizes to unseen data during the optimization.
var validationSplit = new RandomTrainingTestIndexSplitter<double>(trainingPercentage: 0.7, seed: 24)
    .SplitSet(trainSet.Observations, trainSet.Targets);

The optimizer needs to know the bounds of the hyperparameters we are going to tune. In this case we are going to tune the maximum tree depth and the minimium split size for the RegressionDecisionTreeLearner:

// Parameter ranges for the optimizer 
var parameters = new double[][]
{
   new double[] { 1, 100 }, // maximumTreeDepth (min: 1, max: 100)
   new double[] { 1, 16 }, // minimumSplitSize (min: 1, max: 16)
};

The optimizer also needs an objective function, where we learn a candidate RegressionDecisionTreeModel using the current set of hyperparameters, and evaluates the performance using the validation set. Again, we are using the mean square error as metric. The objective function takes as input a double[] containing the candidate set of hyperparameters, and returns an OptimizerResult containing the validation error and the corresponding set of hyperparameters:

 // Define optimizer objective (function to minimize)
 Func<double[], OptimizerResult> minimize = p =>
 {
   // create the candidate learner using the current optimization parameters.
   var candidateLearner = new RegressionDecisionTreeLearner(maximumTreeDepth: (int)p[0], 
       minimumSplitSize: (int)p[1]);

   // learn the model on the validation training set.
   var candidateModel = candidateLearner.Learn(validationSplit.TrainingSet.Observations,
       validationSplit.TrainingSet.Targets);

   // measure the error on the validation test set.
   var validationPredictions = candidateModel.Predict(validationSplit.TestSet.Observations);
   var candidateError = metric.Error(validationSplit.TestSet.Targets, validationPredictions);

   return new OptimizerResult(p, candidateError);
};

When the objective function is defined, we can create and run the optimizer to find the best set of parameters. We are going to let the RandomSearhOptimizer run for 30 iterations and try out 30 sets of hyperparameters. The hyperparameters are sampled randomly within the bounds we defined earlier:

// create optimizer
var optimizer = new RandomSearchOptimizer(parameters, iterations: 30, runParallel: true);

// find best hyperparameters
var result = optimizer.OptimizeBest(minimize);
var best = result.ParameterSet;

The optimizer finds the best set of hyperparamters to be:

  • MaximumTreeDepth: 6
  • MinimumSplitSize: 7

After the optimizer has found the best set of hyperparameters, measured on the validation set, we can create a learner using these parameters and learn a new RegressionDecisionTreeModel on the the full training set:

// create learner with found hyperparameters
var learner = new RegressionDecisionTreeLearner(maximumTreeDepth: (int)best[0], 
                minimumSplitSize: (int)best[1]);

// learn model with found hyperparameters
var model = learner.Learn(trainSet.Observations, trainSet.Targets);

The new set of hyperparameters further reduces the error on the test set, from 0.6914 to 0.5804:

Algorithm Train Error Test Error
RegressionDecisionTreeLearner(optimizer) 0.4544 0.5804

Now that we have tuned the hyperparameters of the RegressionDecisionTreeLearner, we are starting to approach the limits of this machine learning algorithm.

Learning a RandomForestModel

Next, we are going to try a more sophisticated machine learning algorithm. The RandomForest learner is a good starting point since it usually performs very well with its default hyperparameters and seldom requires further tuning. This makes the algorithm easy to use over a wide variety of different problem domains without much expert knowledge.

Lets see if we can improve our wine quality model further by switching to a RegressionRandomForestLearner with default parameters:

// create the random forest learner with default parameters
var learner = new RegressionRandomForestLearner();
            
// learn model
var model = learner.Learn(trainSet.Observations, trainSet.Targets);

Using the RegressionRandomForestLearner, the test error is reduced further, from 0.5804 to 0.4030:

Algorithm Train Error Test Error
RegressionRandomForestLearner(default) 0.0518 0.4030

Compared with the RegressionDecisionTreeLearner combined with the RandomSearchOptimizer, using a RegressionRandomForestLearner with default parameters is both simpler and provides better results. However, the resulting model is larger and more complex.

Usually, a RegressionRandomForestLearner performs well with its default hyperparameters. But to see if we can improve further, we will try to optimize the hyperparamters using the RandomSearchOptimizer as we did with the DecisionTreeLearner.

Since we are now tuning the paramters of a RegressionRandomForestLearner, we need to define a new set of bounds for the optimizer. Here we are going to tune the number of trees, the number of features pr. split in the trees, and the maximum depth of the trees:

var parameters = new double[][]
{
   new double[] { 100, 300 }, // trees (min: 100, max: 300)
   new double[] { 1, numberOfFeatures }, // featuresPrSplit (min: 1, max: numberOfFeatures)
   new double[] { 8, 100 }, // maximumTreeDepth (min: 8, max: 100)
};

Running the optimizer finds the following hyperparameter set:

  • Trees: 242
  • featuresPrSplit: 2
  • maximumTreeDepth: 91

Measurering the error on the test set, only shows a very little improvement over the default paramters, From 0.4030 to 0.3995:

Algorithm Train Error Test Error
RegressionRandomForestLearner(Optimizer) 0.0511 0.3995

Tuning the hyperparameters of a GradientBoostLearner

Another algorithm related to RandomForest is GradientBoost. Usually, tuning the hyperparamteres of a GradientBoost model will reduce the error further than doing the same for a RandomForest model.

Since we are trying to minimize mean square error metric, we are going to use the RegressionSquareLossGradientBoostLearner. The optimizer bounds for the learner will be:

var parameters = new double[][]
{
   new double[] { 80, 300 }, // iterations (trees) (min: 80, max: 300)
   new double[] { 0.02, 0.2 }, // learning rate (min: 0.02, max: 0.2)
   new double[] { 8, 15 }, // maximumTreeDepth (min: 8, max: 15)
   new double[] { 0.5, 0.9 }, // subSampleRatio (min: 0.5, max: 0.9)
   new double[] { 1, numberOfFeatures }, // featuresPrSplit (min: 1, max: numberOfFeatures)
}; 

Running the optimizer finds the following hyperparameter set:

  • Trees: 198
  • learningRate: 0.028
  • maximumTreeDepth: 12
  • subSampleRatio: 0.559
  • featuresPrSplit: 10

Using the found hyperparameters, the RegressionSquareLossGradientBoostLearner is able to reduce the test error further, from 0.3995 to 0.3905:

Algorithm Train Error Test Error
RegressionSquareLossGradientBoostLearner (Optimizer) 0.0174 0.3905

Variable Importance

RandomForest and GradientBoost both produce complex models consisting of many decision trees. However, it is still possible to get insights from the models using variable importance. Variable importance describe the relative importance of the individual features used in the model. This provides information about which features, from the data set, are most important according to the model.

In SharpLearning, most models are able to provide variable importances:

// the variable importance requires the featureNameToIndex
// from the data set. This mapping describes the relation
// from column name to index in the feature matrix.
var featureNameToIndex = parser.EnumerateRows(c => c != targetName)
    .First().ColumnNameToIndex;

// Get the variable importance from the model.
var importances = model.GetVariableImportance(featureNameToIndex);

Below, the variable importances from the last GradientBoost model can be seen:

FeatureName Importance
volatile acidity 100.00
free sulfur dioxide 72.19
alcohol 59.03
residual sugar 9.81
citric acid 7.52
density 5.39
fixed acidity 5.04
chlorides 4.77
total sulfur dioxide 4.39
pH 2.71
sulphates 1.94

According to the model, "volatile acidity", "free sulfur dioxide" and "alcohol" are the most important features.

Summary

In this introduction, we have created several machine learning models and utilized an optimizer to further tune the hyperparameters for better performance:

Algorithm Train Error Test Error
RegressionDecisionTreeLearner(default) 0.0000 0.8415
RegressionDecisionTreeLearner(manually tuned) 0.2681 0.6914
RegressionRandomForestLearner(default) 0.0518 0.4030
RegressionRandomForestLearner(Optimizer) 0.0511 0.3995
RegressionSquareLossGradientBoostLearner (Optimizer) 0.0174 0.3905
Clone this wiki locally