Resources for a practical on supervised machine learning
The Rmarkdown scripts and R scripts constitute the practical.
cross_validation_practical.Rmd
shows the basics of cross-validation.
Ch8-baggboost-lab.Rmd
shows the basics of decision trees and random forests (bagging and boosting).
If you have time, you can try code in the additional_code
folder:
caret_rf.Rmd
shows the basics of using thecaret
package in R to build a machine learning pipeline.
Clone or download this repository.
Then install R and R Studio.
-
Install R
-
and R Studio
https://www.rstudio.com/products/rstudio/download/preview/
OR
follow the instructions here:
https://cambiotraining.github.io/intro-r/#Setup_instructions
From the command line run the R script installer.R
to install all packages
R --no-save < installer.R
OR
run the script installer.R
in R Studio.
For an exercise, do these problems.
- Exercise 1: Download data from
and train a classifier to predict yes/no (flag_yes_no). This would be a binary classifier. Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.
The data is also available here
- Exercise 2: Download data from
https://archive.ics.uci.edu/dataset/2/adult
and train a classifier to predict if income > 50K or < 50K (binary classifier). Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.
Challenge: Use cross-validation to select a few important features.
The data is also available in the adult
folder here
https://github.com/neelsoumya/practical_supervised_machine_learning/tree/main/adult
Challenge exercise
How would you select the features that go in the logistic regression model?
Think of a brute force approach.
Can you think of a more sophisticated apporach?
For an advanced exercise use the glmnet
package in R.
See
https://glmnet.stanford.edu/articles/glmnet.html#logistic-regression-family-binomial
Need more of a challenge? See the additional_code
folder and see the caret programs.
- Exercise 3: Download the data from
* Remember to visualize the data and normalize features
* Build a random forest model to predict diabetes outcome (0/1)
* Plot the out of bag (OOB) error as a function of the number of trees
Free PDF of book and R code
More practical tutorials and R code
I thank Dr. Bajuna Salehe for useful discussions and feedback.
All material is take from the following resources:
Soumya Banerjee