Skip to content

neelsoumya/practical_supervised_machine_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

463a5ba · Mar 13, 2025

History

85 Commits
Feb 7, 2024
Feb 5, 2024
Feb 7, 2024
Mar 5, 2024
Dec 29, 2023
Mar 10, 2024
Jan 3, 2024
Mar 4, 2024
Mar 10, 2024
Jan 17, 2024
Mar 10, 2024
Feb 22, 2024
Dec 29, 2023
Mar 4, 2024
Feb 5, 2024
Mar 13, 2025
Mar 3, 2025

Repository files navigation

practical_supervised_machine_learning

Introduction

Resources for a practical on supervised machine learning

Code

The Rmarkdown scripts and R scripts constitute the practical.

cross_validation_practical.Rmd shows the basics of cross-validation.

Ch8-baggboost-lab.Rmd shows the basics of decision trees and random forests (bagging and boosting).

If you have time, you can try code in the additional_code folder:

  • caret_rf.Rmd shows the basics of using the caret package in R to build a machine learning pipeline.

Installation

Clone or download this repository.

Then install R and R Studio.

https://www.rstudio.com/products/rstudio/download/preview/

OR

follow the instructions here:

https://cambiotraining.github.io/intro-r/#Setup_instructions

From the command line run the R script installer.R to install all packages

R --no-save < installer.R

OR

run the script installer.R in R Studio.

Exercise

For an exercise, do these problems.

  • Exercise 1: Download data from

https://github.com/neelsoumya/teaching_reproducible_science_R/blob/main/metagene_score.csv

and train a classifier to predict yes/no (flag_yes_no). This would be a binary classifier. Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.

The data is also available here

https://github.com/neelsoumya/practical_supervised_machine_learning/blob/main/metagene_score.csv

  • Exercise 2: Download data from

https://archive.ics.uci.edu/dataset/2/adult

and train a classifier to predict if income > 50K or < 50K (binary classifier). Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.

Challenge: Use cross-validation to select a few important features.

The data is also available in the adult folder here

https://github.com/neelsoumya/practical_supervised_machine_learning/tree/main/adult

Challenge exercise

How would you select the features that go in the logistic regression model?

Think of a brute force approach.

Can you think of a more sophisticated apporach?

For an advanced exercise use the glmnet package in R.

See

https://glmnet.stanford.edu/articles/glmnet.html#logistic-regression-family-binomial

Need more of a challenge? See the additional_code folder and see the caret programs.

  • Exercise 3: Download the data from

https://github.com/neelsoumya/practical_supervised_machine_learning/blob/main/diabetes.csv

* Remember to visualize the data and normalize features

* Build a random forest model to predict diabetes outcome (0/1)

* Plot the out of bag (OOB) error as a function of the number of trees

Resources

Free PDF of book and R code

More practical tutorials and R code

Acknowledgements

I thank Dr. Bajuna Salehe for useful discussions and feedback.

All material is take from the following resources:

Contact

Soumya Banerjee

sb2333@cam.ac.uk

About

A practical in R for teaching supervised machine learning

Resources

Stars

Watchers

Forks

Packages

No packages published