index.Rmd

---
title: "Introduction to Statistical Learning in R"
author: '[Francisco Rowe](http://www.franciscorowe.com) ([`@fcorowe`](http://twitter.com/fcorowe))'
date: ''
output:
  tufte::tufte_html:
    css: extra.css
  html_document:
    df_print: paged
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  pdf_document: default
link-citations: yes
subtitle: Computational Notebooks
bibliography: skeleton.bib
---

```{r setup, include=FALSE}
library(tufte)
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)
```

# Description

This site introduces the course *Introduction to Statistical Learning in R*. The course provides an introduction to statistics and probability covering essential topics in descriptive and inferential statistics and supervised machine learning. It adopts a problem-to-solution teaching approach, defining a practical problem and illustrating how statistics can enable understanding to make critically informed decisions about a population by examining a random sample. It uses a learning-by-doing approach based on real-world examples in various contexts. This also teaches how to conduct statistical data analysis in R. The course is organised around 6 sessions. Each session is designed to provide a combination of key statistical concepts and practical application through the use of R.

The course comprises three main components. The first component focuses on descriptive statistics, including descriptive statistics of different data types, common probability distributions and measures of centrality and dispersion. The second component involves inferential statistics covering hypothesis testing, confidence intervals, correlation, regression analysis, supervised machine learning approaches and cross-validation.

## Learning outcomes
Having successfully completed this course, you will be able to:

1.	Conduct exploratory statistical data analysis.
2.	Have an understanding of elementary probability distributions and data types.
3.	Perform correlation and regression data analysis using real-world data.
4.	Assess the statistical significance between different data types.
5.	Carry out statistical data analysis in R.
6.  Have a basic understanding of supervised machine learning and cross-validation.

# Structure

The notes for each session are:

* **Session 1** [**Introduction to R**](d1n1_intro.html): Data types & probability distributions
* **Session 2** [**Descriptive Statistics**](d1n2_summary.html): Measures of centrality & dispersion for continuous & categorical data
* **Session 3** [**Statistical Significance**](d1n3_confidence.html): Hypothesis testing & confidence intervals

* **Session 4** [**Correlation**](d2n1_correlation.html): Correlation visualisation & measures
* **Session 5** [**Regression Analysis**](d2n2_regression.html): Linear regression, dummy variables & logistic regression
* **Session 6** [**Supervised Machine Learning**](d2n3_machinelearning.html): Tree Regressions, Random Forest & Cross-validation

# Data

For this course, we uses 2011 Census data for England and Wales, and the *Quarterly Labour Force Survey (QLFS)*. QLFS is a quarterly sample survey of households living at private addresses in the United Kingdom. The survey is conducted by the Office for National Statistics. Its purpose is to provide information on the UK labour market. A file labelled ‘qlfs.Rdata’ contains the data. It is a small sub-set of the information collected by the QLFS in the first quarter (January to March) 2012. 

For the purposes of this course, I have cleaned and pruned the original dataset, and saved the resulting file in R format (`.Rdata`). The census data are in aggregate format at the district level and is labelled ‘data_census’. The data and relevant documentation will be available in the [*data* folder](../data).

# Dependencies

This course requires the following libraries installed on your machine:

* **tufte**
* **knitr**
* **kableExtra**
* **tidyverse** which includes ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr,
and forcats
* **lubridate**
* **modelr**
* **broom**
* **jtools** 
* **huxtable**
* **tree**
* **boot**
* **MASS**
* **ISLR**
* **randomForest**
* **rsample**
* **Hmisc**
* **ggcorrplot**
* **RColorBrewer**
* **viridis**
* **coefplot**
* **psych**

Each notebook has slightly different dependencies that cater to the topics covered in each session. To run the notebook for the first session, for example, you will need to install the following libraries: `tufte`, `knitr`, `tidyverse` and `kableExtra` You can install package `mypackage` by running the command `install.packages("mypackage")` on the R prompt or through the `Tools --> Install Packages...` menu in RStudio.

Further reading on `tidyverse`, see [R for Data Science](https://r4ds.had.co.nz) @grolemund_wickham_2019_book.

If you use material from this course, you can give appropriate attribution by using the following citation: @rowe_slr20

You can get the materials from this course as a [download](https://github.com/fcorowe/sl/archive/master.zip) of a .zip file or by going directly to the [GitHub repository](https://github.com/fcorowe/sl).

```{r, include=FALSE}
#This page was last built on:
#date()
```