04-eda.Rmd

# Exploratory data analysis and unsupervised learning {#eda}

Exploratory data analysis (EDA) is a process in which we summarise and visually explore a dataset. An important part of EDA is unsupervised learning, which is a collection of methods for finding interesting subgroups and patterns in our data. Unlike statistical hypothesis testing, which is used to reject hypotheses, EDA can be used to _generate_ hypotheses (which can then be confirmed or rejected by new studies). Another purpose of EDA is to find outliers and incorrect observations, which can lead to a cleaner and more useful dataset. In EDA we ask questions about our data and then try to answer them using summary statistics and graphics. Some questions will prove to be important, and some will not. The key to finding important questions is to ask a lot of questions. This chapter will provide you with a wide range of tools for question-asking.

After working with the material in this chapter, you will be able to use R to:

* Create reports using R Markdown,
* Customise the look of your plots,
* Visualise the distribution of a variable,
* Create interactive plots,
* Detect and label outliers,
* Investigate patterns in missing data,
* Visualise trends,
* Plot time series data,
* Visualise multiple variables at once using scatterplot matrices, correlograms and bubble plots,
* Visualise multivariate data using principal component analysis,
* Use unsupervised learning techniques for clustering,
* Use factor analysis to find latent variables in your data.

## Reports with R Markdown {#rmarkdown}
R Markdown files can be used to create nicely formatted documents using R, that are easy to export to other formats, like HTML, Word or pdf. They allow you to mix R code with results and text. They can be used to create reproducible reports that are easy to update with new data, because they include the code for making tables and figures. Additionally, they can be used as notebooks for keeping track of your work and your thoughts as you carry out an analysis. You can even use them for writing books - in fact, this entire book was written using R Markdown.

It is often a good idea to use R Markdown for exploratory analyses, as it allows you to write down your thoughts and comments as the analysis progresses, as well as to save the results of the exploratory journey. For that reason, we'll start this chapter by looking at some examples of what you can do using R Markdown. According to your preference, you can use either R Markdown or ordinary R scripts for the analyses in the remainder of the chapter. The R code used is the same and the results are identical, but if you use R Markdown, you can also save the output of the analysis in a nicely formatted document.

### A first example
When you create a new R Markdown document in RStudio by clicking _File > New File > R Markdown_ in the menu, a document similar to that below is created :

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown.Rmd"))
```

Press the _Knit_ button at the top of the Script panel to create an HTML document using this Markdown file. It will be saved in the same folder as your Markdown file. Once the HTML document has been created, it will open so that you can see the results. You may have to install additional packages for this to work, in which case RStudio will prompt you.

Now, let's have a look at what the different parts of the Markdown document do. The first part is called the _document header_ or _YAML header_. It contains information about the document, including its title, the name of its author, and the date on which it was first created:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown1.Rmd"))
```

The part that says `output: html_document` specifies what type of document should be created when you press _Knit_. In this case, it's set to `html_document`, meaning that an HTML document will be created. By changing this to `output: word_document` you can create a `.docx` Word document instead. By changing it to `output: pdf_document`, you can create a `.pdf` document using LaTeX (you'll have to install LaTeX if you haven't already - RStudio will notify you if that is the case). 

The second part sets the default behaviour of _code chunks_ included in the document, specifying that the output from running the chunks should be printed unless otherwise specified:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown2.Rmd"))
```

The third part contains the first proper section of the document. First, a header is created using `##`. Then there is some text with formatting: `< >` is used to create a link and `**` is used to get **bold text**. Finally, there is a code chunk, delimited by ```` ``` ````:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown3.Rmd"))
```

The fourth and final part contains another section, this time with a figure created using R. A setting is added to the code chunk used to create the figure, which prevents the underlying code from being printed in the document:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown4.Rmd"))
```

In the next few sections, we will look at how formatting and code chunks work in R Markdown.

### Formatting text
To create plain text in a Markdown file, you simply have to write plain text. If you wish to add some formatting to your text, you can use the following:

* `_italics_` or `*italics*`: to create text in _italics_.
* `__bold__` or `**bold**`: to create **bold** text.
* `[linked text](http://www.modernstatisticswithr.com)`: to create [linked text](http://www.modernstatisticswithr.com).
* `` `code` ``: to include inline `code` in your document.
* `$a^2 + b^2 = c^2$`: to create inline equations like $a^2 + b^2 = c^2$ using LaTeX syntax.
* `$$a^2 + b^2 = c^2$$`: to create a centred equation on a new line, like $$a^2 + b^2 = c^2.$$

To add headers and subheaders, and to divide your document into section, start a new line with `#`'s as follows:

```
# Header text
## Sub-header text
### Sub-sub-header text
...and so on.
```

### Lists, tables, and images

To create a bullet list, you can use `*` as follows. Note that you need a blank line between your list and the previous paragraph to begin a list.

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown5.Rmd"))
```

yielding:

* Item 1
* Item 2
  + Sub-item 1
  + Sub-item 2
* Item 3

To create an ordered list, use:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown6.Rmd"))
```

yielding:

1. First item
2. Second item
  i) Sub-item 1
  ii) Sub-item 2
3. Item 3

To create a table, use `|` and `---------` as follows:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown7.Rmd"))
```

which yields the following output:

Column 1  | Column 2
--------- | ---------
Content  | More content
Even more  | And some here
Even more?  | Yes!

To include an image, use the same syntax as when creating linked text with a link to the image path (either local or on the web), but with a `!` in front:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown8.Rmd"))
```

which yields the following:

![](Rlogo.png)

![Put some text here if you want a caption](Rlogo.png)


### Code chunks
The simplest way to define a code chunk is to write:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown9.Rmd"))
```

In RStudio, Ctrl+Alt+I is a keyboard shortcut for inserting this kind of code chunk.

We can add a name and a caption to the chunk, which lets us reference objects created by the chunk:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown10.Rmd"))
```

This yields the following output:

---

```{r pressureplot, fig.cap = "Plot of the pressure data."}
plot(pressure)
```

As we can see in Figure \@ref(fig:pressureplot), the relationship between
temperature and pressure resembles a banana.

---

In addition, you can add settings to the chunk header to control its behaviour. For instance, you can include a code chunk without running it by adding `echo = FALSE`:

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown11.Rmd"))
```

You can add the following settings to your chunks:

* `echo = FALSE` to run the code without printing it,
* `eval = FALSE` to print the code without running it,
* `results = "hide"` to hide printed output,
* `fig.show = "hide"` to hide plots,
* `warning = FALSE` to suppress warning messages from being printed in your document,
* `message = FALSE` to suppress other messages from being printed in your document,
* `include = FALSE` to run a chunk without showing the code or results in the document,
* `error = TRUE` to continue running your R Markdown document even if there is an error in the chunk (by default, the documentation creation stops if there is an error).

Data frames can be printed either as in the Console or as a nicely formatted table. For example,

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown12.Rmd"))
```

yields:

```{r, echo = FALSE}
head(airquality)
```

whereas

```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown13.Rmd"))
```

yields the table below.

```{r, echo = FALSE}
knitr::kable(
    head(airquality), 
    caption = "Some data I found."
)
```

Further help and documentation for R Markdown can be found through the RStudio menus, by clicking  _Help > Cheatsheets > R Markdown Cheat Sheet_ or _Help > Cheatsheets > R Markdown Reference Guide_.

## Customising `ggplot2` plots {#themes}
We'll be using `ggplot2` a lot in this chapter, so before we get started with exploratory analyses, we'll take some time to learn how we can customise the look of `ggplot2`-plots.

Consider the following facetted plot from Section \@ref(comparinggroups):

```{r eval=FALSE}
library(ggplot2)

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10() +
      facet_wrap(~ vore)
```

It looks nice, sure, but there may be things that you'd like to change. Maybe you'd like the plot's background to be white instead of grey, or perhaps you'd like to use a different font. These, and many other things, can be modified using _themes_.

### Using themes
`ggplot2` comes with a number of basic themes. All are fairly similar, but differ in things like background colour, grids and borders. You can add them to your plot using `theme_themeName`\index{\texttt{theme\_...}}, where `themeName` is the name of the theme^[See `?theme_grey` for a list of available themes.]. Here are some examples:

```{r eval=FALSE}
p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) + 
      geom_point() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10() +
      facet_wrap(~ vore)

# Create plot with different themes:
p + theme_grey() # The default theme
p + theme_bw()
p + theme_linedraw()
p + theme_light()
p + theme_dark()
p + theme_minimal()
p + theme_classic()
```

There are several packages available that contain additional themes. Let's download a few:

```{r eval=FALSE}
install.packages("ggthemes")
library(ggthemes)

# Create plot with different themes from ggthemes:
p + theme_tufte() # Minimalist Tufte theme
p + theme_wsj() # Wall Street Journal
p + theme_solarized() + scale_colour_solarized() # Solarized colours

##############################

install.packages("hrbrthemes")
library(hrbrthemes)

# Create plot with different themes from hrbrthemes:
p + theme_ipsum() # Ipsum theme
p + theme_ft_rc() # Suitable for use with dark RStudio themes
p + theme_modern_rc() # Suitable for use with dark RStudio themes
```

### Colour palettes {#colourpalettes}
Unlike e.g. background colours, the _colour palette_, i.e. the list of colours used for plotting, is not part of the theme that you're using. Next, we'll have a look at how to change the colour palette used for your plot.

Let's start by creating a `ggplot` object:

```{r eval=FALSE}
p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) + 
      geom_point() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10()
```

You can change the colour palette using `scale_colour_brewer`\index{\texttt{scale\_colour\_brewer}}\index{plot!choose colour palette}. Three types of colour palettes are available:

* **Sequential palettes**: that range from a colour to white. These are useful for representing ordinal (i.e. ordered) categorical variables and numerical variables.
* **Diverging palettes**: these range from one colour to another, with white in between. Diverging palettes are useful when there is a meaningful middle or 0 value (e.g. when your variables represent temperatures or profit/loss), which can be mapped to white.
* **Qualitative palettes**: which contain multiple distinct colours. These are useful for nominal (i.e. with no natural ordering) categorical variables.

See `?scale_colour_brewer` or http://www.colorbrewer2.org for a list of the available palettes. Here are some examples:

```{r eval=FALSE}
# Sequential palette:
p + scale_colour_brewer(palette = "OrRd")
# Diverging palette:
p + scale_colour_brewer(palette = "RdBu")
# Qualitative palette:
p + scale_colour_brewer(palette = "Set1")
```

In this case, because `vore` is a nominal categorical variable, a qualitative palette is arguably the best choice.

### Theme settings
The point of using a theme is that you get a combination of colours, fonts and other choices that are supposed to go well together, meaning that you don't have to spend too much time picking combinations. But if you like, you can override the default options and customise any and all parts of a theme.

The theme controls all visual aspects of the plot not related to the aesthetics. You can change the theme settings using the `theme` function. For instance, you can use `theme`\index{\texttt{theme}} to remove the legend or change its position:

```{r eval=FALSE}
# No legend:
p + theme(legend.position = "none")

# Legend below figure:
p + theme(legend.position = "bottom")

# Legend inside plot:
p + theme(legend.position = c(0.9, 0.7))
```

In the last example, the vector `c(0.9, 0.7)` gives the relative coordinates of the legend, with `c(0 0)` representing the bottom left corner of the plot and `c(1, 1)` the upper right corner. Try to change the coordinates to different values between `0` and `1` and see what happens.

`theme` has a lot of other settings, including for the colours of the background, the grid and the text in the plot. Here are a few examples that you can use as starting point for experimenting with the settings\index{\texttt{element\_line}}\index{\texttt{element\_rect}}\index{\texttt{element\_text}}\index{\texttt{element\_blank}}:

```{r eval=FALSE}
p + theme(panel.grid.major = element_line(colour = "black"),
          panel.grid.minor = element_line(colour = "purple",
                                          linetype = "dotted"),
          panel.background = element_rect(colour = "red", size = 2),
          plot.background = element_rect(fill = "yellow"),
          axis.text = element_text(family = "mono", colour = "blue"),
          axis.title = element_text(family = "serif", size = 14))
```

To find a complete list of settings, see `?theme`, `?element_line` (lines), `?element_rect` (borders and backgrounds), `?element_text` (text), and `element_blank` (for suppressing plotting of elements). As before, you can use `colors()` to get a list of built-in colours, or use colour hex codes.

$$\sim$$

```{exercise, label="ch4exc0"}
Use the documentation for `theme` and the `element_...` functions to change the plot object `p` created above as follows:

1. Change the background colour of the entire plot to `lightblue`.

2. Change the font of the legend to `serif`.

3. Remove the grid.

4. Change the colour of the axis ticks to `orange` and make them thicker.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions0)


## Exploring distributions
It is often useful to visualise the distribution of a numerical variable. Comparing the distributions of different groups can lead to important insights. Visualising distributions is also essential when checking assumptions used for various statistical tests (sometimes called _initial data analysis_). In this section we will illustrate how this can be done using the `diamonds` data from the `ggplot2` package, which you started to explore in Chapter \@ref(thebasics).

### Density plots and frequency polygons
We already know how to visualise the distribution of the data by dividing it into bins and plotting a histogram:

```{r eval=FALSE}
library(ggplot2)
ggplot(diamonds, aes(carat)) +
      geom_histogram(colour = "black")
```

A similar plot is created using frequency polygons\index{frequency polygons}, which uses lines instead of bars to display the counts in the bins\index{\texttt{geom\_freqpoly}}:

```{r eval=FALSE}
ggplot(diamonds, aes(carat)) +
      geom_freqpoly()
```

An advantage with frequency polygons is that they can be used to compare groups, e.g. diamonds with different cuts, without facetting:

```{r eval=FALSE}
ggplot(diamonds, aes(carat, colour = cut)) +
      geom_freqpoly()
```

It is clear from this figure that there are more diamonds with ideal cuts than diamonds with fair cuts in the data. The polygons have roughly the same shape, except perhaps for the polygon for diamonds with fair cuts.

In some cases, we are more interested in the shape of the distribution than in the actual counts in the different bins. Density plots\index{density plot} are similar to frequency polygons but show an estimate of the density function of the underlying random variable. These estimates are smooth curves that are scaled so that the area below them is 1 (i.e. scaled to be proper density functions)\index{\texttt{geom\_density}}:

```{r eval=FALSE}
ggplot(diamonds, aes(carat, colour = cut)) +
      geom_density()
```

From this figure, it becomes clear that low-carat diamonds tend to have better cuts, which wasn't obvious from the frequency polygons. However, the plot does not provide any information about _how_ common different cuts are. Use density plots if you're more interested in the shape of a variable's distribution, and frequency polygons if you're more interested in counts.

$$\sim$$

```{exercise, label="ch4exc9"}
Using the density plot created above and the documentation for `geom_density`, do the following:

1. Increase the smoothness of the density curves.

2. Fill the area under the density curves with the same colour as the curves themselves.

3. Make the colours that fill the areas under the curves transparent.

4. The figure still isn't that easy to interpret. Install and load the `ggridges`\index{\texttt{ggridges}} package, an extension of `ggplot2` that allows you to make so-called ridge plots (density plots that are separated along the y-axis, similar to facetting). Read the documentation for `geom_density_ridges`\index{\texttt{geom\_density\_ridges}} and use it to make a ridge plot of diamond prices for different cuts.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions9)


<br>

```{exercise, label="ch4exc10"}
Return to the histogram created by `ggplot(diamonds, aes(carat)) + geom_histogram()` above. As there are very few diamonds with carat greater than 3, cut the x-axis at 3. Then decrease the bin width to 0.01. Do any interesting patterns emerge?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions10)


### Asking questions
Exercise \@ref(exr:ch4exc10) causes us to ask why diamonds with carat values that are multiples of 0.25 are more common than others. Perhaps the price is involved? Unfortunately, a plot of `carat` versus `price` is not that informative:

```{r eval=FALSE}
ggplot(diamonds, aes(carat, price)) +
      geom_point(size = 0.05)
```

Maybe we could compute the average price in each bin of the histogram? In that case, we need to extract the bin breaks from the histogram somehow. We could then create a new categorical variable using the breaks with `cut` (as we did in Exercise \@ref(exr:ch3exc3b)). It turns out that extracting the bins is much easier using base graphics\index{\texttt{hist}!extract breaks/midpoints} than `ggplot2`, so let's do that:

```{r eval=FALSE}
# Extract information from a histogram with bin width 0.01,
# which corresponds to 481 breaks:
carat_br <- hist(diamonds$carat, breaks = 481, right = FALSE,
                 plot = FALSE)
# Of interest to us are:
# carat_br$breaks, which contains the breaks for the bins
# carat_br$mid, which contains the midpoints of the bins
#               (useful for plotting!)

# Create categories for each bin:
diamonds$carat_cat <- cut(diamonds$carat, 481, right = FALSE)
```

We now have a variable, `carat_cat`, that shows to which bin each observation belongs. Next, we'd like to compute the mean for each bin. This is a grouped summary - mean by category. After we've computed the bin means, we could then plot them against the bin midpoints. Let's try it:

```{r eval=FALSE}
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)

plot(carat_br$mid, means$price)
```

That didn't work as intended. We get an error message when attempting to plot the results:

```{r eval=FALSE}
Error in xy.coords(x, y, xlabel, ylabel, log) : 
  'x' and 'y' lengths differ
```

The error message implies that the number of bins and the number of mean values that have been computed differ. But we've just computed the mean for each bin, haven't we? So what's going on?

By default, `aggregate` ignores groups for which there are no values when computing grouped summaries. In this case, there are a lot of empty bins - there is for instance no observation in the `[4.99,5)` bin. In fact, only 272 out of the 481 bins are non-empty.

We can solve this in different ways. One way is to remove the empty bins. We can do this using the `match`\index{\texttt{match}} function, which returns the positions of _matching values_ in two vectors. If we use it with the bins from the grouped summary and the vector containing all bins, we can find the indices of the non-empty bins. This requires the use of the `levels` function, that you'll learn more about in Section \@ref(factors):

```{r eval=FALSE}
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)

id <- match(means$carat_cat, levels(diamonds$carat_cat))
```

Finally, we'll also add some vertical lines to our plot, to call attention to multiples of 0.25\index{\texttt{abline}}\index{\texttt{geom\_vline}}.

`r if (knitr::is_html_output()) '<table><div class="column-left">'`\btwocol
Using base graphics is faster here:

```{r eval=FALSE}
plot(carat_br$mid[id], means$price,
     cex = 0.5)

# Add vertical lines at multiples
# of 0.25:
abline(v = c(0.5, 0.75, 1,
             1.25, 1.5))
```
`r if (knitr::is_html_output()) '</div><div class="column-right">'`\columnbreak

But we can of course stick to `ggplot2` if we like:

```{r, eval=F}
library(ggplot2)

d2 <- data.frame(
         bin = carat_br$mid[id],
         mean = means$price)

ggplot(d2, aes(bin, mean)) +
       geom_point() +
       geom_vline(xintercept = 
                  c(0.5, 0.75, 1,
                    1.25, 1.5))
# geom_vline add vertical lines at
# multiples of 0.25
```
`r if (knitr::is_html_output()) '</div></table>'`\etwocol

It appears that there are small jumps in the prices at some of the 0.25-marks. This explains why there are more diamonds just above these marks than just below.

The above example illustrates three crucial things regarding exploratory data analysis:

* Plots (in our case, the histogram) often lead to new questions.

* Often, we must transform, summarise or otherwise manipulate our data to answer a question. Sometimes this is straightforward and sometimes it means diving deep into R code.

* Sometimes the thing that we're trying to do doesn't work straight away. There is almost always a solution though (and oftentimes more than one!). The more you work with R, the more problem-solving tricks you will learn.

### Violin plots
Density curves can also be used as alternatives to boxplots. In Exercise \@ref(exr:ch2exc7), you created boxplots to visualise price differences between diamonds of different cuts:

```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
      geom_boxplot()
```

Instead of using a boxplot, we can use a violin plot. Each group is represented by a "violin"\index{\texttt{geom\_violin}}\index{violin plot}, given by a rotated and duplicated density plot:

```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
      geom_violin()
```

Compared to boxplots, violin plots capture the entire distribution of the data rather than just a few numerical summaries. If you like numerical summaries (and you should!) you can add the median and the quartiles (corresponding to the borders of the box in the boxplot) using the `draw_quantiles` argument:

```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
      geom_violin(draw_quantiles =  c(0.25, 0.5, 0.75))
```

$$\sim$$

```{exercise, label="ch4exc11"}
Using the first boxplot created above, i.e. `ggplot(diamonds, aes(cut, price)) + geom_violin()`, do the following:

1. Add some colour to the plot by giving different colours to each violin.

2. Because the categories are shown along the x-axis, we don't really need the legend. Remove it.

3. Both boxplots and violin plots are useful. Maybe we can have the best of both worlds? Add the corresponding boxplot inside each violin. Hint: the `width` and `alpha` arguments in `geom_boxplot` are useful for creating a nice-looking figure here.\index{violin plot!combine with boxplot}

4. Flip the coordinate system to create horizontal violins and boxes instead.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions11)


### Combine multiple plots into a single graphic {#patchwork}
When exploring data with many variables, you'll often want to make the same kind of plot (e.g. a violin plot) for several variables. It will frequently make sense to place these side-by-side in the same plot window. The `patchwork`\index{\texttt{patchwork}} package extends `ggplot2` by letting you do just that\index{plot!multiple plots in one graphic}. Let's install it:

```{r eval=FALSE}
install.packages("patchwork")
```

To use it, save each plot as a plot object and then add them together:

```{r eval=FALSE}
plot1 <- ggplot(diamonds, aes(cut, carat, fill = cut)) +
            geom_violin() +
            theme(legend.position = "none")
plot2 <- ggplot(diamonds, aes(cut, price, fill = cut)) +
            geom_violin() +
            theme(legend.position = "none")

library(patchwork)
plot1 + plot2
```

You can also arrange the plots on multiple lines, with different numbers of plots on each line. This is particularly useful if you are combining different types of plots in a single plot window. In this case, you separate plots that are same line by `|` and mark the beginning of a new line with `/`:

```{r eval=FALSE}
# Create two more plot objects:
plot3 <- ggplot(diamonds, aes(cut, depth, fill = cut)) +
            geom_violin() +
            theme(legend.position = "none")
plot4 <- ggplot(diamonds, aes(carat, fill = cut)) +
            geom_density(alpha = 0.5) +
            theme(legend.position = c(0.9, 0.6))

# One row with three plots and one row with a single plot:
(plot1 | plot2 | plot3) / plot4

# One column with three plots and one column with a single plot:
(plot1 / plot2 / plot3) | plot4
```

(You may need to enlarge your plot window for this to look good!)

## Outliers and missing data

### Detecting outliers
Both boxplots and scatterplots are helpful in detecting deviating observations - often called outliers. Outliers can be caused by measurement errors or errors in the data input but can also be interesting rare cases that can provide valuable insights about the process that generated the data. Either way, it is often of interest to detect outliers, for instance because that may influence the choice of what statistical tests to use.

Let's return to the scatterplot of diamond carats versus prices:

```{r eval=FALSE}
ggplot(diamonds, aes(carat, price)) +
      geom_point()
```

There are some outliers which we may want to study further. For instance, there is a surprisingly cheap 5 carat diamond, and some cheap 3 carat diamonds^[Note that it is not just the prices nor just the carats of these diamonds that make them outliers, but the unusual combinations of prices and carats!]. But how can we identify those points?

One option is to use the `plotly`\index{\texttt{plotly}} package to make an interactive version of the plot, where we can hover interesting points to see more information about them. Start by installing it:

```{r eval=FALSE}
install.packages("plotly")
```

To use `plotly` with a `ggplot` graphic, we store the graphic in a variable and then use it as input to the `ggplotly`\index{\texttt{ggplotly}}\index{interactive plot} function. The resulting (interactive!) plot takes a little longer than usual to load. Try hovering the points:

```{r eval=FALSE}
myPlot <- ggplot(diamonds, aes(carat, price)) +
      geom_point()

library(plotly)
ggplotly(myPlot)
```

By default, `plotly` only shows the carat and price of each diamond. But we can add more information to the box by adding a `text` aesthetic\index{\texttt{aes}!\texttt{text}}:

```{r eval=FALSE}
myPlot <- ggplot(diamonds, aes(carat, price, text = paste("Row:",
                                            rownames(diamonds)))) +
                                            geom_point()

ggplotly(myPlot)
```

We can now find the row numbers of the outliers visually, which is very useful when exploring data.

$$\sim$$

```{exercise, label="ch4exc12"}
The variables `x` and `y` in the `diamonds` data describe the length and width of the diamonds (in mm). Use an interactive scatterplot to identify outliers in these variables. Check prices, carat and other information and think about if any of the outliers can be due to data errors.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions12)


### Labelling outliers
Interactive plots are great when exploring a dataset but are not always possible to use in other contexts, e.g. for printed reports and some presentations. In these other cases, we can instead annotate the plot with notes about outliers. One way to do this is to use a geom called `geom_text`\index{\texttt{geom\_text}}. 

For instance, we may want to add the row numbers of outliers to a plot. To do so, we use `geom_text` along with a condition that specifies for which points we should add annotations. Like in Section \@ref(conditionsintro), if we e.g. wish to add row numbers for diamonds with carats greater than four, our condition would be `carat > 4`. The `ifelse` function, which we'll look closer at in Section \@ref(conditions), is perfect to use here. The syntax\index{\texttt{ifelse}} will be `ifelse(condition, what text to write if the condition is satisfied, what text to write else)`. To add row names for observations that fulfil the condition but not for other observations, we use `ifelse(condition, rownames(diamonds), "")`. If instead, we wanted to print the price of the diamonds, we'd use `ifelse(condition, price, "")`.

Here are some different examples of conditions used to plot text:

```{r eval=FALSE}
## Using the row number (the 5 carat diamond is on row 27,416)
ggplot(diamonds, aes(carat, price)) +
      geom_point() +
      geom_text(aes(label = ifelse(rownames(diamonds) == 27416,
                                   rownames(diamonds), "")),
                hjust = 1.1)
## (hjust=1.1 shifts the text to the left of the point)

## Plot text next to all diamonds with carat>4
ggplot(diamonds, aes(carat, price)) +
     geom_point() +
     geom_text(aes(label = ifelse(carat > 4, rownames(diamonds), "")),
               hjust = 1.1)

## Plot text next to 3 carat diamonds with a price below 7500
ggplot(diamonds, aes(carat, price)) +
      geom_point() +
      geom_text(aes(label = ifelse(carat == 3 & price < 7500,
                                   rownames(diamonds), "")),
                hjust = 1.1)
```

$$\sim$$

```{exercise, label="ch4exc13"}
Create a static (i.e. non-interactive) scatterplot of `x` versus `y` from the `diamonds` data. Label the diamonds with suspiciously high $y$-values.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions13)


### Missing data
Like many datasets, the mammal sleep data `msleep` contains a lot of missing values, represented by `NA` (Not Available) in R. This becomes evident when we have a look at the data:

```{r eval=FALSE}
library(ggplot2)
View(msleep)
```

We can check if a particular observation is missing using the `is.na` function:

```{r eval=FALSE}
is.na(msleep$sleep_rem[4])
is.na(msleep$sleep_rem)
```

We can count the number of missing values for each variable using\index{\texttt{colSums}}:

```{r eval=FALSE}
colSums(is.na(msleep))
```

Here, `colSums` computes the sum of `is.na(msleep)` for each column of `msleep` (remember that in summation, `TRUE` counts as `1` and `FALSE` as `0`), yielding the number of missing values for each variable. In total, there are 136 missing values in the dataset:

```{r eval=FALSE}
sum(is.na(msleep))
```

You'll notice that `ggplot2` prints a warning in the Console when you create a plot with missing data:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) +
      geom_point() +
      scale_x_log10()
```

Sometimes data are missing simply because the information is not yet available (for instance, the brain weight of the mountain beaver could be missing because no one has ever weighed the brain of a mountain beaver). In other cases, data can be missing because something about them is different (for instance, values for a male patient in a medical trial can be missing because the patient died, or because some values only were collected for female patients). Therefore, it is of interest to see if there are any differences in non-missing variables between subjects that have missing data and subjects that don't.

In `msleep`, all animals have recorded values for `sleep_total` and `bodywt`. To check if the animals that have missing `brainwt` values differ from the others, we can plot them in a different colour in a scatterplot:

```{r eval=FALSE}
ggplot(msleep, aes(bodywt, sleep_total, colour = is.na(brainwt))) +
      geom_point() +
      scale_x_log10()
```

(If `is.na(brainwt)` is `TRUE` then the brain weight is missing in the dataset.) In this case, there are no apparent differences between the animals with missing data and those without.

$$\sim$$

```{exercise, label="ch4exc14"}
Create a version of the diamonds dataset where the `x` value is missing for all diamonds with $x>9$. Make a scatterplot of carat versus price, with points where the `x` value is missing plotted in a different colour. How would you interpret this plot?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions14)


### Exploring data
The `nycflights13`\index{\texttt{nycflights13}}\index{data!\texttt{flights}} package contains data about flights to and from three airports in New York, USA, in 2013. As a summary exercise, we will study a subset of these, namely all flights departing from New York on 1 January that year:

```{r eval=FALSE}
install.packages("nycflights13")
library(nycflights13)
flights2 <- flights[flights$month == 1 & flights$day == 1,]
```

$$\sim$$

```{exercise, label="ch4exc15"}
Explore the `flights2` dataset, focusing on delays and the amount of time spent in the air. Are there any differences between the different carriers? Are there missing data? Are there any outliers?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions15)


## Trends in scatterplots
Let's return to a familiar example - the relationship between animal brain size and sleep times:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10()
```

There appears to be a decreasing trend in the plot. To aid the eye, we can add a smoothed line by adding a new geom, `geom_smooth`\index{\texttt{geom\_smooth}}\index{trend curve}, to the figure:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      geom_smooth() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10()
```

This technique is useful for bivariate data as well as for time series, which we'll delve into next.

By default, `geom_smooth` adds a line fitted using either LOESS^[LOESS, LOcally Estimated Scatterplot Smoothing, is a non-parametric regression method that fits a polynomial to local areas of the data.] or GAM^[GAM, Generalised Additive Model, is a generalised linear model where the response variable is a linear function of smooth functions of the predictors.], as well as the corresponding 95 \% confidence interval describing the uncertainty in the estimate. There are several useful arguments that can be used with `geom_smooth`. You will explore some of these in the exercise below.

$$\sim$$

```{exercise, label="ch4exc1"}
Check the documentation for `geom_smooth`. Starting with the plot of brain weight vs. sleep time created above, do the following:

1. Decrease the degree of smoothing for the LOESS line that was fitted to the data. What is better in this case, more or less smoothing?

2. Fit a straight line to the data instead of a non-linear LOESS line.

3. Remove the confidence interval from the plot.

4. Change the colour of the fitted line to red.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions1)


## Exploring time series {#tsplots}
Before we have a look at time series, you should install four useful packages: `forecast`, `nlme`, `fma` and `fpp2`\index{\texttt{forecast}}\index{\texttt{nlme}}\index{\texttt{fma}}\index{\texttt{fpp2}}. The first contains useful functions for plotting time series data, and the latter three contain datasets that we'll use.

```{r eval=FALSE}
install.packages(c("nlme", "forecast", "fma", "fpp2"),
                 dependencies = TRUE)

```

The `a10`\index{data!\texttt{a10}} dataset contains information about the monthly anti-diabetic drug sales in Australia during the period July 1991 to June 2008. By checking its structure, we see that it is saved as a time series object^[Time series objects are a special class of vectors, with data sampled at equispaced points in time. Each observation can have a year/date/time associated with it.]:

```{r eval=FALSE}
library(fpp2)
str(a10)
```

`ggplot2` requires that data is saved as a data frame in order for it to be plotted. In order to plot the time series, we could first convert it to a data frame and then plot each point using `geom_points`:

```{r eval=FALSE}
a10_df <- data.frame(time = time(a10), sales = a10)
ggplot(a10_df, aes(time, sales)) +
      geom_point()
```

It is however usually preferable to plot time series using lines instead of points. This is done using a different geom: `geom_line`\index{\texttt{geom\_line}}:

```{r eval=FALSE}
ggplot(a10_df, aes(time, sales)) +
      geom_line()
```

Having to convert the time series object to a data frame is a little awkward. Luckily, there is a way around this. `ggplot2` offers a function called `autoplot`\index{\texttt{autoplot}}, that automatically draws an appropriate plot for certain types of data. `forecast` extends this function to time series objects\index{time series!plot}:

```{r eval=FALSE}
library(forecast)
autoplot(a10)
```

We can still add other geoms, axis labels and other things just as before. `autoplot` has simply replaced the `ggplot(data, aes()) + geom` part that would be the first two rows of the `ggplot2` figure, and has implicitly converted the data to a data frame.

$$\sim$$

```{exercise, label="ch4exc2"}
Using the `autoplot(a10)` figure, do the following:

1. Add a smoothed line describing the trend in the data. Make sure that it is smooth enough _not_ to capture the seasonal variation in the data.

2. Change the label of the x-axis to "Year" and the label of the y-axis to "Sales ($ million)".

3. Check the documentation for the `ggtitle` function\index{\texttt{ggtitle}}. What does it do? Use it with the figure.

4. Change the colour of the time series line to red.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions2)


### Annotations and reference lines
We sometimes wish to add text or symbols to plots, for instance to highlight interesting observations. Consider the following time series plot of daily morning gold prices, based on the `gold`\index{data!\texttt{gold}} data from the `forecast` package:

```{r eval=FALSE}
library(forecast)
autoplot(gold)
```

There is a sharp spike a few weeks before day 800, which is due to an incorrect value in the data series. We'd like to add a note about that to the plot. First, we wish to find out on which day the spike appears. This can be done by checking the data manually or using some code:

```{r eval=FALSE}
spike_date <- which.max(gold)
```

To add a circle around that point, we add a call to `annotate`\index{\texttt{annotate}}\index{annotate plot} to the plot:

```{r eval=FALSE}
autoplot(gold) +
      annotate(geom = "point", x = spike_date, y = gold[spike_date], 
               size = 5, shape = 21,
               colour = "red",
               fill = "transparent")
```

`annotate` can be used to annotate the plot with both geometrical objects and text (and can therefore be used as an alternative to `geom_text`).

$$\sim$$

```{exercise, label="ch4exc3"}
Using the figure created above and the documentation for `annotate`, do the following:

1. Add the text "Incorrect value" next to the circle.

2. Create a second plot where the incorrect value has been removed.

3. Read the documentation for the geom `geom_hline`\index{\texttt{geom\_hline}}. Use it to add a red reference line to the plot, at $y=400$.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions3)


### Longitudinal data
Multiple time series with identical time points, known as longitudinal data or panel data, are common in many fields. One example of this is given by the `elecdaily` time series from the `fpp2` package\index{data!\texttt{elecdaily}}\index{longitudinal data}\index{panel data}, which contains information about electricity demand in Victoria, Australia during 2014. As with a single time series, we can plot these data using `autoplot`:

```{r eval=FALSE}
library(fpp2)
autoplot(elecdaily)
```

In this case, it is probably a good idea to facet the data, i.e. to plot each series in a different figure:

```{r eval=FALSE}
autoplot(elecdaily, facets = TRUE)
```

$$\sim$$

```{exercise, label="ch4exc4"}
Make the following changes to the `autoplot(elecdaily, facets = TRUE)`:

1. Remove the `WorkDay` variable from the plot (it describes whether or not a given date is a workday, and while it is useful for modelling purposes, we do not wish to include it in our figure).

2. Add smoothed trend lines to the time series plots.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions4)


### Path plots
Another option for plotting multiple time series is path plots\index{path plot}. A path plot is a scatterplot where the points are connected with lines in the order they appear in the data (which, for time series data, should correspond to time). The lines and points can be coloured to represent time.

To make a path plot of Temperature versus Demand for the `elecdaily` data, we first convert the time series object to a data frame and create a scatterplot:

```{r eval=FALSE}
library(fpp2)
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
      geom_point()
```

Next, we connect the points by lines using the `geom_path`\index{\texttt{geom\_path}} geom:

```{r eval=FALSE}
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
      geom_point() +
      geom_path()
```

The resulting figure is quite messy. Using colour to indicate the passing of time helps a little. For this, we need to add the day of the year to the data frame. To get the values right, we use `nrow`\index{\texttt{nrow}}\index{data frame!number of rows}, which gives us the number of rows in the data frame.

```{r eval=FALSE}
elecdaily2 <- as.data.frame(elecdaily)
elecdaily2$day <- 1:nrow(elecdaily2)
      
ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) +
      geom_point() +
      geom_path()
```

It becomes clear from the plot that temperatures were the highest at the beginning of the year and lower in the winter months (July-August).

$$\sim$$

```{exercise, label="ch4exc5"}
Make the following changes to the plot you created above:

1. Decrease the size of the points.

2. Add text annotations showing the dates of the highest and lowest temperatures, next to the corresponding points in the figure.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions5)


### Spaghetti plots
In cases where we've observed multiple subjects over time, we often wish to visualise their individual time series together using so-called spaghetti plots. With `ggplot2` this is done using the `geom_line` geom.\index{spaghetti plot} To illustrate this, we use the `Oxboys`\index{data!\texttt{Oxboys}} data from the `nlme` package, showing the heights of 26 boys over time.

```{r eval=FALSE}
library(nlme)
ggplot(Oxboys, aes(age, height, group = Subject)) + 
      geom_point() + 
      geom_line()
```

The first two `aes` arguments specify the x and y-axes, and the third specifies that there should be one line per subject (i.e. per boy) rather than a single line interpolating all points. The latter would be a rather useless figure that looks like this:

```{r eval=FALSE}
ggplot(Oxboys, aes(age, height)) + 
      geom_point() + 
      geom_line() +
      ggtitle("A terrible plot")
```

Returning to the original plot, if we wish to be able to identify which time series corresponds to which boy, we can add a colour aesthetic:

```{r eval=FALSE}
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) + 
      geom_point() + 
      geom_line()
```

Note that the boys are ordered by height, rather than subject number, in the legend.

Now, imagine that we wish to add a trend line describing the general growth trend for all boys. The growth appears approximately linear, so it seems sensible to use `geom_smooth(method = "lm")` to add the trend:

```{r eval=FALSE}
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) + 
      geom_point() + 
      geom_line() +
      geom_smooth(method = "lm", colour = "red", se = FALSE)
```

Unfortunately, because we have specified in the aesthetics that the data should be grouped by `Subject`, `geom_smooth` produces one trend line for each boy. The "problem" is that when we specify an aesthetic in the `ggplot` call, it is used for all geoms.

$$\sim$$

```{exercise, label="ch4exc6"}
Figure out how to produce a spaghetti plot of the `Oxboys` data with a single red trend line based on the data from all 26 boys.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions6)


### Seasonal plots and decompositions {#stlsec}
The `forecast` package includes a number of useful functions when working with time series. One of them is `ggseasonplot`\index{\texttt{ggseasonplot}}\index{seasonal plot}, which allows us to easily create a spaghetti plot of different periods of a time series with seasonality, i.e. with patterns that repeat seasonally over time. It works similar to the `autoplot` function, in that it replaces the `ggplot(data, aes) + geom` part of the code.

```{r eval=FALSE}
library(forecast)
library(fpp2)
ggseasonplot(a10)
```

This function is very useful when visually inspecting seasonal patterns.

The `year.labels` and `year.labels.left` arguments removes the legend in favour of putting the years at the end and beginning of the lines:

```{r eval=FALSE}
ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE)
```

As always, we can add more things to our plot if we like:

```{r eval=FALSE}
ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE) +
      ylab("Sales ($ million)") +
      ggtitle("Seasonal plot of anti-diabetic drug sales")
```

When working with seasonal time series, it is common to decompose the series into a seasonal component, a trend component and a remainder. In R, this is typically done using the `stl`\index{\texttt{stl}}\index{time series!decomposition} function, which uses repeated LOESS smoothing to decompose the series. There is an `autoplot` function for `stl` objects:

```{r eval=FALSE}
autoplot(stl(a10, s.window = 365))
```

This plot can too be manipulated in the same way as other `ggplot` objects. You can access the different parts of the decomposition as follows:

```{r eval=FALSE}
stl(a10, s.window = 365)$time.series[,"seasonal"]
stl(a10, s.window = 365)$time.series[,"trend"]
stl(a10, s.window = 365)$time.series[,"remainder"]
```

$$\sim$$

```{exercise, label="ch4exc7"}
Investigate the `writing`\index{data!\texttt{writing}} dataset from the `fma` package graphically. Make a time series plot with a smoothed trend line, a seasonal plot and an `stl`-decomposition plot. Add appropriate plot titles and labels to the axes. Can you see any interesting patterns?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions7)


### Detecting changepoints
The `changepoint`\index{\texttt{changepoint}}\index{time series!changepoint} package contains a number of methods for detecting _changepoints_ in time series, i.e. time points at which either the mean or the variance of the series changes. Finding changepoints can be important for detecting changes in the process underlying the time series. The `ggfortify`\index{\texttt{ggfortify}} package extends `ggplot2` by adding `autoplot` functions for a variety of tools, including those in `changepoint`.\index{\texttt{cpt.mean}}\index{\texttt{cpt.var}}\index{\texttt{cpt.meanvar}} Let's install the packages:

```{r eval=FALSE}
install.packages(c("changepoint", "ggfortify"))
```

We can now look at some examples with the anti-diabetic drug sales data:

```{r eval=FALSE}
library(forecast)
library(fpp2)
library(changepoint)
library(ggfortify)

# Plot the time series:
autoplot(a10)

# Remove the seasonal part and plot the series again:
a10_ns <- a10 - stl(a10, s.window = 365)$time.series[,"seasonal"]
autoplot(a10_ns)

# Plot points where there are changes in the mean:
autoplot(cpt.mean(a10_ns))

# Choosing a different method for finding changepoints
# changes the result:
autoplot(cpt.mean(a10_ns, method = "BinSeg"))

# Plot points where there are changes in the variance:
autoplot(cpt.var(a10_ns))

# Plot points where there are changes in either the mean or
# the variance:
autoplot(cpt.meanvar(a10_ns))
```

As you can see, the different methods from `changepoint` all yield different results. The results for changes in the mean are a bit dubious - which isn't all that strange as we are using a method that looks for jumps in the mean on a time series where the increase actually is more or less continuous. The changepoint for the variance looks more reliable - there is a clear change towards the end of the series where the sales become more volatile. We won't go into details about the different methods here, but mention that the documentation at `?cpt.mean`, `?cpt.var`, and `?cpt.meanvar` contains descriptions of and references for the available methods.

$$\sim$$

```{exercise, label="ch4exc75"}
Are there any changepoints for variance in the `Demand` time series in `elecdaily`? Can you explain why the behaviour of the series changes?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions75)


### Interactive time series plots {#interactivets}
The `plotly` packages can be used to create interactive time series plots. As before, you create a `ggplot2` object as usual, assigning it to a variable and then call the `ggplotly` function. Here is an example with the `elecdaily` data:

```{r eval=FALSE}
library(plotly)
library(fpp2)
myPlot <- autoplot(elecdaily[,"Demand"])

ggplotly(myPlot)
```

When you hover the mouse pointer over a point, a box appears, displaying information about that data point. Unfortunately, the date formatting isn't great in this example - dates are shown as weeks with decimal points. We'll see how to fix this in Section \@ref(datetime). 


## Using polar coordinates {#polar}
Most plots are made using _Cartesian coordinates systems_, in which the axes are orthogonal to each other and values are placed in an even spacing along each axis. In some cases, nonlinear axes (e.g. log-transformed) are used instead, as we have already seen.

Another option is to use a _polar_ coordinate system, in which positions are specified using an angle and a (radial) distance from the origin. Here is an example of a polar scatterplot:

```{r eval=FALSE}
ggplot(msleep, aes(sleep_rem, sleep_total, colour = vore)) + 
      geom_point() +
      xlab("REM sleep (circular axis)") +
      ylab("Total sleep time (radial axis)") +
      coord_polar()
```

### Visualising periodic data
Polar coordinates are particularly useful when the data is periodic. Consider for instance the following dataset\index{data!\texttt{Cape\_Town\_weather}}, describing monthly weather averages for Cape Town, South Africa:

```{r eval=FALSE}
Cape_Town_weather <- data.frame(
Month = 1:12,
Temp_C = c(22, 23, 21, 18, 16, 13, 13, 13, 14, 16, 18, 20),
Rain_mm = c(20, 20, 30, 50, 70, 90, 100, 70, 50, 40, 20, 20),
Sun_h = c(11, 10, 9, 7, 6, 6, 5, 6, 7, 9, 10, 11))
```

We can visualise the monthly average temperature using lines in a Cartesian coordinate system:

```{r eval=FALSE}
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
      geom_line()
```

What this plot doesn't show is that the 12th month and the 1st month actually are consecutive months. If we instead use polar coordinates, this becomes clearer:

```{r eval=FALSE}
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
      geom_line() +
      coord_polar()
```

To improve the presentation, we can change the scale of the x-axis (i.e. the circular axis) so that January and December aren't plotted at the same angle:

```{r eval=FALSE}
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
      geom_line() +
      coord_polar() +
      xlim(0, 12)
```

$$\sim$$

```{exercise, label="ch4exc75b"}
In the plot that we just created, the last and first month of the year aren't connected. You can fix manually this by adding a cleverly designed faux data point to `Cape_Town_weather`. How?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions75b)

### Pie charts
Consider the stacked bar chart that we plotted in Section \@ref(catdata2):

```{r eval=FALSE}
ggplot(msleep, aes(factor(1), fill = vore)) +
      geom_bar()
```

What would happen if we plotted this figure in a polar coordinate system instead? If we map the height of the bars (the y-axis of the Cartesian coordinate system) to both the angle and the radial distance, we end up with a pie chart:

```{r eval=FALSE}
ggplot(msleep, aes(factor(1), fill = vore)) +
      geom_bar() +
      coord_polar(theta = "y")
```

There are many arguments against using pie charts for visualisations. Most boil down to the fact that the same information is easier to interpret when conveyed as a bar chart. This is at least partially due to the fact that most people are more used to reading plots in Cartesian coordinates than in polar coordinates.

If we make a similar transformation of a grouped bar chart, we get a different type of pie chart, in which the height of the bars are mapped to both the angle and the radial distance^[Florence Nightingale, who famously pioneered the use of the pie chart, drew her pie charts this way.]:

```{r eval=FALSE}
# Cartestian bar chart:
ggplot(msleep, aes(vore, fill = vore)) +
      geom_bar()

# Polar bar chart:
ggplot(msleep, aes(vore, fill = vore)) +
      geom_bar() +
      coord_polar()
```

## Visualising multiple variables


### Scatterplot matrices {#scatterplotmatrix}
When we have a large enough number of `numeric` variables in our data, plotting scatterplots of all pairs of variables becomes tedious. Luckily there are some R functions that speed up this process.

The `GGally`\index{\texttt{GGally}} package is an extension to `ggplot2` which contains several functions for plotting multivariate data. They work similarly to the `autoplot` functions that we have used in previous sections. One of these is `ggpairs`\index{\texttt{ggpairs}}, which creates a scatterplot matrix\index{scatterplot matrix}, a grid with scatterplots of all pairs of variables in `data`. In addition, it also plots density estimates (along the diagonal) and shows the (Pearson) correlation for each pair. Let's start by installing `GGally`:

```{r eval=FALSE}
install.packages("GGally")
```

To create a scatterplot matrix for the `airquality` dataset, simply write:

```{r eval=FALSE}
library(GGally)
ggpairs(airquality)
```

(Enlarging your Plot window can make the figure look better.)

If we want to create a scatterplot matrix but only want to include some of the variables in a dataset, we can do so by providing a vector with variable names. Here is an example for the animal sleep data `msleep`:

```{r eval=FALSE}
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
                   "brainwt", "bodywt")])
```

Optionally, if we wish to create a scatterplot involving all `numeric` variables, we can replace the vector with variable names with some R code that extracts the columns containing `numeric` variables\index{data frame!select `numeric` variables}:

```{r eval=FALSE}
ggpairs(msleep[, which(sapply(msleep, class) == "numeric")])
```

(You'll learn more about the `sapply` function in Section \@ref(vectorloops).)

The resulting plot is identical to the previous one, because the list of names contained all `numeric` variables. The grab-all-numeric-variables approach is often convenient, because we don't have to write all the variable names. On the other hand, it's not very helpful in case we only want to include some of the `numeric` variables.

If we include a categorical variable in the list of variables (such as the feeding behaviour `vore`), the matrix will include a bar plot of the categorical variable as well as boxplots and facetted histograms to show differences between different categories in the continuous variables:

```{r eval=FALSE}
ggpairs(msleep[, c("vore", "sleep_total", "sleep_rem", "sleep_cycle",
                   "awake", "brainwt", "bodywt")])
```

Alternatively, we can use a categorical variable to colour points and density estimates using `aes(colour = ...)`. The syntax for this is follows the same pattern as that for a standard `ggplot` call - `ggpairs(data, aes)`. The only exception is that if the categorical variable is not included in the `data` argument, we much specify which data frame it belongs to:

```{r eval=FALSE}
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
                   "brainwt", "bodywt")],
        aes(colour = msleep$vore, alpha = 0.5))
```

As a side note, if all variables in your data frame are `numeric`, and if you only are looking for a quick-and-dirty scatterplot matrix without density estimates and correlations, you can also use the base R `plot`:

```{r eval=FALSE}
plot(airquality)
```

$$\sim$$

```{exercise, label="ch4exc16"}
Create a scatterplot matrix for all `numeric` variables in `diamonds`. Differentiate different cuts by colour. Add a suitable title to the plot. (`diamonds` is a fairly large dataset, and it may take a minute or so for R to create the plot.)

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions16)


### 3D scatterplots
The `plotly` package lets us make three-dimensional scatterplots with the `plot_ly` function\index{\texttt{plot\_ly}}, which can be a useful alternative to scatterplot matrices in some cases. Here is an example using the `airquality` data:

```{r eval=FALSE}
library(plotly)
plot_ly(airquality, x = ~Ozone, y = ~Wind, z = ~Temp,
        color = ~factor(Month))
```

Note that you can drag and rotate the plot, to see it from different angles.

### Correlograms
Scatterplot matrices are not a good choice when we have too many variables, partially because the plot window needs to be very large to fit all variables and partially because it becomes difficult to get a good overview of the data. In such cases, a correlogram, where the strength of the correlation between each pair of variables is plotted instead of scatterplots, can be used instead. It is effectively a visualisation of the correlation matrix of the data, where the strengths and signs of the correlations are represented by different colours.

The `GGally` package contains the function `ggcorr`\index{\texttt{ggcorr}}\index{correlogram}, which can be used to create a correlogram:

```{r eval=FALSE}
ggcorr(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
                   "brainwt", "bodywt")])
```

$$\sim$$

```{exercise, label="ch4exc17"}
Using the `diamonds` dataset and the documentation for `ggcorr`, do the following:

1. Create a correlogram for all `numeric` variables in the dataset.

2. The Pearson correlation that `ggcorr` uses by default isn't always the best choice. A commonly used alternative is the Spearman correlation\index{correlation!Spearman}. Change the type of correlation used to create the plot to the Spearman correlation.

3. Change the colour scale from a categorical scale with 5 categories.

4. Change the colours on the scale to go from yellow (low correlation) to black (high correlation).

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions17)


### Adding more variables to scatterplots
We have already seen how scatterplots can be used to visualise two continuous and one categorical variable by plotting the two continuous variables against each other and using the categorical variable to set the colours of the points. There are however more ways we can incorporate information about additional variables into a scatterplot.

So far, we have set three aesthetics in our scatterplots: `x`, `y` and `colour`. Two other important aesthetics are `shape`\index{\texttt{aes}!\texttt{shape}} and `size`\index{\texttt{aes}!\texttt{size}}, which, as you'd expect, allow us to control the shape and size of the points. As a first example using the `msleep` data, we use feeding behaviour (`vore`) to set the shapes used for the points:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total, shape = vore)) +
      geom_point() +
      scale_x_log10()
```

The plot looks a little nicer if we increase the `size` of the points:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total, shape = vore, size = 2)) +
      geom_point() +
      scale_x_log10()
```

Another option is to let `size` represent a continuous variable, in what is known as a bubble plot\index{bubble plot}:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
                   size = bodywt)) +
      geom_point() +
      scale_x_log10()
```

The size of each "bubble" now represents the weight of the animal. Because some animals are much heavier (i.e. have higher `bodywt` values) than most others, almost all points are quite small. There are a couple of things we can do to remedy this. First, we can transform `bodywt`, e.g. using the square root transformation `sqrt(bodywt)`, to decrease the differences between large and small animals. This can be done by adding `scale_size(trans = "sqrt")` to the plot. Second, we can also use `scale_size` to control the range of point sizes (e.g. from size 1 to size 20). This will cause some points to overlap, so we add `alpha = 0.5` to the geom, to make the points transparent:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
                   size = bodywt)) +
      geom_point(alpha = 0.5) +
      scale_x_log10() +
      scale_size(range = c(1, 20), trans = "sqrt")
```

This produces a fairly nice-looking plot, but it'd look even better if we changed the axes labels and legend texts. We can change the legend text for the size scale by adding the argument `name` to `scale_size`\index{\texttt{scale\_size}}. Including a `\n` in the text lets us create a line break - you'll learn more tricks like that in Section \@ref(strings). Similarly, we can use `scale_colour_discrete`\index{\texttt{scale\_colour\_discrete}} to change the legend text for the colours:

```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
                   size = bodywt)) +
      geom_point(alpha = 0.5) +
      xlab("log(Brain weight)") +
      ylab("Sleep total (h)") +
      scale_x_log10() +
      scale_size(range = c(1, 20), trans = "sqrt",
                 name = "Square root of\nbody weight") +
      scale_colour_discrete(name = "Feeding behaviour")
```

$$\sim$$

```{exercise, label="ch4exc18"}
Using the bubble plot created above, do the following:

1. Replace `colour = vore` in the `aes` by `fill = vore` and add `colour = "black", shape = 21` to `geom_point`. What happens?

2. Use `ggplotly` to create an interactive version of the bubble plot above, where variable information and the animal name are displayed when you hover a point.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions18)


### Overplotting
Let's make a scatterplot of `table` versus `depth` based on the `diamonds` dataset:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_point()
```

This plot is cluttered. There are too many points, which makes it difficult to see if, for instance, high `table` values are more common than low `table` values. In this section, we'll look at some ways to deal with this problem, known as overplotting\index{overplotting}.

The first thing we can try is to decrease the point size:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_point(size = 0.1)
```

This helps a little, but now the outliers become a bit difficult to spot. We can try changing the opacity using `alpha` instead:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_point(alpha = 0.2)
```

This is also better than the original plot, but neither plot is great. Instead of plotting each individual point, maybe we can try plotting the counts or densities in different regions of the plot instead? Effectively, this would be a 2D version of a histogram. There are several ways of doing this in `ggplot2`.

First, we bin the points and count the numbers in each bin, using `geom_bin2d`\index{\texttt{geom\_bind2d}}:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_bin2d()
```

By default, `geom_bin2d` uses 30 bins. Increasing that number can sometimes give us a better idea about the distribution of the data:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_bin2d(bins = 50)
```

If you prefer, you can get a similar plot with hexagonal bins by using `geom_hex`\index{\texttt{geom\_hex}} instead:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      geom_hex(bins = 50)
```

As an alternative to bin counts, we could create a 2-dimensional density estimate\index{\texttt{stat\_density\_2d}}\index{density plot!2-dimensional} and create a contour plot showing the levels of the density:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth)) +
      stat_density_2d(aes(fill = ..level..), geom = "polygon",
                      colour = "white")
```

The `fill = ..level..` bit above probably looks a little strange to you. It means that an internal function (the level of the contours) is used to choose the fill colours. It also means that we've reached a point where we're reaching deep into the depths of `ggplot2`!

We can use a similar approach to show a summary statistic for a third variable in a plot. For instance, we may want to plot the average price as a function of `table` and `depth`. This is called a tile plot\index{tile plot}\index{\texttt{geom\_tile}}:

```{r eval=FALSE}
ggplot(diamonds, aes(table, depth, z = price)) +
     geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
     ggtitle("Mean prices for diamonds with different depths and
             tables")
```

$$\sim$$

```{exercise, label="ch4exc19"}
The following tasks involve the `diamonds` dataset:

1. Create a tile plot of `table` versus `depth`, showing the highest price for a diamond in each bin.

2. Create a bin plot of `carat` versus `price`. What type of diamonds have the highest bin counts?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions19)


### Categorical data
When visualising a pair of categorical variables, plots similar to those in the previous section prove to be useful. One way of doing this is to use the `geom_count`\index{\texttt{geom\_count}} geom. We illustrate this with an example using `diamonds`, showing how common different combinations of colours and cuts are:

```{r eval=FALSE}
ggplot(diamonds, aes(color, cut)) +
      geom_count()
```

However, it is often better to use colour rather than point size to visualise counts, which we can do using a tile plot. First, we have to compute the counts though\index{count occurences}, using `aggregate`. We now wish to have two grouping variables, `color` and `cut`, which we can put on the right-hand side of the formula as follows:

```{r eval=FALSE}
diamonds2 <- aggregate(carat ~ cut + color, data = diamonds,
                       FUN = length)
diamonds2
```

`diamonds2` is now a data frame containing the different combinations of `color` and `cut` along with counts of how many diamonds belong to each combination (labelled `carat`, because we put `carat` in our formula). Let's change the name of the last column from `carat` to `Count`:

```{r eval=FALSE}
names(diamonds2)[3] <- "Count"
```

Next, we can plot the counts using `geom_tile`:

```{r eval=FALSE}
ggplot(diamonds2, aes(color, cut, fill = Count)) +
      geom_tile()
```

It is also possible to combine point size and colours:

```{r eval=FALSE}
ggplot(diamonds2, aes(color, cut, colour = Count, size = Count)) +
      geom_count()
```

$$\sim$$

```{exercise, label="ch4exc20"}
Using the `diamonds` dataset, do the following:

1. Use a plot to find out what the most common combination of cut and clarity is.

2. Use a plot to find out which combination of cut and clarity has the highest average price.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions20)


### Putting it all together
In the next two exercises, you will repeat what you have learned so far by investigating the `gapminder`\index{data!\texttt{gapminder}} and `planes`\index{data!\texttt{planes}} datasets. First, load the corresponding libraries and have a look at the documentation for each dataset:

```{r eval=FALSE}
install.packages("gapminder")
library(gapminder)
?gapminder

library(nycflights13)
?planes
```

$$\sim$$

```{exercise, label="ch4exc21"}
Do the following using the `gapminder` dataset:

1. Create a scatterplot matrix showing life expectancy, population and GDP per capita for all countries, using the data from the year 2007. Use colours to differentiate countries from different continents. Note: you'll probably need to add the argument `upper = list(continuous = "na")` when creating the scatterplot matrix. By default, correlations are shown above the diagonal, but the fact that there only are two countries from Oceania will cause a problem there - at least 3 points are needed for a correlation test.

2. Create an interactive bubble plot, showing information about each country when you hover the points. Use data from the year 2007. Put log(GDP per capita) on the x-axis and life expectancy on the y-axis. Let population determine point size. Plot each country in a different colour and facet by continent.
Tip: the `gapminder` package provides a pretty colour scheme for different countries, called `country_colors`. You can use that scheme by adding `scale_colour_manual(values = country_colors)` to your plot.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions21)


<br>

```{exercise, label="ch4exc22"}
Use graphics to answer the following questions regarding the `planes` dataset:

1. What is the most common combination of manufacturer and plane type in the dataset?

2. Which combination of manufacturer and plane type has the highest average number of seats?

3. Do the numbers of seats on planes change over time? Which plane had the highest number of seats?

4. Does the type of engine used change over time?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions22)


## Principal component analysis {#pca}
If there are many variables in your data, it can often be difficult to detect differences between groups or create a perspicuous visualisation. A useful tool in this context is _principal component analysis_ (PCA, for short)\index{principal component analysis}, which can reduce high-dimensional data to a lower number of variables that can be visualised in one or two scatterplots. The idea is to compute new variables, called _principal components_, that are linear combinations of the original variables^[A linear combination is a weighted sum of the form $a_1x_1+a_2x_2+\cdots+a_px_p$. If you like, you can think of principal components as weighted averages of variables, computed for each row in your data.]. These are constructed with two goals in mind: the principal components should capture as much of the variance in the data as possible, and each principal component should be uncorrelated to the other components. You can then plot the principal components to get a low-dimensional representation of your data, which hopefully captures most of its variation. 

By design, the number of principal components computed are as many as the original number of variables, with the first having the largest variance, the second having the second largest variance, and so on. We hope that it will suffice to use just the first few of these to represent most of the variation in the data, but this is not guaranteed. Principal component analysis is more likely to yield a useful result if several variables are correlated.

To illustrate the principles of PCA we will use a dataset from Charytanowicz et al. (2010), containing measurements on wheat kernels for three varieties of wheat. A description of the variables is available at

http://archive.ics.uci.edu/ml/datasets/seeds

We are interested to find out if these measurements can be used to distinguish between the varieties. The data is stored in a `.txt` file, which we import using `read.table`\index{\texttt{read.table}}\index{data!\texttt{seeds}} (which works just like `read.csv`, but is tailored to text files) and convert the `Variety` column to a categorical `factor` variable (which you'll learn more about in Section \@ref(factors)):

```{r eval=FALSE}
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
```

If we make a scatterplot matrix of all variables, it becomes evident that there are differences between the varieties, but that no single pair of variables is enough to separate them:

```{r eval=FALSE}
library(ggplot2)
library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds$Variety, alpha = 0.2))
```

Moreover, for presentation purposes, the amount of information in the scatterplot matrix is a bit overwhelming. It would be nice to be able to present the data in a single scatterplot, without losing too much information. We'll therefore compute the principal components using the `prcomp` function\index{\texttt{prcomp}}. It is usually recommended that PCA is performed using standardised data, i.e. using data that has been scaled to have mean 0 and standard deviation 1. The reason for this is that it puts all variables on the same scale. If we don't standardise our data then variables with a high variance will completely dominate the principal components. This isn't desirable, as variance is affected by the scale of the measurements, meaning that the choice of measurement scale would influence the results (as an example, the variance of kernel length will be a million times greater if lengths are measured in millimetres instead of in metres).

We don't have to standardise the data ourselves, but can let `prcomp` do that for us using the arguments `center = TRUE` (to get mean 0) and `scale. = TRUE` (to get standard deviation 1):

```{r eval=FALSE}
# Compute principal components:
pca <- prcomp(seeds[,-8], center = TRUE, scale. = TRUE)
```

To see the _loadings_ of the components, i.e. how much each variable contributes to the components, simply type the name of the object `prcomp` created:

```{r eval=FALSE}
pca
```

The first principal component is more or less a weighted average of all variables, but has stronger weights on `Area`, `Perimeter`, `Kernel_length`, `Kernel_width`, and `Groove_length`, all of which are measures of size. We can therefore interpret it as a size variable. The second component has higher loadings for `Compactness` and `Asymmetry`, meaning that it mainly measures those shape features. In Exercise \@ref(exr:ch4exc24) you'll see how the loadings can be visualised in a _biplot_.

To see how much of the variance each component represents, use `summary`\index{\texttt{summary}}:

```{r eval=FALSE}
summary(pca)
```

The first principal component accounts for 71.87 % of the variance, and the first three combined account for 98.67 %.

To visualise this, we can draw a _scree plot_, which shows the variance of each principal component - the total variance of the data is the sum of the variances of the principal components:

```{r eval=FALSE}
screeplot(pca, type = "lines")
```

We can use this to choose how many principal components to use when visualising or summarising our data. In that case, we look for an "elbow", i.e. a bend after which increases the number of components doesn't increase the amount of variance explained much.

We can access the values of the principal components using `pca$x`. Let's check that the first two components really are uncorrelated:

```{r eval=FALSE}
cor(pca$x[,1], pca$x[,2])
```

In this case, almost all of the variance is summarised by the first two or three principal components. It appears that we have successfully reduced the data from 7 variables to 2-3, which should make visualisation much easier. The `ggfortify` package contains an `autoplot` function for PCA objects, that creates a scatterplot of the first two principal components:

```{r eval=FALSE}
library(ggfortify)
autoplot(pca, data = seeds, colour = "Variety")
```

That is much better! The groups are almost completely separated, which shows that the variables can be used to discriminate between the three varieties. The first principal component accounts for 71.87 % of the total variance in the data, and the second for 17.11 %. 

If you like, you can plot other pairs of principal components than just components 1 and 2. In this case, component 3 may be of interest, as its variance is almost as high as that of component 2. You can specify which components to plot with the `x` and `y` arguments:

```{r eval=FALSE}
# Plot 2nd and 3rd PC:
autoplot(pca, data = seeds, colour = "Variety",
         x = 2, y = 3)
```

Here, the separation is nowhere near as clear as in the previous figure. In this particular example, plotting the first two principal components is the better choice.

Judging from these plots, it appears that the kernel measurements can be used to discriminate between the three varieties of wheat. In Chapters \@ref(modchapter) and \@ref(mlchapter) you'll learn how to use R to build models that can be used to do just that, e.g. by predicting which variety of wheat a kernel comes from given its measurements. If we wanted to build a statistical model that could be used for this purpose, we could use the original measurements. But we could also try using the first two principal components as the only input to the model. Principal component analysis is very useful as a pre-processing tool, used to create simpler models based on fewer variables (or ostensibly simpler, because the new variables are typically more difficult to interpret than the original ones).

$$\sim$$

```{exercise, label="ch4exc23"}
Use principal components on the `carat`, `x`, `y`, `z`, `depth`, and `table` variables in the `diamonds` data, and answer the following questions:
  
1. How much of the total variance does the first principal component account for? How many components are needed to account for at least 90 % of the total variance?
  
2. Judging by the loadings, what do the first two principal components measure?

3. What is the correlation between the first principal component and `price`?
  
4. Can the first two principal components be used to distinguish between diamonds with different cuts?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions23)


<br>

```{exercise, label="ch4exc24"}
Return to the scatterplot of the first two principal components for the `seeds` data, created above. Adding the arguments `loadings = TRUE` and `loadings.label = TRUE` to the `autoplot` call creates a _biplot_\index{\texttt{biplot}}, which shows the loadings for the principal components on top of the scatterplot. Create a biplot and compare the result to those obtained by looking at the loadings numerically. Do the conclusions from the two approaches agree?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions24)


## Cluster analysis {#clustering}
Cluster analysis is concerned with grouping observations into groups, _clusters_\index{clustering}, that in some sense are similar. Numerous methods are available for this task, approaching the problem from different angles. Many of these are available in the `cluster`\index{\texttt{cluster}} package, which ships with R. In this section, we'll look at a smorgasbord of clustering techniques.

### Hierarchical clustering
As a first example where clustering can be of interest\index{clustering!hierarchical}, we'll consider the `votes.repub`\index{data!\texttt{votes.repub}} data from `cluster`. It describes the proportion of votes for the Republican candidate in US presidential elections from 1856 to 1976 in 50 different states:

```{r eval=FALSE}
library(cluster)
?votes.repub
View(votes.repub)
```

We are interested in finding subgroups - clusters - of states with similar voting patterns.

To find clusters of similar observations (states, in this case), we could start by assigning each observation to its own cluster. We'd then start with 50 clusters, one for each observation. Next, we could merge the two clusters that are the most similar, yielding 49 clusters, one of which consisted of two observations and 48 consisting of a single observation. We could repeat this process, merging the two most similar clusters in each iteration until only a single cluster was left. This would give us a _hierarchy_ of clusters, which could be plotted in a tree-like structure, where observations from the same cluster would be one the same branch. Like this:

```{r eval=FALSE}
clusters_agnes <- agnes(votes.repub)
plot(clusters_agnes, which = 2)
```

This type of plot is known as a _dendrogram_\index{clustering!dendrogram}.

We've just used `agnes`\index{\texttt{agnes}}, a function from `cluster` that can be used to carry out _hierarchical clustering_ in the manner described above. There are a couple of things that need to be clarified, though.

First, how do we measure how similar two $p$-dimensional observations $x$ and $y$ are? `agnes` provides two measures of distance between points\index{clustering!distance measure}:

* `metric = "euclidean"` (the default), uses the Euclidean $L_2$ distance $||x-y||=\sqrt{\sum_{i=1}^p(x_i-y_i)^2}$,
* `metric = "manhattan"`, uses the Manhattan $L_1$ distance $||x-y||={\sum_{i=1}^p|x_i-y_i|}$.

Note that neither of these will work create if you have categorical variables in your data. If all your variables are binary, i.e. categorical with two values, you can use `mona`\index{\texttt{mona}} instead of `agnes` for hierarchical clustering.

Second, how do we measure how similar two clusters of observations are? `agnes` offers a number of options here. Among them are\index{clustering!linkage}:

* `method = "average` (the default), unweighted average linkage, uses the average distance between points from the two clusters,
* `method = "single`, single linkage, uses the smallest distance between points from the two clusters,
* `method = "complete`, complete linkage, uses the largest distance between points from the two clusters,
* `method = "ward"`, Ward's method, uses the within-cluster variance to compare different possible clusterings, with the clustering with the lowest within-cluster variance being chosen.

Regardless of which of these that you use, it is often a good idea to standardise the numeric variables in your dataset so that they all have the same variance. If you don't, your distance measure is likely to be dominated by variables with larger variance, while variables with low variances will have little or no impact on the clustering. To standardise your data, you can use `scale`\index{\texttt{scale}}:

```{r eval=FALSE}
# Perform clustering on standardised data:
clusters_agnes <- agnes(scale(votes.repub))
# Plot dendrogram:
plot(clusters_agnes, which = 2)
```

At this point, we're starting to use several functions after another, and so this looks like a perfect job for a pipeline. To carry out the same analysis uses `%>%` pipes, we write:

```{r eval=FALSE}
library(magrittr)
votes.repub %>% scale() %>%
                agnes() %>% 
                plot(which = 2)
```

We can now try changing the metric and clustering method used as described above. Let's use the Manhattan distance and complete linkage:

```{r eval=FALSE}
votes.repub %>% scale() %>%
                agnes(metric = "manhattan", method = "complete") %>% 
                plot(which = 2)
```

We can change the look of the dendrogram by adding `hang = -1`, which causes all observations to be placed at the same level:

```{r eval=FALSE}
votes.repub %>% scale() %>%
                agnes(metric = "manhattan", method = "complete") %>% 
                plot(which = 2, hang = -1)
```

As an alternative to `agnes`, we can consider `diana`\index{\texttt{diana}}. `agnes` is an _agglomerative_ method, which starts with a lot of clusters and then merge them step-by-step. `diana`, in contrast, is a _divisive_ method, which starts with one large cluster and then step-by-step splits it into several smaller clusters.

```{r eval=FALSE}
votes.repub %>% scale() %>%
                diana() %>% 
                plot(which = 2)
```

You can change the distance measure used by setting `metric` in the `diana` call. Euclidean distance is the default.

To wrap this section up, we'll look at two packages that are useful for plotting the results of hierarchical clustering: `dendextend` and `factoextra`\index{\texttt{dendextend}}. We installed `factoextra` in the previous section, but still need to install `dendextend`:

```{r eval=FALSE}
install.packages("dendextend")
```

To compare the dendrograms from produced by different methods (or the same method with different settings), in a _tanglegram_, where the dendrograms are plotted against each other, we can use `tanglegram`\index{\texttt{tanglegram}}\index{\texttt{as.dendrogram}} from `dendextend`:

```{r eval=FALSE}
library(dendextend)
# Create clusters using agnes:
votes.repub %>% scale() %>%
                agnes() -> clusters_agnes
# Create clusters using diana:
votes.repub %>% scale() %>%
                diana() -> clusters_diana

# Compare the results:
tanglegram(as.dendrogram(clusters_agnes), 
           as.dendrogram(clusters_diana))
```

Some clusters are quite similar here, whereas others are very different.

Often, we are interested in finding a comparatively small number of clusters, $k$. In hierarchical clustering, we can reduce the number of clusters by "cutting" the dendrogram tree. To do so using the `factoextra` package, we first use `hcut`\index{\texttt{hcut}} to cut the tree into $k$ parts, and then `fviz_dend`\index{\texttt{fviz\_dend}} to plot the dendrogram, with each cluster plotted in a different colour. If, for instance, we want $k=5$ clusters^[Just to be clear, 5 is just an arbitrary number here. We could of course want 4, 14, or any other number of clusters.] and want to use `agnes` with average linkage and Euclidean distance for the clustering, we'd do the following:

```{r eval=FALSE}
library(factoextra)
votes.repub %>% scale() %>%
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "average",
                     hc_metric = "euclidean") %>% 
                fviz_dend()
```

There is no inherent meaning to the colours - they are simply a way to visually distinguish between clusters.

Hierarchical clustering is especially suitable for data with named observations. For other types of data, other methods may be better. We will consider some alternatives next.

$$\sim$$

```{exercise, label="ch4exc25"}
Continue the last example above by changing the clustering method to complete linkage with the Manhattan distance.

1. Do any of the 5 coloured clusters remain the same?
  
2. How can you produce a tanglegram with 5 coloured clusters, to better compare the results from the two clusterings?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions25)


<br>

```{exercise, label="ch4exc25b"}
The `USArrests`\index{data!\texttt{USArrests}} data contains statistics on violent crime rates in 50 US states. Perform a hierarchical cluster analysis of the data. With which states are Maryland clustered?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions25b)


### Heatmaps and clustering variables
When looking at a dendrogram, you may ask why and how different observations are similar. Similarities between observations can be visualised using a _heatmap_\index{heatmap}, which displays the levels of different variables using colour hues or intensities. The `heatmap`\index{\texttt{heatmap}} function creates a heatmap from a `matrix` object. Let's try it with the `votes.repub` voting data. Because `votes.repub` is a `data.frame` object, we have to convert it to a matrix with `as.matrix` first (see Section \@ref(typesoftables)):

```{r eval=FALSE}
library(cluster)
library(magrittr)
votes.repub %>% as.matrix() %>% heatmap()
```

You may want to increase the height of your Plot window so that the names of all states are displayed properly. Using the default colours, low values are represented by a light yellow and high values by a dark red. White represents missing values.

You'll notice that dendrograms are plotted along the margins. `heatmap` performs hierarchical clustering (by default, agglomerative with complete linkage) of the observations as well as of the variables. In the latter case, variables are grouped together based on _similarities between observations_, creating _clusters of variables_\index{clustering!of variables}. In essence, this is just a hierarchical clustering of the transposed data matrix, but it does offer a different view of the data, which at times can be very revealing. The rows and columns are sorted according to the two hierarchical clusterings.

As per usual, it is a good idea to standardise the data before clustering, which can be done using the `scale` argument in `heatmap`. There are two options for scaling, either in the row direction (preferable if you wish to cluster variables) or the column direction (preferable if you wish to cluster observations):

```{r eval=FALSE}
# Standardisation suitable for clustering variables:
votes.repub %>% as.matrix() %>% heatmap(scale = "row")

# Standardisation suitable for clustering observations:
votes.repub %>% as.matrix() %>% heatmap(scale = "col")
```

Looking at the first of these plots, we can see which elections (i.e. which variables) had similar outcomes in terms of Republican votes. For instance, we can see that the elections in 1960, 1976, 1888, 1884, 1880, and 1876 all had similar outcomes, with the large number of orange rows indicating that the Republicans neither did great nor did poorly.

If you like, you can change the colour palette used. As in Section \@ref(colourpalettes), you can choose between palettes from http://www.colorbrewer2.org. `heatmap` is not a `ggplot2` function, so this is done in a slightly different way to what you're used to from other examples\index{\texttt{colorRampPalette}}\index{\texttt{brewer.pal}}. Here are two examples, with the white-blue-purple sequential palette `"BuPu"` and the red-white-blue diverging palette `"RdBu"`:

```{r eval=FALSE}
library(RColorBrewer)
col_palette <- colorRampPalette(brewer.pal(8, "BuPu"))(25)
votes.repub %>% as.matrix() %>%
    heatmap(scale = "row", col = col_palette)

col_palette <- colorRampPalette(brewer.pal(8, "RdBu"))(25)
votes.repub %>% as.matrix() %>%
    heatmap(scale = "row", col = col_palette)
```

$$\sim$$

```{exercise, label="ch4exc25c"}
Draw a heatmap for the `USArrests` data. Have a look at Maryland and the states with which it is clustered. Do they have high or low crime rates?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions25c)


### Centroid-based clustering {#centroid}
Let's return to the `seeds` data that we explored in Section \@ref(pca):

```{r eval=FALSE}
# Download the data:
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
```

We know that there are three varieties of seeds in this dataset, but what if we didn't? Or what if we'd lost the labels and didn't know what seeds are of what type? There are no rows names for this data, and plotting a dendrogram may therefore not be that useful. Instead, we can use $k$-means clustering\index{clustering!$k$-means}\index{clustering!centroid}, where the points are clustered into $k$ clusters based on their distances to the cluster means, or _centroids_.

When performing $k$-means clustering (using the algorithm of Hartigan & Wong (1979) that is the default in the function that we'll use), the data is split into $k$ clusters based on their distance to the mean of all points. Points are then moved between clusters, one at a time, based on how close they are (as measured by Euclidean distance) to the mean of each cluster. The algorithm finishes when no point can be moved between clusters without increasing the average distance between points and the means of their clusters.

To run a $k$-means clustering in R, we can use `kmeans`\index{\texttt{kmeans}}. Let's start by using $k=3$ clusters:

```{r eval=FALSE}
# First, we standardise the data, and then we do a k-means
# clustering.
# We ignore variable 8, Variety, which is the group label.
library(magrittr)
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 3) -> seeds_cluster

seeds_cluster
```

To visualise the results, we'll plot the first two principal components. We'll use colour to show the clusters. Moreover, we'll plot the different varieties in different shapes, to see if the clusters found correspond to different varieties:

```{r eval=FALSE}
# Compute principal components:
pca <- prcomp(seeds[,-8])
library(ggfortify)
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
         shape = "Variety", size = 2, alpha = 0.75)
```

In this case, the clusters more or less overlap with the varieties! Of course, in a lot of cases, we don't know the number of clusters beforehand. What happens if we change $k$?

First, we try $k=2$:

```{r eval=FALSE}
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 2) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
         shape = "Variety", size = 2, alpha = 0.75)
```

Next, $k=4$:

```{r eval=FALSE}
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 4) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
         shape = "Variety", size = 2, alpha = 0.75)
```

And finally, a larger number of clusters, say $k=12$:

```{r eval=FALSE}
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 12) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
         shape = "Variety", size = 2, alpha = 0.75)
```

If it weren't for the fact that the different varieties were shown as different shapes, we'd have no way to say, based on this plot alone, which choice of $k$ that is preferable here. Before we go into methods for choosing $k$ though, we'll mention `pam`\index{\texttt{pam}}. `pam` is an alternative to $k$-means that works in the same way, but uses median-like points, _medoids_ instead of cluster means. This makes it more robust to outliers. Let's try it with $k=3$ clusters:

```{r eval=FALSE}
seeds[, -8] %>% scale() %>% 
                pam(k = 3) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$clustering,
         shape = "Variety", size = 2, alpha = 0.75)
```

For both `kmeans` and `pam`, there are visual tools that can help us choose the value of $k$ in the `factoextra` package\index{\texttt{factoextra}}. Let's install it:

```{r eval=FALSE}
install.packages("factoextra")
```

The `fviz_nbclust`\index{\texttt{fviz\_nbclust}} function in `factoextra` can be used to obtain plots that can guide the choice of $k$. It takes three arguments as input: the data, the clustering function (e.g. `kmeans` or `pam`) and the method used for evaluating different choices of $k$. There are three options for the latter: `"wss"`, `"silhouette"` and `"gap_stat"`.

`method = "wss"` yields a plot that relies on the within-cluster sum of squares, WSS, which is a measure of the within-cluster variation\index{clustering!WSS}. The smaller this is, the more compact are the clusters. The WSS is plotted for several choices of $k$, and we look for an "elbow", just as we did when using a scree plot for PCA. That is, we look for the value of $k$ such that increasing $k$ further doesn't improve the WSS much. Let's have a look at an example, using `pam` for clustering:

```{r eval=FALSE}
library(factoextra)
fviz_nbclust(scale(seeds[, -8]), pam, method = "wss")

# Or, using a pipeline instead:
library(magrittr)
seeds[, -8] %>% scale() %>% 
                fviz_nbclust(pam, method = "wss")
```

$k=3$ seems like a good choice here.

`method = "silhouette"` produces a silhouette plot\index{clustering!silhouette plot}. The silhouette value measures how similar a point is compared to other points in its cluster. The closer to 1 this value is, the better. In a silhouette plot, the average silhouette value for points in the data are plotted against $k$:

```{r eval=FALSE}
fviz_nbclust(scale(seeds[, -8]), pam, method = "silhouette")
```

Judging by this plot, $k=2$ appears to be the best choice.

Finally, `method = "gap_stat"` yields a plot of the gap statistic\index{clustering!gap statistic} (Tibshirani et al., 2001), which is based on comparing the WSS to its expected value under a null distribution obtained using the bootstrap (Section \@ref(bootstrap)). Higher values of the gap statistic are preferable:

```{r eval=FALSE}
fviz_nbclust(scale(seeds[, -8]), pam, method = "gap_stat")
```

In this case, $k=3$ gives the best value.

In addition to plots for choosing $k$, `factoextra`\index{\texttt{fviz\_cluster}} provides the function `fviz_cluster` for creating PCA-based plots, with an option to add convex hulls or ellipses around the clusters:

```{r eval=FALSE}
# First, find the clusters:
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 3) -> seeds_cluster

# Plot clusters and their convex hulls:
library(factoextra)
fviz_cluster(seeds_cluster, data = seeds[, -8])

# Without row numbers:
fviz_cluster(seeds_cluster, data = seeds[, -8], geom = "point")

# With ellipses based on the multivariate normal distribution:
fviz_cluster(seeds_cluster, data = seeds[, -8],
             geom = "point", ellipse.type = "norm")
```

Note that in this plot, the shapes correspond to the clusters and not the varieties of seeds.

$$\sim$$

```{exercise, label="ch4exc26"}
The `chorSub`\index{data!\texttt{chorSub}} data from `cluster` contains measurements of 10 chemicals in 61 geological samples from the Kola Peninsula. Cluster this data using using either `kmeans` or `pam` (does either seem to be a better choice here?). What is a good choice of $k$ here? Visualise the results.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions26)


<!--### A hybrid approach
```{r eval=FALSE}
library(factoextra)
seeds_cluster <- hkmeans(seeds[, -8], 3)
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
         shape = "Variety", size = 2, alpha = 0.75)
```
-->

### Fuzzy clustering
An alternative to $k$-means clustering is _fuzzy clustering_\index{clustering!fuzzy}, in which each point is "spread out" over the $k$ clusters instead of being placed in a single cluster. The more similar it is to other observations in a cluster, the higher is its membership in that cluster. Points can have a high degree of membership to several clusters, which is useful in applications where points should be allowed to belong to more than one cluster. An important example is genetics, where genes can encode proteins with more than one function. If each point corresponds to a gene, it then makes sense to allow the points to belong to several clusters, potentially associated with different functions. The opposite of fuzzy clustering is _hard clustering_, in which each point only belongs to one cluster.

`fanny`\index{\texttt{fanny}} from `cluster` can be used to perform fuzzy clustering:

```{r eval=FALSE}
library(cluster)
library(magrittr)
seeds[, -8] %>% scale() %>% 
                fanny(k = 3) -> seeds_cluster

# Check membership of each cluster for the different points:
seeds_cluster$membership

# Plot the closest hard clustering:
library(factoextra)
fviz_cluster(seeds_cluster, geom = "point")
```

As for `kmeans` and `pam`, we can use `fviz_nbclust` to determine how many clusters to use:

```{r eval=FALSE}
seeds[, -8] %>% scale() %>% 
                fviz_nbclust(fanny, method = "wss")
seeds[, -8] %>% scale() %>% 
                fviz_nbclust(fanny, method = "silhouette")
# Producing the gap statistic plot takes a while here, so
# you may want to skip it in this case:
seeds[, -8] %>% scale() %>% 
                fviz_nbclust(fanny, method = "gap")
```

$$\sim$$

```{exercise, label="ch4exc27"}
Do a fuzzy clustering of the `USArrests` data. Is Maryland strongly associated with a single cluster, or with several clusters? What about New Jersey?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions27)


### Model-based clustering {#modelbasedclustering}
As a last option, we'll consider model-based clustering\index{clustering!model-based}, in which each cluster is assumed to come from a multivariate normal distribution. This will yield ellipsoidal clusters. `Mclust` from the `mclust` package\index{\texttt{Mclust}}\index{\texttt{mclust}} fits such a model, called a Gaussian finite mixture model, using the EM-algorithm (Scrucca et al., 2016). First, let's install the package:

```{r eval=FALSE}
install.packages("mclust")
```

Now, let's cluster the `seeds` data. The number of clusters is chosen as part of the clustering procedure. We'll use a function from the `factoextra`\index{\texttt{factoextra}} for plotting the clusters with ellipsoids, and so start by installing that:

```{r eval=FALSE}
install.packages("factoextra")

library(mclust)
seeds_cluster <- Mclust(scale(seeds[, -8]))
summary(seeds_cluster)

# Plot results with ellipsoids:
library(factoextra)
fviz_cluster(seeds_cluster, geom = "point", ellipse.type = "norm")
```

Gaussian finite mixture models are based on the assumption that the data is numerical. For categorical data, we can use latent class analysis, which we'll discuss in Section \@ref(lca), instead.

$$\sim$$

```{exercise, label="ch4exc28"}
Return to the `chorSub`\index{data!\texttt{chorSub}} data from Exercise \@ref(exr:ch4exc26). Cluster it using a Gaussian finite mixture model. How many clusters do you find?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions28)


### Comparing clusters
Having found some interesting clusters, we are often interested in exploring differences between the clusters. To do so, we much first extract the cluster labels from our clustering (which are contained in the variables `clustering` for methods with Western female names, `cluster` for `kmeans`, and `classification` for `Mclust`). We can then add those labels to our data frame and use them when plotting.

For instance, using the `seeds` data, we can compare the area of seeds from different clusters:

```{r eval=FALSE}
# Cluster the seeds using k-means with k=3:
library(cluster)
seeds[, -8] %>% scale() %>% 
                kmeans(centers = 3) -> seeds_cluster

# Add the results to the data frame:
seeds$clusters <- factor(seeds_cluster$cluster)
# Instead of $cluster, we'd use $clustering for agnes, pam, and fanny
# objects, and $classification for an Mclust object.

# Compare the areas of the 3 clusters using boxplots:
library(ggplot2)
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
      geom_boxplot()

# Or using density estimates:
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
      geom_density(alpha = 0.7)
```

We can also create a scatterplot matrix to look at all variables simultaneously:

```{r eval=FALSE}
library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds$clusters, alpha = 0.2))
```

It may be tempting to run some statistical tests (e.g. a t-test) to see if there are differences between the clusters. Note, however, that in statistical hypothesis testing, it is typically assumed that the hypotheses that are being tested have been generated independently from the data. _Double-dipping_, where the data first is used to generate a hypothesis ("judging from this boxplot, there seems to be a difference in means between these two groups!" or "I found these clusters, and now I'll run a test to see if they are different") and then test that hypothesis, is generally frowned upon, as that substantially inflates the risk of a type I error. Recently, there have however been some advances in valid techniques for testing differences in means between clusters found using hierarchical clustering; see Gao et al. (2020).

## Exploratory factor analysis
The purpose of _factor analysis_\index{factor analysis} is to describe and understand the correlation structure for a set of observable variables through a smaller number of unobservable underlying variables, called _factors_ or _latent variables_. These are thought to explain the values of the observed variables in a causal manner. Factor analysis is a popular tool in psychometrics, where it for instance is used to identify latent variables that explain people's results on different tests, e.g. related to personality, intelligence, or attitude.

### Factor analysis
We'll use the `psych`\index{\texttt{psych}} package, along with the associated package `GPArotation`\index{\texttt{GPArotation}}, for our analyses. Let's install them:

```{r eval=FALSE}
install.packages(c("psych", "GPArotation"))
```

For our first example of factor analysis, we'll be using the `attitude`\index{data!\texttt{attitude}} data that comes with R. It describes the outcome of a survey of employees at a financial organisation. Have a look at its documentation to read about the variables in the dataset:

```{r eval=FALSE}
?attitude
attitude
```

To fit a factor analysis model to these data, we can use `fa`\index{\texttt{fa}} from `psych`. `fa` requires us to specify the number of factors used in the model. We'll get back to how to choose the number of factors, but for now, let's go with 2:

```{r eval=FALSE}
library(psych)
# Fit factor model:
attitude_fa <- fa(attitude, nfactors = 2,
                  rotate = "oblimin", fm = "ml")
```

`fa` does two things for us. First, it fits a factor model to the data, which yields a table of _factor loadings_, i.e. the correlation between the two unobserved factors and the observed variables. However, there is an infinite number of mathematically valid factor models for any given dataset. Therefore, the factors are _rotated_ according to some rule to obtain a factor model that hopefully allows for easy and useful interpretation. Several methods can be used to fit the factor model (set using the `fm` argument in `fa`) and for rotation the solution (set using `rotate`). We'll look at some of the options shortly.

First, we'll print the result, showing the factor loadings (after rotation). We'll also plot the resulting model using `fa.diagram`\index{\texttt{fa.diagram}}, showing the correlation between the factors and the observed variables:

```{r eval=FALSE}
# Print results:
attitude_fa

# Plot results:
fa.diagram(attitude_fa, simple = FALSE)
```

The first factor is correlated to the variables `advance`, `learning` and `raises`. We can perhaps interpret this factor as measuring the employees' career opportunity at the organisation. The second factor is strongly correlated to `complaints` and (overall) `rating`, but also to a lesser degree correlated to `raises`, `learning` and `privileges`. This can maybe be interpreted as measuring how the employees' feel that they are treated at the organisation.

We can also see that the two factors are correlated. In some cases, it makes sense to expect the factors to be uncorrelated. In that case, we can change the rotation method used, from `oblimin` (which yields _oblique rotations_, allowing for correlations - usually a good default) to `varimax`, which yields uncorrelated factors:

```{r eval=FALSE}
attitude_fa <- fa(attitude, nfactors = 2,
                  rotate = "varimax", fm = "ml")
fa.diagram(attitude_fa, simple = FALSE)
```

In this case, the results are fairly similar.

The `fm = "ml"` setting means that maximum likelihood estimation of the factor model is performed, under the assumption of a normal distribution for the data. Maximum likelihood estimation is widely recommended for estimation of factor models, and can often work well even for non-normal data (Costello & Osborne, 2005). However, there are cases where it fails to find useful factors. `fa` offers several different estimation methods. A good alternative is `minres`, which often works well when maximum likelihood fails:

```{r eval=FALSE}
attitude_fa <- fa(attitude, nfactors = 2,
                  rotate = "oblimin", fm = "minres")
fa.diagram(attitude_fa, simple = FALSE)
```

Once again, the results are similar to what we saw before. In other examples, the results differ more. When choosing which estimation method and rotation to use, bear in mind that in an exploratory study, there is no harm in playing around with a few different methods. After all, your purpose is to generate hypotheses rather than confirm them, and looking at the data in a few different ways will help you do that.

To determine the number of factors that are appropriate for a particular dataset, we can draw a scree plot with `scree`\index{\texttt{scree}}. This is interpreted in the same way as for principal components analysis (Section \@ref(pca)) and centroid-based clustering (Section \@ref(centroid)) - we look for an "elbow" in the plot, which tells us at which point adding more factors no longer contributes much to the model:

```{r eval=FALSE}
scree(attitude, pc = FALSE)
```

A useful alternative version of this is provided by `fa.parallel`\index{\texttt{fa.parallel}}, which adds lines showing what the scree plot would look like for randomly generated uncorrelated data of the same size as the original dataset. As long as the blue line, representing the actual data, is higher than the red line, representing randomly generated data, adding more factors improves the model:

```{r eval=FALSE}
fa.parallel(attitude, fm = "ml", fa = "fa")
```

Some older texts recommend that only factors with an eigenvalue (the y-axis in the scree plot) greater than 1 be kept in the model. It is widely agreed that this so-called Kaiser rule is inappropriate (Costello & Osborne, 2005), as it runs the risk of leaving out important factors.

Similarly, some older texts also recommend using principal components analysis to fit factor models. While the two are mathematically similar in that both in some sense reduce the dimensionality of the data, PCA and factor analysis are designed to target different problems. Factor analysis is concerned with an underlying causal structure where the unobserved factors affect the observed variables. In contrast, PCA simply seeks to create a small number of variables that summarise the variation in the data, which can work well even if there are no unobserved factors affecting the variables.

$$\sim$$

```{exercise, label="ch4exc29"}
Factor analysis only relies on the covariance or correlation matrix of your data. When using `fa` and other functions for factor analysis, you can input either a data frame or a covariance/correlation matrix. Read about the `ability.cov` data that comes shipped with R, and perform a factor analysis of it.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions29)


### Latent class analysis {#lca}
When there is a single categorical latent variable, factor analysis overlaps with clustering, which we studied in Section \@ref(clustering). Whether we think of the values of the latent variable as clusters, classes, factor levels, or something else is mainly a philosophical question - from a mathematical perspective, it doesn't matter what name we use for them.

When observations from the same cluster are assumed to be uncorrelated, the resulting model is called _latent profile analysis_\index{latent profile analysis}, which typically is handled using model-based clustering (Section \@ref(modelbasedclustering)). The special case where the observed variables are categorical is instead known as _latent class analysis_\index{latent class analysis}. This is common e.g. in analyses of survey data, and we'll have a look at such an example in this section. The package that we'll use for our analyses is called `poLCA`\index{\texttt{poLCA}} - let's install it:

```{r eval=FALSE}
install.packages("poLCA")
```

The [National Mental Health Services Survey](https://www.datafiles.samhsa.gov/study-dataset/national-mental-health-services-survey-2019-n-mhss-2019-ds0001-nid18959)\index{data!National Mental Health Services Survey} is an annual survey collecting information about mental health treatment facilities in the US. We'll analyse data from the 2019 survey, courtesy of the Substance Abuse and Mental Health Data Archive, and try to find latent classes. [Download `nmhss-puf-2019.csv` from the book's web page](http://www.modernstatisticswithr.com/data.zip), and set `file_path` to its path. We can then load and look at a summary of the data using:

```{r eval=FALSE}
nmhss <- read.csv(file_path)
summary(nmhss)
```

All variables are categorical (except perhaps for the first one, which is an identifier). According to the [survey's documentation](https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/N-MHSS-2019/N-MHSS-2019-datasets/N-MHSS-2019-DS0001/N-MHSS-2019-DS0001-info/N-MHSS-2019-DS0001-info-codebook.pdf), negative values are used to represent missing values. For binary variables, `0` means no/non-presence and `1` means yes/presence.

Next, we'll load the `poLCA` package and read the documentation for the function that we'll use for the analysis.

```{r eval=FALSE}
library(poLCA)
?poLCA
```

As you can see in the description of the `data` argument, the observed variables (called _manifest variables_ here) are only allowed to contain consecutive integer values, starting from 1. Moreover, missing values should be represented by `NA`, and not by negative numbers (just as elsewhere in R!). We therefore need to make two changes to our data:

* Change negative values to `NA`,
* Change the levels of binary variables so that `1` means no/non-presence and `2` means yes/presence.

In our example, we'll look at variables describing what treatments are available at the different facilities. Let's create a new data frame for those variables:

```{r eval=FALSE}
treatments <- nmhss[, names(nmhss)[17:30]]
summary(treatments)
```

To make the changes to the data that we need, we can do the following:

```{r eval=FALSE}
# Change negative values to NA:
treatments[treatments < 0] <- NA

# Change binary variables from 0 and 1 to
# 1 and 2:
treatments <- treatments + 1

# Check the results:
summary(treatments)
```

We are now ready to get started with our analysis. To begin with, we will try to find classes based on whether or not the facilities offer the following five treatments:

* `TREATPSYCHOTHRPY`: The facility offers individual psychotherapy,
* `TREATFAMTHRPY`: The facility offers couples/family therapy,
* `TREATGRPTHRPY`: The facility offers group therapy,
* `TREATCOGTHRPY`: The facility offers cognitive behavioural therapy,
* `TREATPSYCHOMED`: The facility offers psychotropic medication.
The `poLCA` function needs three inputs: a formula describing what observed variables to use, a data frame containing the observations, and `nclass`, the number of latent classes to find. To begin with, let's try two classes:

```{r eval=FALSE}
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
                 TREATGRPTHRPY, TREATCOGTHRPY,
                 TREATPSYCHOMED) ~ 1,
           data = treatments, nclass = 2)
```

The output shows the probabilities of 1's (no/non-presence) and 2's (yes/presence) for the two classes. So, for instance, from the output

```{r eval=FALSE}
$TREATPSYCHOTHRPY
           Pr(1)  Pr(2)
class 1:  0.6628 0.3372
class 2:  0.0073 0.9927
```

we gather that 34 % of facilities belonging to the first class offer individual psychotherapy, whereas 99 % of facilities from the second class offer individual psychotherapy. Looking at the other variables, we see that the second class always has high probabilities of offering therapies, while the first class doesn't. Interpreting this, we'd say that the second class contains facilities that offer a wide variety of treatments, and the first facilities that only offer some therapies. Finally, we see from the output that 88 % of the facilities belong to the second class:

```{r eval=FALSE}
Estimated class population shares 
 0.1167 0.8833 
```

We can visualise the class differences in a plot:

```{r eval=FALSE}
plot(m)
```

To see which classes different observations belong to, we can use:

```{r eval=FALSE}
m$predclass
```

Just as in a cluster analysis, it is often a good idea to run the analysis with different numbers of classes. Next, let's try 3 classes:

```{r eval=FALSE}
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
                 TREATGRPTHRPY, TREATCOGTHRPY,
                 TREATPSYCHOMED) ~ 1,
           data = treatments, nclass = 3)
```

This time, we run into numerical problems - the model estimation has failed, as indicated by the following warning message:

```{r eval=FALSE}
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
```

`poLCA` fits the model using a method known as the _EM algorithm_, which finds maximum likelihood estimates numerically. First, the observations are randomly assigned to the classes. Step by step, the observations are then moved between classes, until the optimal split has been found. It can however happen that more steps are needed to find the optimum (by default 1,000 steps are used), or that we end up with unfortunate initial class assignments that prevent the algorithm from finding the optimum. To attenuate this problem, we can increase the number of steps used, or run the algorithm multiple times, each with new initial class assignments. The `poLCA` arguments for this are `maxiter`, which controls the number of steps (or iterations) used, and `nrep`, which controls the number of repetitions with different initial assignments. We'll increase both, and see if that helps. Note that this means that the algorithm will take longer to run:

```{r eval=FALSE}
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
                 TREATGRPTHRPY, TREATCOGTHRPY,
                 TREATPSYCHOMED) ~ 1,
           data = treatments, nclass = 3,
           maxiter = 2500, nrep = 5)
```

These setting should do the trick for this dataset, and you probably won't see a warning message this time. If you do, try increasing either number and run the code again.

The output that you get can differ between runs - in particular, the order of the classes can differ depending on initial assignments. Here is part of the output from my run:

```{r eval=FALSE}
$TREATPSYCHOTHRPY
           Pr(1)  Pr(2)
class 1:  0.0076 0.9924
class 2:  0.0068 0.9932
class 3:  0.6450 0.3550

$TREATFAMTHRPY
           Pr(1)  Pr(2)
class 1:  0.1990 0.8010
class 2:  0.0223 0.9777
class 3:  0.9435 0.0565

$TREATGRPTHRPY
           Pr(1)  Pr(2)
class 1:  0.0712 0.9288
class 2:  0.3753 0.6247
class 3:  0.4935 0.5065

$TREATCOGTHRPY
           Pr(1)  Pr(2)
class 1:  0.0291 0.9709
class 2:  0.0515 0.9485
class 3:  0.5885 0.4115

$TREATPSYCHOMED
           Pr(1)  Pr(2)
class 1:  0.0825 0.9175
class 2:  1.0000 0.0000
class 3:  0.3406 0.6594

Estimated class population shares 
 0.8059 0.0746 0.1196 
```

We can interpret this as follows:

* Class 1 (81 % of facilities): Offer all treatments, including psychotropic medication.
* Class 2 (7 % of facilities): Offer all treatments, except for psychotropic medication.
* Class 3 (12 % of facilities): Only offer some treatments, which may include psychotropic medication.

You can either let interpretability guide your choice of how many classes to include in your analysis, our use model fit measures like $AIC$ and $BIC$, which are printed in the output and can be obtained from the model using:

```{r eval=FALSE}
m$aic
m$bic
```

The lower these are, the better is the model fit.

If you like, you can add a covariate to your latent class analysis, which allows you to simultaneously find classes and study their relationship with the covariate. Let's add the variable `PAYASST` (which says whether a facility offers treatment at no charge or minimal payment to clients who cannot afford to pay) to our data, and then use that as a covariate.

```{r eval=FALSE}
# Add PAYASST variable to data, then change negative values
# to NA's:
treatments$PAYASST <- nmhss$PAYASST
treatments$PAYASST[treatments$PAYASST < 0] <- NA

# Run LCA with covariate:
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
                 TREATGRPTHRPY, TREATCOGTHRPY,
                 TREATPSYCHOMED) ~ PAYASST,
           data = treatments, nclass = 3,
           maxiter = 2500, nrep = 5)
```

My output from this model includes the following tables:

```{r eval=FALSE}
========================================================= 
Fit for 3 latent classes: 
========================================================= 
2 / 1 
            Coefficient  Std. error  t value  Pr(>|t|)
(Intercept)     0.10616     0.18197    0.583     0.570
PAYASST         0.43302     0.11864    3.650     0.003
========================================================= 
3 / 1 
            Coefficient  Std. error  t value  Pr(>|t|)
(Intercept)     1.88482     0.20605    9.147         0
PAYASST         0.59124     0.10925    5.412         0
=========================================================
```

The interpretation is that both class 2 and class 3 differ significantly from class 1 (the p-values in the `Pr(>|t|)` column are low), with the positive coefficients for `PAYASST` telling us that class 2 and 3 facilities are more likely to offer pay assistance than class 1 facilities.

$$\sim$$

```{exercise, label="ch4exc30"}
The `cheating`\index{data!\texttt{cheating}} dataset from `poLCA` contains students' answers to four questions about cheating, along with their grade point averages (GPA). Perform a latent class analysis using GPA as a covariate. What classes do you find? Does having a high GPA increase the probability of belonging to either class?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions30)


\vfill