r-viz-homework.Rmd

---
title: "Advanced Data Visualization Homework"
output:
  html_document:
    code_folding: hide
---

(_Refer back to the [Advanced Data Visualization lesson](r-viz-gapminder.html))._

```{r inithomework, include=FALSE}
library(knitr)
opts_chunk$set(eval=TRUE, echo=TRUE, fig.keep="last", message = FALSE, warning = FALSE, cache=TRUE)
```

### Key Concepts

> 
- geoms
- aesthetic mappings
- statistical layers
- scales
- ggthemes
- ggsave

### Getting Started

The data we're going to look at is cleaned up version of a gene expression dataset from [Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) _Mol Biol Cell_ 19:352-367](http://www.ncbi.nlm.nih.gov/pubmed/17959824). This data is from a gene expression microarray, and in this paper the authors are examining the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a _single_ nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that: 

1. **Raise or lower their expression in response to growth rate**. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress. The authors found that expression of >25% of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. They also found that the subset of negatively growth-correlated genes is enriched for peroxisomal functions, and positively correlated genes mainly encode ribosomal functions. 
2. **Respond differently when different nutrients are being limited**. If you see particular genes that respond very differently when a nutrient is sharply restricted, these genes might be involved in the transport or metabolism of that specific nutrient.

You can download the cleaned up version of the data at [the link above](data.html). The file is called [**brauer2007_tidy.csv**](data/brauer2007_tidy.csv). Load the **ggplot2**, **dplyr**, **readr** packages, and read the tidy _Brauer_ data into R using the `read_csv()` function (_note, **not** `read.csv()`_). Assign the data to an object called `ydat`.

_Note, the code is available by hitting the "Code" button above each expected output, but try not to use it unless you're stuck_.


```{r loadhwdata}
# Load required libraries
library(readr)
library(dplyr)
library(ggplot2)

# Read data from file
ydat <- read_csv("data/brauer2007_tidy.csv")

# Display the data
ydat
```


### Problem Set

Follow the prompts and use **ggplot2** to reproduce the plots below. 

#### Part 1

We can start by taking a look at the distribution of the expression values. 

1) Plot a histogram of the expression variable, and set the bin number equal to 100.

```{r}
ggplot(ydat, aes(x = expression)) +
    geom_histogram(bins = 100) 
```


2) Check the distribution of each nutrient in the data set by adjusting the fill aesthetic. Use the same bin number for this histogram.

```{r}
ggplot(ydat, aes(x = expression, fill = nutrient)) +
    geom_histogram(bins = 100)
```

Wow. That's ugly. Might be a candidate for [accidental aRt](http://accidental-art.tumblr.com/) but not very helpful for our analysis.

3) Now split off the same histogram into a faceted display with 3 columns.

```{r}
ggplot(ydat, aes(x = expression, fill = nutrient)) +
    geom_histogram(bins = 100) +
    facet_wrap(~ nutrient, ncol = 3)
```


The basic exploratory process above confirms that the overall distribution (as well each distribution by nutrient) is normal.

#### Part 2

Let's compare the genes with the highest and lowest average expression values. 

We can figure out which these are using some familiar logic:

1. Take the original *ydat* data frame ... 
2. Then *group by* symbol ...
3. Then *summarize* mean of all expression values for that symbol ...
4. Then *arrange* descending by the mean ...
5. Then *filter* for the first or last row.

The code below implements that pipeline in **dplyr** syntax:

```{r, echo=TRUE}
ydat %>%
    group_by(symbol) %>%
    summarise(meanexp = mean(expression)) %>%
    arrange(desc(meanexp)) %>%
    filter(row_number() == 1 | row_number() == n())
```

The output tells us that the gene with the highest mean expression is *HXT3*, while the gene with the lowest mean expression is *HXT6*.

4) Subset the data to only include these genes, and create a stripplot that has expression values as "jittered" points on the y-axis and the gene symbols the x-axis. 

> **HINT** you can add a "jitter" position to `geom_point()` but it's easier to control width of the effect if you use `geom_jitter()`

```{r}
gene_exp <- 
    ydat %>%
    filter(symbol == "HXT3" | symbol == "HXT6")

ggplot(gene_exp, aes(x = symbol, y = expression)) +
    geom_jitter(width = 0.1)
```


5) Now map each observation to its nutrient by color and adjust the size of the points to be 2.

```{r}
ggplot(gene_exp, aes(x = symbol, y = expression)) +
    geom_jitter(aes(col = nutrient), width = 0.1, size = 2)
```

Although these two genes are on opposite ends of the distribution of average expression values, they both seem to express similar amounts when Glucose is the restricted nutrient. 

#### Part 3

Now let's try to make something that has a little bit more of a polished look. 

6) Using **dplyr** logic, create a data frame that has the mean expression values for all combinations of rate and nutrient (_hint_: use `group_by()` and `summarize()`). Create a plot of this data with rate on the x-axis and mean expression on the y-axis and lines colored by nutrient. 

```{r}
nutrient_rates <-
    ydat %>%
    group_by(rate, nutrient) %>%
    summarise(meanexp = mean(expression)) 

ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + geom_line(aes(col=nutrient))

```


7) Add black dotted line (lty=3) that represents the smoothed mean of expression across all combinations of nutrients and rates. 

```{r}
ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + 
  geom_line(aes(col=nutrient), lty=1) + 
  geom_smooth(col = "black", lty=3, se = FALSE) 
```

8) Change the scale to include breaks for *all* of the rates.

> **HINT** The `read_csv()` function read in the rate variable as continuous rather than discrete. There are a few ways to remedy this, but first see if you can set the scale for the x axis variable without changing the dataframe.

```{r}
ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + 
  geom_line(aes(col=nutrient), lty=1) + 
  geom_smooth(col = "black", se = FALSE, lty=3) + 
  scale_x_continuous(breaks = nutrient_rates$rate)
```


9) By default `ggplot()` will name the x and y axes with names of their respective variables. You might want to apply more meaningful labels. Change the name of the x-axis to "Rate", the name of the y-axis to "Mean Expression" and the plot title to "Mean Expression By Rate (Brauer)"

> **HINT** `?labs` will pull up the **ggplot2** documentation on axes labels and plot titles.

```{r}
ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + 
  geom_line(aes(col=nutrient), lty=1) + 
  geom_smooth(col = "black", se = FALSE, lty=3) + 
  scale_x_continuous(breaks = nutrient_rates$rate) + 
  xlab("Rate") + ylab("Mean Expression") + 
  ggtitle("Mean Expression By Rate (Brauer)")
```


10) Add a theme from the **ggthemes** package. The plot below is based on Edward Tufte's book _The Visual Display of Quantitative Information_. Choose a theme that you like, but choose wisely -- some of these themes will override other adjustments you've made to your plot above, including axis labels.

> **HINT 1**: `library(ggthemes)` not working for you? [Install the package first](https://github.com/jrnold/ggthemes#install).

> **HINT 2** You can either do this by trial-and-error or check out the package vignette to get an idea of what each theme looks like: <https://github.com/jrnold/ggthemes>

```{r}
library(ggthemes)
ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + 
  geom_line(aes(col=nutrient), lty=1) + 
  geom_smooth(col = "black", se = FALSE, lty=3) + 
  scale_x_continuous(breaks = nutrient_rates$rate) + 
  xlab("Rate") + ylab("Mean Expression") + 
  ggtitle("Mean Expression By Rate (Brauer)") + 
  theme_tufte()
```


11) The last step is to save the plot you've created. Write your plot to a 10 X 6 PDF using a **ggplot2** function.

```{r, eval = FALSE}
p <- ggplot(nutrient_rates, aes(x = rate, y =  meanexp)) + 
  geom_line(aes(col=nutrient), lty=1) + 
  geom_smooth(col = "black", se = FALSE, lty=3) + 
  scale_x_continuous(breaks = nutrient_rates$rate) + 
  ylab("Mean Expression") + 
  ggtitle("Mean Expression By Rate (Brauer)") + 
  theme_tufte()

ggsave(p, filename = "plot.pdf", width = 10, height = 6)
```