index.Rmd

---
title: "Plotting the Course Through Charted Waters"
output:
  learnr::tutorial:
      theme: "cosmo"
tutorial:
  id: "org.wikimedia.mikhail.dataviz-literacy"
  version: 0.9.6
runtime: shiny_prerendered
---

```{r setup, include=FALSE}
library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
```
```{r data}
library(magrittr)
library(dplyr)
library(ggplot2)
titanic <- data.frame(Titanic)
compress <- function(x, round_by = 2) {
  div <- findInterval(x, c(1, 1e3, 1e6, 1e9, 1e12))
  return(paste0(round( x / 10 ^ (3 * ifelse(div - 1 < 0, 0, div - 1)), round_by),
                c("", "", "K", "M", "B", "T")[div + 1]))
}
```

## Introduction

Heat maps, stacked area plots, mosaic plots, choropleths -- oh my! There are so many different ways to visually convey relationships and patterns in data! In this workshop on data visualization literacy, you'll learn to recognize many popular types of charts and how to glean insights from them. The **Appendix** contains some examples of data visualization as visual essays and it also includes links to resources for learning how to create your own.

This workshop is [available as open source](https://github.com/bearloga/wmf-allhands18). There is an [interactive version](http://dataviz-literacy.wmflabs.org/) (which should automatically send you to either [mirror 1](http://dataviz-lit-01.wmflabs.org/), [mirror 2](http://dataviz-lit-02.wmflabs.org/), or [mirror 3](http://dataviz-lit-03.wmflabs.org/)) and a [static version](https://bearloga.github.io/wmf-allhands18/).

|              | Contact Information                       |
|-------------:|:------------------------------------------|
| **Work**     | mikhail at wikimedia dot org              |
| **Personal** | mikhail at mpopov dot com                 |
| **IRC**      | bearloga in #wikimedia-discovery, etc.    |
| **Twitter**  | [bearloga](https://twitter.com/bearloga)  |

<!-- Piwik -->
<script type="text/javascript">
  var _paq = _paq || [];
  _paq.push(["setDomains", ["*.dataviz-literacy.wmflabs.org"]]);
  _paq.push(['trackPageView']);
  _paq.push(['enableLinkTracking']);
  (function() {
    var u="//piwik.wikimedia.org/";
    _paq.push(['setTrackerUrl', u+'piwik.php']);
    _paq.push(['setSiteId', '15']);
    var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
    g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
  })();
</script>
<noscript><p><img src="//piwik.wikimedia.org/piwik.php?idsite=15" style="border:0;" alt="" /></p></noscript>
<!-- End Piwik Code -->

## Terms and basics

### Data visualization as storytelling

Graphical displays should:

- Show the data
- Induce the viewer to think about the substance rather than graphic design or format
- Avoid distorting the data
- Present many numbers in a small space
- Make large data sets coherent
- Encourage the eye to compare different pieces of data

-- Edward R. Tufte, *The Visual Display of Quantitative Information*

### Types of variables

- [*Quantitative* variables](https://en.wikipedia.org/wiki/Likert_scale) have a numeric value and/or an ordering
  - *Continuous* variables have an infinite range of possible values
    - **Examples**: time, age, weight, lengths (height, distance, time spent online), drug dosage
  - Quantitative variables that have limited possible values are *discrete*
    - **Examples**: population size, number of times an event occurred, pageviews, number of questions a student got correct on a test
  - Continuous variables are sometimes discretized by rounding if precision is not necessary
- [*Categorical*](https://en.wikipedia.org/wiki/Categorical_variable) / *discrete* / *qualitative* variables have a limited number of possible values:
  - *Nominal* variables have two or more categories that do not have an intrinsic order
    - **Examples**: gender, ethnicity, controls vs test group, operating system
  - *Ordinal* variables are like nominal, but the categories have an ordering/ranking such as the [Likert rating scale](https://en.wikipedia.org/wiki/Likert_scale)
  - Categorical variables can also be created from quantitative variables
    - **Example**: survey takers are often combined into age groups such as "18-24"

Refer to [levels of measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for more information.

### Things to look for

- Title (most plots should have this)
- Axis labels (almost all plots should have this)
- How many variables and their types
    - Including ones used to dictate colors, shapes, patterns, sizes, opacities, etc.
    - Independent ("*predictor*") variables (e.g. time) are usually on the X (horizontal) axis
        - Occasionally time is plotted on the vertical axis for specific reasons
    - Dependent ("*outcome*" / "*response*") variables are usually on the Y (vertical) axis
- Scales (especially log-transformed ones)

## Common visualizations

### Pies, Waffles, Bars, and Tables

A [pie chart](https://en.wikipedia.org/wiki/Pie_chart) and a [bar chart](https://en.wikipedia.org/wiki/Bar_chart) (sometimes called a *bar plot*) are an easy way to visually compare values. The pie chart -- where the slices represent proportions of the whole -- is excellent for 2-4 categories, the table is great for 1-8 categories, and the bars' heights work well for comparing more than 5 categories.

```{r pie_chart}
per_class <- titanic %>%
  group_by(Class) %>%
  summarize(Total = sum(Freq)) %>%
  mutate(
    Class = factor(Class, c("Crew", "3rd", "2nd", "1st")),
    Prop = Total / sum(Total),
    Position = cumsum(Prop) - (Prop / 2)
  )
ggplot(per_class, aes(x = factor(""), y = Prop, fill = Class)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  scale_fill_brewer(palette = "Set1") +
  coord_polar("y") +
  theme_minimal(14) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_blank()
  ) +
  geom_text(aes(y = Position, label = scales::percent(Prop)), size = 5, color = "white") +
  ggtitle("Pie chart of Titanic passengers by class")
```
```{r bar_chart, fig.cap='Notice how the use of color allows us to compare survivorship within classes.'}
class_survival <- titanic %>%
  group_by(Class, Survived) %>%
  summarize(Passengers = sum(Freq))
ggplot(class_survival, aes(y = Passengers, x = Class, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Bar chart of survivorship on Titanic by class",) +
  theme_minimal(14)
```
```{r table, results='asis'}
class_survival %>%
  mutate(Survived = if_else(Survived == "Yes", "Survived", "Did not survive")) %>%
  tidyr::spread(Survived, Passengers) %>%
  knitr::kable(format = "markdown")
```

In the past decade, a semi-alternative to the pie chart called *waffle charts* (or "square pie charts") has gained popularity at representing relative sizes between groups. (See [Women in IT – Squaring the Pie?](https://eagereyes.org/techniques/square-pie-charts).) **Semi-alternative** becauses waffles compare **totals** and pie charts compare **percentages**. As such, waffle charts are good for comparing relative sizes, but not at comparing relative %s.

Each square represents a certain number of units, which I think makes it easier to visually compare sizes of groups. For example, it is easier to compare 11 squares (2nd class passengers who survived) to 20 squares (1st class passengers who survived) than 1 pie slice to another pie slice that is 1.8 times bigger:

```{r waffle, fig.width=12, fig.height=6, out.width=624}
temp <- split(class_survival, class_survival$Survived) %>%
  purrr::map(~ set_names(.x$Passengers, .x$Class) / 10)
p1 <- class_survival %>%
  mutate(Survived = factor(if_else(Survived == "Yes", "Survived", "Did not survive"), c("Survived", "Did not survive"))) %>%
  group_by(Survived) %>%
  mutate(
    Prop = Passengers / sum(Passengers),
    Position = cumsum(Prop) - (Prop / 2)
  ) %>%
  ggplot(aes(x = factor(""), y = Prop, fill = Class)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  facet_wrap(~ Survived) +
  scale_fill_brewer(palette = "Set2") +
  coord_polar("y") +
  theme_minimal(14) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_blank()
  ) +
  geom_text(aes(y = Position, label = scales::percent(Prop)), size = 4) +
  ggtitle("Titanic passengers by class")
p2 <- waffle::waffle(
  temp$Yes, rows = 5, size = 1,
  xlab = "1 square = 10 passengers",
  title = "Titanic passengers who survived"
)
p3 <- waffle::waffle(
  temp$No, rows = 5, size = 1,
  xlab = "1 square = 10 passengers",
  title = "Titanic passengers who did not survive"
)
top_row <- cowplot::plot_grid(p1, p2, rel_widths = c(1, 1.3))
cowplot::plot_grid(top_row, p3, ncol = 1)
```

### Histograms and Densities

A [histogram](https://en.wikipedia.org/wiki/Histogram) shows the distribution of a continuous variable by splitting it into bins and counting how many observations fall into each bin (left). Sometimes those counts are divided by the total number of observations to yield proportions/probabilities instead (right). **Note** that the histogram on the right also includes a [probability density estimate](https://en.wikipedia.org/wiki/Density_estimation).

```{r histograms, fig.width=10, fig.height=5, out.width=624}
par(mfrow = c(1, 2), cex = 1.2)
hist(trees$Height, col = "gray40",
     main = "Height of Black Cherry Trees",
     xlab = "Height (ft)")
hist(trees$Height, col = "gray70",
     main = "Height of Black Cherry Trees",
     xlab = "Height (ft)", freq = FALSE, border = FALSE)
lines(density(trees$Height, adj = 1), lwd = 2)
```

An important factor to watch out for is the bin size, which -- ideally -- was carefully chosen by the creator of the visualization. Bins that are too wide will cause the distribution to appear wide, while bins that are too narrow will make the distribution appear to noisy:

```{r bin_sizes, fig.width=12, fig.height=4, out.width=624, fig.cap='The three little histo-bears.'}
par(mfrow = c(1, 3), cex = 1.2)
hist(trees$Height, col = "gray40", breaks = 3, xlab = "Height (ft)", main = "Too wide")
hist(trees$Height, col = "gray40", breaks = 6, xlab = "Height (ft)", main = "Just right")
hist(trees$Height, col = "gray40", breaks = 18, xlab = "Height (ft)", main = "Too narrow")
```

For a deeper look at histograms, I encourage you to check out [Exploring Histograms](https://tinlizzie.org/histograms/) by Aran Lunzer and Amelia McNamara.

### Comparing Distributions

When you see one of these, they are used for comparing distributions of a continuous variable (such as sepal length of Iris flowers) between different groups (such as different species):

```{r comparing_distributions, fig.width=10, fig.height=5, out.width=624}
p1 <- ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density(alpha = 0.5, adjust = 1.5) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal(14) +
  labs(x = "Sepal length (cm)", title = "Density plot")
p2 <- ggplot(data = iris, aes(y = Sepal.Length, x = Species)) +
  geom_violin(aes(fill = Species), adjust = 1.5) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal(14) +
  labs(y = "Sepal length (cm)", title = "Violin plot")
cowplot::plot_grid(p1, p2, nrow = 1)
```

The [density plot](https://en.wikipedia.org/wiki/Kernel_density_estimation) on the left is like a smooth histogram that doesn't discretize the variable into bins. The [violin plot](https://en.wikipedia.org/wiki/Violin_plot) on the left is a rotated version that makes it easier to perform the comparison because the densities (distributions) are not overlapping.

An alternative called *ridgeline plot* recently gained a lot of popularity for comparing distributions across groups because of how compact it was, which was especially useful when comparing many groups.

```{r ridgeline, eval=FALSE}
library(ggridges) # formerly ggjoy
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + 
  geom_density_ridges() +
  scale_x_continuous(expand = c(0.01, 0)) +
  scale_y_discrete(expand = c(0.01, 0)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal(14) +
  labs(x = "Sepal length (cm)", title = "Ridgeline plot")
```

![Ridgeline plot of sepal length.](index_files/figure-html4/ridgeline.png)

A [box-and-whiskers chart](https://en.wikipedia.org/wiki/Box_plot) (also known as a *box plot*) allows you to visually compare the distributions by way of a [five number summary](https://en.wikipedia.org/wiki/Five-number_summary) which includes:

- Sample minimum (the smallest value)
- First [quartile](https://en.wikipedia.org/wiki/Quartile) (*Q<sub>1</sub>*) which is the 25th percentile
- Second quartile (*Q<sub>2</sub>*) also known as the [*median*](https://en.wikipedia.org/wiki/Median)
- Third quartile (*Q<sub>3</sub>*) which is the 75th percentile
- Sample maximum (the largest value)

```{r boxplot, fig.height=7, fig.width=7, out.width=624}
five_number_summary <- iris %>%
  group_by(1) %>%
  summarize(
    `Sample minimum` = min(Sepal.Length),
    `1st quartile (25th percentile)` = quantile(Sepal.Length, 0.25),
    `2nd quartile (median)` = median(Sepal.Length),
    `3rd quartile (75th percentile)` = quantile(Sepal.Length, 0.75),
    `Sample maximum` = max(Sepal.Length)
  ) %>%
  tidyr::gather(Summary, Sepal.Length, -1) %>%
  select(-1)
ggplot(data = iris, aes(y = Sepal.Length, x = 1)) +
  geom_boxplot(color = "gray50") +
  geom_point(data = five_number_summary, size = 9, shape = "←", position = position_nudge(x = 0.025)) +
  geom_label(
    data = five_number_summary,
    aes(label = Summary, hjust = "left"),
    nudge_x = 0.025, size = 5, label.padding = unit(0.5, "lines")
  ) +
  theme_minimal(14) +
  labs(y = "Sepal length (cm)", title = "Box plot", x = NULL) +
  theme(
    panel.grid.major.y = element_line(linetype = "dashed", color = "gray80"),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_blank()
  )
```

My personal preference is when a violin plot and a box plot are combined so you still see the distribution in case there are multiple peaks ([*modes*](https://en.wikipedia.org/wiki/Mode_(statistics)) -- something you can't see with just a box-and-whiskers plot -- but you also see the summaries:

```{r violin_box_combined, fig.cap='Notice how the box plot hides the three modes.'}
temp <- iris %>%
  mutate(Species = "all 3") %>%
  rbind(iris)
ggplot(data = temp, aes(y = Sepal.Length, x = Species)) +
  geom_violin(fill = "gray80", color = NA, adjust = 0.5) +
  geom_boxplot(width = 0.1) +
  theme_minimal(14) +
  labs(y = "Sepal length (cm)", title = "Violin and box")
```

### Multiple variables

[Scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are the most popular and simplest way to investigate relationships between quantitative variables. You have one variable on the X axis and one variable on the Y axis. Each point represents a single unit from your dataset (e.g. a subject of an experiment):

```{r scatterplot, fig.cap='Shape and color of the points are determined by the species. Shapes are often used together with color to make the graphic better for colorblindness and grayscale printing.'}
ggplot(data = iris, aes(x = Petal.Length, Sepal.Length)) +
  geom_point(aes(color = Species, shape = Species)) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(14) +
  labs(
    x = "Petal length (cm)", y = "Sepal length (cm)",
    title = "Scatter plot of relationship between petal & sepal lengths"
  )
```

Data scientists and analysts often use *scatterplot matrices* to look at many different relationships between pairs of variables simultaneously:

```{r scatterplot_matrices, fig.height=10, fig.width=10, out.width=624, fig.cap='These are usually not present in final drafts of reports and are instead used as tools during the exploratory data analysis step.'}
panel.hist <- function(x, ...)
{
  # Copied from ?pairs
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE, breaks = 20)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "gray40", border = "black")
}
panel.ellipse <- function(x, y, ...) {
  args <- list(...)
  tmp <- split(data.frame(x = x, y = y), args$col)
  for (color in names(tmp)) {
    points(
      tmp[[color]], pch = 16,
      col = adjustcolor(color, alpha.f = 0.25)
    )
    mixtools::ellipse(
      mu = colMeans(tmp[[color]]),
      sigma = cov(tmp[[color]]),
      alpha = 0.4, col = color,
      newplot = FALSE, draw = TRUE
    )
  }
}
pairs(
  iris[, 1:4],
  col = RColorBrewer::brewer.pal(3, "Set1")[as.numeric(iris$Species)],
  pch = 20, diag.panel = panel.hist, lower.panel = panel.ellipse,
  main = "Scatterplot matrix of Iris flower measurements"
)
```

At first glance there is **a lot** going on in that particular matrix, but really there are three main components that we can focus on just one at a time:

1. the four panels along the *diagonal* have histograms of the individual variables,
2. the six panels in the *upper triangle* above the diagonal have basic scatter plots with points colored according to species for each pair of variables, and
3. the six panels in the *lower triangle* below the diagonal are also scatterplots, but with ellipses tracing the two-dimensional densities (assuming [Normality](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)).

[Line charts](https://en.wikipedia.org/wiki/Line_chart) are the most common way to visualize [time series](https://en.wikipedia.org/wiki/Time_series) data, with time **usually** as the horizontal X axis and range of a quantitative variable as the vertical Y axis:

```{r pageviews, cache=TRUE}
enwiki_pvs <- pageviews::project_pageviews(end = Sys.Date() - 1)
frwiki_pvs <- pageviews::project_pageviews(project = "fr.wikipedia", end = Sys.Date() - 1)
pvs <- rbind(enwiki_pvs, frwiki_pvs)
pvs$language <- factor(pvs$language, c("en", "fr"), c("English", "French"))
```
```{r tsplot, fig.cap='You may notice that the linear scale and the difference in magnitude makes it difficult to notice patterns for French Wikipedia. Perhaps this chart can be improved later in the workshop?'}
ggplot(pvs, aes(x = date, y = views / 1e6, color = language)) +
  geom_line() +
  scale_color_brewer("Language", palette = "Set1") +
  scale_x_datetime(date_breaks = "4 months", date_labels = "%b '%y") +
  labs(
    x = "Date", y = "Pageviews (millions)",
    title = "French & English Wikipedia daily pageviews",
    subtitle = "All platforms, all agent types"
  ) +
  theme_minimal(14)
```

## Other visualizations

### Mosaic plots

[Mosaic plots](https://en.wikipedia.org/wiki/Mosaic_plot) are used to visualize the relationships between two or more qualitative variables, and they are incredibly rare. While they are very useful once you learn how to read them, that step can be very difficult and so it is unsurprising that they don't show up more. They're often used by statisticians during exploratory data analysis to perform a visual check before performing a statistical test of independence.

We will use these to examine distribution of hair and eye colors in ~600 statistics students at University of Delaware reported by Snee, R. D. in *The American Statistician* journal in 1974:

```{r mosaic1, fig.height=7, fig.width=9, out.width=624}
mosaicplot(
  t(margin.table(HairEyeColor, c(1, 3))),
  color = c("black", "brown", "red", "yellow"),
  main = "Mosaic Plot of Men and Women's Hair Colors"
)
rect(0.25, 0.5, 0.75, 0.7, col = "white", border = "black", lwd = 8)
rect(0.25, 0.5, 0.75, 0.7, col = "white", border = "green", lwd = 4)
text(0.5, 0.6, "Black hair color was more prevalent\nin men than women in this dataset.", cex = 1.1)
rect(0.34, 0, 0.66, 0.10, col = "white", border = "black", lwd = 8)
rect(0.34, 0, 0.66, 0.10, col = "white", border = "green", lwd = 4)
text(0.5, 0.05, "Opposite for blond hair.", cex = 1.1)
```

We can extend a mosaic plot to include *standardized residuals* (also called [*studentized residuals*](https://en.wikipedia.org/wiki/Studentized_residual)) from a [log-linear model](https://en.wikipedia.org/wiki/Log-linear_model). Cells representing <span style='color:red;font-weight:bold;'>negative residuals</span> -- meaning there are <span style='color:red;font-weight:bold;'>fewer observations than would have been expected under independence</span> -- are drawn as <span style='color:red;font-weight:bold;'>red</span> with broken borders; <span style='color:blue;font-weight:bold;'>positive residuals</span> -- meaning <span style='color:blue;font-weight:bold;'>more observations than would be expected</span> -- are drawn in <span style='color:blue;font-weight:bold;'>blue</span> with solid borders.

```{r mosaic2, fig.height=7, fig.width=9, out.width=624}
mosaicplot(
  margin.table(HairEyeColor, c(1, 2)), shade = TRUE,
  main = "Shaded Mosaic Plot of Hair and Eye Colors"
)
text(0.35, 0.79, "← Way more black-haired\npeople with brown eyes\nthan expected given\noverall proportions\nof brown-eyed and\nblack-haired people.", cex = 1.1)
highlight <- data.frame(
  x = c(0.00, 0.15, 0.15, 0.535, 0.535, 0.645, 0.645, 0.824, 0.824, 0.645, 0.645, 0.535, 0.535, 0.15, 0.15, 0.00, 0.00),
  y = c(0.025, 0.025, 0.075, 0.075, 0.165, 0.165, 0.095, 0.095, 0.19, 0.19, 0.37, 0.37, 0.27, 0.27, 0.16, 0.16, 0.025)
)
lines(x = highlight$x, y = highlight$y, col = "black", lwd = 8)
lines(x = highlight$x, y = highlight$y, col = "green", lwd = 4)
text(0.35, 0.175, "Less blond-haired people\nwith hazel-colored eyes\nthan we’d expect.", cex = 1.1)
# axis(4, at = seq(0, 1, 0.05))
# axis(1, at = seq(0, 1, 0.05))
# xy <- expand.grid(x = seq(0, 1, 0.1), y = seq(0, 1, 0.1)); xy$l <- sprintf("(%.1f, %.1f)", xy$x, xy$y)
# z <- xy[identify(xy$x, xy$y, xy$l), ]
```

We can also look at the proportions across all three variables:

```{r mosaic3, fig.height=6, fig.width=8, out.width=624}
mosaicplot(
  HairEyeColor,
  main = "Mosaic Plot of Hair and Eye Colors in Women and Men"
)
highlight <- data.frame(
  x = c(0.00, 0.19, 0.19, 0.65, 0.65, 0.78, 0.78, 1.00, 1.00, 0.78, 0.78, 0.65, 0.65, 0.19, 0.19, 0.00, 0.00),
  y = c(0.19, 0.19, 0.28, 0.28, 0.38, 0.38, 0.20, 0.20, 0.92, 0.92, 0.63, 0.63, 0.585, 0.585, 0.40, 0.40, 0.19)
)
lines(x = highlight$x, y = highlight$y, col = "black", lwd = 8)
lines(x = highlight$x, y = highlight$y, col = "green", lwd = 4)
```

What the third mosaic plot tells us:

- Blond was the most prevalent hair color among those with blue eyes.
- More brown-haired men had blue eyes than brown-haired women.
- More blond-haired women had blue eyes than blonde-haired men.

### Stacked area plots

A *stacked area plot* is a way to visualize changes in amounts (or proportions) over time.

```{r stacked_area, fig.height=6, fig.width=12, out.width=624, fig.cap='Beginning with 1925, the number of people over the age of 64 has increased dramatically, especially after 1975.'}
library(gcookbook) # install.packages("gcookbook")
p1 <- ggplot(uspopage, aes(y = Thousands, x = Year, fill = AgeGroup)) +
  geom_area(color = "black") +
  scale_y_continuous("Number of people in thousands", labels = compress) +
  scale_fill_discrete("Age group", breaks = rev(levels(uspopage$AgeGroup))) +
  ggtitle("Stacked areas", "Age distribution in the United States, 1900-2002") +
  theme_minimal(14) +
  guides(fill = guide_legend(reverse = TRUE))
p2 <- ggplot(uspopage, aes(y = Thousands, x = Year, fill = AgeGroup)) +
  geom_area(color = "black", position = "fill") +
  scale_y_continuous("Proportion", labels = scales::percent_format()) +
  scale_fill_discrete("Age group", breaks = rev(levels(uspopage$AgeGroup))) +
  ggtitle("Stacked proportions", "Age distribution in the United States, 1900-2002") +
  theme_minimal(14) +
  guides(fill = guide_legend(reverse = TRUE))
cowplot::plot_grid(p1, p2, nrow = 1)
```

### Heat maps

[Heatmaps](https://en.wikipedia.org/wiki/Heat_map) are a graphical representation of matrices. For example, we can visualize a dataset of top 50 NBA players' performance statistics from [the 2008-09 season]((https://en.wikipedia.org/wiki/2008%E2%80%9309_NBA_season)) (obtained from [RotoWire](https://www.rotowire.com/), formerly databaseBasketball):

```{r nba_data, cache=TRUE}
# https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
colnames(nba) <- c(
  "Name", "Games", "Minutes", "Points",
  "Field goals made", "Field goal attempts", "Field goal %",
  "Free throws made", "Free throw attempts", "Free throw %",
  "Three-pointers made", "Three-point attempts", "Three-point %",
  "Offensive rebounds", "Defensive rebounds", "Total rebounds",
  "Assists", "Steals", "Blocks", "Turnovers", "PF"
)
nba$PF <- NULL; nba.m <- reshape2::melt(nba)
nba.m <- plyr::ddply(nba.m, plyr::.(variable), transform, rescale = scale(value))
nba.m$Name <- factor(nba.m$Name, levels(nba.m$Name)[order(levels(nba.m$Name))])
```
```{r nba_table, dependson='nba_data', eval=FALSE}
DT::datatable(
  nba,
  options = list(order = list(1, "desc")),
  class = "cell-border stripe",
  rownames = FALSE, filter = "top"
) %>%
  DT::formatPercentage(c("Field goal %", "Free throw %", "Three-point %"))
```
```{r heatmap, fig.height=10, fig.width=8, out.width=624, dependson='nba_data'}
ggplot(nba.m, aes(x = variable, y = Name, fill = rescale)) +
  geom_tile(color = "white") +
  viridis::scale_fill_viridis(
    "Compared to other 49 players",
    breaks = c(-2, 0, 2, 4),
    labels = function(x) {
      return(factor(x, c(-2, 0, 2, 4), c("Worse", "Average", "Better", "Way better")))
    }
  ) +
  theme_minimal(14) +
  labs(
    x = NULL, y = NULL,
    title = "Heatmap of top 50 NBA scorers' performance",
    subtitle = "Performance data from 2008-2009 season; centered and scaled",
    caption = "Source: FlowingData and RotoWire (formerly databaseBasketball)"
  ) +
  scale_x_discrete(expand = c(0, 1)) +
  scale_y_discrete(expand = c(0, 0), limits = rev(levels(nba.m$Name))) +
  theme(
    legend.position = "bottom", # "none",
    legend.key.width = unit(3,"line"),
    axis.ticks = element_blank(),
    axis.text.y = element_text(color = "black"),
    axis.text.x = element_text(size = 14 * 0.8, angle = 330, hjust = 0, color = "black"),
    plot.caption = element_text(size = 8)
  )
```

Some observations:

- [Dwight Howard](https://en.wikipedia.org/wiki/Dwight_Howard) was the best at blocking shots
- Dwight Howard was also one of the worst at making free throws
- [Yao Ming](https://en.wikipedia.org/wiki/Yao_Ming) was the best at making three-pointers (by % successful out of total attempts)

```{r corr_map, fig.width=8, fig.height=6, out.width=624}
nba_correlations <- cor(nba[, -1])
abbreviations <- abbreviate(row.names(nba_correlations), minlength = 6)
rownames(nba_correlations) <- colnames(nba_correlations) <- abbreviations
correlation_plot <- GGally::ggcorr(data = NULL, cor_matrix = nba_correlations, nbreaks = 4, palette = "RdGy", label = TRUE, label_size = 3, label_color = "white", hjust = "right", size = 5, color = "grey50", layout.exp = 2) +
  ggtitle("Heatmap of correlations", "between performance statistics")
correlation_plot + theme_minimal(14)
```

Some observations:

- **Negative correlations** (in <span style='font-weight:bold;color:#CA0020;'>red</span>):
    - Players who made a higher percentage of field goals ("Fldgl.") stayed away from trying to make three-point shots ("Thr-pa" is "Three-point attempts").
- **Positive correlations** (in <span style='font-weight:bold;color:#404040;'>grey</span>):
    - Players who attempted/made more field goals ("Fldgla"/"Fldglm") also scored more points.

### Tree maps

[Treemapping](https://en.wikipedia.org/wiki/Treemapping) is a way to visualize hierarchical (nested) data as rectangles within other rectangles, with the area of the rectangle representing the proportion and sometimes a shade or color representing another variable. It is not dissimilar to a mosaic plot!

```{r treemap, fig.cap='Almost all of the crew was male and almost 80% of them died. Most of the 3rd class passengers did not make it either, while more than 85% of women in 1st and 2nd classes survived.'}
library(treemapify)
titanic %>%
  dplyr::mutate(Type = dplyr::case_when(
    Sex == "Female" & Age == "Child" ~ "Girls",
    Sex == "Male" & Age == "Child" ~ "Boys",
    Sex == "Female" & Age == "Adult" ~ "Women",
    Sex == "Male" & Age == "Adult" ~ "Men"
  )) %>%
  dplyr::group_by(Class, Type, Survived) %>%
  dplyr::summarize(Freq = sum(Freq)) %>%
  dplyr::summarize(Total = sum(Freq), Survival = Freq[Survived == "Yes"] / Total) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(Label = sprintf("%s (%.1f%%)", Type, 100 * Survival)) %>%
  ggplot(aes(area = Total, fill = Survival, label = Label, subgroup = Class)) +
  geom_treemap() +
  geom_treemap_subgroup_border(color = "white") +
  geom_treemap_subgroup_text(
    place = "centre", grow = TRUE, alpha = 0.5,
    color = "black", fontface = "italic", min.size = 0
  ) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = TRUE) +
  scale_fill_continuous(labels = scales::percent_format(), breaks = seq(0, 1, 0.25)) +
  labs(title = "Treemap of Titanic passengers' survival rates") +
  theme_minimal(14)
```

### Choropleths

[Choropleths](https://en.wikipedia.org/wiki/Choropleth_map) are geographical maps that are colored and/or shaded according to some variable such as population density.

```{r choropleth}
data("USArrests", package = "datasets")
data("fifty_states", package = "fiftystater")
library(fiftystater); library(mapproj)
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests) %>%
  dplyr::left_join(data.frame(abb = state.abb, state = tolower(state.name)), by = "state")
centroids <- fifty_states[, c("long", "lat")] %>%
  split(fifty_states$id) %>%
  lapply(as.matrix) %>%
  lapply(geosphere::centroid) %>%
  lapply(as.data.frame) %>%
  dplyr::bind_rows(.id = "state")
abbrvs <- crimes[, c("state", "abb")] %>%
  dplyr::left_join(centroids, by = "state")
ggplot(crimes, aes(map_id = state)) + 
  geom_map(aes(fill = Murder), map = fifty_states, color = "white") + 
  geom_text(data = abbrvs, aes(label = abb, x = lon, y = lat)) +
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_fill_distiller(palette = "RdGy") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(
    x = NULL, y = NULL, fill = "Murder arrests",
    title = "Choropleth of 1973 crime rates by US state",
    subtitle = "Arrest rates are per 100,000 residents",
    caption = "Source: World Almanac and Book of facts, 1975"
  ) +
  theme_minimal(14) +
  theme(
    panel.background = element_blank()
  )
```

### Networks and graphs

[Network diagrams](https://en.wikipedia.org/wiki/Graph_drawing) are for visualizing [graphs](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) (from [graph theory](https://en.wikipedia.org/wiki/Graph_theory)) and networks (from [network theory](https://en.wikipedia.org/wiki/Network_theory)) where there are [*nodes*](https://en.wikipedia.org/wiki/Node_(computer_science)) ([*vertices*](https://en.wikipedia.org/wiki/Vertex_(graph_theory))) connected by *links* ([*edges*](https://en.wikipedia.org/wiki/Edge_(graph_theory))). Their goal is to visually represent relationships between units. For example, using [the Wikipedia Clickstream data](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) from November 2017 we can start at the article on net neutrality and visualize a *neighborhood* of articles that are *adjacent* to the central one:

<a title="By MPopov (WMF) (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3ANet_neutrality_clickstream_(Nov_2017).png"><img width="512" alt="Net neutrality clickstream (Nov 2017)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Net_neutrality_clickstream_%28Nov_2017%29.png/512px-Net_neutrality_clickstream_%28Nov_2017%29.png"/></a>

The darkness of the edges connecting the vertices represents how many clicks there were between the pairs of articles. We can see that there are more clicks between "net neutrality" and "digital rights" than between "net neutrality" and "human rights", but way more clicks between "net neutrality" and "Wikipedia Zero".

## Scales and transformed data

Sometimes the author of the visualization has chosen to apply a [transformation to the data](https://en.wikipedia.org/wiki/Data_transformation_(statistics)) because the data is [skewed](https://en.wikipedia.org/wiki/Skewness). It is important to watch out for these, especially [logarithmic scales](https://en.wikipedia.org/wiki/Logarithmic_scale).

```{r transformed_scatterplot, fig.width=12, fig.height=12, out.width=624, out.height=624}
set.seed(0)
x <- sample(1:100, 500, replace = TRUE)
slope <- 0.01; intercept <- runif(1, 0, 1); err.std <- runif(1, 0.01, 0.1)
noise <- rnorm(length(x), 0, sd = err.std)
y <- intercept + slope * x + noise
df <- data.frame(x = x, y = 10 ^ y)
p1 <- ggplot(df, aes(x = y)) + geom_histogram(aes(y = ..density..), fill = "gray70") + geom_density(adjust = 2) + theme_minimal(14) + ggtitle("Distribution of y is right skewed")
p2 <- ggplot(df, aes(x = log10(y))) + geom_histogram(aes(y = ..density..), fill = "gray70") + geom_density(adjust = 2) + theme_minimal(14) + ggtitle("Transformed y is more Normally distributed")
p3 <- ggplot(df, aes(x = x, y = y)) + geom_point() + ggtitle("Not a linear relationship between x and y") + theme_minimal(14) + geom_smooth(se = FALSE, method = "lm", size = 2)
p4 <- ggplot(df, aes(x = x, y = log10(y))) + geom_point() + ggtitle("Linear relationship between x and transformed y") + theme_minimal(14) + geom_smooth(se = FALSE, method = "lm", size = 2)
cowplot::plot_grid(plotlist = list(p1, p2, p3, p4))
```

Let us revisit the pageviews data from earlier by utilizing a logarithmic axis:

```{r tsplot_log10, fig.cap='Notice how the French Wikipedia pageviews are no longer dampened by English Wikipedia pageviews\' magnitude.'}
ggplot(pvs, aes(x = date, y = views / 1e6, color = language)) +
  geom_line() +
  scale_y_log10("Pageviews (millions)") +
  scale_color_brewer("Language", palette = "Set1") +
  scale_x_datetime(date_breaks = "4 months", date_labels = "%b '%y") +
  labs(
    x = "Date", title = "French & English Wikipedia daily pageviews",
    subtitle = "All platforms, all agent types"
  ) +
  theme_minimal(14)
```

It is possible (but rare) to encounter logarithmically scaled time axes, which are helpful when you have long tails caused by outliers:

```{r transformed_histogram, fig.width=12, fig.height=6, out.width=624}
logtime_breaks <- c(1, 5, 30, 60, 60*5, 60*10, 60*30, 60*60, 60*60*24)
logtime_labels <- function(breaks) {
  lbls <- breaks %>%
    round %>%
    lubridate::seconds_to_period() %>%
    tolower %>%
    gsub(" ", "", .) %>%
    sub("(.*[a-z])0s$", "\\1", .) %>%
    sub("(.*[a-z])0m$", "\\1", .) %>%
    sub("(.*[a-z])0h$", "\\1", .)
  return(lbls)
}
scale_x_logtime <- function(...) {
  scale_x_log10(..., breaks = logtime_breaks, labels = logtime_labels)
}
set.seed(0)
sessions <- data.frame(length = 10 ^ abs(rnorm(100, 0, 2)))
p1 <- ggplot(sessions, aes(x = length)) +
  geom_histogram() +
  scale_x_continuous(
    name = "Session length",
    breaks = 3600 * seq(0, 24, 4),
    labels = logtime_labels
  ) +
  ggtitle("Histogram of session length", "Not very useful") +
  theme_minimal(14)
p2 <- ggplot(sessions, aes(x = length)) +
  geom_histogram() +
  scale_x_logtime(name = "Session length") +
  ggtitle("Logarithmically scaled time axis", "Substantially more useful") +
  theme_minimal(14)
cowplot::plot_grid(p1, p2)
```

## Group activity

Pair up with someone sitting next to you and pick one of the following 3 visualizations. You and your partner(s) should agree on the same one.

1. **This part is done individually** (3 minutes)
    - Note 2-3 interesting observations.
    - **Reminder:**
        - Once you've identified the variables involved, you are looking for relationships between them.
        - You're also looking for patterns and outliers.
2. **This part is done with your partner(s)** (2-3 minutes)
    - Share your insights with your partner(s).
    - Check if they agree with your observations.
    - If they didn't notice the same things as you, explain how you arrived at your interpretation of the chart.

A different take on the Titanic data:

```{r plot1, fig.width=8, fig.height=12, out.width=624}
par(mfrow = c(2, 1))
mosaicplot(~ Sex + Age + Survived, data = Titanic, shade = TRUE, main = "Plot 1a: Titanic passenger survivorship", cex.axis = 0.8)
mosaicplot(~ Class + Sex + Survived, data = Titanic, shade = TRUE, main = "Plot 2a: Titanic passenger survivorship", cex.axis = 0.8)
par(mfrow = c(1, 1))
```

A different take on the violent crime rates data:

```{r plot2, fig.width=10, fig.height=5, out.width=624}
waffle::iron(
  waffle::waffle(
    USArrests["California", c("Murder", "Assault", "Rape")],
    xlab = "1 square = 1 arrest per 100,000 residents",
    title = "Plot 2a: Violent crimes in California in 1973",
    rows = 10
  ),
  waffle::waffle(
    USArrests["Pennsylvania", c("Murder", "Assault", "Rape")],
    xlab = "1 square = 1 arrest per 100,000 residents",
    title = "Plot 2b: Violent crimes in Pennsylvania in 1973",
    rows = 4
  )
)
```

A different take on the Wikipedia pageviews data:

```{r plot3, dependson='pageviews', fig.width=12, fig.height=8, out.width=624}
seasons <- data.frame(
  month = c(3, 6, 9, 12),
  day = c(21, 21, 23, 21),
  name = c("Spring Equinox", "Summer Solstice", "Autumnal Equinox", "Winter Solstice"),
  starts = c("Spring", "Summer", "Autumn", "Winter"),
  stringsAsFactors = FALSE
)
to_season <- function(d) {
  d.year <- lubridate::year(d)
  markers <- as.Date(paste(seasons$month, seasons$day, d.year, sep = "-"), format = "%m-%d-%Y")
  lgcls <- markers <= d
  if (all(!lgcls)) {
    return("Winter")
  } else {
    return(seasons$starts[max(which(lgcls))])
  }
}
pvs$season <- purrr::map_chr(pvs$date, to_season)
pvs %>%
  dplyr::mutate(wday = lubridate::wday(date, label = TRUE, abbr = TRUE)) %>%
  dplyr::filter(language == "English" | views < 40e6) %>%
  ggplot(aes(x = wday, y = views)) +
  geom_violin(fill = "gray80", color = NA, adjust = 0.75) +
  geom_boxplot(width = 1) +
  scale_y_continuous(labels = compress) +
  facet_grid(language ~ season, scales = "free_y") +
  labs(
    x = "Day of week", y = "Pageviews",
    title = "Plot 3: English and French Wikipedia pageviews",
    subtitle = "By day of week and season"
  ) +
  theme_bw(15)
```

## Assessment

Some questions to verify that you understand the core concepts in data visualization:

```{r quiz}
quiz(
  question(
    "Which of these are examples of a discrete quantitative variable?",
    answer("Time spent reading a Wikipedia article"),
    answer("Number of articles edited by user", correct = TRUE),
    answer("User session ID"),
    answer("Percent change in monthly pageviews from previous month"),
    answer("Total pageviews last month", correct = TRUE),
    random_answer_order = TRUE, type = "multiple",
    incorrect = "Time spent and % change are continuous and user session ID is a qualitative variable."
  ),
  question(
    "A log transformation of the data can help when a variable has positive skew (a long tail on the right)",
    answer("True", correct = TRUE),
    answer("False")
  ),
  question(
    "Which basic chart is the best type for showing a relationship between two continuous quantitative variables?",
    answer("pie"),
    answer("bar"),
    answer("scatter", correct = TRUE),
    answer("histogram"),
    answer("box and whiskers"),
    answer("mosaic"),
    random_answer_order = TRUE
  ),
  question(
    "An effective data visualization will include some or all of the following:",
    answer("Distribution of a variable", correct = TRUE),
    answer("Relationship(s) between variables", correct = TRUE),
    answer("Labels", correct = TRUE),
    answer("Fancy fonts"),
    answer("Aesthetically pleasing colors"),
    answer("Interactivity"),
    random_answer_order = TRUE, type = "multiple",
    incorrect = "Interactivity CAN make a data visualization more effective, but we have seen static (non-interactive) examples so far that are completely fine without it. When it comes to colors, the most important factor is whether they convey the information well, and having a color scheme that is aesthetically pleasing is really nice, but is not technically necessary. Fancy fonts can look good, but they don't add to the story the visualization is meant to tell."
  ),
  question(
    "A heatmap is a choropleth",
    answer("True"),
    answer("False", correct = TRUE),
    incorrect = "Choropleth maps include geographical boundaries, which heatmaps do not."
  ),
  question(
    "Which of the following are examples of a qualitative/categorical variable? (This includes nominal and ordinal types.)",
    answer("Survey responder's age group \"65 and older\"", correct = TRUE),
    answer("Survey responder's age"),
    answer("Rating scale (e.g. \"worst\" to \"best\")", correct = TRUE),
    answer("Survey responder's gender", correct = TRUE),
    answer("How much time user spent responding to survey"),
    random_answer_order = TRUE, type = "multiple",
    incorrect = "Age is quantitative while an age group is qualitative. Time spent is quantitative."
  )
)
```

## Appendix

### Visual essays

- [A visual introduction to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) by Stephanie Yee and Tony Chu
- [Exploring Histograms](https://tinlizzie.org/histograms/) by Aran Lunzer and Amelia McNamara
- [Algorithms Tour](http://algorithms-tour.stitchfix.com/): How data science is woven into the fabric of StitchFix
- [An Interactive Visualization of Every Line in Hamilton](https://pudding.cool/2017/03/hamilton/index.html) by Shirley Wu
- [Constructed Career Paths from Job Switching Data](http://flowingdata.com/2017/11/28/career-paths/) by Nathan Yau

### Collections

- [The New York Times Graphics Department](https://twitter.com/nytgraphics)
- [Information is Beautiful Awards](https://www.informationisbeautifulawards.com/)
- [FlowingData](http://flowingdata.com/)'s [10 Best Data Visualization Projects of 2017](https://flowingdata.com/2017/12/28/10-best-data-visualization-projects-of-2017/)

### Further reading

- [The Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi) by Edward Tufte
- [Handbook of Data Visualization](http://www.springer.com/us/book/9783540330363) (Editors: Chun-houh Chen, Wolfgang Karl Härdle, Antony Unwin)

### Making your own

- [Visualize This: The FlowingData Guide to Design, Visualization, and Statistics](http://book.flowingdata.com/) by Nathan Yau
- [R Graphics Cookbook](http://shop.oreilly.com/product/0636920063704.do) by Winston Chang
- [ggplot2: Elegant Graphics for Data Analysis](http://ggplot2.org/book/) by Hadley Wickham
- [Data Visualization with Python and JavaScript](http://shop.oreilly.com/product/0636920037057.do) by Kyran Dale
- [SVG Animations: From Common UX Implementations to Complex Responsive Animation](http://shop.oreilly.com/product/0636920045335.do) by Sarah Drasner
- [D3.js in Action](https://www.manning.com/books/d3js-in-action-second-edition) by Elijah Meeks