-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.Rmd
866 lines (756 loc) · 42.9 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
---
title: "Plotting the Course Through Charted Waters"
output:
learnr::tutorial:
theme: "cosmo"
tutorial:
id: "org.wikimedia.mikhail.dataviz-literacy"
version: 0.9.6
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
```
```{r data}
library(magrittr)
library(dplyr)
library(ggplot2)
titanic <- data.frame(Titanic)
compress <- function(x, round_by = 2) {
div <- findInterval(x, c(1, 1e3, 1e6, 1e9, 1e12))
return(paste0(round( x / 10 ^ (3 * ifelse(div - 1 < 0, 0, div - 1)), round_by),
c("", "", "K", "M", "B", "T")[div + 1]))
}
```
## Introduction
Heat maps, stacked area plots, mosaic plots, choropleths -- oh my! There are so many different ways to visually convey relationships and patterns in data! In this workshop on data visualization literacy, you'll learn to recognize many popular types of charts and how to glean insights from them. The **Appendix** contains some examples of data visualization as visual essays and it also includes links to resources for learning how to create your own.
This workshop is [available as open source](https://github.com/bearloga/wmf-allhands18). There is an [interactive version](http://dataviz-literacy.wmflabs.org/) (which should automatically send you to either [mirror 1](http://dataviz-lit-01.wmflabs.org/), [mirror 2](http://dataviz-lit-02.wmflabs.org/), or [mirror 3](http://dataviz-lit-03.wmflabs.org/)) and a [static version](https://bearloga.github.io/wmf-allhands18/).
| | Contact Information |
|-------------:|:------------------------------------------|
| **Work** | mikhail at wikimedia dot org |
| **Personal** | mikhail at mpopov dot com |
| **IRC** | bearloga in #wikimedia-discovery, etc. |
| **Twitter** | [bearloga](https://twitter.com/bearloga) |
<!-- Piwik -->
<script type="text/javascript">
var _paq = _paq || [];
_paq.push(["setDomains", ["*.dataviz-literacy.wmflabs.org"]]);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="//piwik.wikimedia.org/";
_paq.push(['setTrackerUrl', u+'piwik.php']);
_paq.push(['setSiteId', '15']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<noscript><p><img src="//piwik.wikimedia.org/piwik.php?idsite=15" style="border:0;" alt="" /></p></noscript>
<!-- End Piwik Code -->
## Terms and basics
### Data visualization as storytelling
Graphical displays should:
- Show the data
- Induce the viewer to think about the substance rather than graphic design or format
- Avoid distorting the data
- Present many numbers in a small space
- Make large data sets coherent
- Encourage the eye to compare different pieces of data
-- Edward R. Tufte, *The Visual Display of Quantitative Information*
### Types of variables
- [*Quantitative* variables](https://en.wikipedia.org/wiki/Likert_scale) have a numeric value and/or an ordering
- *Continuous* variables have an infinite range of possible values
- **Examples**: time, age, weight, lengths (height, distance, time spent online), drug dosage
- Quantitative variables that have limited possible values are *discrete*
- **Examples**: population size, number of times an event occurred, pageviews, number of questions a student got correct on a test
- Continuous variables are sometimes discretized by rounding if precision is not necessary
- [*Categorical*](https://en.wikipedia.org/wiki/Categorical_variable) / *discrete* / *qualitative* variables have a limited number of possible values:
- *Nominal* variables have two or more categories that do not have an intrinsic order
- **Examples**: gender, ethnicity, controls vs test group, operating system
- *Ordinal* variables are like nominal, but the categories have an ordering/ranking such as the [Likert rating scale](https://en.wikipedia.org/wiki/Likert_scale)
- Categorical variables can also be created from quantitative variables
- **Example**: survey takers are often combined into age groups such as "18-24"
Refer to [levels of measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for more information.
### Things to look for
- Title (most plots should have this)
- Axis labels (almost all plots should have this)
- How many variables and their types
- Including ones used to dictate colors, shapes, patterns, sizes, opacities, etc.
- Independent ("*predictor*") variables (e.g. time) are usually on the X (horizontal) axis
- Occasionally time is plotted on the vertical axis for specific reasons
- Dependent ("*outcome*" / "*response*") variables are usually on the Y (vertical) axis
- Scales (especially log-transformed ones)
## Common visualizations
### Pies, Waffles, Bars, and Tables
A [pie chart](https://en.wikipedia.org/wiki/Pie_chart) and a [bar chart](https://en.wikipedia.org/wiki/Bar_chart) (sometimes called a *bar plot*) are an easy way to visually compare values. The pie chart -- where the slices represent proportions of the whole -- is excellent for 2-4 categories, the table is great for 1-8 categories, and the bars' heights work well for comparing more than 5 categories.
```{r pie_chart}
per_class <- titanic %>%
group_by(Class) %>%
summarize(Total = sum(Freq)) %>%
mutate(
Class = factor(Class, c("Crew", "3rd", "2nd", "1st")),
Prop = Total / sum(Total),
Position = cumsum(Prop) - (Prop / 2)
)
ggplot(per_class, aes(x = factor(""), y = Prop, fill = Class)) +
geom_bar(width = 1, stat = "identity", color = "black") +
scale_fill_brewer(palette = "Set1") +
coord_polar("y") +
theme_minimal(14) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank()
) +
geom_text(aes(y = Position, label = scales::percent(Prop)), size = 5, color = "white") +
ggtitle("Pie chart of Titanic passengers by class")
```
```{r bar_chart, fig.cap='Notice how the use of color allows us to compare survivorship within classes.'}
class_survival <- titanic %>%
group_by(Class, Survived) %>%
summarize(Passengers = sum(Freq))
ggplot(class_survival, aes(y = Passengers, x = Class, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Bar chart of survivorship on Titanic by class",) +
theme_minimal(14)
```
```{r table, results='asis'}
class_survival %>%
mutate(Survived = if_else(Survived == "Yes", "Survived", "Did not survive")) %>%
tidyr::spread(Survived, Passengers) %>%
knitr::kable(format = "markdown")
```
In the past decade, a semi-alternative to the pie chart called *waffle charts* (or "square pie charts") has gained popularity at representing relative sizes between groups. (See [Women in IT – Squaring the Pie?](https://eagereyes.org/techniques/square-pie-charts).) **Semi-alternative** becauses waffles compare **totals** and pie charts compare **percentages**. As such, waffle charts are good for comparing relative sizes, but not at comparing relative %s.
Each square represents a certain number of units, which I think makes it easier to visually compare sizes of groups. For example, it is easier to compare 11 squares (2nd class passengers who survived) to 20 squares (1st class passengers who survived) than 1 pie slice to another pie slice that is 1.8 times bigger:
```{r waffle, fig.width=12, fig.height=6, out.width=624}
temp <- split(class_survival, class_survival$Survived) %>%
purrr::map(~ set_names(.x$Passengers, .x$Class) / 10)
p1 <- class_survival %>%
mutate(Survived = factor(if_else(Survived == "Yes", "Survived", "Did not survive"), c("Survived", "Did not survive"))) %>%
group_by(Survived) %>%
mutate(
Prop = Passengers / sum(Passengers),
Position = cumsum(Prop) - (Prop / 2)
) %>%
ggplot(aes(x = factor(""), y = Prop, fill = Class)) +
geom_bar(width = 1, stat = "identity", color = "black") +
facet_wrap(~ Survived) +
scale_fill_brewer(palette = "Set2") +
coord_polar("y") +
theme_minimal(14) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank()
) +
geom_text(aes(y = Position, label = scales::percent(Prop)), size = 4) +
ggtitle("Titanic passengers by class")
p2 <- waffle::waffle(
temp$Yes, rows = 5, size = 1,
xlab = "1 square = 10 passengers",
title = "Titanic passengers who survived"
)
p3 <- waffle::waffle(
temp$No, rows = 5, size = 1,
xlab = "1 square = 10 passengers",
title = "Titanic passengers who did not survive"
)
top_row <- cowplot::plot_grid(p1, p2, rel_widths = c(1, 1.3))
cowplot::plot_grid(top_row, p3, ncol = 1)
```
### Histograms and Densities
A [histogram](https://en.wikipedia.org/wiki/Histogram) shows the distribution of a continuous variable by splitting it into bins and counting how many observations fall into each bin (left). Sometimes those counts are divided by the total number of observations to yield proportions/probabilities instead (right). **Note** that the histogram on the right also includes a [probability density estimate](https://en.wikipedia.org/wiki/Density_estimation).
```{r histograms, fig.width=10, fig.height=5, out.width=624}
par(mfrow = c(1, 2), cex = 1.2)
hist(trees$Height, col = "gray40",
main = "Height of Black Cherry Trees",
xlab = "Height (ft)")
hist(trees$Height, col = "gray70",
main = "Height of Black Cherry Trees",
xlab = "Height (ft)", freq = FALSE, border = FALSE)
lines(density(trees$Height, adj = 1), lwd = 2)
```
An important factor to watch out for is the bin size, which -- ideally -- was carefully chosen by the creator of the visualization. Bins that are too wide will cause the distribution to appear wide, while bins that are too narrow will make the distribution appear to noisy:
```{r bin_sizes, fig.width=12, fig.height=4, out.width=624, fig.cap='The three little histo-bears.'}
par(mfrow = c(1, 3), cex = 1.2)
hist(trees$Height, col = "gray40", breaks = 3, xlab = "Height (ft)", main = "Too wide")
hist(trees$Height, col = "gray40", breaks = 6, xlab = "Height (ft)", main = "Just right")
hist(trees$Height, col = "gray40", breaks = 18, xlab = "Height (ft)", main = "Too narrow")
```
For a deeper look at histograms, I encourage you to check out [Exploring Histograms](https://tinlizzie.org/histograms/) by Aran Lunzer and Amelia McNamara.
### Comparing Distributions
When you see one of these, they are used for comparing distributions of a continuous variable (such as sepal length of Iris flowers) between different groups (such as different species):
```{r comparing_distributions, fig.width=10, fig.height=5, out.width=624}
p1 <- ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.5, adjust = 1.5) +
scale_fill_brewer(palette = "Set1") +
theme_minimal(14) +
labs(x = "Sepal length (cm)", title = "Density plot")
p2 <- ggplot(data = iris, aes(y = Sepal.Length, x = Species)) +
geom_violin(aes(fill = Species), adjust = 1.5) +
scale_fill_brewer(palette = "Set1") +
theme_minimal(14) +
labs(y = "Sepal length (cm)", title = "Violin plot")
cowplot::plot_grid(p1, p2, nrow = 1)
```
The [density plot](https://en.wikipedia.org/wiki/Kernel_density_estimation) on the left is like a smooth histogram that doesn't discretize the variable into bins. The [violin plot](https://en.wikipedia.org/wiki/Violin_plot) on the left is a rotated version that makes it easier to perform the comparison because the densities (distributions) are not overlapping.
An alternative called *ridgeline plot* recently gained a lot of popularity for comparing distributions across groups because of how compact it was, which was especially useful when comparing many groups.
```{r ridgeline, eval=FALSE}
library(ggridges) # formerly ggjoy
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
geom_density_ridges() +
scale_x_continuous(expand = c(0.01, 0)) +
scale_y_discrete(expand = c(0.01, 0)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal(14) +
labs(x = "Sepal length (cm)", title = "Ridgeline plot")
```
![Ridgeline plot of sepal length.](index_files/figure-html4/ridgeline.png)
A [box-and-whiskers chart](https://en.wikipedia.org/wiki/Box_plot) (also known as a *box plot*) allows you to visually compare the distributions by way of a [five number summary](https://en.wikipedia.org/wiki/Five-number_summary) which includes:
- Sample minimum (the smallest value)
- First [quartile](https://en.wikipedia.org/wiki/Quartile) (*Q<sub>1</sub>*) which is the 25th percentile
- Second quartile (*Q<sub>2</sub>*) also known as the [*median*](https://en.wikipedia.org/wiki/Median)
- Third quartile (*Q<sub>3</sub>*) which is the 75th percentile
- Sample maximum (the largest value)
```{r boxplot, fig.height=7, fig.width=7, out.width=624}
five_number_summary <- iris %>%
group_by(1) %>%
summarize(
`Sample minimum` = min(Sepal.Length),
`1st quartile (25th percentile)` = quantile(Sepal.Length, 0.25),
`2nd quartile (median)` = median(Sepal.Length),
`3rd quartile (75th percentile)` = quantile(Sepal.Length, 0.75),
`Sample maximum` = max(Sepal.Length)
) %>%
tidyr::gather(Summary, Sepal.Length, -1) %>%
select(-1)
ggplot(data = iris, aes(y = Sepal.Length, x = 1)) +
geom_boxplot(color = "gray50") +
geom_point(data = five_number_summary, size = 9, shape = "←", position = position_nudge(x = 0.025)) +
geom_label(
data = five_number_summary,
aes(label = Summary, hjust = "left"),
nudge_x = 0.025, size = 5, label.padding = unit(0.5, "lines")
) +
theme_minimal(14) +
labs(y = "Sepal length (cm)", title = "Box plot", x = NULL) +
theme(
panel.grid.major.y = element_line(linetype = "dashed", color = "gray80"),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank()
)
```
My personal preference is when a violin plot and a box plot are combined so you still see the distribution in case there are multiple peaks ([*modes*](https://en.wikipedia.org/wiki/Mode_(statistics)) -- something you can't see with just a box-and-whiskers plot -- but you also see the summaries:
```{r violin_box_combined, fig.cap='Notice how the box plot hides the three modes.'}
temp <- iris %>%
mutate(Species = "all 3") %>%
rbind(iris)
ggplot(data = temp, aes(y = Sepal.Length, x = Species)) +
geom_violin(fill = "gray80", color = NA, adjust = 0.5) +
geom_boxplot(width = 0.1) +
theme_minimal(14) +
labs(y = "Sepal length (cm)", title = "Violin and box")
```
### Multiple variables
[Scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are the most popular and simplest way to investigate relationships between quantitative variables. You have one variable on the X axis and one variable on the Y axis. Each point represents a single unit from your dataset (e.g. a subject of an experiment):
```{r scatterplot, fig.cap='Shape and color of the points are determined by the species. Shapes are often used together with color to make the graphic better for colorblindness and grayscale printing.'}
ggplot(data = iris, aes(x = Petal.Length, Sepal.Length)) +
geom_point(aes(color = Species, shape = Species)) +
scale_color_brewer(palette = "Set1") +
theme_minimal(14) +
labs(
x = "Petal length (cm)", y = "Sepal length (cm)",
title = "Scatter plot of relationship between petal & sepal lengths"
)
```
Data scientists and analysts often use *scatterplot matrices* to look at many different relationships between pairs of variables simultaneously:
```{r scatterplot_matrices, fig.height=10, fig.width=10, out.width=624, fig.cap='These are usually not present in final drafts of reports and are instead used as tools during the exploratory data analysis step.'}
panel.hist <- function(x, ...)
{
# Copied from ?pairs
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE, breaks = 20)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "gray40", border = "black")
}
panel.ellipse <- function(x, y, ...) {
args <- list(...)
tmp <- split(data.frame(x = x, y = y), args$col)
for (color in names(tmp)) {
points(
tmp[[color]], pch = 16,
col = adjustcolor(color, alpha.f = 0.25)
)
mixtools::ellipse(
mu = colMeans(tmp[[color]]),
sigma = cov(tmp[[color]]),
alpha = 0.4, col = color,
newplot = FALSE, draw = TRUE
)
}
}
pairs(
iris[, 1:4],
col = RColorBrewer::brewer.pal(3, "Set1")[as.numeric(iris$Species)],
pch = 20, diag.panel = panel.hist, lower.panel = panel.ellipse,
main = "Scatterplot matrix of Iris flower measurements"
)
```
At first glance there is **a lot** going on in that particular matrix, but really there are three main components that we can focus on just one at a time:
1. the four panels along the *diagonal* have histograms of the individual variables,
2. the six panels in the *upper triangle* above the diagonal have basic scatter plots with points colored according to species for each pair of variables, and
3. the six panels in the *lower triangle* below the diagonal are also scatterplots, but with ellipses tracing the two-dimensional densities (assuming [Normality](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)).
[Line charts](https://en.wikipedia.org/wiki/Line_chart) are the most common way to visualize [time series](https://en.wikipedia.org/wiki/Time_series) data, with time **usually** as the horizontal X axis and range of a quantitative variable as the vertical Y axis:
```{r pageviews, cache=TRUE}
enwiki_pvs <- pageviews::project_pageviews(end = Sys.Date() - 1)
frwiki_pvs <- pageviews::project_pageviews(project = "fr.wikipedia", end = Sys.Date() - 1)
pvs <- rbind(enwiki_pvs, frwiki_pvs)
pvs$language <- factor(pvs$language, c("en", "fr"), c("English", "French"))
```
```{r tsplot, fig.cap='You may notice that the linear scale and the difference in magnitude makes it difficult to notice patterns for French Wikipedia. Perhaps this chart can be improved later in the workshop?'}
ggplot(pvs, aes(x = date, y = views / 1e6, color = language)) +
geom_line() +
scale_color_brewer("Language", palette = "Set1") +
scale_x_datetime(date_breaks = "4 months", date_labels = "%b '%y") +
labs(
x = "Date", y = "Pageviews (millions)",
title = "French & English Wikipedia daily pageviews",
subtitle = "All platforms, all agent types"
) +
theme_minimal(14)
```
## Other visualizations
### Mosaic plots
[Mosaic plots](https://en.wikipedia.org/wiki/Mosaic_plot) are used to visualize the relationships between two or more qualitative variables, and they are incredibly rare. While they are very useful once you learn how to read them, that step can be very difficult and so it is unsurprising that they don't show up more. They're often used by statisticians during exploratory data analysis to perform a visual check before performing a statistical test of independence.
We will use these to examine distribution of hair and eye colors in ~600 statistics students at University of Delaware reported by Snee, R. D. in *The American Statistician* journal in 1974:
```{r mosaic1, fig.height=7, fig.width=9, out.width=624}
mosaicplot(
t(margin.table(HairEyeColor, c(1, 3))),
color = c("black", "brown", "red", "yellow"),
main = "Mosaic Plot of Men and Women's Hair Colors"
)
rect(0.25, 0.5, 0.75, 0.7, col = "white", border = "black", lwd = 8)
rect(0.25, 0.5, 0.75, 0.7, col = "white", border = "green", lwd = 4)
text(0.5, 0.6, "Black hair color was more prevalent\nin men than women in this dataset.", cex = 1.1)
rect(0.34, 0, 0.66, 0.10, col = "white", border = "black", lwd = 8)
rect(0.34, 0, 0.66, 0.10, col = "white", border = "green", lwd = 4)
text(0.5, 0.05, "Opposite for blond hair.", cex = 1.1)
```
We can extend a mosaic plot to include *standardized residuals* (also called [*studentized residuals*](https://en.wikipedia.org/wiki/Studentized_residual)) from a [log-linear model](https://en.wikipedia.org/wiki/Log-linear_model). Cells representing <span style='color:red;font-weight:bold;'>negative residuals</span> -- meaning there are <span style='color:red;font-weight:bold;'>fewer observations than would have been expected under independence</span> -- are drawn as <span style='color:red;font-weight:bold;'>red</span> with broken borders; <span style='color:blue;font-weight:bold;'>positive residuals</span> -- meaning <span style='color:blue;font-weight:bold;'>more observations than would be expected</span> -- are drawn in <span style='color:blue;font-weight:bold;'>blue</span> with solid borders.
```{r mosaic2, fig.height=7, fig.width=9, out.width=624}
mosaicplot(
margin.table(HairEyeColor, c(1, 2)), shade = TRUE,
main = "Shaded Mosaic Plot of Hair and Eye Colors"
)
text(0.35, 0.79, "← Way more black-haired\npeople with brown eyes\nthan expected given\noverall proportions\nof brown-eyed and\nblack-haired people.", cex = 1.1)
highlight <- data.frame(
x = c(0.00, 0.15, 0.15, 0.535, 0.535, 0.645, 0.645, 0.824, 0.824, 0.645, 0.645, 0.535, 0.535, 0.15, 0.15, 0.00, 0.00),
y = c(0.025, 0.025, 0.075, 0.075, 0.165, 0.165, 0.095, 0.095, 0.19, 0.19, 0.37, 0.37, 0.27, 0.27, 0.16, 0.16, 0.025)
)
lines(x = highlight$x, y = highlight$y, col = "black", lwd = 8)
lines(x = highlight$x, y = highlight$y, col = "green", lwd = 4)
text(0.35, 0.175, "Less blond-haired people\nwith hazel-colored eyes\nthan we’d expect.", cex = 1.1)
# axis(4, at = seq(0, 1, 0.05))
# axis(1, at = seq(0, 1, 0.05))
# xy <- expand.grid(x = seq(0, 1, 0.1), y = seq(0, 1, 0.1)); xy$l <- sprintf("(%.1f, %.1f)", xy$x, xy$y)
# z <- xy[identify(xy$x, xy$y, xy$l), ]
```
We can also look at the proportions across all three variables:
```{r mosaic3, fig.height=6, fig.width=8, out.width=624}
mosaicplot(
HairEyeColor,
main = "Mosaic Plot of Hair and Eye Colors in Women and Men"
)
highlight <- data.frame(
x = c(0.00, 0.19, 0.19, 0.65, 0.65, 0.78, 0.78, 1.00, 1.00, 0.78, 0.78, 0.65, 0.65, 0.19, 0.19, 0.00, 0.00),
y = c(0.19, 0.19, 0.28, 0.28, 0.38, 0.38, 0.20, 0.20, 0.92, 0.92, 0.63, 0.63, 0.585, 0.585, 0.40, 0.40, 0.19)
)
lines(x = highlight$x, y = highlight$y, col = "black", lwd = 8)
lines(x = highlight$x, y = highlight$y, col = "green", lwd = 4)
```
What the third mosaic plot tells us:
- Blond was the most prevalent hair color among those with blue eyes.
- More brown-haired men had blue eyes than brown-haired women.
- More blond-haired women had blue eyes than blonde-haired men.
### Stacked area plots
A *stacked area plot* is a way to visualize changes in amounts (or proportions) over time.
```{r stacked_area, fig.height=6, fig.width=12, out.width=624, fig.cap='Beginning with 1925, the number of people over the age of 64 has increased dramatically, especially after 1975.'}
library(gcookbook) # install.packages("gcookbook")
p1 <- ggplot(uspopage, aes(y = Thousands, x = Year, fill = AgeGroup)) +
geom_area(color = "black") +
scale_y_continuous("Number of people in thousands", labels = compress) +
scale_fill_discrete("Age group", breaks = rev(levels(uspopage$AgeGroup))) +
ggtitle("Stacked areas", "Age distribution in the United States, 1900-2002") +
theme_minimal(14) +
guides(fill = guide_legend(reverse = TRUE))
p2 <- ggplot(uspopage, aes(y = Thousands, x = Year, fill = AgeGroup)) +
geom_area(color = "black", position = "fill") +
scale_y_continuous("Proportion", labels = scales::percent_format()) +
scale_fill_discrete("Age group", breaks = rev(levels(uspopage$AgeGroup))) +
ggtitle("Stacked proportions", "Age distribution in the United States, 1900-2002") +
theme_minimal(14) +
guides(fill = guide_legend(reverse = TRUE))
cowplot::plot_grid(p1, p2, nrow = 1)
```
### Heat maps
[Heatmaps](https://en.wikipedia.org/wiki/Heat_map) are a graphical representation of matrices. For example, we can visualize a dataset of top 50 NBA players' performance statistics from [the 2008-09 season]((https://en.wikipedia.org/wiki/2008%E2%80%9309_NBA_season)) (obtained from [RotoWire](https://www.rotowire.com/), formerly databaseBasketball):
```{r nba_data, cache=TRUE}
# https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
colnames(nba) <- c(
"Name", "Games", "Minutes", "Points",
"Field goals made", "Field goal attempts", "Field goal %",
"Free throws made", "Free throw attempts", "Free throw %",
"Three-pointers made", "Three-point attempts", "Three-point %",
"Offensive rebounds", "Defensive rebounds", "Total rebounds",
"Assists", "Steals", "Blocks", "Turnovers", "PF"
)
nba$PF <- NULL; nba.m <- reshape2::melt(nba)
nba.m <- plyr::ddply(nba.m, plyr::.(variable), transform, rescale = scale(value))
nba.m$Name <- factor(nba.m$Name, levels(nba.m$Name)[order(levels(nba.m$Name))])
```
```{r nba_table, dependson='nba_data', eval=FALSE}
DT::datatable(
nba,
options = list(order = list(1, "desc")),
class = "cell-border stripe",
rownames = FALSE, filter = "top"
) %>%
DT::formatPercentage(c("Field goal %", "Free throw %", "Three-point %"))
```
```{r heatmap, fig.height=10, fig.width=8, out.width=624, dependson='nba_data'}
ggplot(nba.m, aes(x = variable, y = Name, fill = rescale)) +
geom_tile(color = "white") +
viridis::scale_fill_viridis(
"Compared to other 49 players",
breaks = c(-2, 0, 2, 4),
labels = function(x) {
return(factor(x, c(-2, 0, 2, 4), c("Worse", "Average", "Better", "Way better")))
}
) +
theme_minimal(14) +
labs(
x = NULL, y = NULL,
title = "Heatmap of top 50 NBA scorers' performance",
subtitle = "Performance data from 2008-2009 season; centered and scaled",
caption = "Source: FlowingData and RotoWire (formerly databaseBasketball)"
) +
scale_x_discrete(expand = c(0, 1)) +
scale_y_discrete(expand = c(0, 0), limits = rev(levels(nba.m$Name))) +
theme(
legend.position = "bottom", # "none",
legend.key.width = unit(3,"line"),
axis.ticks = element_blank(),
axis.text.y = element_text(color = "black"),
axis.text.x = element_text(size = 14 * 0.8, angle = 330, hjust = 0, color = "black"),
plot.caption = element_text(size = 8)
)
```
Some observations:
- [Dwight Howard](https://en.wikipedia.org/wiki/Dwight_Howard) was the best at blocking shots
- Dwight Howard was also one of the worst at making free throws
- [Yao Ming](https://en.wikipedia.org/wiki/Yao_Ming) was the best at making three-pointers (by % successful out of total attempts)
```{r corr_map, fig.width=8, fig.height=6, out.width=624}
nba_correlations <- cor(nba[, -1])
abbreviations <- abbreviate(row.names(nba_correlations), minlength = 6)
rownames(nba_correlations) <- colnames(nba_correlations) <- abbreviations
correlation_plot <- GGally::ggcorr(data = NULL, cor_matrix = nba_correlations, nbreaks = 4, palette = "RdGy", label = TRUE, label_size = 3, label_color = "white", hjust = "right", size = 5, color = "grey50", layout.exp = 2) +
ggtitle("Heatmap of correlations", "between performance statistics")
correlation_plot + theme_minimal(14)
```
Some observations:
- **Negative correlations** (in <span style='font-weight:bold;color:#CA0020;'>red</span>):
- Players who made a higher percentage of field goals ("Fldgl.") stayed away from trying to make three-point shots ("Thr-pa" is "Three-point attempts").
- **Positive correlations** (in <span style='font-weight:bold;color:#404040;'>grey</span>):
- Players who attempted/made more field goals ("Fldgla"/"Fldglm") also scored more points.
### Tree maps
[Treemapping](https://en.wikipedia.org/wiki/Treemapping) is a way to visualize hierarchical (nested) data as rectangles within other rectangles, with the area of the rectangle representing the proportion and sometimes a shade or color representing another variable. It is not dissimilar to a mosaic plot!
```{r treemap, fig.cap='Almost all of the crew was male and almost 80% of them died. Most of the 3rd class passengers did not make it either, while more than 85% of women in 1st and 2nd classes survived.'}
library(treemapify)
titanic %>%
dplyr::mutate(Type = dplyr::case_when(
Sex == "Female" & Age == "Child" ~ "Girls",
Sex == "Male" & Age == "Child" ~ "Boys",
Sex == "Female" & Age == "Adult" ~ "Women",
Sex == "Male" & Age == "Adult" ~ "Men"
)) %>%
dplyr::group_by(Class, Type, Survived) %>%
dplyr::summarize(Freq = sum(Freq)) %>%
dplyr::summarize(Total = sum(Freq), Survival = Freq[Survived == "Yes"] / Total) %>%
dplyr::ungroup() %>%
dplyr::mutate(Label = sprintf("%s (%.1f%%)", Type, 100 * Survival)) %>%
ggplot(aes(area = Total, fill = Survival, label = Label, subgroup = Class)) +
geom_treemap() +
geom_treemap_subgroup_border(color = "white") +
geom_treemap_subgroup_text(
place = "centre", grow = TRUE, alpha = 0.5,
color = "black", fontface = "italic", min.size = 0
) +
geom_treemap_text(colour = "white", place = "topleft", reflow = TRUE) +
scale_fill_continuous(labels = scales::percent_format(), breaks = seq(0, 1, 0.25)) +
labs(title = "Treemap of Titanic passengers' survival rates") +
theme_minimal(14)
```
### Choropleths
[Choropleths](https://en.wikipedia.org/wiki/Choropleth_map) are geographical maps that are colored and/or shaded according to some variable such as population density.
```{r choropleth}
data("USArrests", package = "datasets")
data("fifty_states", package = "fiftystater")
library(fiftystater); library(mapproj)
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests) %>%
dplyr::left_join(data.frame(abb = state.abb, state = tolower(state.name)), by = "state")
centroids <- fifty_states[, c("long", "lat")] %>%
split(fifty_states$id) %>%
lapply(as.matrix) %>%
lapply(geosphere::centroid) %>%
lapply(as.data.frame) %>%
dplyr::bind_rows(.id = "state")
abbrvs <- crimes[, c("state", "abb")] %>%
dplyr::left_join(centroids, by = "state")
ggplot(crimes, aes(map_id = state)) +
geom_map(aes(fill = Murder), map = fifty_states, color = "white") +
geom_text(data = abbrvs, aes(label = abb, x = lon, y = lat)) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map() +
scale_fill_distiller(palette = "RdGy") +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(
x = NULL, y = NULL, fill = "Murder arrests",
title = "Choropleth of 1973 crime rates by US state",
subtitle = "Arrest rates are per 100,000 residents",
caption = "Source: World Almanac and Book of facts, 1975"
) +
theme_minimal(14) +
theme(
panel.background = element_blank()
)
```
### Networks and graphs
[Network diagrams](https://en.wikipedia.org/wiki/Graph_drawing) are for visualizing [graphs](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) (from [graph theory](https://en.wikipedia.org/wiki/Graph_theory)) and networks (from [network theory](https://en.wikipedia.org/wiki/Network_theory)) where there are [*nodes*](https://en.wikipedia.org/wiki/Node_(computer_science)) ([*vertices*](https://en.wikipedia.org/wiki/Vertex_(graph_theory))) connected by *links* ([*edges*](https://en.wikipedia.org/wiki/Edge_(graph_theory))). Their goal is to visually represent relationships between units. For example, using [the Wikipedia Clickstream data](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) from November 2017 we can start at the article on net neutrality and visualize a *neighborhood* of articles that are *adjacent* to the central one:
<a title="By MPopov (WMF) (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3ANet_neutrality_clickstream_(Nov_2017).png"><img width="512" alt="Net neutrality clickstream (Nov 2017)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Net_neutrality_clickstream_%28Nov_2017%29.png/512px-Net_neutrality_clickstream_%28Nov_2017%29.png"/></a>
The darkness of the edges connecting the vertices represents how many clicks there were between the pairs of articles. We can see that there are more clicks between "net neutrality" and "digital rights" than between "net neutrality" and "human rights", but way more clicks between "net neutrality" and "Wikipedia Zero".
## Scales and transformed data
Sometimes the author of the visualization has chosen to apply a [transformation to the data](https://en.wikipedia.org/wiki/Data_transformation_(statistics)) because the data is [skewed](https://en.wikipedia.org/wiki/Skewness). It is important to watch out for these, especially [logarithmic scales](https://en.wikipedia.org/wiki/Logarithmic_scale).
```{r transformed_scatterplot, fig.width=12, fig.height=12, out.width=624, out.height=624}
set.seed(0)
x <- sample(1:100, 500, replace = TRUE)
slope <- 0.01; intercept <- runif(1, 0, 1); err.std <- runif(1, 0.01, 0.1)
noise <- rnorm(length(x), 0, sd = err.std)
y <- intercept + slope * x + noise
df <- data.frame(x = x, y = 10 ^ y)
p1 <- ggplot(df, aes(x = y)) + geom_histogram(aes(y = ..density..), fill = "gray70") + geom_density(adjust = 2) + theme_minimal(14) + ggtitle("Distribution of y is right skewed")
p2 <- ggplot(df, aes(x = log10(y))) + geom_histogram(aes(y = ..density..), fill = "gray70") + geom_density(adjust = 2) + theme_minimal(14) + ggtitle("Transformed y is more Normally distributed")
p3 <- ggplot(df, aes(x = x, y = y)) + geom_point() + ggtitle("Not a linear relationship between x and y") + theme_minimal(14) + geom_smooth(se = FALSE, method = "lm", size = 2)
p4 <- ggplot(df, aes(x = x, y = log10(y))) + geom_point() + ggtitle("Linear relationship between x and transformed y") + theme_minimal(14) + geom_smooth(se = FALSE, method = "lm", size = 2)
cowplot::plot_grid(plotlist = list(p1, p2, p3, p4))
```
Let us revisit the pageviews data from earlier by utilizing a logarithmic axis:
```{r tsplot_log10, fig.cap='Notice how the French Wikipedia pageviews are no longer dampened by English Wikipedia pageviews\' magnitude.'}
ggplot(pvs, aes(x = date, y = views / 1e6, color = language)) +
geom_line() +
scale_y_log10("Pageviews (millions)") +
scale_color_brewer("Language", palette = "Set1") +
scale_x_datetime(date_breaks = "4 months", date_labels = "%b '%y") +
labs(
x = "Date", title = "French & English Wikipedia daily pageviews",
subtitle = "All platforms, all agent types"
) +
theme_minimal(14)
```
It is possible (but rare) to encounter logarithmically scaled time axes, which are helpful when you have long tails caused by outliers:
```{r transformed_histogram, fig.width=12, fig.height=6, out.width=624}
logtime_breaks <- c(1, 5, 30, 60, 60*5, 60*10, 60*30, 60*60, 60*60*24)
logtime_labels <- function(breaks) {
lbls <- breaks %>%
round %>%
lubridate::seconds_to_period() %>%
tolower %>%
gsub(" ", "", .) %>%
sub("(.*[a-z])0s$", "\\1", .) %>%
sub("(.*[a-z])0m$", "\\1", .) %>%
sub("(.*[a-z])0h$", "\\1", .)
return(lbls)
}
scale_x_logtime <- function(...) {
scale_x_log10(..., breaks = logtime_breaks, labels = logtime_labels)
}
set.seed(0)
sessions <- data.frame(length = 10 ^ abs(rnorm(100, 0, 2)))
p1 <- ggplot(sessions, aes(x = length)) +
geom_histogram() +
scale_x_continuous(
name = "Session length",
breaks = 3600 * seq(0, 24, 4),
labels = logtime_labels
) +
ggtitle("Histogram of session length", "Not very useful") +
theme_minimal(14)
p2 <- ggplot(sessions, aes(x = length)) +
geom_histogram() +
scale_x_logtime(name = "Session length") +
ggtitle("Logarithmically scaled time axis", "Substantially more useful") +
theme_minimal(14)
cowplot::plot_grid(p1, p2)
```
## Group activity
Pair up with someone sitting next to you and pick one of the following 3 visualizations. You and your partner(s) should agree on the same one.
1. **This part is done individually** (3 minutes)
- Note 2-3 interesting observations.
- **Reminder:**
- Once you've identified the variables involved, you are looking for relationships between them.
- You're also looking for patterns and outliers.
2. **This part is done with your partner(s)** (2-3 minutes)
- Share your insights with your partner(s).
- Check if they agree with your observations.
- If they didn't notice the same things as you, explain how you arrived at your interpretation of the chart.
A different take on the Titanic data:
```{r plot1, fig.width=8, fig.height=12, out.width=624}
par(mfrow = c(2, 1))
mosaicplot(~ Sex + Age + Survived, data = Titanic, shade = TRUE, main = "Plot 1a: Titanic passenger survivorship", cex.axis = 0.8)
mosaicplot(~ Class + Sex + Survived, data = Titanic, shade = TRUE, main = "Plot 2a: Titanic passenger survivorship", cex.axis = 0.8)
par(mfrow = c(1, 1))
```
A different take on the violent crime rates data:
```{r plot2, fig.width=10, fig.height=5, out.width=624}
waffle::iron(
waffle::waffle(
USArrests["California", c("Murder", "Assault", "Rape")],
xlab = "1 square = 1 arrest per 100,000 residents",
title = "Plot 2a: Violent crimes in California in 1973",
rows = 10
),
waffle::waffle(
USArrests["Pennsylvania", c("Murder", "Assault", "Rape")],
xlab = "1 square = 1 arrest per 100,000 residents",
title = "Plot 2b: Violent crimes in Pennsylvania in 1973",
rows = 4
)
)
```
A different take on the Wikipedia pageviews data:
```{r plot3, dependson='pageviews', fig.width=12, fig.height=8, out.width=624}
seasons <- data.frame(
month = c(3, 6, 9, 12),
day = c(21, 21, 23, 21),
name = c("Spring Equinox", "Summer Solstice", "Autumnal Equinox", "Winter Solstice"),
starts = c("Spring", "Summer", "Autumn", "Winter"),
stringsAsFactors = FALSE
)
to_season <- function(d) {
d.year <- lubridate::year(d)
markers <- as.Date(paste(seasons$month, seasons$day, d.year, sep = "-"), format = "%m-%d-%Y")
lgcls <- markers <= d
if (all(!lgcls)) {
return("Winter")
} else {
return(seasons$starts[max(which(lgcls))])
}
}
pvs$season <- purrr::map_chr(pvs$date, to_season)
pvs %>%
dplyr::mutate(wday = lubridate::wday(date, label = TRUE, abbr = TRUE)) %>%
dplyr::filter(language == "English" | views < 40e6) %>%
ggplot(aes(x = wday, y = views)) +
geom_violin(fill = "gray80", color = NA, adjust = 0.75) +
geom_boxplot(width = 1) +
scale_y_continuous(labels = compress) +
facet_grid(language ~ season, scales = "free_y") +
labs(
x = "Day of week", y = "Pageviews",
title = "Plot 3: English and French Wikipedia pageviews",
subtitle = "By day of week and season"
) +
theme_bw(15)
```
## Assessment
Some questions to verify that you understand the core concepts in data visualization:
```{r quiz}
quiz(
question(
"Which of these are examples of a discrete quantitative variable?",
answer("Time spent reading a Wikipedia article"),
answer("Number of articles edited by user", correct = TRUE),
answer("User session ID"),
answer("Percent change in monthly pageviews from previous month"),
answer("Total pageviews last month", correct = TRUE),
random_answer_order = TRUE, type = "multiple",
incorrect = "Time spent and % change are continuous and user session ID is a qualitative variable."
),
question(
"A log transformation of the data can help when a variable has positive skew (a long tail on the right)",
answer("True", correct = TRUE),
answer("False")
),
question(
"Which basic chart is the best type for showing a relationship between two continuous quantitative variables?",
answer("pie"),
answer("bar"),
answer("scatter", correct = TRUE),
answer("histogram"),
answer("box and whiskers"),
answer("mosaic"),
random_answer_order = TRUE
),
question(
"An effective data visualization will include some or all of the following:",
answer("Distribution of a variable", correct = TRUE),
answer("Relationship(s) between variables", correct = TRUE),
answer("Labels", correct = TRUE),
answer("Fancy fonts"),
answer("Aesthetically pleasing colors"),
answer("Interactivity"),
random_answer_order = TRUE, type = "multiple",
incorrect = "Interactivity CAN make a data visualization more effective, but we have seen static (non-interactive) examples so far that are completely fine without it. When it comes to colors, the most important factor is whether they convey the information well, and having a color scheme that is aesthetically pleasing is really nice, but is not technically necessary. Fancy fonts can look good, but they don't add to the story the visualization is meant to tell."
),
question(
"A heatmap is a choropleth",
answer("True"),
answer("False", correct = TRUE),
incorrect = "Choropleth maps include geographical boundaries, which heatmaps do not."
),
question(
"Which of the following are examples of a qualitative/categorical variable? (This includes nominal and ordinal types.)",
answer("Survey responder's age group \"65 and older\"", correct = TRUE),
answer("Survey responder's age"),
answer("Rating scale (e.g. \"worst\" to \"best\")", correct = TRUE),
answer("Survey responder's gender", correct = TRUE),
answer("How much time user spent responding to survey"),
random_answer_order = TRUE, type = "multiple",
incorrect = "Age is quantitative while an age group is qualitative. Time spent is quantitative."
)
)
```
## Appendix
### Visual essays
- [A visual introduction to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) by Stephanie Yee and Tony Chu
- [Exploring Histograms](https://tinlizzie.org/histograms/) by Aran Lunzer and Amelia McNamara
- [Algorithms Tour](http://algorithms-tour.stitchfix.com/): How data science is woven into the fabric of StitchFix
- [An Interactive Visualization of Every Line in Hamilton](https://pudding.cool/2017/03/hamilton/index.html) by Shirley Wu
- [Constructed Career Paths from Job Switching Data](http://flowingdata.com/2017/11/28/career-paths/) by Nathan Yau
### Collections
- [The New York Times Graphics Department](https://twitter.com/nytgraphics)
- [Information is Beautiful Awards](https://www.informationisbeautifulawards.com/)
- [FlowingData](http://flowingdata.com/)'s [10 Best Data Visualization Projects of 2017](https://flowingdata.com/2017/12/28/10-best-data-visualization-projects-of-2017/)
### Further reading
- [The Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi) by Edward Tufte
- [Handbook of Data Visualization](http://www.springer.com/us/book/9783540330363) (Editors: Chun-houh Chen, Wolfgang Karl Härdle, Antony Unwin)
### Making your own
- [Visualize This: The FlowingData Guide to Design, Visualization, and Statistics](http://book.flowingdata.com/) by Nathan Yau
- [R Graphics Cookbook](http://shop.oreilly.com/product/0636920063704.do) by Winston Chang
- [ggplot2: Elegant Graphics for Data Analysis](http://ggplot2.org/book/) by Hadley Wickham
- [Data Visualization with Python and JavaScript](http://shop.oreilly.com/product/0636920037057.do) by Kyran Dale
- [SVG Animations: From Common UX Implementations to Complex Responsive Animation](http://shop.oreilly.com/product/0636920045335.do) by Sarah Drasner
- [D3.js in Action](https://www.manning.com/books/d3js-in-action-second-edition) by Elijah Meeks