This repository has been archived by the owner on Aug 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
04-eda.Rmd
2435 lines (1685 loc) · 123 KB
/
04-eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Exploratory data analysis and unsupervised learning {#eda}
Exploratory data analysis (EDA) is a process in which we summarise and visually explore a dataset. An important part of EDA is unsupervised learning, which is a collection of methods for finding interesting subgroups and patterns in our data. Unlike statistical hypothesis testing, which is used to reject hypotheses, EDA can be used to _generate_ hypotheses (which can then be confirmed or rejected by new studies). Another purpose of EDA is to find outliers and incorrect observations, which can lead to a cleaner and more useful dataset. In EDA we ask questions about our data and then try to answer them using summary statistics and graphics. Some questions will prove to be important, and some will not. The key to finding important questions is to ask a lot of questions. This chapter will provide you with a wide range of tools for question-asking.
After working with the material in this chapter, you will be able to use R to:
* Create reports using R Markdown,
* Customise the look of your plots,
* Visualise the distribution of a variable,
* Create interactive plots,
* Detect and label outliers,
* Investigate patterns in missing data,
* Visualise trends,
* Plot time series data,
* Visualise multiple variables at once using scatterplot matrices, correlograms and bubble plots,
* Visualise multivariate data using principal component analysis,
* Use unsupervised learning techniques for clustering,
* Use factor analysis to find latent variables in your data.
## Reports with R Markdown {#rmarkdown}
R Markdown files can be used to create nicely formatted documents using R, that are easy to export to other formats, like HTML, Word or pdf. They allow you to mix R code with results and text. They can be used to create reproducible reports that are easy to update with new data, because they include the code for making tables and figures. Additionally, they can be used as notebooks for keeping track of your work and your thoughts as you carry out an analysis. You can even use them for writing books - in fact, this entire book was written using R Markdown.
It is often a good idea to use R Markdown for exploratory analyses, as it allows you to write down your thoughts and comments as the analysis progresses, as well as to save the results of the exploratory journey. For that reason, we'll start this chapter by looking at some examples of what you can do using R Markdown. According to your preference, you can use either R Markdown or ordinary R scripts for the analyses in the remainder of the chapter. The R code used is the same and the results are identical, but if you use R Markdown, you can also save the output of the analysis in a nicely formatted document.
### A first example
When you create a new R Markdown document in RStudio by clicking _File > New File > R Markdown_ in the menu, a document similar to that below is created :
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown.Rmd"))
```
Press the _Knit_ button at the top of the Script panel to create an HTML document using this Markdown file. It will be saved in the same folder as your Markdown file. Once the HTML document has been created, it will open so that you can see the results. You may have to install additional packages for this to work, in which case RStudio will prompt you.
Now, let's have a look at what the different parts of the Markdown document do. The first part is called the _document header_ or _YAML header_. It contains information about the document, including its title, the name of its author, and the date on which it was first created:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown1.Rmd"))
```
The part that says `output: html_document` specifies what type of document should be created when you press _Knit_. In this case, it's set to `html_document`, meaning that an HTML document will be created. By changing this to `output: word_document` you can create a `.docx` Word document instead. By changing it to `output: pdf_document`, you can create a `.pdf` document using LaTeX (you'll have to install LaTeX if you haven't already - RStudio will notify you if that is the case).
The second part sets the default behaviour of _code chunks_ included in the document, specifying that the output from running the chunks should be printed unless otherwise specified:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown2.Rmd"))
```
The third part contains the first proper section of the document. First, a header is created using `##`. Then there is some text with formatting: `< >` is used to create a link and `**` is used to get **bold text**. Finally, there is a code chunk, delimited by ```` ``` ````:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown3.Rmd"))
```
The fourth and final part contains another section, this time with a figure created using R. A setting is added to the code chunk used to create the figure, which prevents the underlying code from being printed in the document:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown4.Rmd"))
```
In the next few sections, we will look at how formatting and code chunks work in R Markdown.
### Formatting text
To create plain text in a Markdown file, you simply have to write plain text. If you wish to add some formatting to your text, you can use the following:
* `_italics_` or `*italics*`: to create text in _italics_.
* `__bold__` or `**bold**`: to create **bold** text.
* `[linked text](http://www.modernstatisticswithr.com)`: to create [linked text](http://www.modernstatisticswithr.com).
* `` `code` ``: to include inline `code` in your document.
* `$a^2 + b^2 = c^2$`: to create inline equations like $a^2 + b^2 = c^2$ using LaTeX syntax.
* `$$a^2 + b^2 = c^2$$`: to create a centred equation on a new line, like $$a^2 + b^2 = c^2.$$
To add headers and subheaders, and to divide your document into section, start a new line with `#`'s as follows:
```
# Header text
## Sub-header text
### Sub-sub-header text
...and so on.
```
### Lists, tables, and images
To create a bullet list, you can use `*` as follows. Note that you need a blank line between your list and the previous paragraph to begin a list.
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown5.Rmd"))
```
yielding:
* Item 1
* Item 2
+ Sub-item 1
+ Sub-item 2
* Item 3
To create an ordered list, use:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown6.Rmd"))
```
yielding:
1. First item
2. Second item
i) Sub-item 1
ii) Sub-item 2
3. Item 3
To create a table, use `|` and `---------` as follows:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown7.Rmd"))
```
which yields the following output:
Column 1 | Column 2
--------- | ---------
Content | More content
Even more | And some here
Even more? | Yes!
To include an image, use the same syntax as when creating linked text with a link to the image path (either local or on the web), but with a `!` in front:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown8.Rmd"))
```
which yields the following:
![](Rlogo.png)
![Put some text here if you want a caption](Rlogo.png)
### Code chunks
The simplest way to define a code chunk is to write:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown9.Rmd"))
```
In RStudio, Ctrl+Alt+I is a keyboard shortcut for inserting this kind of code chunk.
We can add a name and a caption to the chunk, which lets us reference objects created by the chunk:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown10.Rmd"))
```
This yields the following output:
---
```{r pressureplot, fig.cap = "Plot of the pressure data."}
plot(pressure)
```
As we can see in Figure \@ref(fig:pressureplot), the relationship between
temperature and pressure resembles a banana.
---
In addition, you can add settings to the chunk header to control its behaviour. For instance, you can include a code chunk without running it by adding `echo = FALSE`:
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown11.Rmd"))
```
You can add the following settings to your chunks:
* `echo = FALSE` to run the code without printing it,
* `eval = FALSE` to print the code without running it,
* `results = "hide"` to hide printed output,
* `fig.show = "hide"` to hide plots,
* `warning = FALSE` to suppress warning messages from being printed in your document,
* `message = FALSE` to suppress other messages from being printed in your document,
* `include = FALSE` to run a chunk without showing the code or results in the document,
* `error = TRUE` to continue running your R Markdown document even if there is an error in the chunk (by default, the documentation creation stops if there is an error).
Data frames can be printed either as in the Console or as a nicely formatted table. For example,
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown12.Rmd"))
```
yields:
```{r, echo = FALSE}
head(airquality)
```
whereas
```{r echo = FALSE, comment = ""}
cat(htmltools::includeText("Examples/rmarkdown13.Rmd"))
```
yields the table below.
```{r, echo = FALSE}
knitr::kable(
head(airquality),
caption = "Some data I found."
)
```
Further help and documentation for R Markdown can be found through the RStudio menus, by clicking _Help > Cheatsheets > R Markdown Cheat Sheet_ or _Help > Cheatsheets > R Markdown Reference Guide_.
## Customising `ggplot2` plots {#themes}
We'll be using `ggplot2` a lot in this chapter, so before we get started with exploratory analyses, we'll take some time to learn how we can customise the look of `ggplot2`-plots.
Consider the following facetted plot from Section \@ref(comparinggroups):
```{r eval=FALSE}
library(ggplot2)
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)
```
It looks nice, sure, but there may be things that you'd like to change. Maybe you'd like the plot's background to be white instead of grey, or perhaps you'd like to use a different font. These, and many other things, can be modified using _themes_.
### Using themes
`ggplot2` comes with a number of basic themes. All are fairly similar, but differ in things like background colour, grids and borders. You can add them to your plot using `theme_themeName`\index{\texttt{theme\_...}}, where `themeName` is the name of the theme^[See `?theme_grey` for a list of available themes.]. Here are some examples:
```{r eval=FALSE}
p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)
# Create plot with different themes:
p + theme_grey() # The default theme
p + theme_bw()
p + theme_linedraw()
p + theme_light()
p + theme_dark()
p + theme_minimal()
p + theme_classic()
```
There are several packages available that contain additional themes. Let's download a few:
```{r eval=FALSE}
install.packages("ggthemes")
library(ggthemes)
# Create plot with different themes from ggthemes:
p + theme_tufte() # Minimalist Tufte theme
p + theme_wsj() # Wall Street Journal
p + theme_solarized() + scale_colour_solarized() # Solarized colours
##############################
install.packages("hrbrthemes")
library(hrbrthemes)
# Create plot with different themes from hrbrthemes:
p + theme_ipsum() # Ipsum theme
p + theme_ft_rc() # Suitable for use with dark RStudio themes
p + theme_modern_rc() # Suitable for use with dark RStudio themes
```
### Colour palettes {#colourpalettes}
Unlike e.g. background colours, the _colour palette_, i.e. the list of colours used for plotting, is not part of the theme that you're using. Next, we'll have a look at how to change the colour palette used for your plot.
Let's start by creating a `ggplot` object:
```{r eval=FALSE}
p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```
You can change the colour palette using `scale_colour_brewer`\index{\texttt{scale\_colour\_brewer}}\index{plot!choose colour palette}. Three types of colour palettes are available:
* **Sequential palettes**: that range from a colour to white. These are useful for representing ordinal (i.e. ordered) categorical variables and numerical variables.
* **Diverging palettes**: these range from one colour to another, with white in between. Diverging palettes are useful when there is a meaningful middle or 0 value (e.g. when your variables represent temperatures or profit/loss), which can be mapped to white.
* **Qualitative palettes**: which contain multiple distinct colours. These are useful for nominal (i.e. with no natural ordering) categorical variables.
See `?scale_colour_brewer` or http://www.colorbrewer2.org for a list of the available palettes. Here are some examples:
```{r eval=FALSE}
# Sequential palette:
p + scale_colour_brewer(palette = "OrRd")
# Diverging palette:
p + scale_colour_brewer(palette = "RdBu")
# Qualitative palette:
p + scale_colour_brewer(palette = "Set1")
```
In this case, because `vore` is a nominal categorical variable, a qualitative palette is arguably the best choice.
### Theme settings
The point of using a theme is that you get a combination of colours, fonts and other choices that are supposed to go well together, meaning that you don't have to spend too much time picking combinations. But if you like, you can override the default options and customise any and all parts of a theme.
The theme controls all visual aspects of the plot not related to the aesthetics. You can change the theme settings using the `theme` function. For instance, you can use `theme`\index{\texttt{theme}} to remove the legend or change its position:
```{r eval=FALSE}
# No legend:
p + theme(legend.position = "none")
# Legend below figure:
p + theme(legend.position = "bottom")
# Legend inside plot:
p + theme(legend.position = c(0.9, 0.7))
```
In the last example, the vector `c(0.9, 0.7)` gives the relative coordinates of the legend, with `c(0 0)` representing the bottom left corner of the plot and `c(1, 1)` the upper right corner. Try to change the coordinates to different values between `0` and `1` and see what happens.
`theme` has a lot of other settings, including for the colours of the background, the grid and the text in the plot. Here are a few examples that you can use as starting point for experimenting with the settings\index{\texttt{element\_line}}\index{\texttt{element\_rect}}\index{\texttt{element\_text}}\index{\texttt{element\_blank}}:
```{r eval=FALSE}
p + theme(panel.grid.major = element_line(colour = "black"),
panel.grid.minor = element_line(colour = "purple",
linetype = "dotted"),
panel.background = element_rect(colour = "red", size = 2),
plot.background = element_rect(fill = "yellow"),
axis.text = element_text(family = "mono", colour = "blue"),
axis.title = element_text(family = "serif", size = 14))
```
To find a complete list of settings, see `?theme`, `?element_line` (lines), `?element_rect` (borders and backgrounds), `?element_text` (text), and `element_blank` (for suppressing plotting of elements). As before, you can use `colors()` to get a list of built-in colours, or use colour hex codes.
$$\sim$$
```{exercise, label="ch4exc0"}
Use the documentation for `theme` and the `element_...` functions to change the plot object `p` created above as follows:
1. Change the background colour of the entire plot to `lightblue`.
2. Change the font of the legend to `serif`.
3. Remove the grid.
4. Change the colour of the axis ticks to `orange` and make them thicker.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions0)
## Exploring distributions
It is often useful to visualise the distribution of a numerical variable. Comparing the distributions of different groups can lead to important insights. Visualising distributions is also essential when checking assumptions used for various statistical tests (sometimes called _initial data analysis_). In this section we will illustrate how this can be done using the `diamonds` data from the `ggplot2` package, which you started to explore in Chapter \@ref(thebasics).
### Density plots and frequency polygons
We already know how to visualise the distribution of the data by dividing it into bins and plotting a histogram:
```{r eval=FALSE}
library(ggplot2)
ggplot(diamonds, aes(carat)) +
geom_histogram(colour = "black")
```
A similar plot is created using frequency polygons\index{frequency polygons}, which uses lines instead of bars to display the counts in the bins\index{\texttt{geom\_freqpoly}}:
```{r eval=FALSE}
ggplot(diamonds, aes(carat)) +
geom_freqpoly()
```
An advantage with frequency polygons is that they can be used to compare groups, e.g. diamonds with different cuts, without facetting:
```{r eval=FALSE}
ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly()
```
It is clear from this figure that there are more diamonds with ideal cuts than diamonds with fair cuts in the data. The polygons have roughly the same shape, except perhaps for the polygon for diamonds with fair cuts.
In some cases, we are more interested in the shape of the distribution than in the actual counts in the different bins. Density plots\index{density plot} are similar to frequency polygons but show an estimate of the density function of the underlying random variable. These estimates are smooth curves that are scaled so that the area below them is 1 (i.e. scaled to be proper density functions)\index{\texttt{geom\_density}}:
```{r eval=FALSE}
ggplot(diamonds, aes(carat, colour = cut)) +
geom_density()
```
From this figure, it becomes clear that low-carat diamonds tend to have better cuts, which wasn't obvious from the frequency polygons. However, the plot does not provide any information about _how_ common different cuts are. Use density plots if you're more interested in the shape of a variable's distribution, and frequency polygons if you're more interested in counts.
$$\sim$$
```{exercise, label="ch4exc9"}
Using the density plot created above and the documentation for `geom_density`, do the following:
1. Increase the smoothness of the density curves.
2. Fill the area under the density curves with the same colour as the curves themselves.
3. Make the colours that fill the areas under the curves transparent.
4. The figure still isn't that easy to interpret. Install and load the `ggridges`\index{\texttt{ggridges}} package, an extension of `ggplot2` that allows you to make so-called ridge plots (density plots that are separated along the y-axis, similar to facetting). Read the documentation for `geom_density_ridges`\index{\texttt{geom\_density\_ridges}} and use it to make a ridge plot of diamond prices for different cuts.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions9)
<br>
```{exercise, label="ch4exc10"}
Return to the histogram created by `ggplot(diamonds, aes(carat)) + geom_histogram()` above. As there are very few diamonds with carat greater than 3, cut the x-axis at 3. Then decrease the bin width to 0.01. Do any interesting patterns emerge?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions10)
### Asking questions
Exercise \@ref(exr:ch4exc10) causes us to ask why diamonds with carat values that are multiples of 0.25 are more common than others. Perhaps the price is involved? Unfortunately, a plot of `carat` versus `price` is not that informative:
```{r eval=FALSE}
ggplot(diamonds, aes(carat, price)) +
geom_point(size = 0.05)
```
Maybe we could compute the average price in each bin of the histogram? In that case, we need to extract the bin breaks from the histogram somehow. We could then create a new categorical variable using the breaks with `cut` (as we did in Exercise \@ref(exr:ch3exc3b)). It turns out that extracting the bins is much easier using base graphics\index{\texttt{hist}!extract breaks/midpoints} than `ggplot2`, so let's do that:
```{r eval=FALSE}
# Extract information from a histogram with bin width 0.01,
# which corresponds to 481 breaks:
carat_br <- hist(diamonds$carat, breaks = 481, right = FALSE,
plot = FALSE)
# Of interest to us are:
# carat_br$breaks, which contains the breaks for the bins
# carat_br$mid, which contains the midpoints of the bins
# (useful for plotting!)
# Create categories for each bin:
diamonds$carat_cat <- cut(diamonds$carat, 481, right = FALSE)
```
We now have a variable, `carat_cat`, that shows to which bin each observation belongs. Next, we'd like to compute the mean for each bin. This is a grouped summary - mean by category. After we've computed the bin means, we could then plot them against the bin midpoints. Let's try it:
```{r eval=FALSE}
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
plot(carat_br$mid, means$price)
```
That didn't work as intended. We get an error message when attempting to plot the results:
```{r eval=FALSE}
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
```
The error message implies that the number of bins and the number of mean values that have been computed differ. But we've just computed the mean for each bin, haven't we? So what's going on?
By default, `aggregate` ignores groups for which there are no values when computing grouped summaries. In this case, there are a lot of empty bins - there is for instance no observation in the `[4.99,5)` bin. In fact, only 272 out of the 481 bins are non-empty.
We can solve this in different ways. One way is to remove the empty bins. We can do this using the `match`\index{\texttt{match}} function, which returns the positions of _matching values_ in two vectors. If we use it with the bins from the grouped summary and the vector containing all bins, we can find the indices of the non-empty bins. This requires the use of the `levels` function, that you'll learn more about in Section \@ref(factors):
```{r eval=FALSE}
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
id <- match(means$carat_cat, levels(diamonds$carat_cat))
```
Finally, we'll also add some vertical lines to our plot, to call attention to multiples of 0.25\index{\texttt{abline}}\index{\texttt{geom\_vline}}.
`r if (knitr::is_html_output()) '<table><div class="column-left">'`\btwocol
Using base graphics is faster here:
```{r eval=FALSE}
plot(carat_br$mid[id], means$price,
cex = 0.5)
# Add vertical lines at multiples
# of 0.25:
abline(v = c(0.5, 0.75, 1,
1.25, 1.5))
```
`r if (knitr::is_html_output()) '</div><div class="column-right">'`\columnbreak
But we can of course stick to `ggplot2` if we like:
```{r, eval=F}
library(ggplot2)
d2 <- data.frame(
bin = carat_br$mid[id],
mean = means$price)
ggplot(d2, aes(bin, mean)) +
geom_point() +
geom_vline(xintercept =
c(0.5, 0.75, 1,
1.25, 1.5))
# geom_vline add vertical lines at
# multiples of 0.25
```
`r if (knitr::is_html_output()) '</div></table>'`\etwocol
It appears that there are small jumps in the prices at some of the 0.25-marks. This explains why there are more diamonds just above these marks than just below.
The above example illustrates three crucial things regarding exploratory data analysis:
* Plots (in our case, the histogram) often lead to new questions.
* Often, we must transform, summarise or otherwise manipulate our data to answer a question. Sometimes this is straightforward and sometimes it means diving deep into R code.
* Sometimes the thing that we're trying to do doesn't work straight away. There is almost always a solution though (and oftentimes more than one!). The more you work with R, the more problem-solving tricks you will learn.
### Violin plots
Density curves can also be used as alternatives to boxplots. In Exercise \@ref(exr:ch2exc7), you created boxplots to visualise price differences between diamonds of different cuts:
```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
geom_boxplot()
```
Instead of using a boxplot, we can use a violin plot. Each group is represented by a "violin"\index{\texttt{geom\_violin}}\index{violin plot}, given by a rotated and duplicated density plot:
```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
geom_violin()
```
Compared to boxplots, violin plots capture the entire distribution of the data rather than just a few numerical summaries. If you like numerical summaries (and you should!) you can add the median and the quartiles (corresponding to the borders of the box in the boxplot) using the `draw_quantiles` argument:
```{r eval=FALSE}
ggplot(diamonds, aes(cut, price)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))
```
$$\sim$$
```{exercise, label="ch4exc11"}
Using the first boxplot created above, i.e. `ggplot(diamonds, aes(cut, price)) + geom_violin()`, do the following:
1. Add some colour to the plot by giving different colours to each violin.
2. Because the categories are shown along the x-axis, we don't really need the legend. Remove it.
3. Both boxplots and violin plots are useful. Maybe we can have the best of both worlds? Add the corresponding boxplot inside each violin. Hint: the `width` and `alpha` arguments in `geom_boxplot` are useful for creating a nice-looking figure here.\index{violin plot!combine with boxplot}
4. Flip the coordinate system to create horizontal violins and boxes instead.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions11)
### Combine multiple plots into a single graphic {#patchwork}
When exploring data with many variables, you'll often want to make the same kind of plot (e.g. a violin plot) for several variables. It will frequently make sense to place these side-by-side in the same plot window. The `patchwork`\index{\texttt{patchwork}} package extends `ggplot2` by letting you do just that\index{plot!multiple plots in one graphic}. Let's install it:
```{r eval=FALSE}
install.packages("patchwork")
```
To use it, save each plot as a plot object and then add them together:
```{r eval=FALSE}
plot1 <- ggplot(diamonds, aes(cut, carat, fill = cut)) +
geom_violin() +
theme(legend.position = "none")
plot2 <- ggplot(diamonds, aes(cut, price, fill = cut)) +
geom_violin() +
theme(legend.position = "none")
library(patchwork)
plot1 + plot2
```
You can also arrange the plots on multiple lines, with different numbers of plots on each line. This is particularly useful if you are combining different types of plots in a single plot window. In this case, you separate plots that are same line by `|` and mark the beginning of a new line with `/`:
```{r eval=FALSE}
# Create two more plot objects:
plot3 <- ggplot(diamonds, aes(cut, depth, fill = cut)) +
geom_violin() +
theme(legend.position = "none")
plot4 <- ggplot(diamonds, aes(carat, fill = cut)) +
geom_density(alpha = 0.5) +
theme(legend.position = c(0.9, 0.6))
# One row with three plots and one row with a single plot:
(plot1 | plot2 | plot3) / plot4
# One column with three plots and one column with a single plot:
(plot1 / plot2 / plot3) | plot4
```
(You may need to enlarge your plot window for this to look good!)
## Outliers and missing data
### Detecting outliers
Both boxplots and scatterplots are helpful in detecting deviating observations - often called outliers. Outliers can be caused by measurement errors or errors in the data input but can also be interesting rare cases that can provide valuable insights about the process that generated the data. Either way, it is often of interest to detect outliers, for instance because that may influence the choice of what statistical tests to use.
Let's return to the scatterplot of diamond carats versus prices:
```{r eval=FALSE}
ggplot(diamonds, aes(carat, price)) +
geom_point()
```
There are some outliers which we may want to study further. For instance, there is a surprisingly cheap 5 carat diamond, and some cheap 3 carat diamonds^[Note that it is not just the prices nor just the carats of these diamonds that make them outliers, but the unusual combinations of prices and carats!]. But how can we identify those points?
One option is to use the `plotly`\index{\texttt{plotly}} package to make an interactive version of the plot, where we can hover interesting points to see more information about them. Start by installing it:
```{r eval=FALSE}
install.packages("plotly")
```
To use `plotly` with a `ggplot` graphic, we store the graphic in a variable and then use it as input to the `ggplotly`\index{\texttt{ggplotly}}\index{interactive plot} function. The resulting (interactive!) plot takes a little longer than usual to load. Try hovering the points:
```{r eval=FALSE}
myPlot <- ggplot(diamonds, aes(carat, price)) +
geom_point()
library(plotly)
ggplotly(myPlot)
```
By default, `plotly` only shows the carat and price of each diamond. But we can add more information to the box by adding a `text` aesthetic\index{\texttt{aes}!\texttt{text}}:
```{r eval=FALSE}
myPlot <- ggplot(diamonds, aes(carat, price, text = paste("Row:",
rownames(diamonds)))) +
geom_point()
ggplotly(myPlot)
```
We can now find the row numbers of the outliers visually, which is very useful when exploring data.
$$\sim$$
```{exercise, label="ch4exc12"}
The variables `x` and `y` in the `diamonds` data describe the length and width of the diamonds (in mm). Use an interactive scatterplot to identify outliers in these variables. Check prices, carat and other information and think about if any of the outliers can be due to data errors.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions12)
### Labelling outliers
Interactive plots are great when exploring a dataset but are not always possible to use in other contexts, e.g. for printed reports and some presentations. In these other cases, we can instead annotate the plot with notes about outliers. One way to do this is to use a geom called `geom_text`\index{\texttt{geom\_text}}.
For instance, we may want to add the row numbers of outliers to a plot. To do so, we use `geom_text` along with a condition that specifies for which points we should add annotations. Like in Section \@ref(conditionsintro), if we e.g. wish to add row numbers for diamonds with carats greater than four, our condition would be `carat > 4`. The `ifelse` function, which we'll look closer at in Section \@ref(conditions), is perfect to use here. The syntax\index{\texttt{ifelse}} will be `ifelse(condition, what text to write if the condition is satisfied, what text to write else)`. To add row names for observations that fulfil the condition but not for other observations, we use `ifelse(condition, rownames(diamonds), "")`. If instead, we wanted to print the price of the diamonds, we'd use `ifelse(condition, price, "")`.
Here are some different examples of conditions used to plot text:
```{r eval=FALSE}
## Using the row number (the 5 carat diamond is on row 27,416)
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(rownames(diamonds) == 27416,
rownames(diamonds), "")),
hjust = 1.1)
## (hjust=1.1 shifts the text to the left of the point)
## Plot text next to all diamonds with carat>4
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(carat > 4, rownames(diamonds), "")),
hjust = 1.1)
## Plot text next to 3 carat diamonds with a price below 7500
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(carat == 3 & price < 7500,
rownames(diamonds), "")),
hjust = 1.1)
```
$$\sim$$
```{exercise, label="ch4exc13"}
Create a static (i.e. non-interactive) scatterplot of `x` versus `y` from the `diamonds` data. Label the diamonds with suspiciously high $y$-values.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions13)
### Missing data
Like many datasets, the mammal sleep data `msleep` contains a lot of missing values, represented by `NA` (Not Available) in R. This becomes evident when we have a look at the data:
```{r eval=FALSE}
library(ggplot2)
View(msleep)
```
We can check if a particular observation is missing using the `is.na` function:
```{r eval=FALSE}
is.na(msleep$sleep_rem[4])
is.na(msleep$sleep_rem)
```
We can count the number of missing values for each variable using\index{\texttt{colSums}}:
```{r eval=FALSE}
colSums(is.na(msleep))
```
Here, `colSums` computes the sum of `is.na(msleep)` for each column of `msleep` (remember that in summation, `TRUE` counts as `1` and `FALSE` as `0`), yielding the number of missing values for each variable. In total, there are 136 missing values in the dataset:
```{r eval=FALSE}
sum(is.na(msleep))
```
You'll notice that `ggplot2` prints a warning in the Console when you create a plot with missing data:
```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
scale_x_log10()
```
Sometimes data are missing simply because the information is not yet available (for instance, the brain weight of the mountain beaver could be missing because no one has ever weighed the brain of a mountain beaver). In other cases, data can be missing because something about them is different (for instance, values for a male patient in a medical trial can be missing because the patient died, or because some values only were collected for female patients). Therefore, it is of interest to see if there are any differences in non-missing variables between subjects that have missing data and subjects that don't.
In `msleep`, all animals have recorded values for `sleep_total` and `bodywt`. To check if the animals that have missing `brainwt` values differ from the others, we can plot them in a different colour in a scatterplot:
```{r eval=FALSE}
ggplot(msleep, aes(bodywt, sleep_total, colour = is.na(brainwt))) +
geom_point() +
scale_x_log10()
```
(If `is.na(brainwt)` is `TRUE` then the brain weight is missing in the dataset.) In this case, there are no apparent differences between the animals with missing data and those without.
$$\sim$$
```{exercise, label="ch4exc14"}
Create a version of the diamonds dataset where the `x` value is missing for all diamonds with $x>9$. Make a scatterplot of carat versus price, with points where the `x` value is missing plotted in a different colour. How would you interpret this plot?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions14)
### Exploring data
The `nycflights13`\index{\texttt{nycflights13}}\index{data!\texttt{flights}} package contains data about flights to and from three airports in New York, USA, in 2013. As a summary exercise, we will study a subset of these, namely all flights departing from New York on 1 January that year:
```{r eval=FALSE}
install.packages("nycflights13")
library(nycflights13)
flights2 <- flights[flights$month == 1 & flights$day == 1,]
```
$$\sim$$
```{exercise, label="ch4exc15"}
Explore the `flights2` dataset, focusing on delays and the amount of time spent in the air. Are there any differences between the different carriers? Are there missing data? Are there any outliers?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions15)
## Trends in scatterplots
Let's return to a familiar example - the relationship between animal brain size and sleep times:
```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```
There appears to be a decreasing trend in the plot. To aid the eye, we can add a smoothed line by adding a new geom, `geom_smooth`\index{\texttt{geom\_smooth}}\index{trend curve}, to the figure:
```{r eval=FALSE}
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
geom_smooth() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```
This technique is useful for bivariate data as well as for time series, which we'll delve into next.
By default, `geom_smooth` adds a line fitted using either LOESS^[LOESS, LOcally Estimated Scatterplot Smoothing, is a non-parametric regression method that fits a polynomial to local areas of the data.] or GAM^[GAM, Generalised Additive Model, is a generalised linear model where the response variable is a linear function of smooth functions of the predictors.], as well as the corresponding 95 \% confidence interval describing the uncertainty in the estimate. There are several useful arguments that can be used with `geom_smooth`. You will explore some of these in the exercise below.
$$\sim$$
```{exercise, label="ch4exc1"}
Check the documentation for `geom_smooth`. Starting with the plot of brain weight vs. sleep time created above, do the following:
1. Decrease the degree of smoothing for the LOESS line that was fitted to the data. What is better in this case, more or less smoothing?
2. Fit a straight line to the data instead of a non-linear LOESS line.
3. Remove the confidence interval from the plot.
4. Change the colour of the fitted line to red.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions1)
## Exploring time series {#tsplots}
Before we have a look at time series, you should install four useful packages: `forecast`, `nlme`, `fma` and `fpp2`\index{\texttt{forecast}}\index{\texttt{nlme}}\index{\texttt{fma}}\index{\texttt{fpp2}}. The first contains useful functions for plotting time series data, and the latter three contain datasets that we'll use.
```{r eval=FALSE}
install.packages(c("nlme", "forecast", "fma", "fpp2"),
dependencies = TRUE)
```
The `a10`\index{data!\texttt{a10}} dataset contains information about the monthly anti-diabetic drug sales in Australia during the period July 1991 to June 2008. By checking its structure, we see that it is saved as a time series object^[Time series objects are a special class of vectors, with data sampled at equispaced points in time. Each observation can have a year/date/time associated with it.]:
```{r eval=FALSE}
library(fpp2)
str(a10)
```
`ggplot2` requires that data is saved as a data frame in order for it to be plotted. In order to plot the time series, we could first convert it to a data frame and then plot each point using `geom_points`:
```{r eval=FALSE}
a10_df <- data.frame(time = time(a10), sales = a10)
ggplot(a10_df, aes(time, sales)) +
geom_point()
```
It is however usually preferable to plot time series using lines instead of points. This is done using a different geom: `geom_line`\index{\texttt{geom\_line}}:
```{r eval=FALSE}
ggplot(a10_df, aes(time, sales)) +
geom_line()
```
Having to convert the time series object to a data frame is a little awkward. Luckily, there is a way around this. `ggplot2` offers a function called `autoplot`\index{\texttt{autoplot}}, that automatically draws an appropriate plot for certain types of data. `forecast` extends this function to time series objects\index{time series!plot}:
```{r eval=FALSE}
library(forecast)
autoplot(a10)
```
We can still add other geoms, axis labels and other things just as before. `autoplot` has simply replaced the `ggplot(data, aes()) + geom` part that would be the first two rows of the `ggplot2` figure, and has implicitly converted the data to a data frame.
$$\sim$$
```{exercise, label="ch4exc2"}
Using the `autoplot(a10)` figure, do the following:
1. Add a smoothed line describing the trend in the data. Make sure that it is smooth enough _not_ to capture the seasonal variation in the data.
2. Change the label of the x-axis to "Year" and the label of the y-axis to "Sales ($ million)".
3. Check the documentation for the `ggtitle` function\index{\texttt{ggtitle}}. What does it do? Use it with the figure.
4. Change the colour of the time series line to red.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions2)
### Annotations and reference lines
We sometimes wish to add text or symbols to plots, for instance to highlight interesting observations. Consider the following time series plot of daily morning gold prices, based on the `gold`\index{data!\texttt{gold}} data from the `forecast` package:
```{r eval=FALSE}
library(forecast)
autoplot(gold)
```
There is a sharp spike a few weeks before day 800, which is due to an incorrect value in the data series. We'd like to add a note about that to the plot. First, we wish to find out on which day the spike appears. This can be done by checking the data manually or using some code:
```{r eval=FALSE}
spike_date <- which.max(gold)
```
To add a circle around that point, we add a call to `annotate`\index{\texttt{annotate}}\index{annotate plot} to the plot:
```{r eval=FALSE}
autoplot(gold) +
annotate(geom = "point", x = spike_date, y = gold[spike_date],
size = 5, shape = 21,
colour = "red",
fill = "transparent")
```
`annotate` can be used to annotate the plot with both geometrical objects and text (and can therefore be used as an alternative to `geom_text`).
$$\sim$$
```{exercise, label="ch4exc3"}
Using the figure created above and the documentation for `annotate`, do the following:
1. Add the text "Incorrect value" next to the circle.
2. Create a second plot where the incorrect value has been removed.
3. Read the documentation for the geom `geom_hline`\index{\texttt{geom\_hline}}. Use it to add a red reference line to the plot, at $y=400$.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions3)
### Longitudinal data
Multiple time series with identical time points, known as longitudinal data or panel data, are common in many fields. One example of this is given by the `elecdaily` time series from the `fpp2` package\index{data!\texttt{elecdaily}}\index{longitudinal data}\index{panel data}, which contains information about electricity demand in Victoria, Australia during 2014. As with a single time series, we can plot these data using `autoplot`:
```{r eval=FALSE}
library(fpp2)
autoplot(elecdaily)
```
In this case, it is probably a good idea to facet the data, i.e. to plot each series in a different figure:
```{r eval=FALSE}
autoplot(elecdaily, facets = TRUE)
```
$$\sim$$
```{exercise, label="ch4exc4"}
Make the following changes to the `autoplot(elecdaily, facets = TRUE)`:
1. Remove the `WorkDay` variable from the plot (it describes whether or not a given date is a workday, and while it is useful for modelling purposes, we do not wish to include it in our figure).
2. Add smoothed trend lines to the time series plots.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions4)
### Path plots
Another option for plotting multiple time series is path plots\index{path plot}. A path plot is a scatterplot where the points are connected with lines in the order they appear in the data (which, for time series data, should correspond to time). The lines and points can be coloured to represent time.
To make a path plot of Temperature versus Demand for the `elecdaily` data, we first convert the time series object to a data frame and create a scatterplot:
```{r eval=FALSE}
library(fpp2)
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point()
```
Next, we connect the points by lines using the `geom_path`\index{\texttt{geom\_path}} geom:
```{r eval=FALSE}
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point() +
geom_path()
```
The resulting figure is quite messy. Using colour to indicate the passing of time helps a little. For this, we need to add the day of the year to the data frame. To get the values right, we use `nrow`\index{\texttt{nrow}}\index{data frame!number of rows}, which gives us the number of rows in the data frame.
```{r eval=FALSE}
elecdaily2 <- as.data.frame(elecdaily)
elecdaily2$day <- 1:nrow(elecdaily2)
ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) +
geom_point() +
geom_path()
```
It becomes clear from the plot that temperatures were the highest at the beginning of the year and lower in the winter months (July-August).
$$\sim$$
```{exercise, label="ch4exc5"}
Make the following changes to the plot you created above:
1. Decrease the size of the points.
2. Add text annotations showing the dates of the highest and lowest temperatures, next to the corresponding points in the figure.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch4solutions5)
### Spaghetti plots
In cases where we've observed multiple subjects over time, we often wish to visualise their individual time series together using so-called spaghetti plots. With `ggplot2` this is done using the `geom_line` geom.\index{spaghetti plot} To illustrate this, we use the `Oxboys`\index{data!\texttt{Oxboys}} data from the `nlme` package, showing the heights of 26 boys over time.
```{r eval=FALSE}
library(nlme)
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
```
The first two `aes` arguments specify the x and y-axes, and the third specifies that there should be one line per subject (i.e. per boy) rather than a single line interpolating all points. The latter would be a rather useless figure that looks like this:
```{r eval=FALSE}
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line() +
ggtitle("A terrible plot")
```
Returning to the original plot, if we wish to be able to identify which time series corresponds to which boy, we can add a colour aesthetic:
```{r eval=FALSE}
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) +
geom_point() +
geom_line()
```
Note that the boys are ordered by height, rather than subject number, in the legend.
Now, imagine that we wish to add a trend line describing the general growth trend for all boys. The growth appears approximately linear, so it seems sensible to use `geom_smooth(method = "lm")` to add the trend: