forked from NUstat/intro-stat-data-sci
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-visualization.qmd
1438 lines (1016 loc) · 72.5 KB
/
02-visualization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Visualization {#sec-viz}
```{r}
#| label: setup-viz
#| include: false
#| purl: false
# for number learning checks
chap <- 2
lc <- 0
# `r paste0(chap, ".", (lc <- lc + 1))`
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
fig.align='center',
warning = FALSE
)
options(scipen = 99, digits = 3)
# In knitr::kable printing replace all NA's with blanks
options(knitr.kable.NA = '')
# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
We begin the development of your data science toolbox with data visualization. By visualizing our data, we gain valuable insights that we couldn't initially see from just looking at the raw data in spreadsheet form. We will use the `ggplot2` package as it provides an easy way to customize your plots. `ggplot2` is rooted in the data visualization theory known as *The Grammar of Graphics* [@wilkinson2005].
At the most basic level, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way for us to get a sense for how quantitative variables compare in terms of their center (where the values tend to be located) and their spread (how they vary around the center). Graphics should be designed to emphasize the findings and insight you want your audience to understand. This does however require a balancing act. On the one hand, you want to highlight as many meaningful relationships and interesting findings as possible; on the other you don't want to include so many as to overwhelm your audience.
As we will see, plots/graphics also help us to identify patterns and outliers in our data. We will see that a common extension of these ideas is to compare the *distribution* of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is *distributed* in terms of its values) as we go across the levels of a different categorical variable.
## Packages Needed {.unnumbered}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Read @sec-packages for information on how to install and load R packages.
```{r}
#| label: pkgs-ch02
#| message: false
library(nycflights13)
library(ggplot2)
library(dplyr)
```
```{r}
#| label: pkgs-ch01-internal
#| message: false
#| warning: false
#| echo: false
# Packages needed internally, but not in text.
library(gapminder)
library(knitr)
library(kableExtra)
library(readr)
```
## The Grammar of Graphics {#sec-grammarofgraphics}
We begin with a discussion of a theoretical framework for data visualization known as "The Grammar of Graphics," which serves as the foundation for the `ggplot2` package. Think of how we construct sentences in English to form sentences by combining different elements, like nouns, verbs, particles, subjects, objects, etc. However, we can't just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, "The Grammar of Graphics" define a set of rules for constructing *statistical graphics* by combining different types of *layers*. This grammar was created by Leland Wilkinson [@wilkinson2005] and has been implemented in a variety of data visualization software including R.
### Components of the Grammar
In short, the grammar tells us that:
> **A statistical graphic is a `mapping` of `data` variables to `aes`thetic attributes of `geom`etric objects.**
Specifically, we can break a graphic into three essential components:
1. `data`: the data set composed of variables that we map.
2. `geom`: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.
3. `aes`: aesthetic attributes of the geometric object. For example, x-position, y-position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data set.
You might be wondering why we wrote the terms `data`, `geom`, and `aes` in a computer code type font. We'll see very shortly that we'll specify the elements of the grammar in R using these terms. However, let's first break down the grammar with an example.
### Gapminder data {#sec-gapminder}
```{r}
#| echo: false
gapminder_2007 <- gapminder %>%
filter(year == 2007) %>%
select(-year) %>%
rename(
Country = country,
Continent = continent,
`Life Expectancy` = lifeExp,
`Population` = pop,
`GDP per Capita` = gdpPercap
)
```
In February 2006, a statistician named Hans Rosling gave a TED talk titled ["The best stats you've ever seen"](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) where he presented global economic, health, and development data from the website [gapminder.org](http://www.gapminder.org/tools/#_locale_id=en;&chart-type=bubbles). For example, for the `r nrow(gapminder_2007)` countries included from 2007, let's consider only the first 6 countries when listed alphabetically in @tbl-gapminder-2007.
```{r}
#| label: tbl-gapminder-2007
#| tbl-cap: "Gapminder 2007 Data: First 6 of 142 countries"
#| echo: false
gapminder_2007 %>%
head() %>%
kable(
format = "markdown",
digits = 2,
caption = "Gapminder 2007 Data: First 6 of 142 countries",
booktabs = TRUE
) %>%
kable_styling(
font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position")
)
```
Each row in this table corresponds to a country in 2007. For each row, we have 5 columns:
1. **Country**: Name of country.
2. **Continent**: Which of the five continents the country is part of. (Note that "Americas" includes countries in both North and South America and that Antarctica is excluded.)
3. **Life Expectancy**: Life expectancy in years.
4. **Population**: Number of people living in the country.
5. **GDP per Capita**: Gross domestic product (in US dollars).
Now consider @fig-gapminder, which plots this data for all `r nrow(gapminder_2007)` countries in the data.
```{r}
#| label: fig-gapminder
#| echo: false
#| fig-cap: Life Expectancy over GDP per Capita in 2007
ggplot(
data = gapminder_2007,
mapping =
aes(
x = `GDP per Capita`,
y = `Life Expectancy`,
size = Population, col = Continent
)
) +
geom_point() +
labs(x = "GDP per capita", y = "Life expectancy")
```
Let's view this plot through the grammar of graphics:
1. The `data` variable **GDP per Capita** gets mapped to the `x`-position `aes`thetic of the points.
2. The `data` variable **Life Expectancy** gets mapped to the `y`-position `aes`thetic of the points.
3. The `data` variable **Population** gets mapped to the `size` `aes`thetic of the points.
4. The `data` variable **Continent** gets mapped to the `color` `aes`thetic of the points.
We'll see shortly that `data` corresponds to the particular data frame where our data is saved and a "data variable" corresponds to a particular column in the data frame. Furthermore, the type of `geom`etric object considered in this plot are points. That being said, while in this example we are considering points, graphics are not limited to just points. Other plots involve lines while others involve bars.
Let's summarize the three essential components of the Grammar in @tbl-summary-gapminder.
```{r}
#| label: tbl-summary-gapminder
#| tbl-cap: "Summary of Grammar of Graphics for this plot"
#| echo: false
tibble(
`data variable` = c("GDP per Capita", "Life Expectancy", "Population", "Continent"),
aes = c("x", "y", "size", "color"),
geom = c("point", "point", "point", "point")
) %>%
kable(
format = "markdown",
caption = "Summary of Grammar of Graphics for this plot",
booktabs = TRUE
) %>%
kable_styling(
font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position")
)
```
### Other components
There are other components of the Grammar of Graphics we can control as well. As you start to delve deeper into the Grammar of Graphics, you'll start to encounter these topics more frequently. In this book however, we'll keep things simple and only work with the two additional components listed below:
- `facet`ing breaks up a plot into small multiples corresponding to the levels of another variable (@sec-facets)
- `position` adjustments for barplots (@sec-geombar)
Other more complex components like `scales` and `coord`inate systems are left for a more advanced text such as [R for Data Science](http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings) [@rds2016]. Generally speaking, the Grammar of Graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.
### ggplot2 package
In this book, we will be using the `ggplot2` package for data visualization, which is an implementation of the Grammar of Graphics for R [@R-ggplot2]. As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the Grammar of Graphics are specified in the `ggplot()` function included in the `ggplot2` package, which expects at a minimum as arguments (i.e. inputs):
- The data frame where the variables exist: the `data` argument.
- The mapping of the variables to aesthetic attributes: the `mapping` argument which specifies the `aes`thetic attributes involved.
After we've specified these components, we then add *layers* to the plot using the `+` sign. The most essential layer to add to a plot is the layer that specifies which type of `geom`etric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include layers specifying the plot title, axes labels, visual themes for the plots, and facets (which we'll see in @sec-facets.
Let's now put the theory of the Grammar of Graphics into practice.
## Five Named Graphs - The 5NG {#sec-five-ng}
In order to keep things simple, we will only focus on five types of graphics in this book, each with a commonly given name. We term these "five named graphs" the **5NG**:
1. scatterplots
2. linegraphs
3. boxplots
4. histograms
5. barplots
We will discuss some variations of these plots, but with this basic repertoire of graphics in your toolbox you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables and while others are only appropriate for quantitative variables. You'll want to quiz yourself often as we go along on which plot makes sense a given a particular problem or data set.
## 5NG#1: Scatterplots {#sec-scatterplots}
The simplest of the 5NG are *scatterplots*, also called bivariate plots. They allow you to visualize the relationship between two numerical variables. While you may already be familiar with scatterplots, let's view them through the lens of the Grammar of Graphics. Specifically, we will visualize the relationship between the following two numerical variables in the `flights` data frame included in the `nycflights13` package:
1. `dep_delay`: departure delay on the horizontal "x" axis and
2. `arr_delay`: arrival delay on the vertical "y" axis
for Alaska Airlines flights leaving NYC in 2013. This requires paring down the data from all 336,776 flights that left NYC in 2013, to only the 714 *Alaska Airlines* flights that left NYC in 2013.
What this means computationally is: we'll take the `flights` data frame, extract only the 714 rows corresponding to Alaska Airlines flights, and save this in a new data frame called `alaska_flights`. Run the code below to do this:
```{r}
alaska_flights <- flights %>%
filter(carrier == "AS")
```
For now we suggest you ignore how this code works; we'll explain this in detail in @sec-wrangling when we cover data wrangling. However, convince yourself that this code does what it is supposed to by running `View(alaska_flights)`: it creates a new data frame `alaska_flights` consisting of only the 714 Alaska Airlines flights.
We'll see later in @sec-wrangling on data wrangling that this code uses the `dplyr` package for data wrangling to achieve our goal: it takes the `flights` data frame and `filter`s it to only return the rows where `carrier` is equal to `"AS"`, Alaska Airlines' carrier code. Other examples of carrier codes include "AA" for American Airlines and "UA" for United Airlines. Recall from @sec-code that testing for equality is specified with `==` and not `=`. Fasten your seat belts and sit tight for now however, we'll introduce these ideas more fully in @sec-wrangling.
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Take a look at both the `flights` and `alaska_flights` data frames by running `View(flights)` and `View(alaska_flights)`. In what respect do these data frames differ?
:::
### Scatterplots via geom_point {#sec-geompoint}
Let's now go over the code that will create the desired scatterplot, keeping in mind our discussion on the Grammar of Graphics in @sec-grammarofgraphics. We'll be using the `ggplot()` function included in the `ggplot2` package.
```{r}
#| eval: false
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
```
Let's break this down piece-by-piece:
- Within the `ggplot()` function, we specify two of the components of the Grammar of Graphics as arguments (i.e. inputs):
1. The `data` frame to be `alaska_flights` by setting `data = alaska_flights`.
2. The `aes`thetic `mapping` by setting `aes(x = dep_delay, y = arr_delay)`. Specifically:
- the variable `dep_delay` maps to the `x` position aesthetic
- the variable `arr_delay` maps to the `y` position aesthetic
- We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object. In this case the geometric object are points, set by specifying `geom_point()`.
After running the above code, you'll notice two outputs: a warning message and the graphic shown in @fig-noalpha. Let's first unpack the warning message:
```{r}
#| label: fig-noalpha
#| fig-cap: "Arrival Delays vs Departure Delays for Alaska Airlines flights from NYC in 2013"
#| warning: true
#| echo: false
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
```
After running the above code, R returns a warning message alerting us to the fact that 5 rows were ignored due to them being missing. For 5 rows either the value for `dep_delay` or `arr_delay` or both were missing (recorded in R as `NA`), and thus these rows were ignored in our plot. Turning our attention to the resulting scatterplot in @fig-noalpha, we see that a positive relationship exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. We also note the large mass of points clustered near (0, 0).
Before we continue, let's consider a few more notes on the layers in the above code that generated the scatterplot:
- Note that the `+` sign comes at the end of lines, and not at the beginning. You'll get an error in R if you put it at the beginning.
- When adding layers to a plot, you are encouraged to start a new line after the `+` so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.
- To stress the importance of adding layers in particular the layer specifying the `geom`etric object, consider @fig-nolayers where no layers are added. A not very useful plot!
```{r}
#| label: fig-nolayers
#| fig-cap: "Plot with no layers"
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))
```
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What are some practical reasons why `dep_delay` and `arr_delay` have a positive relationship?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What variables (not necessarily in the `flights` data frame) would you expect to have a negative correlation (i.e. a negative relationship) with `dep_delay`? Why? Remember that we are focusing on numerical variables here.
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What are some other features of the plot that stand out to you?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Create a new scatterplot using different variables in the `alaska_flights` data frame by modifying the example above.
:::
### Over-plotting {#sec-overplotting}
The large mass of points near (0, 0) in @fig-noalpha can cause some confusion as it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called *overplotting*. As one may guess, this corresponds to values being plotted on top of each other *over* and *over* again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatterplot as we have here. There are two methods to address the issue of overplotting:
1. By adjusting the transparency of the points.
2. By adding a little random "jitter", or random "nudges", to each of the points.
**Method 1: Changing the transparency**
The first way of addressing overplotting is by changing the transparency of the points by using the `alpha` argument in `geom_point()`. By default, this value is set to `1`. We can change this to any value between `0` and `1`, where `0` sets the points to be 100% transparent and `1` sets the points to be 100% opaque. Note how the following code is identical to the code in @sec-scatterplots that created the scatterplot with overplotting, but with `alpha = 0.2` added to the `geom_point()`:
```{r}
#| label: fig-alpha
#| fig-cap: "Delay scatterplot with alpha = 0.2"
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
```
The key feature to note in @fig-alpha is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no `aes()` surrounding `alpha = 0.2`. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of `alpha`. In fact, you'll receive an error if you try to change the second line above to read `geom_point(aes(alpha = 0.2))`.
**Method 2: Jittering the points**
The second way of addressing overplotting is by *jittering* all the points, in other words give each point a small nudge in a random direction. You can think of "jittering" as shaking the points around a bit on the plot. Let's illustrate using a simple example first. Say we have a data frame `jitter_example` with 4 rows of identical value 0 for both `x` and `y`:
```{r}
#| label: jitter-example-df
#| echo: false
jitter_example <- tibble(
x = c(0, 0, 0, 0),
y = c(0, 0, 0, 0)
)
jitter_example
```
We display the resulting scatterplot in @fig-jitter-example-plot-1; observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others.
```{r}
#| label: fig-jitter-example-plot-1
#| fig-cap: "Regular scatterplot of jitter example data"
#| echo: false
ggplot(data = jitter_example, mapping = aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(-0.025, 0.025), ylim = c(-0.025, 0.025)) +
labs(title = "Regular scatterplot")
```
In @fig-jitter-example-plot-2 we instead display a *jittered scatterplot* where each point is given a random "nudge." It is now plainly evident that this plot involves four points. Keep in mind that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in `jitter_example` remain unchanged.
```{r}
#| label: fig-jitter-example-plot-2
#| fig-cap: "Jittered scatterplot of jitter example data"
#| echo: false
ggplot(data = jitter_example, mapping = aes(x = x, y = y)) +
geom_jitter(width = 0.01, height = 0.01) +
coord_cartesian(xlim = c(-0.025, 0.025), ylim = c(-0.025, 0.025)) +
labs(title = "Jittered scatterplot")
```
To create a jittered scatterplot, instead of using `geom_point()`, we use `geom_jitter()`. To specify how much jitter to add, we adjust the `width` and `height` arguments. This corresponds to how hard you'd like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case minutes).
```{r}
#| label: fig-jitter
#| fig-cap: "Jittered delay scatterplot"
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30)
```
Observe how the above code is identical to the code that created the scatterplot with overplotting in [Subsection -@sec-geompoint], but with `geom_point()` replaced with `geom_jitter()`.
The resulting plot in @fig-jitter helps us a little bit in getting a sense for the overplotting, but with a relatively large data set like this one (`r nrow(alaska_flights)` flights), it can be argued that changing the transparency of the points by setting `alpha` proved more effective. In terms of how much jitter one should add using the `width` and `height` arguments, it is important to add just enough jitter to break any overlap in points, but not so much that we completely alter the overall pattern in points.
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Why is setting the `alpha` argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
After viewing the @fig-alpha above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in @fig-noalpha?
:::
### Summary
Scatterplots display the relationship between two numerical variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one variable versus another. However, if you try to create a scatterplot where either one of the two variables is not numerical, you might get strange results. Be careful!
With medium to large data sets, you may need to play around with the different modifications one can make to a scatterplot. This tweaking is often a fun part of data visualization, since you'll have the chance to see different relationships come about as you make subtle changes to your plots.
```{=html}
<!--
2019/1/28 note: Add example here using size or color aesthetic?
-->
```
## 5NG#2: Linegraphs {#sec-linegraphs}
The next of the five named graphs are linegraphs. Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the *explanatory* variable, is of a sequential nature; in other words there is an inherent ordering to the variable. The most common example of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called *time series* plots. Linegraphs should be avoided when there is not a clear sequential ordering to the variable on the x-axis. Let's illustrate linegraphs using another data set in the `nycflights13` package: the `weather` data frame.
Let's get a sense for the `weather` data frame:
- Explore the `weather` data by running `View(weather)`.
- Run `?weather` to bring up the help file.
We can see that there is a variable called `temp` of hourly temperature recordings in Fahrenheit at weather stations near all three airports in New York City: Newark (`origin` code `EWR`), JFK, and La Guardia (`LGA`). Instead of considering hourly temperatures for all days in 2013 for all three airports however, for simplicity let's only consider hourly temperatures at only Newark airport for the first 15 days in January.
Recall in @sec-scatterplots we used the `filter()` function to only choose the subset of rows of `flights` corresponding to Alaska Airlines flights. We similarly use `filter()` here, but by using the `&` operator we only choose the subset of rows of `weather` where
1. The `origin` is `"EWR"` and
2. the `month` is January and
3. the `day` is between `1` and `15`
```{r}
early_january_weather <- weather %>%
filter(origin == "EWR" & month == 1 & day <= 15)
```
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Take a look at both the `weather` and `early_january_weather` data frames by running `View(weather)` and `View(early_january_weather)`. In what respect do these data frames differ?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
`View()` the `flights` data frame again. Why does the `time_hour` variable uniquely identify the hour of the measurement whereas the `hour` variable does not?
:::
### Linegraphs via geom_line {#sec-geomline}
Let's plot a linegraph of hourly temperatures in `early_january_weather` by using `geom_line()` instead of `geom_point()` like we did for scatterplots:
```{r}
#| label: fig-hourlytemp
#| fig-cap: Hourly Temperature in Newark for January 1-15, 2013
ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) +
geom_line()
```
Much as with the `ggplot()` code that created the scatterplot of departure and arrival delays for Alaska Airlines flights in @fig-noalpha, let's break down the above code piece-by-piece in terms of the Grammar of Graphics:
- Within the `ggplot()` function call, we specify two of the components of the Grammar of Graphics as arguments:
1. The `data` frame to be `early_january_weather` by setting `data = early_january_weather`
2. The `aes`thetic mapping by setting `aes(x = time_hour, y = temp)`. Specifically:
- the variable `time_hour` maps to the `x` position aesthetic.
- the variable `temp` maps to the `y` position aesthetic
- We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object in question. In this case the geometric object is a `line`, set by specifying `geom_line()`.
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Why are linegraphs frequently used when time is the explanatory variable on the x-axis?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Plot a time series of a variable other than `temp` for Newark Airport in the first 15 days of January 2013.
:::
### Summary
Linegraphs, just like scatterplots, display the relationship between two numerical variables. However it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e. the explanatory variable) has an inherent ordering, like some notion of time.
## 5NG#3: Histograms {#sec-histograms}
Let's consider the `temp` variable in the `weather` data frame once again, but unlike with the linegraphs in @sec-linegraphs, let's say we don't care about the relationship of temperature to time, but rather we only care about how the values of `temp` *distribute*. In other words:
1. What are the smallest and largest values?
2. What is the "center" value?
3. How do the values spread out?
4. What are frequent and infrequent values?
One way to visualize this *distribution* of this single variable `temp` is to plot them on a horizontal line as we do in @fig-temp-on-line:
```{r}
#| label: fig-temp-on-line
#| echo: false
#| fig-height: 0.8
#| fig-cap: "Plot of Hourly Temperature Recordings from NYC in 2013"
ggplot(data = weather, mapping = aes(x = temp, y = factor("A"))) +
geom_point() +
theme(axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank())
```
This gives us a general idea of how the values of `temp` distribute: observe that temperatures vary from around `r round(min(weather$temp, na.rm = TRUE), 0)`°F up to `r round(max(weather$temp, na.rm = TRUE), 0)`°F. Furthermore, there appear to be more recorded temperatures between 40°F and 60°F than outside this range. However, because of the high degree of overlap in the points, it's hard to get a sense of exactly how many values are between, say, 50°F and 55°F.
What is commonly produced instead of the above plot is known as a *histogram*. A histogram is a plot that visualizes the *distribution* of a numerical value as follows:
1. We first cut up the x-axis into a series of *bins*, where each bin represents a range of values.
2. For each bin, we count the number of observations that fall in the range corresponding to that bin.
3. Then for each bin, we draw a bar whose height marks the corresponding count.
Let's drill-down on an example of a histogram, shown in @fig-histogramexample.
```{r}
#| label: fig-histogramexample
#| echo: false
#| fig-cap: "Example histogram"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, boundary = 70, color = "white")
```
Observe that there are three bins of equal width between 30°F and 60°F, thus we have three bins of width 10°F each: one bin for the 30-40°F range, another bin for the 40-50°F range, and another bin for the 50-60°F range. Since:
1. The bin for the 30-40°F range has a height of around 5000, this histogram is telling us that around 5000 of the hourly temperature recordings are between 30°F and 40°F.
2. The bin for the 40-50°F range has a height of around 4300, this histogram is telling us that around 4300 of the hourly temperature recordings are between 40°F and 50°F.
3. The bin for the 50-60°F range has a height of around 3500, this histogram is telling us that around 3500 of the hourly temperature recordings are between 50°F and 60°F.
The remaining bins all have a similar interpretation.
### Histograms via geom_histogram {#sec-geomhistogram}
Let's now present the `ggplot()` code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in `aes()`: the single numerical variable `temp`. The y-aesthetic of a histogram gets computed for you automatically. Furthermore, the geometric object layer is now a `geom_histogram()`
```{r}
#| label: fig-weather-histogram
#| warning: true
#| fig-cap: "Histogram of hourly temperatures at three NYC airports"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram()
```
Let's unpack the messages R sent us first. The first message is telling us that the histogram was constructed using `bins = 30`, in other words 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We'll see in the next section how to change this default number of bins. The second message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Alaska Airlines flights in @fig-noalpha: that because one row has a missing `NA` value for `temp`, it was omitted from the histogram. R is just giving us a friendly heads up that this was the case.
Now's let's unpack the resulting histogram in @fig-weather-histogram. Observe that values less than 25°F as well as values above 80°F are rather rare. However, because of the large number of bins, its hard to get a sense for which range of temperatures is covered by each bin; everything is one giant amorphous blob. So let's add white vertical borders demarcating the bins by adding a `color = "white"` argument to `geom_histogram()`:
```{r}
#| label: fig-weather-histogram-2
#| message: false
#| fig-cap: "Histogram of hourly temperatures at three NYC airports with white borders"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white")
```
We can now better associate ranges of temperatures to each of the bins. We can also vary the color of the bars by setting the `fill` argument. Run `colors()` to see all `r colors() %>% length()` possible choice of colors!
```{r}
#| label: fig-weather-histogram-3
#| message: false
#| fig-cap: "Histogram of hourly temperatures at three NYC airports with white borders"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white", fill = "steelblue")
```
### Adjusting the bins {#sec-adjustbins}
Observe in both @fig-weather-histogram-2 and @fig-weather-histogram-3 that in the 50-75°F range there appear to be roughly 8 bins. Thus each bin has width 25 divided by 8, or roughly 3.12°F which is not a very easily interpretable range to work with. Let's now adjust the number of bins in our histogram in one of two methods:
1. By adjusting the number of bins via the `bins` argument to `geom_histogram()`.
2. By adjusting the width of the bins via the `binwidth` argument to `geom_histogram()`.
Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows:
```{r}
#| label: fig-hist-bins40
#| message: false
#| fig-cap: "Histogram with 40 bins"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 40, color = "white")
```
Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the `binwidth` argument in the `geom_histogram()` layer. For example, let's set the width of each bin to be 10°F.
```{r}
#| label: fig-hist-binwidth10
#| warning: false
#| message: false
#| fig-cap: "Histogram with binwidth 10"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, color = "white")
```
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Would you classify the distribution of temperatures as symmetric or skewed?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What would you guess is the "center" value in this distribution? Why did you make that choice?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Is this data spread out greatly from the center or is it close? Why?
:::
### Summary
Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question.
## Facets {#sec-facets}
Before continuing the 5NG, let's briefly introduce a new concept called *faceting*. Faceting is used when we'd like to split a particular visualization of variables by another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.
For example, suppose we were interested in looking at how the histogram of hourly temperature recordings at the three NYC airports we saw in @sec-histograms differed by month. We would "split" this histogram by the 12 possible months in a given year, in other words plot histograms of `temp` for each `month`. We do this by adding `facet_wrap(~ month)` layer.
```{r}
#| label: fig-facethistogram
#| fig-cap: "Faceted histogram"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month)
```
Note the use of the tilde `~` before `month` in `facet_wrap()`. The tilde is required and you'll receive the error `Error in as.quoted(facets) : object 'month' not found` if you don't include it before `month` here. We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of `facet_wrap()`. For example, say we would like our faceted plot to have 4 rows instead of 3. Add the `nrow = 4` argument to `facet_wrap(~ month)`
```{r}
#| label: fig-facethistogram2
#| fig-cap: "Faceted histogram with 4 instead of 3 rows"
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month, nrow = 4)
```
Observe in both @fig-facethistogram and @fig-facethistogram2 that as we might expect in the Northern Hemisphere, temperatures tend to be higher in the summer months, while they tend to be lower in the winter.
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
For which types of data sets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Does the `temp` variable in the `weather` data set have a lot of variability? Why do you say that?
:::
## 5NG#4: Boxplots {#sec-boxplots}
While faceted histograms are one visualization that allows us to compare distributions of a numerical variable split by another variable, another visualization that achieves this same goal are *side-by-side boxplots*. A boxplot is constructed from the information provided in the *five-number summary* of a numerical variable (see [Appendix -@sec-stat-background]). To keep things simple for now, let's only consider hourly temperature recordings for the month of November in @fig-nov1.
```{r}
#| label: fig-nov1
#| echo: false
#| fig-cap: "November temperatures"
#| fig-height: 3.7
n_nov <- weather %>%
filter(month == 11) %>%
nrow()
min_nov <- weather %>%
filter(month == 11) %>%
pull(temp) %>%
min(na.rm = TRUE)
max_nov <- weather %>%
filter(month == 11) %>%
pull(temp) %>%
max(na.rm = TRUE)
quartiles <- weather %>%
filter(month == 11) %>%
pull(temp) %>%
quantile(prob=c(0.25, 0.5, 0.75))
weather %>%
filter(month %in% c(11)) %>%
ggplot(mapping = aes(x = factor(month), y = temp)) +
#geom_boxplot() +
geom_jitter(width = 0.05, height = 0.5, alpha = 0.1) +
labs(x = NULL)
```
These `r n_nov` observations have the following five-number summary:
1. Minimum: `r min_nov`°F
2. First quartile AKA 25^th^ percentile: `r quartiles[1]`°F
3. Median AKA second quartile AKA 50^th^ percentile: `r quartiles[2]`°F
4. Third quartile AKA 75^th^ percentile: `r quartiles[3]`°F
5. Maximum: `r max_nov`°F
Let's mark these 5 values with dashed horizontal lines in @fig-nov2.
```{r}
#| label: fig-nov2
#| echo: false
#| fig-cap: "November temperatures"
#| fig-height: 3.7
five_number <- tibble(
temp = c(min_nov, quartiles, max_nov)
)
weather %>%
filter(month %in% c(11)) %>%
ggplot(mapping = aes(x = factor(month), y = temp)) +
#geom_boxplot() +
geom_hline(data = five_number, aes(yintercept=temp), linetype = "dashed") +
geom_jitter(width = 0.05, height = 0.5, alpha = 0.1) +
labs(x = NULL)
```
Let's add the boxplot underneath these points and dashed horizontal lines in @fig-nov3.
```{r}
#| label: fig-nov3
#| echo: false
#| fig-cap: "November temperatures"
#| fig-height: 3.7
weather %>%
filter(month %in% c(11)) %>%
ggplot(mapping = aes(x = factor(month), y = temp)) +
geom_boxplot() +
geom_hline(data = five_number, aes(yintercept=temp), linetype = "dashed") +
geom_jitter(width = 0.05, height = 0.5, alpha = 0.1) +
labs(x = NULL)
```
What the boxplot does summarize the `r weather %>% filter(month == 11) %>% nrow()` points by emphasizing that:
1. 25% of points (about 534 observations) fall below the bottom edge of the box, which is the first quartile of `r quartiles[1] %>% round(3)`°F. In other words 25% of observations were colder than `r quartiles[1] %>% round(3)`°F.
2. 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of `r quartiles[2] %>% round(3)`°F. In other words 25% of observations were between `r quartiles[1] %>% round(3)` and `r quartiles[2] %>% round(3)`°F and 50% of observations were colder than `r quartiles[2] %>% round(3)`°F.
3. 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of `r quartiles[3] %>% round(3)`°F. In other words 25% of observations were between `r quartiles[2] %>% round(3)` and `r quartiles[3] %>% round(3)`°F and 75% of observations were colder than `r quartiles[3] %>% round(3)`°F.
4. 25% of points fall over the top edge of the box. In other words 25% of observations were warmer than `r quartiles[3] %>% round(3)`°F.
5. The middle 50% of points lie within the *interquartile range* between the first and third quartile of `r quartiles[3] %>% round(3)` - `r quartiles[1] %>% round(3)` = `r (quartiles[3] - quartiles[1]) %>% round(3)`°F.
Lastly, for clarity's sake let's remove the points but keep the dashed horizontal lines in @fig-nov4.
```{r}
#| label: fig-nov4
#| echo: false
#| fig-cap: "November temperatures"
#| fig-height: 3.7
weather %>%
filter(month %in% c(11)) %>%
ggplot(mapping = aes(x = factor(month), y = temp)) +
geom_boxplot() +
geom_hline(data = five_number, aes(yintercept=temp), linetype = "dashed") +
# geom_jitter(width = 0.05, height = 0.5, alpha = 0.1) +
labs(x = NULL)
```
We can now better see the *whiskers* of the boxplot. They stick out from either end of the box all the way to the minimum and maximum observed temperatures of `r min_nov`°F and `r max_nov`°F respectively. However, the whiskers don't always extend to the smallest and largest observed values. They in fact can extend no more than 1.5 $\times$ the interquartile range from either end of the box, in this case 1.5 $\times$ `r (quartiles[3] - quartiles[1]) %>% round(3)`°F = `r (1.5*(quartiles[3] - quartiles[1])) %>% round(3)`°F from either end of the box. Any observed values outside this whiskers get marked with points called *outliers*, which we'll see in the next section.
### Boxplots via geom_boxplot {#sec-geomboxplot}
Let's now create a side-by-side boxplot of hourly temperatures split by the 12 months as we did above with the faceted histograms. We do this by mapping the `month` variable to the x-position aesthetic, the `temp` variable to the y-position aesthetic, and by adding a `geom_boxplot()` layer:
```{r}
#| label: fig-badbox
#| fig-cap: "Invalid boxplot specification"
#| fig-height: 3.5
ggplot(data = weather, mapping = aes(x = month, y = temp)) +
geom_boxplot()
```
Warning messages:
1: Continuous x aesthetic -- did you forget aes(group=...)?
2: Removed 1 rows containing non-finite values (stat_boxplot).
Observe in @fig-badbox that this plot does not provide information about temperature separated by month. The warning messages clue us in as to why. The second warning message is identical to the warning message when plotting a histogram of hourly temperatures: that one of the values was recorded as `NA` missing. However, the first warning message is telling us that we have a "continuous", or numerical variable, on the x-position aesthetic. Boxplots however require a categorical variable on the x-axis.
We can convert the numerical variable `month` into a categorical variable by using the `factor()` function. So after applying `factor(month)`, month goes from having numerical values 1, 2, ..., 12 to having labels "1", "2", ..., "12."
```{r}
#| label: fig-monthtempbox
#| fig-cap: Temp by month boxplot
#| fig-height: 3.7
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
```
The resulting @fig-monthtempbox shows 12 separate "box and whiskers" plots with the features we saw earlier focusing only on November:
- The "box" portions of this visualization represent the 1^st^ quartile, the median AKA the 2^nd^ quartile, and the 3^rd^ quartile.
- The "length" of each box, i.e. the value of the 3^rd^ quartile minus the value of the 1^st^ quartile, is the *interquartile range*. It is a measure of spread of the middle 50% of values, with longer boxes indicating more variability.
- The "whisker" portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25^th^ percentile and greater than the 75^th^ percentiles respectively. They're set to extend out no more than $1.5 \times IQR$ units away from either end of the boxes. We say "no more than" because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.
- The dots representing values falling outside the whiskers are called *outliers*. These can be thought of as anomalous values.
It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than $1.5 \times IQR$ units long. Looking at this plot we can see, as expected, that summer months (6 through 8) have higher median temperatures as evidenced by the higher solid lines in the middle of the boxes. We can easily compare temperatures across months by drawing imaginary horizontal lines across the plot. Furthermore, the height of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of temperatures recorded in a given month.
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Which months have the highest variability in temperature? What reasons can you give for this?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
We looked at the distribution of the numerical variable `temp` split by the numerical variable `month` that we converted to a categorical variable using the `factor()` function. Why would a boxplot of `temp` split by the numerical variable `pressure` similarly converted to a categorical variable using the `factor()` not be informative?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
:::
### Summary
Side-by-side boxplots provide us with a way to compare and contrast the distribution of a quantitative variable across multiple levels of another categorical variable. One can see where the median falls across the different groups by looking at the center line in the boxes. To see how spread out the variable is across the different groups, look at both the width of the box and also how far the whiskers stretch out away from the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with points.
## 5NG#5: Barplots {#sec-geombar}
Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another common task is visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories, also known as *levels*, of a categorical variable. Often the best way to visualize these different counts, also known as *frequencies*, is with a barplot (also known as a barchart). One complication, however, is how your data is represented: is the categorical variable of interest "pre-counted" or not? For example, run the following code that manually creates two data frames representing a collection of fruit: 3 apples and 2 oranges.
```{r}
fruits <- tibble(
fruit = c("apple", "apple", "orange", "apple", "orange")
)
fruits_counted <- tibble(
fruit = c("apple", "orange"),
number = c(3, 2)
)
```
We see both the `fruits` and `fruits_counted` data frames represent the same collection of fruit. Whereas `fruits` just lists the fruit individually...
```{r}
#| label: fruits
#| echo: false
fruits
```
... `fruits_counted` has a variable `number` which represents pre-counted values of each fruit.
```{r}
#| label: fruitscounted
#| echo: false
fruits_counted
```
Depending on how your categorical data is represented, you'll need to use add a different `geom` layer to your `ggplot()` to create a barplot, as we now explore.
### Barplots via geom_bar or geom_col
Let's generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the `fruits` data frame where all 5 fruits are listed individually in 5 rows, we map the `fruit` variable to the x-position aesthetic and add a `geom_bar()` layer.
```{r}
#| label: fig-geombar
#| fig-cap: "Barplot when counts are not pre-counted"
#| fig-height: 2.5
ggplot(data = fruits, mapping = aes(x = fruit)) +
geom_bar()
```
However, using the `fruits_counted` data frame where the fruit have been "pre-counted", we map the `fruit` variable to the x-position aesthetic as with `geom_bar()`, but we also map the `count` variable to the y-position aesthetic, and add a `geom_col()` layer.
```{r}
#| label: fig-geomcol
#| fig-cap: "Barplot when counts are pre-counted"
#| fig-height: 2.5
ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
geom_col()
```
Compare the barplots in @fig-geombar and @fig-geomcol. They are identical because they reflect count of the same 5 fruit. However depending on how our data is saved, either pre-counted or not, we must add a different `geom` layer. When the categorical variable whose distribution you want to visualize is:
- Is not pre-counted in your data frame: use `geom_bar()`.
- Is pre-counted in your data frame, use `geom_col()` with the y-position aesthetic mapped to the variable that has the counts.
Let's now go back to the `flights` data frame in the `nycflights13` package and visualize the distribution of the categorical variable `carrier`. In other words, let's visualize the number of domestic flights out of the three New York City airports each airline company flew in 2013. Recall from @sec-exploredataframes when you first explored the `flights` data frame you saw that each row corresponds to a flight. In other words the `flights` data frame is more like the `fruits` data frame than the `fruits_counted` data frame above, and thus we should use `geom_bar()` instead of `geom_col()` to create a barplot. Much like a `geom_histogram()`, there is only one variable in the `aes()` aesthetic mapping: the variable `carrier` gets mapped to the `x`-position.
```{r}
#| label: fig-flightsbar
#| fig-cap: "Number of flights departing NYC in 2013 by airline using `geom_bar()`"
#| fig-height: 2.5
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar()
```
Observe in @fig-flightsbar that United Air Lines (`UA`), JetBlue Airways (`B6`), and ExpressJet Airlines (`EV`) had the most flights depart New York City in 2013. If you don't know which airlines correspond to which carrier codes, then run `View(airlines)` to see a directory of airlines. For example: AA is American Airlines; B6 is JetBlue Airways; DL is Delta Airlines; EV is ExpressJet Airlines; MQ is Envoy Air; while UA is United Airlines.
Alternatively, say you had a data frame `flights_counted` where the number of flights for each `carrier` was pre-counted like in @tbl-flights-counted.
```{r}
#| label: tbl-flights-counted
#| tbl-cap: "Number of flights pre-counted for each carrier"
#| message: false
#| echo: false
flights_table <- flights %>%
group_by(carrier) %>%
summarize(number = n()) %>%
arrange(desc(number))
kable(
format = "markdown",
flights_table,
digits = 3,
caption = "Number of flights pre-counted for each carrier",
booktabs = TRUE,
longtable = TRUE
) %>%
kable_styling(
font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position"),
full_width = FALSE
)
```
In order to create a barplot visualizing the distribution of the categorical variable `carrier` in this case, we would use `geom_col()` instead with `x` mapped to `carrier` and `y` mapped to `number` as seen below. The resulting barplot would be identical to @fig-flightsbar.
```{r}
#| eval: false
ggplot(data = flights_table, mapping = aes(x = carrier, y = number)) +
geom_col()
```
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
Why are histograms inappropriate for visualizing categorical variables?
:::
::: {.callout-tip icon="false" collapse="true"}
## :dart: Learning Check `r paste0(chap, ".", (lc <- lc + 1))`
What is the difference between histograms and barplots?
:::
::: {.callout-tip icon="false" collapse="true"}