forked from lmullen/dh-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data.Rmd
564 lines (393 loc) · 36.1 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
---
title: Data Manipulation
---
One of the most crucial tasks in digital history is manipulating data. It is no exaggeration to say that in many projects, manipulating and cleaning the data is 80 percent or more of the work, while visualizing or analyzing it is only 20 percent of the work. It might seem like visualizing or analyzing is the real historical work, and it is certainly the part of the work that leads to historical interpretations. But data manipulation is the crucial task for the digital historian, because it is in manipulating the data that we explores its possibilities and its perils: that is, the possibilities that exist for finding interesting historical interpretations, but also the places where we could make errors by trusting the data too much.
There are several things that you can learn by learning data manipulation. Most obviously, you will learn how to get data from whatever form you receive it to a form which is useable for any kind of task, as well as how to take large datasets and summarize them meaningfully. Less obviously, the methods for manipulating data are closely related to some of the basic principles of databases. For example, filtering data is like querying it; merging data is like joining it; tidying data is like normalizing a database. If you can learn the principles behind data manipulation, you will be well prepared for understanding databases.
## Definitions
We need to begin with a few definitions. What is *data*? For our purposes, let us say that data is any source amenable to computation. Most obviously this could be numeric information, such as in the reports of the U.S. Census Bureau. But it could also be a corpus of text files to be text mined, or bibliographic information retrived from an API.
Data comes in different *data structures*. Each of these types of data is likely to have its own format. The Census Bureau information is like to come as a table of numbers; the API information is likely to come as a list (an R data structure like an object in JavaScript, a dictionary in Python, or a hash in Ruby), and the text corpus is likely to be a set of plain text files. This chapter will briefly discuss how to work with lists, though we will leave most of those examples for the chapter on [APIs](apis.html). Working with text files will be discussed in the chapter on [text analysis](topic-modeling.html). This chapter will mostly discuss how to work with data that comes in tables: the form you are most likely to work with.
Tables in R are represented by a data structure called a *data frame*. A data frame is very much like a spreadsheet as represented in R. It contains columns, which have names which are usually the names of variables. It also has rows which contain the observations. A data frame is the type of data structure that you get whenever you load a .csv file into R.
Take for example the data frame of US state populations from the historydata package.
```{r}
library(stringi)
library(dplyr)
library(historydata)
data(us_state_populations)
us_state_populations
```
This data frame has columns of varying types. The `state` column is a character vector containing state names; the `population` column is a numeric vector.
This can be seen when examining the structure of the data frame.
```{r}
str(us_state_populations)
```
We can see that there are 983 "observations" (rows) and 4 "variables" (columns) in the object which has the class `data.frame`. The `$` gives us a hint that we can access the parts of the data frame with that operator. We can, for example, find the median population of all the states in U.S. history:
```{r}
median(us_state_populations$population, na.rm = TRUE)
```
We can think of data frames as coming in two forms. One is a *wide* data frame, in which there are many columns of variables. Take for instance this data set from the [NHGIS](http://nhgis.org) which contains information about the birthplace of citizens of each state in the United States in 1850.
```{r}
birthplace_1850 <- read.csv("data/nhgis-nativity/nhgis0023_ds12_1850_state.csv",
stringsAsFactors = FALSE)
str(birthplace_1850)
```
This dataset contains only `r nrow(birthplace_1850)` rows, one for each state or territory in the United States. But it contains `r ncol(birthplace_1850)` columns. Some of these contain the `YEAR` and `STATE` information. But most of these columns have cryptic names such as `AFA001` and `AFB057`. Looking at an excerpt from the codebook that NHGIS provides with the dataset, we can see what these columns mean.
```
Table 2: Place of Birth
Universe: Persons
Source code: NT12
NHGIS code: AFB
AFB001: Native-born: Alabama
AFB002: Native-born: Arkansas
AFB003: Native-born: California
...
AFB034: Foreign-born: England
AFB035: Foreign-born: Scotland
AFB036: Foreign-born: Wales
AFB037: Foreign-born: Ireland
```
The name of each column in the dataset is itself a variable. The column name represents where someone was born. Actually, it represents two pieces of information: where someone was born, and whether that birth place was inside the United States or outside of it. This type of wide dataset is very difficult to work with for most purposes. For example, it would require a complex set of commands to find the proportion of native- and foreign-born people in each state. We have a similar table for other years, but it is impossible to easily join the two data frames together. It is also difficult to replace the codes with the names that they represent.
A much better solution is to use a *long* (or *narrow*) data format. Fortunately R has a command to reshape the data frame from wide to long. Consider the same data frame in long format. Don't worry for now about the commands that do the transformation (but notice that we can do this transformation with a single function). Instead just look at the final product.
```{r}
library(tidyr)
birthplace_1850_long <- birthplace_1850 %>%
gather(birthplace, count, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME)
head(birthplace_1850_long)
str(birthplace_1850_long)
```
Notice that now we have `r nrow(birthplace_1850_long)` rows but only `r ncol(birthplace_1850_long)` columns. All of the columns whose names were codes are now gathered into the `birthplace` and `count` columns. The `birthplace` column contains the codes (which, as you remember, actually encode information about the birthplace) and the `count` column contains the number of people who had that particular birthplace for each combination of state and year.
This long dataset is far more useful. We could much more easily replace the hard-to-read codes with the birthplaces that they represent. We could also do the same transformation on datasets for other years, and then append them to one another, creating a new timeseries.
This way of structuring data has been called *tidy data* by Hadley Wickham.^[Hadley Wickham, "Tidy Data," Journal of Statistical Software 50, no. 10 (2014): <http://www.jstatsoft.org/v59/i10>.] We can use his definition of tidy data from the article in which he explains the concept:
> In tidy data:
> 1. Each variable forms a column.
> 2. Each observation forms a row.
> 3. Each type of observational unit forms a table.^[Wickham, "Tidy Data," 4.]
We will explore this definition by applying it to our dataset later in this chapter. In this case we have applied principles 1 and 2. For us an observation take the form, "*n* people from birthplace *x* lived in state *y* in year *z*." Our wide version of the birthplace data was messy because the columns contained information (the birthplace), not just the names of the variables. By reshaping the data, we have moved all of the information in the table into a place where it can be easily manipulated. The tidy data format is not superior to the wide data format for every purpose: for presentation purposes it is often much better to have data in a wide format, and sometimes it is easier to perform certain calculations in that format. The R package [tidyr](http://cran.rstudio.org/web/packages/tidyr/) (again, written by Wickham) also provides a function to go from narrow to wide data. But for most data manipulation the goal is to achieve a tidy data format, from which it is far easier to do other data manipulations to gather information.
There is one other definition that we need to offer. So far we have been talking about *tidying* data, by which we mean changing its structure into a more useful form. Another task altogether is *cleaning* data. Data often contains inconsistencies or outright errors. For example, in our data frame of birthplaces, it is entirely possible that the names of states could be inconsistent: `"MA"`, `"Massachusetts"`, and `"Mass."` and even `"MA "` and `"Massachusetts "` (note the extra spaces) might convey the same meaning to you, but they are decidedly not the same thing to your computer. It is common for dates to contain errors: I recently cleaned some data where an event started in `1880` and ended in `1800`. Any human entered data will be rife with such errors; any data that is OCR'ed is likely to be worse. The methods of data manipulation, including data tidying, are generalizable to most datasets. The problems of data cleaning are particular to each dataset. Only by working closely with a dataset will you be able to identify the problems and find an appropriate way to solve them. This chapter will suggest some basic strategies for data cleaning, but on the whole it will focus on data tidying and manipulation.
This chapter will begin by identifying the common verbs of a grammar of data manipulation provided by two packages, [dplyr](http://cran.rstudio.org/web/packages/dplyr/) and [tidyr](http://cran.rstudio.org/web/packages/tidyr/), using a small sample dataset. Then it will take your through the process of combining those verbs to turn the NHGIS dataset on birthplaces into a usable dataset.
## The grammar of data manipulation
When we manipulate data, it helps to think of what we are doing as a set of verbs. Since almost everything in R is a function, those verbs will be expressed as functions. This concept of data manipulation as a grammar, as well as the actual R functions which make it possible, are drawn from Hadley Wickham's work.^[The package dplyr is also written by Romain Francois.]
There are essentially eight verbs for data manipulation. These eight verbs are provided by two R packages, [dplyr](http://cran.rstudio.org/web/packages/dplyr/) and [tidyr](http://cran.rstudio.org/web/packages/tidyr/). While these manipulations can also be performed in base R, they are far more easily explained, and even faster to compute, using these two packages.
We'll begin our exploration of the grammar of data manipulation by creating a toy dataset small enough that we can perform all of the calculations manually to understand what is going on. We will use a made up example of church memberships.
```{r}
set.seed(337)
sample_df <- data.frame(
name = toupper(letters[1:10]),
denomination = sample(c("Presbyterian", "Episcopalian",
"Catholic", "Baptist"), 10, replace = TRUE),
city = sample(c("New York", "Boston", "Baltimore"), 10, replace = TRUE),
members = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
attendees = sample(seq(1e2, 1e4, 10), 10, replace = TRUE),
stringsAsFactors = FALSE)
sample_df
```
Now that we have a data frame with sample data, we can use it to try out the data manipulation verbs. We also will load our two necessary libraries:
```{r}
library(dplyr)
library(tidyr)
```
### The pipe (%>%) function
First we have to learn about the pipe operator, `%>%`. The pipe operator is not one of the data manipulation verbs, but it will make our task of data manipulation far easier.^[The pipe operator is provided by the [magrittr](http://cran.rstudio.org/web/packages/magrittr/) by Stefan Bache. It is not necessary to load magrittr directly if you are loading dplyr, since dply imports the pipe operator. The pipe operator is, like everything in R a function and not technically an operator. But it is a special type of binary function in R which takes its arguments as a left-hand side and a right-hand side.]
The pipe operator is like the pipe operator, `|`, in Unix: it passes the results of one command to the input of the next. You can think of it as the word "then" for R: do this *then* do that. What the pipe operator does is take the output of one function and pass it as the first argument in the next function. Consider the following example.
```{r}
head(sample_df)
sample_df %>% head()
```
Notice that the two examples are equivalent. The pipe operator lets us call a function on a dataset by writing it after instead of around the variable name. Even more important, it lets us pass the dataset through a chain of operators. We can do this without saving any intermerdiate state. Consider the following more complicated example to find out how many unique denominations are in our dataset:
```{r}
length(unique(sample_df$denomination))
sample_df$denomination %>%
unique() %>%
length()
```
The top example is nearly unreadable, and that is with only two function calls and a variable name. The bottom example is far more readable: "Take the list of denominations, *then* find the unique names, *then* find the number of unique names." The ability to chain operators lets us string together our data manipulation verbs to perform complex actions.^[For more, see the [magrittr vignette](http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)].
### select() (dplyr)
Our first data manipulation verb is *select*. The `select()` operator lets us pick which columns we want from a data frame. We can do this by specifying the names of the columns that we wish to keep:
```{r}
sample_df %>%
select(name, city, members)
```
It is also possible to specify the columns we do not want and keep all the others.
```{r}
sample_df %>%
select(-denomination, -attendees)
```
You can also specify a range of columns using the `:` symbol.
```{r}
sample_df %>%
select(name, city:attendees)
```
### filter() (dplyr)
The verb to select only certain rows from a column is *filter*. The `filter()` function take a conditional expression using one or more of the columns in the data frame. A conditional checks for the truth of an expression and returns a boolean value, `TRUE` or `FALSE`. All of the following are examples of conditionals.
```{r}
100 > 0
4 <= 0
"string" == "another string"
is.na("a value")
is.na(NA)
nchar("string") == 6
```
You can use the `filter()` function to return only the rows that match a particular conditional or set of conditionals. For example, we could return only the Catholic churches in our sample data frame.
```{r}
sample_df %>%
filter(denomination == "Catholic")
```
Alternatively, we could find only the churches with at least 500 members.
```{r}
sample_df %>%
filter(members >= 500)
```
Or we could use both conditions. This is the equivalent of joining the two conditions with the `&` operator.
```{r}
sample_df %>%
filter(denomination == "Catholic",
members >= 500)
```
### arrange() (dplyr)
The verb *arrange* and its corresponding `arrange() function lets us sort a data frame. We might want to sort our sample list of churches by the number of attendees:
```{r}
sample_df %>%
arrange(attendees)
```
Notice that the churches are sorted in ascending order of attendees. If we want to sort them from greatest to least, we can use the `desc()` function.
```{r}
sample_df %>%
arrange(desc(attendees))
```
Sorting will work with character vectors as well as with numeric vectors.
### mutate() (dplyr)
*Mutate* and the `mutate()` function is the verb that lets us create new columns from existing columns.^[You can think of this, if you must, as being like a formula in Excel.] For example, in our churches dataset, we might like to know the proportion of attendees that are members. We can use the `=` operator to assign a calculation to a new column named whatever we like.
```{r}
sample_df %>%
mutate(membership_proportion = members / attendees)
```
You can use use the `mutate()` function for all kinds of operations, not just numeric calculations. For example, we could create a column, `protestant`, which has the value `TRUE` if the denomination field is not `Catholic`, and `FALSE` otherwise.
```{r}
sample_df %>%
mutate(protestant = ifelse(denomination == "Catholic", FALSE, TRUE))
```
### summarize() and group_by() (dplyr)
One of the most powerful data manipulation verbs is *summarize*, `summarize(). It is often the case that our dataset contains many individual observations which we want to collapse into a summary table. In this pattern, we take our data frame and split it apart into several data frames by grouping together unique variables or combinations of variables. We then summarize each data frame using some function, and combine them back together.
This concept is difficult to wrap our minds around without an example. Say, for example, that we wanted to know the total number of each denomination from our data frame. We have four different denominations represented. We could group all the Episcopalians together, all the Baptists together, and so on. Then we could count how many are in each group (using the function `n()`) and combine that information into a new data frame. The three pieces here are the variable (or combination of variables) to group by, the function used to summarize them, and the name of the new column to represent the summed up value.
```{r}
sample_df %>%
group_by(denomination) %>%
summarize(number_of_churches = n())
```
We could perform the same essential operation by count the number of churches in each city.
```{r}
sample_df %>%
group_by(city) %>%
summarize(number_of_churches = n())
```
Or we could summarize the churches in a more sophisticated way by summing up (using the function `sum()`) the number of attendees and members for each denomination, or by using `mean()` to find the average number of attendees and members.^[Keep in mind that the `sum()` function will return `NA` if the vector passed to it contains any `NA`s. This is because R is (rightly) conservative and won't pretend to assign a value to those missing values. You can avoid this by adding the `na.rm = TRUE` argument to sum. So if you have a vector `x` with `NA` values, then `sum(x, na.rm = TRUE)` will remove the missing values and sum the rest. The same argument also applies to `mean()` and `median()` and quite a few other R functions.]
```{r}
sample_df %>%
group_by(denomination) %>%
summarize(total_members = sum(members),
total_attendees = sum(attendees),
avg_members = mean(members),
avg_attendees = mean(attendees))
```
We could also find the total number of members for each denomination by city by adding a field to the `group_by()` function. This will yield all the unique combinations of `city` and `denomination`.
```{r}
sample_df %>%
group_by(denomination, city) %>%
summarize(total_members = sum(members))
```
When using summarize it is important to keep the three components in mind.
1. The `group_by()` function takes the name of a variable or variables for which it will find all the unique combinations. The number of unique combinations will be the number of rows in our summarized data frame. Those unique combinations will be the basis of several new data frames which will be passed to step 2.
2. The function to the right of the `=` sign inside `summarize()` provides a function which turns multiple rows in our split-apart data frames into a single row. For example, `sum()` and `mean()` both take a vector of numbers and return a single value, while `n()` counts the number of rows in the split-apart data frame.
3. The variable name to the left of the `=` sign inside `summarize()` provides the new column name in the summarized data frame.^[These three steps correspond to the "split-apply-combine" steps in Hadley Wickham's article by that name. Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis," *Journal of Statistical Software* 40, no. 1 (2011): <http://www.jstatsoft.org/v40/i01/>.]
### gather() (tidyr)
Our next two data manipulation verbs are drawn from the [tidyr](http://cran.rstudio.org/web/packages/tidyr/) package instead of from the [dplyr](http://cran.rstudio.org/web/packages/dply/) package. Because they involve manipulating data from wide to long formats and back again, we will create a slightly different example data frame. Suppose that our data frame from earlier now contains two columns for the membership in three different years:
```{r}
sample_df_wide <- data.frame(
name = toupper(letters[1:10]),
denomination = sample(c("Presbyterian", "Episcopalian",
"Catholic", "Baptist"), 10, replace = TRUE),
city = sample(c("New York", "Boston", "Baltimore"), 10, replace = TRUE),
members_1830 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
members_1840 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
members_1850 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
stringsAsFactors = FALSE)
sample_df_wide
```
This information would be easier to plot if it were in long format instead of wide format. The `gather()` function from tidyr lets us perform such an operation.
What we want to do in this case is take the three columns `members_1830`, `members_1840`, and `members_1850` and turn them into two columns: `year` and `members`. As they stand now, the column names actually represent data, that is, the year in which an observation was taken, and that information belongs in its own column. The `gather()` function will expect us to give the names of the key and value columns. In this case, that will be `year` (whose values will beome the the existing names of the columns) and `members` (whose values will become the current values of the columns we are gathering). The function `gather()` will also let us specify which columns should remain untouched, in this case, `name`, `denomination`, and `city`. The columns to be spread can be specified with the same syntax as in `select()`.
```{r}
sample_df_wide %>%
gather(year, members, -name, -denomination, -city)
```
Notice that our data frame is now much longer but also narrower. We would have to do some additional cleanup to transform the character vectors of the type `members_1830` to integers of the type `1830`, but we could do that using mutate and functions from the [stringr](http://cran.rstudio.org/web/packages/stringr/) package. (See the chapter on manipulating [strings](strings.html).)
### spread() (tidyr)
*Gather* is the exact opposite of *spread*. Where the `gather()` function took column names and turned them into the values of rows, `spread()` takes the values of rows and turns them into columns. We can see this with our newly transformed long dataset made with `gather()`, which we can retransform using `spread()`. We have to pass spread the names of the key and value columns. Each unique value in the key column will become a new column with that name; each associated value will the value column will be filed under that column.
```{r}
sample_df_long <- sample_df_wide %>%
gather(year, members, -name, -denomination, -city)
sample_df_long %>%
spread(year, members)
```
Sometimes you will want to spread or gather a column to perform calculations on it, and then you will want to do the reverse to get it back to its original form.
### The join family of functions (dplyr and base R)
The final data manipulation verb is *join*, which has a wide array of functions associated with it. Joining is a powerful action which combines two separate data frames based on a key column (or columns) shared by both of them. For example, let's suppose we have a separate data frame with the population of cities.
```{r}
cities <- data.frame(
city = c("New York", "Baltimore", "Boston"),
population = sample(seq(1e6, 1e7, 1e3), 3))
```
In our original `sample_df` data frame, it would be nice to associate the population of each city with the churches located in those cities. Then we could calculate what percentage of the city attends each church or denomination. Since the city name is the same in the `city` columns in both data frames, we can join the two twogether. Churches in New York will get a new column `population` with the population of New York, and so forth. The join is performed by the function `left_join()` from [dplyr](http://cran.rstudio.org/web/packages/dplyr/).
```{r}
sample_df %>%
left_join(cities, by = "city")
```
Notice that in the new data frame, all the churches in New York have the same value for `population`, drawn from the `cities` data frame; the same is true for Baltimore and Boston.
There are many different kinds of joins, including `left_join()`, `inner_join()`, `semi_join()`, and `anti_join()`. You can read about the exact definition of each join by looking up `?join` in the dplyr help. Suffice it to say that the different kinds of joins do different things when matching data frames together, usually having to do with what happens when there is not a perfect match.
The `by = "city"` argument in `left_join()` specifies the names of the columns which have the shared key. There can be multiple names of the columns in `left_join()`, but the columns have to be named the same thing. If one data frame had a column named `city` and the other had a column named `city_name`, the join would not work. It is likely that eventually dplyr will remove this limitation. In the meantime, you can rename columns using `select()`. Specifying `select(city = city_name)` will rename a column; or you can add a new column with `mutate(city= city_name)`. You could also use the base R function `merge()` which lets you specify `by.x` and `by.y` arguments in which you give the names of the shared columns in each data frame.
### Do
Finally dplyr also provides a useful `do()` function for applying models and the like to data frames. We will use this function in the chapter on [statistics](statistics.html).
## Example: Tidying and analyzing birthplace data
Lets return to our NHGIS dataset about birthplace data to put these data manipulation verbs into practice. Here we have a very interesting dataset which can tell us about immigration to and migration in the United States. But it is contained in three different files, and the format of each file is almost unusable. In particular, the codes are unreadable. This dataset will give us practice manipulating and the querying data.
### Combining the data files
The first thing to do is to get the paths of our three files and the associated codebooks.
```{r}
birthplace1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state.csv"
birthplace2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state.csv"
birthplace3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state.csv"
codes1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state_codebook.txt"
codes2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state_codebook.txt"
codes3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state_codebook.txt"
```
Now we can load those files.
```{r}
birthdata1 <- read.csv(birthplace1, stringsAsFactors = FALSE)
birthdata2 <- read.csv(birthplace2, stringsAsFactors = FALSE)
birthdata3 <- read.csv(birthplace3, stringsAsFactors = FALSE)
```
A quick look at the first few lines of one of the files will remind us what we are working with.
```{r}
head(birthdata1)
```
We want to keep the columns `GISJOIN`, `YEAR`, `STATE`, `STATEA`, and `AREANAME` as columns, but we want to turn all the other columns into key-value pairs using `gather()`.^[Some of these columns will be useful later for merging our data with geospatial data, so we'll keep them around, even though they aren't strictly necessary for our purposes now.] For good measure, we will also give these data frames the extra class `tbl_df`.^[The main value that `tbl_df`, which is provided by [dplyr](http://cran.rstudio.org/web/packages/dplyr/), adds is that it makes the default printing of data frames much niced.]
```{r}
library(dplyr)
library(tidyr)
birthdata1 %>%
gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
tbl_df() -> birthdata1
birthdata2 %>%
gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
tbl_df() -> birthdata2
birthdata3 %>%
gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
tbl_df() -> birthdata3
```
Now we can check our work by looking at one of the our data variables.
```{r}
birthdata1
```
It would be nice to combine these three variables into one. We could not do this before when they had different numbers of columns with different names. But now that all three data frames have the same columns, we can easily combine them one after another with the function `rbind()`.
```{r}
birthplace <- rbind(birthdata1, birthdata2, birthdata3)
birthplace
```
Our new `birthplace` variable now has `r nrow(birthplace)` rows: the same as the sum of the number of rows in `birthdata1`, `birthdata2`, and `birthdata3`. We now have a single variable containing all our data in a long instead of wide format. This will be much easier to work with.
### Excursus: a better way of loading multiple files
You might have noticed that our method of opening multiple files is clunky. We had to keep track of a lot of variable names using numeric suffixes. Even worse, we had to do a lot of copying and pasting of code. This was barely manageable for three files; it would have been unmanageable for dozens of files. A good rule for programming is DRY: don't repeat yourself. If you find yourself copying and pasting code, there is probably a better way to do it.
This section uses some of R's functional programming techniques and its data structures to perform the same action more easily. This technique could load a hundred files as easily as it could load two. You can skip this section if you want.
Instead of loading each file individually, we are going to to use R's `lapply()` function to apply the `read.csv()` function to each element in a vector of file names. Those files will be read into a list of data frames. We can then use `lapply()` again to reshape the data frames. Then we can combine them all together.
```{r}
csv_files <- Sys.glob("data/nhgis-nativity/*state.csv")
```
Notice that in one lines of code we've loaded all of the filenames, using a Unix-like glob (or wildcard expansion) .
```{r}
csv_files
```
Now we use `lapply()` to apply the function that reads the CSV files.
```{r}
birthplace_data <- lapply(csv_files, read.csv, stringsAsFactors = FALSE)
```
The results is a list of data frames. Those data frames are identical to the data frames we load individually above.
```{r}
str(birthplace_data, max.level = 1)
```
Next we use `lapply()` again to reshape the data frames. Notice that the code is essentially the same was what we used above with `gather()` but we only have to write it once, no matter how many data frames are in our list. We've put the code inside an anonymous function (that is, a function that hasn't been assigned to a variable) because `lapply()` applies a function to some kind of vector or list.
```{r}
birthplace_reshaped <- lapply(birthplace_data, function(df) {
df %>%
gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
tbl_df()
})
```
We can double check that the code did what we expected.
```{r}
birthplace_reshaped
```
This is still a list of data frames, and now we need to bind them together. We can use the `rbind_all()` function provided by dplyr to combine them:
```{r warning=FALSE}
birthplace <- rbind_all(birthplace_reshaped)
birthplace
```
The result is exactly the same as what we did above, but it took many fewer lines of code, and the code is generalizable to any problem where we have to load multiple CSV files and combine them. We could have made the code even shorter by using the pipe operator. We will demonstrate that by loading and parsing the codebooks in a series of steps. Exactly why we are parsing the code books will be explained in the next section.
```{r warning=FALSE}
library(mullenMisc)
Sys.glob("data/nhgis-nativity/*state_codebook.txt") %>%
lapply(parse_nhgis_codebook) %>%
rbind_all() -> codebooks
codebooks
```
### Merging codes with their values
Our `birthplace` data frame is now much more usable than it was before. But still those codes are a problem. We have the codebooks that [NHGIS](http://nhgis.org) provided to explain them, but we can hardly go looking at them every time we want to see what a code means. The [mullenMisc](https://github.com/lmullen/mullenMisc) package contains a function, `parse_nhgis_codebook()` which can turn a codebook text file into a data frame.^[[mullenMisc](https://github.com/lmullen/mullenMisc) is my personal package which contains miscellaneous functions that I use repeatedly. You might eventually want to create your own personal package for such purposes.] We will do that for each of the files, then bind them together.
```{r}
library(mullenMisc)
codebook1 <- parse_nhgis_codebook(codes1)
codebook2 <- parse_nhgis_codebook(codes2)
codebook3 <- parse_nhgis_codebook(codes3)
codebooks <- rbind(codebook1, codebook2, codebook2)
```
Now we have a list of all the codes and their meanings. Now we can use a join to combine our dataframe of codes with our data frame of birthplaces. In this case the proper join is `left_join()`, since a left join will return all of the rows from our birthplace data (we want all of that, even if we don't have a matching code) and all of the columns from both the birthplace data frame and the codebook data frame. Both our `birthplace` data frame and our `codebooks` data frame have a column called `code`, so we don't need to specify which tables to join by.
```{r}
birthplace <- birthplace %>%
left_join(codebooks)
str(birthplace)
```
Now our data frame has a `meaning` column. This is much better than dealing with those codes. But still the meaning column has two kinds of information in it: whether the place is "native" or "foreign", and what the name of the place is. We can use `mutate()` to create new columns which extract just that information. The `stri_detect_fixed()` function will detect wether the word `"Foreign"` is present. If so, we will give that field `native_or_foreign` the value `"foreign"`; otherwise we will give it the value `"native"`.
```{r}
birthplace <- birthplace %>%
mutate(native_or_foreign = ifelse(stri_detect_fixed(meaning, "Foreign"),
"foreign", "native"))
```
Now we have a field with a native or foreign value, and we can use the `meaning` column for the actual place of birth. Now we can start asking interesting questions. In most cases these questions will involve summarizing the data. One interesting question is what the proportion of native- and foreign-born people was in each year.
```{r}
birthplace %>%
group_by(YEAR) %>%
mutate(total = sum(value, na.rm = TRUE),
proportion = value / total) %>%
group_by(YEAR, native_or_foreign) %>%
summarize(proportion = sum(proportion, na.rm = TRUE)) ->
birthplace_proportions_by_year
birthplace_proportions_by_year
library(ggplot2)
birthplace_proportions_by_year %>%
ggplot(aes(x = YEAR, y = proportion, fill= native_or_foreign)) +
geom_bar(stat = "identity") +
coord_flip()
```
Alternatively, we could try to find the most common foreign born group in each state by year.
```{r}
birthplace %>%
group_by(YEAR) %>%
mutate(total = sum(value, na.rm = TRUE),
proportion = value / total) %>%
filter(native_or_foreign == "foreign") %>%
ungroup() %>%
arrange(desc(proportion)) %>%
select(YEAR, STATE, meaning, value, proportion)
```
Unsurprisingly, Irish and German immigrants comprised the biggest proportion of immigrants in any state, though it is somewhat surprising that they were less than one percent of the population of any given state except New York in 1850.
## Broom package
R functions often return data in non-tidy formats. The spatial objects from the [sp](http://cran.rstudio.org/web/packages/sp/) package are one example, as are the models returned from various statistical functions. It is often much easier to use the [broom](http://cran.rstudio.org/web/packages/broom/) package to tidy this data than to work with it in its raw form. TODO
## Manipulating lists and lists of lists
## Data Cleaning
## Further reading
- Watch Hadley Wickham, "[Tidy Data and Tidy Tools](http://vimeo.com/33727555)," NYC Open Statistical Computing Meetup, Dec. 2011.
- Hadley Wickham, "Tidy Data," Journal of Statistical Software 50, no. 10 (2014): <http://www.jstatsoft.org/v59/i10>.