data.Rmd

---
title: Data Manipulation
---

One of the most crucial tasks in digital history is manipulating data. It is no exaggeration to say that in many projects, manipulating and cleaning the data is 80 percent or more of the work, while visualizing or analyzing it is only 20 percent of the work. It might seem like visualizing or analyzing is the real historical work, and it is certainly the part of the work that leads to historical interpretations. But data manipulation is the crucial task for the digital historian, because it is in manipulating the data that we explores its possibilities and its perils: that is, the possibilities that exist for finding interesting historical interpretations, but also the places where we could make errors by trusting the data too much.

There are several things that you can learn by learning data manipulation. Most obviously, you will learn how to get data from whatever form you receive it to a form which is useable for any kind of task, as well as how to take large datasets and summarize them meaningfully. Less obviously, the methods for manipulating data are closely related to some of the basic principles of databases. For example, filtering data is like querying it; merging data is like joining it; tidying data is like normalizing a database. If you can learn the principles behind data manipulation, you will be well prepared for understanding databases.

## Definitions

We need to begin with a few definitions. What is *data*? For our purposes, let us say that data is any source amenable to computation. Most obviously this could be numeric information, such as in the reports of the U.S. Census Bureau. But it could also be a corpus of text files to be text mined, or bibliographic information retrived from an API. 

Data comes in different *data structures*. Each of these types of data is likely to have its own format. The Census Bureau information is like to come as a table of numbers; the API information is likely to come as a list (an R data structure like an object in JavaScript, a dictionary in Python, or a hash in Ruby), and the text corpus is likely to be a set of plain text files. This chapter will briefly discuss how to work with lists, though we will leave most of those examples for the chapter on [APIs](apis.html). Working with text files will be discussed in the chapter on [text analysis](topic-modeling.html). This chapter will mostly discuss how to work with data that comes in tables: the form you are most likely to work with.

Tables in R are represented by a data structure called a *data frame*. A data frame is very much like a spreadsheet as represented in R. It contains columns, which have names which are usually the names of variables. It also has rows which contain the observations. A data frame is the type of data structure that you get whenever you load a .csv file into R.

Take for example the data frame of US state populations from the historydata package.

```{r}
library(stringi)
library(dplyr)
library(historydata)
data(us_state_populations)
us_state_populations
```

This data frame has columns of varying types. The `state` column is a character vector containing state names; the `population` column is a numeric vector.

This can be seen when examining the structure of the data frame.

```{r}
str(us_state_populations)
```

We can see that there are 983 "observations" (rows) and 4 "variables" (columns) in the object which has the class `data.frame`. The `$` gives us a hint that we can access the parts of the data frame with that operator. We can, for example, find the median population of all the states in U.S. history:

```{r}
median(us_state_populations$population, na.rm = TRUE)
```

We can think of data frames as coming in two forms. One is a *wide* data frame, in which there are many columns of variables. Take for instance this data set from the [NHGIS](http://nhgis.org) which contains information about the birthplace of citizens of each state in the United States in 1850.

```{r}
birthplace_1850 <- read.csv("data/nhgis-nativity/nhgis0023_ds12_1850_state.csv",
                            stringsAsFactors = FALSE)
str(birthplace_1850)
```

This dataset contains only `r nrow(birthplace_1850)` rows, one for each state or territory in the United States. But it contains `r ncol(birthplace_1850)` columns. Some of these contain the `YEAR` and `STATE` information. But most of these columns have cryptic names such as `AFA001` and `AFB057`. Looking at an excerpt from the codebook that NHGIS provides with the dataset, we can see what these columns mean.

```
Table 2:     Place of Birth
Universe:    Persons
Source code: NT12
NHGIS code:  AFB
    AFB001:      Native-born: Alabama
    AFB002:      Native-born: Arkansas
    AFB003:      Native-born: California
...
    AFB034:      Foreign-born: England
    AFB035:      Foreign-born: Scotland
    AFB036:      Foreign-born: Wales
    AFB037:      Foreign-born: Ireland

```

The name of each column in the dataset is itself a variable. The column name represents where someone was born. Actually, it represents two pieces of information: where someone was born, and whether that birth place was inside the United States or outside of it. This type of wide dataset is very difficult to work with for most purposes. For example, it would require a complex set of commands to find the proportion of native- and foreign-born people in each state. We have a similar table for other years, but it is impossible to easily join the two data frames together. It is also difficult to replace the codes with the names that they represent.

A much better solution is to use a *long* (or *narrow*) data format. Fortunately  R has a command to reshape the data frame from wide to long. Consider the same data frame in long format. Don't worry for now about the commands that do the transformation (but notice that we can do this transformation with a single function). Instead just look at the final product.

```{r}
library(tidyr)
birthplace_1850_long <- birthplace_1850 %>%
  gather(birthplace, count, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME)

head(birthplace_1850_long)
str(birthplace_1850_long)
```

Notice that now we have `r nrow(birthplace_1850_long)` rows but only `r ncol(birthplace_1850_long)` columns. All of the columns whose names were codes are now gathered into the `birthplace` and `count` columns. The `birthplace` column contains the codes (which, as you remember, actually encode information about the birthplace) and the `count` column contains the number of people who had that particular birthplace for each combination of state and year. 

This long dataset is far more useful. We could much more easily replace the hard-to-read codes with the birthplaces that they represent. We could also do the same transformation on datasets for other years, and then append them to one another, creating a new timeseries. 

This way of structuring data has been called *tidy data* by Hadley Wickham.^[Hadley Wickham, "Tidy Data," Journal of Statistical Software 50, no. 10 (2014): <http://www.jstatsoft.org/v59/i10>.] We can use his definition of tidy data from the article in which he explains the concept:

> In tidy data:  
> 1. Each variable forms a column.  
> 2. Each observation forms a row.  
> 3. Each type of observational unit forms a table.^[Wickham, "Tidy Data," 4.]  

We will explore this definition by applying it to our dataset later in this chapter. In this case we have applied principles 1 and 2. For us an observation take the form, "*n* people from birthplace *x* lived in state *y* in year *z*." Our wide version of the birthplace data was messy because the columns contained information (the birthplace), not just the names of the variables. By reshaping the data, we have moved all of the information in the table into a place where it can be easily manipulated. The tidy data format is not superior to the wide data format for every purpose: for presentation purposes it is often much better to have data in a wide format, and sometimes it is easier to perform certain calculations in that format. The R package [tidyr](http://cran.rstudio.org/web/packages/tidyr/)  (again, written by Wickham) also provides a function to go from narrow to wide data. But for most data manipulation the goal is to achieve a tidy data format, from which it is far easier to do other data manipulations to gather information.

There is one other definition that we need to offer. So far we have been talking about *tidying* data, by which we mean changing its structure into a more useful form. Another task altogether is *cleaning* data. Data often contains inconsistencies or outright errors. For example, in our data frame of birthplaces, it is entirely possible that the names of states could be inconsistent: `"MA"`, `"Massachusetts"`, and `"Mass."` and even `"MA "` and `"Massachusetts "` (note the extra spaces) might convey the same meaning to you, but they are decidedly not the same thing to your computer. It is common for dates to contain errors: I recently cleaned some data where an event started in `1880` and ended in `1800`. Any human entered data will be rife with such errors; any data that is OCR'ed is likely to be worse. The methods of data manipulation, including data tidying, are generalizable to most datasets. The problems of data cleaning are particular to each dataset. Only by working closely with a dataset will you be able to identify the problems and find an appropriate way to solve them. This chapter will suggest some basic strategies for data cleaning, but on the whole it will focus on data tidying and manipulation.

This chapter will begin by identifying the common verbs of a grammar of data manipulation provided by two packages, [dplyr](http://cran.rstudio.org/web/packages/dplyr/) and [tidyr](http://cran.rstudio.org/web/packages/tidyr/), using a small sample dataset. Then it will take your through the process of combining those verbs to turn the NHGIS dataset on birthplaces into a usable dataset.

## The grammar of data manipulation

When we manipulate data, it helps to think of what we are doing as a set of verbs. Since almost everything in R is a function, those verbs will be expressed as functions. This concept of data manipulation as a grammar, as well as the actual R functions which make it possible, are drawn from Hadley Wickham's work.^[The package dplyr is also written by Romain Francois.]

There are essentially eight verbs for data manipulation. These eight verbs are provided by two R packages, [dplyr](http://cran.rstudio.org/web/packages/dplyr/)   and [tidyr](http://cran.rstudio.org/web/packages/tidyr/). While these manipulations can also be performed in base R, they are far more easily explained, and even faster to compute, using these two packages.

We'll begin our exploration of the grammar of data manipulation by creating a toy dataset small enough that we can perform all of the calculations manually to understand what is going on. We will use a made up example of church memberships.

```{r}
set.seed(337)
sample_df <- data.frame(
  name = toupper(letters[1:10]),
  denomination = sample(c("Presbyterian", "Episcopalian",
                          "Catholic", "Baptist"), 10, replace = TRUE),
  city = sample(c("New York", "Boston", "Baltimore"), 10,  replace = TRUE),
  members = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  attendees = sample(seq(1e2, 1e4, 10), 10, replace = TRUE),
  stringsAsFactors = FALSE)
sample_df
```

Now that we have a data frame with sample data, we can use it to try out the data manipulation verbs. We also will load our two necessary libraries:

```{r}
library(dplyr)
library(tidyr)
```


### The pipe (%>%) function

First we have to learn about the pipe operator, `%>%`. The pipe operator is not one of the data manipulation verbs, but it will make our task of data manipulation far easier.^[The pipe operator is provided by the [magrittr](http://cran.rstudio.org/web/packages/magrittr/) by Stefan Bache. It is not necessary to load magrittr directly if you are loading dplyr, since dply imports the pipe operator. The pipe operator is, like everything in R a function and not technically an operator. But it is a special type of binary function in R which takes its arguments as a left-hand side and a right-hand side.]

The pipe operator is like the pipe operator, `|`, in Unix: it passes the results of one command to the input of the next. You can think of it as the word "then" for R: do this *then* do that. What the pipe operator does is take the output of one function and pass it as the first argument in the next function. Consider the following example.

```{r}
head(sample_df)

sample_df %>% head()
```

Notice that the two examples are equivalent. The pipe operator lets us call a function on a dataset by writing it after instead of around the variable name. Even more important, it lets us pass the dataset through a chain of operators. We can do this without saving any intermerdiate state. Consider the following more complicated example to find out how many unique denominations are in our dataset:

```{r}
length(unique(sample_df$denomination))

sample_df$denomination %>%
  unique() %>%
  length()
```

The top example is nearly unreadable, and that is with only two function calls and a variable name. The bottom example is far more readable: "Take the list of denominations, *then* find the unique names, *then* find the number of unique names." The ability to chain operators lets us string together our data manipulation verbs to perform complex actions.^[For more, see the [magrittr vignette](http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)].

### select() (dplyr)

Our first data manipulation verb is *select*. The `select()` operator lets us pick which columns we want from a data frame. We can do this by specifying the names of the columns that we wish to keep:

```{r}
sample_df %>%
  select(name, city, members)
```

It is also possible to specify the columns we do not want and keep all the others.

```{r}
sample_df %>%
  select(-denomination, -attendees)
```

You can also specify a range of columns using the `:` symbol.

```{r}
sample_df %>%
  select(name, city:attendees)
```

### filter() (dplyr)

The verb to select only certain rows from a column is *filter*. The `filter()` function take a conditional expression using one or more of the columns in the data frame. A conditional checks for the truth of an expression and returns a boolean value, `TRUE` or `FALSE`. All of the following are examples of conditionals.

```{r}
100 > 0
4 <= 0
"string" == "another string"
is.na("a value")
is.na(NA)
nchar("string") == 6
```

You can use the `filter()` function to return only the rows that match a particular conditional or set of conditionals. For example, we could return only the Catholic churches in our sample data frame.

```{r}
sample_df %>%
  filter(denomination == "Catholic")
```

Alternatively, we could find only the churches with at least 500 members.

```{r}
sample_df %>%
  filter(members >= 500)
```

Or we could use both conditions. This is the equivalent of joining the two conditions with the `&` operator.

```{r}
sample_df %>%
  filter(denomination == "Catholic",
         members >= 500)
```

### arrange() (dplyr)

The verb *arrange* and its corresponding `arrange() function lets us sort a data frame. We might want to sort our sample list of churches by the number of attendees:

```{r}
sample_df %>%
  arrange(attendees)
```

Notice that the churches are sorted in ascending order of attendees. If we want to sort them from greatest to least, we can use the `desc()` function.

```{r}
sample_df %>%
  arrange(desc(attendees))
```

Sorting will work with character vectors as well as with numeric vectors.

### mutate() (dplyr)

*Mutate* and the `mutate()` function is the verb that lets us create new columns from existing columns.^[You can think of this, if you must, as being like a formula in Excel.] For example, in our churches dataset, we might like to know the proportion of attendees that are members. We can use the `=` operator to assign a calculation to a new column named whatever we like.

```{r}
sample_df %>%
  mutate(membership_proportion = members / attendees)
```

You can use use the `mutate()` function for all kinds of operations, not just numeric calculations. For example, we could create a column, `protestant`, which has the value `TRUE` if the denomination field is not `Catholic`, and `FALSE` otherwise.

```{r}
sample_df %>%
  mutate(protestant = ifelse(denomination == "Catholic", FALSE, TRUE))
```

### summarize() and group_by() (dplyr)

One of the most powerful data manipulation verbs is *summarize*, `summarize(). It is often the case that our dataset contains many individual observations which we want to collapse into a summary table. In this pattern, we take our data frame and split it apart into several data frames by grouping together unique variables or combinations of variables. We then summarize each data frame using some function, and combine them back together.

This concept is difficult to wrap our minds around without an example. Say, for example, that we wanted to know the total number of each denomination from our data frame. We have four different denominations represented. We could group all the Episcopalians together, all the Baptists together, and so on. Then we could count how many are in each group (using the function `n()`) and combine that information into a new data frame. The three pieces here are the variable (or combination of variables) to group by, the function used to summarize them, and the name of the new column to represent the summed up value. 

```{r}
sample_df %>%
  group_by(denomination) %>%
  summarize(number_of_churches = n())
```

We could perform the same essential operation by count the number of churches in each city.

```{r}
sample_df %>%
  group_by(city) %>%
  summarize(number_of_churches = n())
```

Or we could summarize the churches in a more sophisticated way by summing up (using the function `sum()`) the number of attendees and members for each denomination, or by using `mean()` to find the average number of attendees and members.^[Keep in mind that the `sum()` function will return `NA` if the vector passed to it contains any `NA`s. This is because R is (rightly) conservative and won't pretend to assign a value to those missing values. You can avoid this by adding the `na.rm = TRUE` argument to sum. So if you have a vector `x` with `NA` values, then `sum(x, na.rm = TRUE)` will remove the missing values and sum the rest. The same argument also applies to `mean()` and `median()` and quite a few other R functions.]

```{r}
sample_df %>%
  group_by(denomination) %>%
  summarize(total_members = sum(members),
            total_attendees = sum(attendees),
            avg_members = mean(members),
            avg_attendees = mean(attendees))
```

We could also find the total number of members for each denomination by city by adding a field to the `group_by()` function. This will yield all the unique combinations of `city` and `denomination`.

```{r}
sample_df %>%
  group_by(denomination, city) %>%
  summarize(total_members = sum(members))
```

When using summarize it is important to keep the three components in mind.

1. The `group_by()` function takes the name of a variable or variables for which it will find all the unique combinations. The number of unique combinations will be the number of rows in our summarized data frame. Those unique combinations will be the basis of several new data frames which will be passed to step 2.

2. The function to the right of the `=` sign inside `summarize()` provides a function which turns multiple rows in our split-apart data frames into a single row. For example, `sum()` and `mean()` both take a vector of numbers and return a single value, while `n()` counts the number of rows in the split-apart data frame. 

3. The variable name to the left of the `=` sign inside `summarize()` provides the new column name in the summarized data frame.^[These three steps correspond to the "split-apply-combine" steps in Hadley Wickham's article by that name. Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis," *Journal of Statistical Software* 40, no. 1 (2011): <http://www.jstatsoft.org/v40/i01/>.]

### gather() (tidyr)

Our next two data manipulation verbs are drawn from the [tidyr](http://cran.rstudio.org/web/packages/tidyr/) package instead of from the [dplyr](http://cran.rstudio.org/web/packages/dply/) package. Because they involve manipulating data from wide to long formats and back again, we will create a slightly different example data frame. Suppose that our data frame from earlier now contains two columns for the membership in three different years:

```{r}
sample_df_wide <- data.frame(
  name = toupper(letters[1:10]),
  denomination = sample(c("Presbyterian", "Episcopalian",
                          "Catholic", "Baptist"), 10, replace = TRUE),
  city = sample(c("New York", "Boston", "Baltimore"), 10,  replace = TRUE),
  members_1830 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  members_1840 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  members_1850 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  stringsAsFactors = FALSE)
sample_df_wide
```

This information would be easier to plot if it were in long format instead of wide format. The `gather()` function from tidyr lets us perform such an operation. 

What we want to do in this case is take the three columns `members_1830`, `members_1840`, and `members_1850` and turn them into two columns: `year` and `members`. As they stand now, the column names actually represent data, that is, the year in which an observation was taken, and that information belongs in its own column. The `gather()` function will expect us to give the names of the key and value columns. In this case, that will be `year` (whose values will beome the the existing names of the columns) and `members` (whose values will become the current values of the columns we are gathering). The function `gather()` will also let us specify which columns should remain untouched, in this case, `name`, `denomination`, and `city`. The columns to be spread can be specified with the same syntax as in `select()`.

```{r}
sample_df_wide %>%
  gather(year, members, -name, -denomination, -city)
```

Notice that our data frame is now much longer but also narrower. We would have to do some additional cleanup to transform the character vectors of the type `members_1830` to integers of the type `1830`, but we could do that using mutate and functions from the [stringr](http://cran.rstudio.org/web/packages/stringr/) package. (See the chapter on manipulating [strings](strings.html).)

### spread() (tidyr)

*Gather* is the exact opposite of *spread*. Where the `gather()` function took column names and turned them into the values of rows, `spread()` takes the values of rows and turns them into columns. We can see this with our newly transformed long dataset made with `gather()`, which we can retransform using `spread()`. We have to pass spread the names of the key and value columns. Each unique value in the key column will become a new column with that name; each associated value will the value column will be filed under that column.

```{r}
sample_df_long <- sample_df_wide %>%
  gather(year, members, -name, -denomination, -city)

sample_df_long %>%
  spread(year, members)
```

Sometimes you will want to spread or gather a column to perform calculations on it, and then you will want to do the reverse to get it back to its original form.

### The join family of functions (dplyr and base R)

The final data manipulation verb is *join*, which has a wide array of functions associated with it. Joining is a powerful action which combines two separate data frames based on a key column (or columns) shared by both of them. For example, let's suppose we have a separate data frame with the population of cities.

```{r}
cities <- data.frame(
  city = c("New York", "Baltimore", "Boston"),
  population = sample(seq(1e6, 1e7, 1e3), 3))
```

In our original `sample_df` data frame, it would be nice to associate the population of each city with the churches located in those cities. Then we could calculate what percentage of the city attends each church or denomination. Since the city name is the same in the `city` columns in both data frames, we can join the two twogether. Churches in New York will get a new column `population` with the population of New York, and so forth. The join is performed by the function `left_join()` from [dplyr](http://cran.rstudio.org/web/packages/dplyr/).

```{r}
sample_df %>%
  left_join(cities, by = "city")
```

Notice that in the new data frame, all the churches in New York have the same value for `population`, drawn from the `cities` data frame; the same is true for Baltimore and Boston.

There are many different kinds of joins, including `left_join()`, `inner_join()`, `semi_join()`, and `anti_join()`. You can read about the exact definition of each join by looking up `?join` in the dplyr help. Suffice it to say that the different kinds of joins do different things when matching data frames together, usually having to do with what happens when there is not a perfect match.

The `by = "city"` argument in `left_join()` specifies the names of the columns which have the shared key. There can be multiple names of the columns in `left_join()`, but the columns have to be named the same thing. If one data frame had a column named `city` and the other had a column named `city_name`, the join would not work. It is likely that eventually dplyr will remove this limitation. In the meantime, you can rename columns using `select()`. Specifying `select(city = city_name)` will rename a column; or you can add a new column with `mutate(city= city_name)`. You could also use the base R function `merge()` which lets you specify `by.x` and `by.y` arguments in which you give the names of the shared columns in each data frame.

### Do

Finally dplyr also provides a useful `do()` function for applying models and the like to data frames. We will use this function in the chapter on [statistics](statistics.html).

## Example: Tidying and analyzing birthplace data

Lets return to our NHGIS dataset about birthplace data to put these data manipulation verbs into practice. Here we have a very interesting dataset which can tell us about immigration to and migration in the United States. But it is contained in three different files, and the format of each file is almost unusable. In particular, the codes are unreadable. This dataset will give us practice manipulating and the querying data.

### Combining the data files

The first thing to do is to get the paths of our three files and the associated codebooks.

```{r}
birthplace1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state.csv"
birthplace2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state.csv"
birthplace3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state.csv"

codes1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state_codebook.txt"
codes2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state_codebook.txt"
codes3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state_codebook.txt"
```

Now we can load those files. 

```{r}
birthdata1 <- read.csv(birthplace1, stringsAsFactors = FALSE)
birthdata2 <- read.csv(birthplace2, stringsAsFactors = FALSE)
birthdata3 <- read.csv(birthplace3, stringsAsFactors = FALSE)
```

A quick look at the first few lines of one of the files will remind us what we are working with.

```{r}
head(birthdata1)
```

We want to keep the columns `GISJOIN`, `YEAR`, `STATE`, `STATEA`, and `AREANAME` as columns, but we want to turn all the other columns into key-value pairs using `gather()`.^[Some of these columns will be useful later for merging our data with geospatial data, so we'll keep them around, even though they aren't strictly necessary for our purposes now.] For good measure, we will also give these data frames the extra class `tbl_df`.^[The main value that `tbl_df`, which is provided by [dplyr](http://cran.rstudio.org/web/packages/dplyr/), adds is that it makes the default printing of data frames much niced.]

```{r}
library(dplyr)
library(tidyr)

birthdata1 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata1

birthdata2 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata2

birthdata3 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata3
```

Now we can check our work by looking at one of the our data variables.

```{r}
birthdata1
```

It would be nice to combine these three variables into one. We could not do this before when they had different numbers of columns with different names. But now that all three data frames have the same columns, we can easily combine them one after another with the function `rbind()`.

```{r}
birthplace <- rbind(birthdata1, birthdata2, birthdata3)
birthplace
```

Our new `birthplace` variable now has `r nrow(birthplace)` rows: the same as the sum of the number of rows in `birthdata1`, `birthdata2`, and `birthdata3`. We now have a single variable containing all our data in a long instead of wide format. This will be much easier to work with.

### Excursus: a better way of loading multiple files

You might have noticed that our method of opening multiple files is clunky. We had to keep track of a lot of variable names using numeric suffixes. Even worse, we had to do a lot of copying and pasting of code. This was barely manageable for three files; it would have been unmanageable for dozens of files. A good rule for programming is DRY: don't repeat yourself. If you find yourself copying and pasting code, there is probably a better way to do it.

This section uses some of R's functional programming techniques and its data structures to perform the same action more easily. This technique could load a hundred files as easily as it could load two. You can skip this section if you want.

Instead of loading each file individually, we are going to to use R's `lapply()` function to apply the `read.csv()` function to each element in a vector of file names. Those files will be read into a list of data frames. We can then use `lapply()` again to reshape the data frames. Then we can combine them all together.

```{r}
csv_files <- Sys.glob("data/nhgis-nativity/*state.csv")
```

Notice that in one lines of code we've loaded all of the filenames, using a Unix-like glob (or wildcard expansion) .

```{r}
csv_files
```

Now we use `lapply()` to apply the function that reads the CSV files.

```{r}
birthplace_data <- lapply(csv_files, read.csv, stringsAsFactors = FALSE)
```

The results is a list of data frames. Those data frames are identical to the data frames we load individually above.

```{r}
str(birthplace_data, max.level = 1)
```

Next we use `lapply()` again to reshape the data frames. Notice that the code is essentially the same was what we used above with `gather()` but we only have to write it once, no matter how many data frames are in our list. We've put the code inside an anonymous function (that is, a function that hasn't been assigned to a variable) because `lapply()` applies a function to some kind of vector or list.

```{r}
birthplace_reshaped <- lapply(birthplace_data, function(df) {
  df %>%
    gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
    tbl_df()
  })
```

We can double check that the code did what we expected.

```{r}
birthplace_reshaped
```

This is still a list of data frames, and now we need to bind them together. We can use the `rbind_all()` function provided by dplyr to combine them:

```{r warning=FALSE}
birthplace <- rbind_all(birthplace_reshaped)
birthplace
```

The result is exactly the same as what we did above, but it took many fewer lines of code, and the code is generalizable to any problem where we have to load multiple CSV files and combine them. We could have made the code even shorter by using the pipe operator. We will demonstrate that by loading and parsing the codebooks in a series of steps. Exactly why we are parsing the code books will be explained in the next section.

```{r warning=FALSE}
library(mullenMisc)
Sys.glob("data/nhgis-nativity/*state_codebook.txt") %>%
  lapply(parse_nhgis_codebook) %>%
  rbind_all() -> codebooks

codebooks
```

### Merging codes with their values

Our `birthplace` data frame is now much more usable than it was before. But still those codes are a problem. We have the codebooks that [NHGIS](http://nhgis.org) provided to explain them, but we can hardly go looking at them every time we want to see what a code means. The [mullenMisc](https://github.com/lmullen/mullenMisc) package contains a function, `parse_nhgis_codebook()` which can turn a codebook text file into a data frame.^[[mullenMisc](https://github.com/lmullen/mullenMisc) is my personal package which contains miscellaneous functions that I use repeatedly. You might eventually want to create your own personal package for such purposes.] We will do that for each of the files, then bind them together.

```{r}
library(mullenMisc)
codebook1 <- parse_nhgis_codebook(codes1)
codebook2 <- parse_nhgis_codebook(codes2)
codebook3 <- parse_nhgis_codebook(codes3)

codebooks <-  rbind(codebook1, codebook2, codebook2)
```

Now we have a list of all the codes and their meanings. Now we can use a join to combine our dataframe of codes with our data frame of birthplaces. In this case the proper join is `left_join()`, since a left join will return all of the rows from our birthplace data (we want all of that, even if we don't have a matching code) and all of the columns from both the birthplace data frame and the codebook data frame. Both our `birthplace` data frame and our `codebooks` data frame have a column called `code`, so we don't need to specify which tables to join by.

```{r}
birthplace <- birthplace %>%
  left_join(codebooks)

str(birthplace)
```

Now our data frame has a `meaning` column. This is much better than dealing with those codes. But still the meaning column has two kinds of information in it: whether the place is "native" or "foreign", and what the name of the place is. We can use `mutate()` to create new columns which extract just that information. The `stri_detect_fixed()` function will detect wether the word `"Foreign"` is present. If so, we will give that field `native_or_foreign` the value `"foreign"`; otherwise we will give it the value `"native"`.

```{r}
birthplace <- birthplace %>%
  mutate(native_or_foreign = ifelse(stri_detect_fixed(meaning, "Foreign"),
                                    "foreign", "native")) 
```

Now we have a field with a native or foreign value, and we can use the `meaning` column for the actual place of birth. Now we can start asking interesting questions. In most cases these questions will involve summarizing the data. One interesting question is what the proportion of native- and foreign-born people was in each year.

```{r}
birthplace %>%
  group_by(YEAR) %>%
  mutate(total = sum(value, na.rm = TRUE),
         proportion = value / total) %>%
  group_by(YEAR, native_or_foreign) %>%
  summarize(proportion = sum(proportion, na.rm = TRUE)) ->
  birthplace_proportions_by_year

birthplace_proportions_by_year
  
library(ggplot2)
birthplace_proportions_by_year %>%
  ggplot(aes(x = YEAR, y = proportion, fill= native_or_foreign)) + 
  geom_bar(stat = "identity") +
  coord_flip()
```

Alternatively, we could try to find the most common foreign born group in each state by year.

```{r}
birthplace %>%
  group_by(YEAR) %>%
  mutate(total = sum(value, na.rm = TRUE),
         proportion = value / total) %>%
  filter(native_or_foreign == "foreign") %>%
  ungroup() %>%
  arrange(desc(proportion)) %>%
  select(YEAR, STATE, meaning, value, proportion)
```

Unsurprisingly, Irish and German immigrants comprised the biggest proportion of immigrants in any state, though it is somewhat surprising that they were less than one percent of the population of any given state except New York in 1850.

## Broom package

R functions often return data in non-tidy formats. The spatial objects from the [sp](http://cran.rstudio.org/web/packages/sp/) package are one example, as are the models returned from various statistical functions. It is often much easier to use the [broom](http://cran.rstudio.org/web/packages/broom/) package to tidy this data than to work with it in its raw form. TODO

## Manipulating lists and lists of lists

## Data Cleaning

## Further reading

- Watch Hadley Wickham, "[Tidy Data and Tidy Tools](http://vimeo.com/33727555)," NYC Open Statistical Computing Meetup, Dec. 2011.
- Hadley Wickham, "Tidy Data," Journal of Statistical Software 50, no. 10 (2014): <http://www.jstatsoft.org/v59/i10>.