-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path600_iteration.Rmd
269 lines (151 loc) · 6.47 KB
/
600_iteration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
<!--
This file by Martin Monkman
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/
-->
# (PART) Week 6 {-}
# Iteration {#iteration}
## Setup
This chunk of R code loads the packages that we will be using.
```{r setup_600, eval=FALSE}
library(tidyverse)
library(tidyverse)
library(gapminder)
```
### Reading & Resources {#iteration-reading}
Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, [_R for Data Science_ (2nd ed.), "27 Iteration"](https://r4ds.hadley.nz/iteration)
JD Long and Paul Teetor, [R Cookbook, 2nd ed., "Iterating with a Loop"](https://rc2e.com/simpleprogramming#recipe-for_loop), 2019
## Iteration {#iteration-intro}
Like functions, iteration is a way to reduce
* repetition in your code, and
* repetition in copy-and-paste tasks.
Iteration:
> helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. ([_R4DS_, "21 Iteration"](https://r4ds.had.co.nz/iteration.html))
A rule to follow:
**Never copy and paste more than twice.**
A simple example, where we create a dataframe (tibble) named "df" with 4 columns of 50 records each, each one with a normal distribution centred on 100. (This is a variation on what's in the "Iteration" chapter of _R4DS_.)
```{r iteration_df}
set.seed(8675309)
df <- tibble(
a = rnorm(50, mean = 100, sd = 10),
b = rnorm(50, mean = 100, sd = 10),
c = rnorm(50, mean = 100, sd = 10),
d = rnorm(50, mean = 100, sd = 10)
)
df
```
Calculate the median of variable `a`:
```{r}
median(df$a)
```
Calculate the median of the other three variables:
```{r}
# solution
median(df$b)
median(df$c)
median(df$d)
```
## For-Loop {#iteration-loop}
Or rather than repeat the same function four times, you can write a "for-loop", which executes the code within the loop for as many times as is specified.
There are three parts in a for-loop:
1. Define the output object before you start the loop—this makes your loop more efficient. Notice that the code below uses `ncol()` to count the number of columns in our dataframe "df"
2. The sequence, which defines the number of times the loop iterates
3. The body of the code.
```{r}
output <- vector(mode = "double", length = ncol(df)) # 1. output
# the loop
# - "i" is the counter that
for (i in seq_along(df)) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
# print the output object
output
```
## Detour: square brackets! {#iteration-squarebrackets}
These are _accessors_: a method of accessing data by defining the location of that data in the object.
Square brackets inside square brackets is the base R method of referencing a particular location in a list.
One square bracket: the first list in the object
* think of this as a subset of the object
```{r}
df[1]
```
Two square brackets: the contents of the first object
```{r}
df[[1]]
```
Two values inside two square brackets: this is a single value
* [[row, column]]
```{r}
df[[50, 1]]
```
For a good explanation of R's accessors and the application in selecting vector and matrix elements, see
* ["\[, \[\[, $: R accessors explained"](https://www.r-bloggers.com/r-accessors-explained/)) by Christopher Brown
* [_R Cookbook_, 2nd ed., "2.9 Selecting Vector Elements"](https://rc2e.com/somebasics#recipe-id039)
## Data frame iteration {#iteration-dataframe}
Let's calculate the mean of every column in the `mtcars` data frame:
```{r}
mtcars
```
**Advice from _R4DS_:**
> Think about the output, sequence, and body **before** you start writing the loop.
* Note that we can use the `names()` function to pull the variable names (column names) from the data table.
```{r}
# create the output vector
output <- vector("double", ncol(mtcars))
# the names() function
names(output)
# assign the names to the `output` vector
names(output) <- names(mtcars)
# the loop
for (i in names(mtcars)) {
output[i] <- mean(mtcars[[i]])
}
# print the output
output
```
## A very practical case {#iteration-practical}
(This is modified from ["21.3.5 Exercises"](https://r4ds.had.co.nz/iteration.html#exercises-56) in _R for Data Science_, which uses CSV files.)
Imagine you have a directory full of Excel files that you want to read and then combine into a single dataframe. This circumstance is encountered in many data analysis situations—perhaps the finance department of your company produces a monthly report with the data from a single month, but your interest is to compare many months of data over a few years. The challenge is to merge the data into a single R object containing all of the available data.
For this example, we have three files that are identically structured. Because we now know how to write a loop, we can write some code that will read each file, and then merge them together.
The steps are:
1. Create a list with the names of the files of interest (that is, the Excel files) in the sub-folder "data_monthly". The function `dir()` returns a character vector of the names of the files (or directories) in the named directory.
In the code below, note the use of a regular expression `"\\.xls*"` to define the pattern for any Excel file, including the wild card "*" which identifies both the older ".xls" extension or the newer ".xlsx".
```{r}
all_files <- dir("data_monthly/", pattern = "\\.xls*", full.names = TRUE)
all_files
```
2. With the `length()` function, we can find out how many files there are, and can use that to assign our object:
```{r}
df_list <- vector("list", length(all_files))
```
The `all_files` and `df_list` objects are used in our loop:
```{r}
for (i in seq_along(all_files)) {
df_list[[i]] <- readxl::read_excel(all_files[[i]])
}
```
This creates an interesting R object—it's a list, but a list of tibbles, each one with the contents of a single Excel file. This structure is sometimes referred to as "nested"—the tibbles are _nested_ inside the list.
Let's take a look:
```{r}
df_list
```
The last step is to use the function `bind_rows()` (from {dplyr}) to combine the list of data frames into a single data frame.
```{r}
df <- dplyr::bind_rows(df_list)
df
```
#### An alternative
In this approach, we create the output object to be a list with the file names:
```{r}
# this is the same
df2_list <- vector("list", length(all_files))
# assign the file names
names(df2_list) <- all_files
# the loop
for (fname in all_files) {
df2_list[[fname]] <- readxl::read_excel(fname)
}
df2 <- bind_rows(df2_list)
df2
```
-30-