600_iteration.Rmd

<!-- 
This file by Martin Monkman 
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/  
-->

# (PART) Week 6 {-} 


# Iteration {#iteration}

## Setup

This chunk of R code loads the packages that we will be using.

```{r setup_600, eval=FALSE}
library(tidyverse)

library(tidyverse)
library(gapminder)
```


### Reading & Resources {#iteration-reading}

Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, [_R for Data Science_ (2nd ed.), "27 Iteration"](https://r4ds.hadley.nz/iteration)

JD Long and Paul Teetor, [R Cookbook, 2nd ed., "Iterating with a Loop"](https://rc2e.com/simpleprogramming#recipe-for_loop), 2019


## Iteration {#iteration-intro}

Like functions, iteration is a way to reduce 

* repetition in your code, and 

* repetition in copy-and-paste tasks.


Iteration:  

> helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. ([_R4DS_, "21 Iteration"](https://r4ds.had.co.nz/iteration.html))

A rule to follow:

**Never copy and paste more than twice.**


A simple example, where we create a dataframe (tibble) named "df" with 4 columns of 50 records each, each one with a normal distribution centred on 100. (This is a variation on what's in the "Iteration" chapter of _R4DS_.)


```{r iteration_df}
set.seed(8675309)

df <- tibble(
  a = rnorm(50, mean = 100, sd = 10),
  b = rnorm(50, mean = 100, sd = 10),
  c = rnorm(50, mean = 100, sd = 10),
  d = rnorm(50, mean = 100, sd = 10)
)
  
df

```

Calculate the median of variable `a`:

```{r}
median(df$a)

```

Calculate the median of the other three variables:

```{r}
# solution
median(df$b)
median(df$c)
median(df$d)

```

## For-Loop {#iteration-loop}

Or rather than repeat the same function four times, you can write a "for-loop", which executes the code within the loop for as many times as is specified.

There are three parts in a for-loop:

1. Define the output object before you start the loop—this makes your loop more efficient. Notice that the code below uses `ncol()` to count the number of columns in our dataframe "df"

2. The sequence, which defines the number of times the loop iterates

3. The body of the code.

```{r}
output <- vector(mode = "double", length = ncol(df))  # 1. output

# the loop
# - "i" is the counter that 
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}

# print the output object
output

```


## Detour: square brackets! {#iteration-squarebrackets}

These are _accessors_: a method of accessing data by defining the location of that data in the object.

Square brackets inside square brackets is the base R method of referencing a particular location in a list.


One square bracket: the first list in the object

*  think of this as a subset of the object

```{r}

df[1]
```

Two square brackets: the contents of the first object

```{r}
df[[1]]
```

Two values inside two square brackets: this is a single value

* [[row, column]]

```{r}
df[[50, 1]]

```

For a good explanation of R's accessors and the application in selecting vector and matrix elements, see 

* ["\[, \[\[, $: R accessors explained"](https://www.r-bloggers.com/r-accessors-explained/)) by Christopher Brown

* [_R Cookbook_, 2nd ed., "2.9 Selecting Vector Elements"](https://rc2e.com/somebasics#recipe-id039) 


## Data frame iteration {#iteration-dataframe}

Let's calculate the mean of every column in the `mtcars` data frame:

```{r}

mtcars

```


**Advice from _R4DS_:**

> Think about the output, sequence, and body **before** you start writing the loop.


* Note that we can use the `names()` function to pull the variable names (column names) from the data table.

```{r}
# create the output vector
output <- vector("double", ncol(mtcars))

# the names() function 
names(output)
# assign the names to the `output` vector
names(output) <- names(mtcars)

# the loop
for (i in names(mtcars)) {
  output[i] <- mean(mtcars[[i]])
}

# print the output
output

```


## A very practical case {#iteration-practical}

(This is modified from ["21.3.5 Exercises"](https://r4ds.had.co.nz/iteration.html#exercises-56) in _R for Data Science_, which uses CSV files.)


Imagine you have a directory full of Excel files that you want to read and then combine into a single dataframe. This circumstance is encountered in many data analysis situations—perhaps the finance department of your company produces a monthly report with the data from a single month, but your interest is to compare many months of data over a few years. The challenge is to merge the data into a single R object containing all of the available data.

For this example, we have three files that are identically structured. Because we now know how to write a loop, we can write some code that will read each file, and then merge them together.

The steps are: 

1. Create a list with the names of the files of interest (that is, the Excel files) in the sub-folder "data_monthly". The function `dir()` returns a character vector of the names of the files (or directories) in the named directory.

In the code below, note the use of a regular expression `"\\.xls*"` to define the pattern for any Excel file, including the wild card "*" which identifies both the older ".xls" extension or the newer ".xlsx".

```{r}

all_files <- dir("data_monthly/", pattern = "\\.xls*", full.names = TRUE)
all_files

```

2. With the `length()` function, we can find out how many files there are, and can use that to assign our object:

```{r}

df_list <- vector("list", length(all_files))

```

The `all_files` and `df_list` objects are used in our loop:

```{r}
for (i in seq_along(all_files)) {
  df_list[[i]] <- readxl::read_excel(all_files[[i]])
}
```

This creates an interesting R object—it's a list, but a list of tibbles, each one with the contents of a single Excel file. This structure is sometimes referred to as "nested"—the tibbles are _nested_ inside the list.

Let's take a look:

```{r}
df_list
```

The last step is to use the function `bind_rows()` (from {dplyr}) to combine the list of data frames into a single data frame.

```{r}

df <- dplyr::bind_rows(df_list)

df 

```


#### An alternative

In this approach, we create the output object to be a list with the file names:

```{r}
# this is the same
df2_list <- vector("list", length(all_files))
# assign the file names
names(df2_list) <- all_files

# the loop
for (fname in all_files) {
  df2_list[[fname]] <- readxl::read_excel(fname)
}


df2 <- bind_rows(df2_list)
df2
```


-30-