dataManagement.Rmd

---
title: "Data Management"
output:
  html_document:
    code_folding: show
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  error = TRUE,
  comment = "",
  class.source = "fold-show")
```

# Import Data {#importData}

Importing data takes syntax of the following form for `.csv` files:

```{r, eval = FALSE}
data <- read.csv("filepath/filename.csv")
```

Note: it is important to use forward slashes ("/") rather than backslashes ("\\") when specifying filepaths in `R`.

Below, I import a `.csv` file and save it into an object called `mydata` (you could call this object whatever you want):

```{r, eval = FALSE}
mydata <- read.csv("https://osf.io/s6wrm/download")
```

```{r, include = FALSE}
mydata <- read.csv("./data/titanic.csv") #https://osf.io/s6wrm/download
```

Importing data takes syntax of the following form for `.RData` files:

```{r, eval = FALSE}
load("filepath/filename.RData")
```

## Import Multiple Data Files {#importMultipleDataFiles}

```{r, eval = FALSE}
dataNames <- paste("data", 1:100, sep = "")
dataFilenames <- paste(dataNames, ".csv", sep = "")
dataFilepaths <- paste("C:/users/username/", dataFilenames, sep = "")

data_list <- lapply(dataFilepaths, read.csv) # lapply(dataFilepaths, data.table::fread) is even faster
names(data_list) <- basename(dataFilepaths)
```

Alternatively, if you want to load all `.csv` files in a directory, you can identify the filenames programmatically:

```{r, eval = FALSE}
dataFilenames <- list.files(
  path = "C:/users/username/",
  pattern = "\\.csv$")

dataFilepaths <- list.files(
  path = "C:/users/username/",
  pattern = "\\.csv$",
  full.names = TRUE)

data_list <- lapply(dataFilepaths, read.csv) # lapply(dataFilepaths, data.table::fread) is even faster
names(data_list) <- basename(dataFilepaths)
```

# Save Data {#saveData}

Saving data takes syntax of the following form for `.csv` files:

```{r, eval = FALSE}
write.csv(object, file = "filepath/filename.csv")
```

For example:

```{r, eval = FALSE}
write.csv(mydata, file = "mydata.csv")
```

Saving data takes syntax of the following form for `.RData` files:

```{r, eval = FALSE}
save(object, file = "filepath/filename.RData")
```

# Use Lab Functions {#labFunctions}

To use lab functions, first install the [`petersenlab` package](https://devpsylab.github.io/petersenlab).
The [`petersenlab` package](https://devpsylab.github.io/petersenlab) is here: https://devpsylab.github.io/petersenlab.
You can install it using the following commands:

```{r, eval = FALSE}
install.packages("remotes")
remotes::install_github("DevPsyLab/petersenlab")
```

Once you have the [`petersenlab` package](https://devpsylab.github.io/petersenlab) installed, load the package:

```{r}
library("petersenlab")
```

To run scripts on the lab drive, set the path to the lab drive (`//lc-rs-store24.hpc.uiowa.edu/lss_itpetersen/Lab/`) using the following code:

```{r, eval = FALSE}
petersenLabPath <- setLabPath()
```

# Load/Install Packages {#loadInstallPackages}

To install a single package that is on the CRAN repository, use the following syntax:

```{r, eval = FALSE}
install.packages("name_of_package")
```

To install multiple packages that are on the CRAN repository, use the following syntax:

```{r, eval = FALSE}
install.packages(c("name_of_package1","name_of_package2","name_of_package3"))
```

To install a package that is on a `GitHub` repository, use the following syntax:

```{r, eval = FALSE}
install.packages("remotes")
remotes::install_github("username_of_GitHub_author/name_of_package")
```

For instance:

```{r, eval = FALSE}
remotes::install_github("DevPsyLab/petersenlab")
```

The default way to load a package in `R` is:

```{r, eval = FALSE}
library("packageName1")
library("packageName2")
library("packageName3")
```

However, when sourcing (i.e., running) other `R` scripts, it is possible that you will run scripts that use packages that you do not have installed, resulting in an error that prevents the script from running.
Thus, it can be safer to load packages using the lab function, `load_or_install(),` rather than using `library()`.
The `load_or_install()` function checks whether a package is installed.
If the package is not installed, the function installs and loads the package.
If the package is installed, the function loads the package.
To use this function, you must have the [`petersenlab` package](https://devpsylab.github.io/petersenlab) loaded.

```{r, eval = FALSE}
library("petersenlab")
load_or_install(c("packageName1","packageName2","packageName3"))
```

For example:

```{r}
library("petersenlab")
load_or_install(c("tidyverse","psych"))
```

# Set a Seed {#seed}

Set a seed (any number) to reproduce the results of analyses that involve random number generation.

```{r}
set.seed(52242)
```

# Run an `R` Script {#source}

To run an `R` script, use the following syntax:

```{r, eval = FALSE}
source("filepath/filename.R")
```

# Render an `R` Markdown (`.Rmd`) File {#renderRmd}

To render a `.Rmd` file, use the following syntax:

```{r, eval = FALSE}
render("filepath/filename.Rmd")
```

# Variable Names {#varNames}

To look at the names of variables in a data frame, use the following syntax:

```{r}
names(mydata)
```

# Logical Operators {#logicalOperators}

Logical operators evaluate a condition for each value and yield values of `TRUE` and `FALSE`, corresponding to whether the evaluation for a given value met the condition.

## Is Equal To: `==`

```{r}
mydata$survived == 1
```

## Is Not Equal To: `!=`

```{r}
mydata$survived != 1
```

### Greater Than: `>`

```{r}
mydata$parch > 1
```

## Less Than: `<`

```{r}
mydata$parch < 1
```

## Greater Than or Equal To: `>=`

```{r}
mydata$parch >= 1
```

## Less Than or Equal To: `<=`

```{r}
mydata$parch <= 1
```
## Is in a Value of Another Vector: `%in%`

```{r}
anotherVector <- c(0,1)
mydata$parch %in% anotherVector
```

## Is Not in a Value of Another Vector: `%ni%`

Note: this function is part of the [`petersenlab`](https://github.com/DevPsyLab/petersenlab) package and is not available in base `R`.

```{r}
mydata$parch %ni% anotherVector
```

## Is Missing: `is.na()`

```{r}
is.na(mydata$prediction)
```

## Is Not Missing: `!is.na()`

```{r}
!is.na(mydata$prediction)
```

## And: `&`

```{r}
!is.na(mydata$prediction) & mydata$parch >= 1
```

## Or: `|`

```{r}
is.na(mydata$prediction) | mydata$parch >= 1
```

# Subset {#subset}

To subset a data frame, use brackets to specify the subset of rows and columns to keep, where the value/vector before the comma specifies the rows to keep, and the value/vector after the comma specifies the columns to keep:

```{r, eval = FALSE}
dataframe[rowsToKeep, columnsToKeep]
```

You can subset by using any of the following:

- numeric indices of the rows/columns to keep (or drop)
- names of the rows/columns to keep (or drop)
- values of `TRUE` and `FALSE` corresponding to which rows/columns to keep

## One Variable

To subset one variable, use the following syntax:

```{r}
mydata$age
```

or:

```{r}
mydata[,"age"]
```

## Particular Rows of One Variable

To subset one variable, use the following syntax:

```{r}
mydata$age[which(mydata$survived == 1)]
```

or:

```{r}
mydata[which(mydata$survived == 1), "age"]
```

## Particular Columns (Variables)

To subset particular columns/variables, use the following syntax:

### Base `R`

```{r}
subsetVars <- c("survived","age","prediction")

mydata[,c(1,2,3)]
mydata[,c("survived","age","prediction")]
mydata[,subsetVars]
```

Or, to drop columns:

```{r}
dropVars <- c("sibsp","parch")

mydata[,-c(5,6)]
mydata[,names(mydata) %ni% c("sibsp","parch")]
mydata[,names(mydata) %ni% dropVars]
```

### Tidyverse

```{r}
mydata %>%
  select(survived, age, prediction)

mydata %>%
  select(survived:prediction)

mydata %>%
  select(all_of(subsetVars))
```

Or, to drop columns:

```{r}
mydata %>%
  select(-sibsp, -parch)

mydata %>%
  select(-c(sibsp:parch))

mydata %>%
  select(-all_of(dropVars))
```

## Particular Rows

To subset particular rows, use the following syntax:

### Base `R`

```{r}
subsetRows <- c(1,3,5)

mydata[c(1,3,5),]
mydata[subsetRows,]
mydata[which(mydata$survived == 1),]
```

### Tidyverse

```{r}
mydata %>%
  filter(survived == 1)

mydata %>%
  filter(survived == 1, parch <= 1)

mydata %>%
  filter(survived == 1 | parch <= 1)
```

## Particular Rows and Columns

To subset particular rows and columns, use the following syntax:

### Base `R`

```{r}
mydata[c(1,3,5), c(1,2,3)]
mydata[subsetRows, subsetVars]
mydata[which(mydata$survived == 1), subsetVars]
```

### Tidyverse

```{r}
mydata %>%
  filter(survived == 1) %>%
  select(all_of(subsetVars))
```

# View Data {#viewData}

## All Data

To view data, use the following syntax:

```{r, eval = FALSE}
View(mydata)
```

## First 6 Rows/Elements

To view only the first six rows (if a data frame) or elements (if a vector), use the following syntax:

```{r}
head(mydata)
head(mydata$age)
```

# Data Characteristics {#dataCharacteristics}

## Data Structure

```{r}
str(mydata)
```

## Data Dimensions

Number of rows and columns:

```{r}
dim(mydata)
```

## Number of Elements

```{r}
length(mydata$age)
```

## Number of Missing Elements

```{r}
length(mydata$age[which(is.na(mydata$age))])
```

## Number of Non-Missing Elements

```{r}
length(mydata$age[which(!is.na(mydata$age))])
length(na.omit(mydata$age))
```

# Create New Variables {#createNewVars}

To create a new variable, use the following syntax:

```{r}
mydata$newVar <- NA
```

```{r, include = FALSE}
mydata$newVar <- NULL
```

Here is an example of creating a new variable:

```{r}
mydata$ID <- 1:nrow(mydata)
```

# Create a Data Frame {#createDF}

Here is an example of creating a data frame:

```{r}
mydata2 <- data.frame(
  ID = c(1:5, 1047:1051),
  cat = sample(0:1, 10, replace = TRUE)
)

mydata2
```

# Recode Variables {#recodeVars}

Here is an example of recoding a variable:

```{r}
mydata$oldVar1[which(mydata$sex == "male")] <- 0
mydata$oldVar1[which(mydata$sex == "female")] <- 1

mydata$oldVar2[which(mydata$sex == "male")] <- 1
mydata$oldVar2[which(mydata$sex == "female")] <- 0
```

Recode multiple variables:

```{r}
mydata %>%
  mutate(across(c(
    survived:pclass),
    ~ case_match(
      .,
      0 ~ "No",
      1 ~ "Yes")))

mydata %>%
  mutate(across(c(
    survived:pclass),
    ~ case_match(
      .,
      c(0,1) ~ 1,
      c(2,3) ~ 2)))
```

# Rename Variables {#renameVars}

```{r}
mydata <- mydata %>% 
  rename(
    newVar1 = oldVar1,
    newVar2 = oldVar2)
```

Using a vector of variable names:

```{r, eval = FALSE}
varNamesFrom <- c("oldVar1","oldVar2")
varNamesTo <- c("newVar1","newVar2")

mydata <- mydata %>% 
  rename_with(~ varNamesTo, all_of(varNamesFrom))
```

# Convert the Types of Variables {#convertVarTypes}

One variable:

```{r}
mydata$factorVar <- factor(mydata$sex)
mydata$numericVar <- as.numeric(mydata$prediction)
mydata$integerVar <- as.integer(mydata$parch)
mydata$characterVar <- as.character(mydata$sex)
```

Multiple variables:

```{r}
mydata %>%
  mutate(across(c(
    age,
    parch,
    prediction),
    as.numeric))

mydata %>%
  mutate(across(
    age:parch,
    as.numeric))

mydata %>%
  mutate(across(where(is.factor), as.character))
```

# Merging/Joins {#merging}

## Overview

Merging (also called joining) merges two data objects using a shared set of variables called "keys."
The keys are the variable(s) that uniquely identify each row (i.e., they account for the levels of nesting).
In some data objects, the key might be the participant's ID (e.g., `participantID`).
However, some data objects have multiple keys.
For instance, in long form data objects, each participant may have multiple rows corresponding to multiple timepoints.
In this case, the keys are `participantID` and `timepoint`.
If a participant has multiple rows corresponding to timepoints and measures, the keys are `participantID`, `timepoint`, and `measure`.
In general, each row should have a value on each of the keys; there should be no missingness in the keys.

To merge two objects, the keys must be present in both objects.
The keys are used to merge the variables in object 1 (`x`) with the variables in object 2 (`y`).
Different merge types select different rows to merge.

Note: if the two objects include variables with the same name (apart from the keys), `R` will not know how you want each to appear in the merged object.
So, it will add a suffix (e.g., `.x`, `.y`) to each common variable to indicate which object (i.e., object `x` or object `y`) the variable came from, where object `x` is the first object—i.e., the object to which object `y` (the second object) is merged.
In general, apart from the keys, you should not include variables with the same name in two objects to be merged.
To prevent this, either remove or rename the shared variable in one of the objects, or include the shared variable as a key.
However, as described above, you should include it as a key ***only*** if it uniquely identifies each row in terms of levels of nesting.

## Data Before Merging

Here are the data in the `mydata` object:

```{r}
mydata

dim(mydata)
```

Here are the data in the `mydata2` object:

```{r}
mydata2

dim(mydata2)
```

## Types of Joins {#mergeTypes}

### Visual Overview of Join Types

Below is a visual that depicts various types of merges/joins.
Object `x` is the circle labeled as `A`.
Object `y` is the circle labeled as `B`.
The area of overlap in the Venn diagram indicates the rows on the keys that are shared between the two objects (e.g., `participantID` values 1, 2, and 3).
The non-overlapping area indicates the rows on the keys that are unique to each object (e.g., `participantID` values 4, 5, and 6 in Object `x` and values 7, 8, and 9 in Object `y`).
The shaded yellow area indicates which rows (on the keys) are kept in the merged object from each of the two objects, when using each of the merge types.
For instance, a left outer join keeps the shared rows and the rows that are unique to object `x`, but it drops the rows that are unique to object `y`.

![Types of merges/joins](images/joins.png)

Image source: [Predictive Hacks](https://predictivehacks.com/?all-tips=anti-joins-with-pandas) (archived at: https://perma.cc/WV7U-BS68)

### Full Outer Join {#fullJoin}

A full outer join includes all rows in $x$ **or** $y$.
It returns columns from $x$ and $y$.
Here is how to merge two data frames using a full outer join (i.e., "full join"):

```{r}
fullJoinData <- merge(mydata, mydata2, by = "ID", all = TRUE)

fullJoinData
dim(fullJoinData)
```

Or, alternatively, using `tidyverse`:

```{r}
full_join(mydata, mydata2, by = "ID")
```

### Left Outer Join {#leftJoin}

A left outer join includes all rows in $x$.
It returns columns from $x$ and $y$.
Here is how to merge two data frames using a left outer join ("left join"):

```{r}
leftJoinData <- merge(mydata, mydata2, by = "ID", all.x = TRUE)

leftJoinData
dim(leftJoinData)
```

Or, alternatively, using `tidyverse`:

```{r}
left_join(mydata, mydata2, by = "ID")
```

### Right Outer Join {#rightJoin}

A right outer join includes all rows in $y$.
It returns columns from $x$ and $y$.
Here is how to merge two data frames using a right outer join ("right join"):

```{r}
rightJoinData <- merge(mydata, mydata2, by = "ID", all.y = TRUE)

rightJoinData
dim(rightJoinData)
```

Or, alternatively, using `tidyverse`:

```{r}
right_join(mydata, mydata2, by = "ID")
```

### Inner Join {#innerJoin}

An inner join includes all rows that are in **both** $x$ **and** $y$.
An inner join will return one row of $x$ for each matching row of $y$, and can duplicate values of records on either side (left or right) if $x$ and $y$ have more than one matching record.
It returns columns from $x$ and $y$.
Here is how to merge two data frames using an inner join:

```{r}
innerJoinData <- merge(mydata, mydata2, by = "ID", all.x = FALSE, all.y = FALSE)

innerJoinData
dim(innerJoinData)
```

Or, alternatively, using `tidyverse`:

```{r}
inner_join(mydata, mydata2, by = "ID")
```

### Semi Join {#semiJoin}

A semi join is a filter.
A left semi join returns all rows from $x$ **with** a match in $y$.
That is, it filters out records from $x$ that are not in $y$.
Unlike an [inner join](#innerJoin), a left semi join will never duplicate rows of $x$, and it includes columns from only $x$ (not from $y$). 
Here is how to merge two data frames using a left semi join:

```{r}
semiJoinData <- semi_join(mydata, mydata2, by = "ID")

semiJoinData
dim(semiJoinData)
```

### Anti Join {#antiJoin}

An anti join is a filter.
A left anti join returns all rows from $x$ **without** a match in $y$.
That is, it filters out records from $x$ that are in $y$.
It returns columns from only $x$ (not from $y$).
Here is how to merge two data frames using a left anti join:

```{r}
antiJoinData <- anti_join(mydata, mydata2, by = "ID")

antiJoinData
dim(antiJoinData)
```

### Cross Join {#crossJoin}

A cross join combines each row in $x$ with each row in $y$.

```{r}
crossJoinData <- cross_join(
  data.frame(rater = c("Mother","Father","Teacher")),
  data.frame(timepoint = 1:3))

crossJoinData
dim(crossJoinData)
```

# Long to Wide {#longToWide}

Original data:

```{r}
fish_encounters
```

Data widened by a variable (`station`), using `tidyverse`: 

```{r}
fish_encounters %>% 
  pivot_wider(
    names_from = station,
    values_from = seen)
```

# Wide to Long {#wideToLong}

Original data:

```{r}
mtcars
```

Data in long form, transformed from wide form using `tidyverse`:

```{r}
mtcars %>% 
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value")
```

# Average Ratings Across Coders {#avgAcrossCoders}

Create data with multiple coders:

```{r, class.source = "fold-hide"}
idWaveCoder <- 
  expand.grid(
    id = 1:100,
    wave = 1:3,
    coder = 1:3,
    positiveAffect = NA,
    negativeAffect = NA
  )

idWaveCoder$positiveAffect <- rnorm(nrow(idWaveCoder))
idWaveCoder$negativeAffect <- rnorm(nrow(idWaveCoder))

idWaveCoder %>% 
  arrange(id, wave, coder)
```

Average data across coders:

```{r}
idWave <- idWaveCoder %>% 
  group_by(id, wave) %>% 
  summarise(
    across(everything(),
      ~ mean(.x, na.rm = TRUE)),
    .groups = "drop") %>% 
  select(-coder)

idWave
```

# Loops {#loops}

If you want to perform the same computation multiple times, it can be faster to do it in a loop compared to writing out the same computation many times.
For instance, here is a loop that prints each element of a vector and the loop index (`i`) that indicates where the loop is in terms of its iterations:

```{r}
fruits <- c("apple", "banana", "cherry")

for(i in 1:length(fruits)){
  print(paste("The loop is at index:", i, sep = " "))
  print(fruits[i])
}
```

# Session Info

```{r, class.source = "fold-hide"}
sessionInfo()
```