tutorial1-introduction.Rmd

---
  title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 1: Introduction to R"
  author: "Brock Ferguson"
  date: "July 1, 2015"
  output:
    html_document:
      toc: true
      theme: readable
---

# Dependencies
  
Install the packages we'll be using today, and load ggplot -- which we'll use in this first tutorial.

```{r quiet=T}
# commented out because I already have these
# install.packages('reshape2','dplyr','ggplot2','lme4','lmerTest')

library(ggplot2)
```

# Working with dataframes

We are going to use dataset from Wordbank (http://wordbank.stanford.edu/) for these
early tutorials. This dataset tracks boys' and girls' productive vocabularies
through toddlerhood. To make things easier, I have aggregated the dataset across
subjects to yield a median vocabulary size by gender and age, with confidence intervals.

```{r}
vocab <- read.csv('data-vocab.csv')
```

This dataset is loaded as a "dataframe" -- a series of columns.
You can see this dataframe now in our "Environment" window if you are using RStudio.

Each column in a dataframe can be of a unique type.

Let's see what types of columns we have:

```{r}
summary(vocab)
```

We don't need the X column (it's just indicates the row number) so we will get rid of it
by selecting the columns of the dataframe using a vector representing the range 2 to 7.

```{r}
vocab <- vocab[, c(2:7)]
summary(vocab)
```

We can select one column and assign it to a new vector variable.

```{r}
medians <- vocab[, 'median']
```

This vector is now listed in the Environment as well -- under "values".

We can also select rows by number. Because we haven't assigned their values to a new variable, they will just be printed to the screen.

```{r}
vocab[c(1:5), ]
vocab[c(1,2,3,4,5), ]
```

... Or combine row and column selections...

```{r}
vocab[5, 'gender']
vocab[c(1:5), c('age','gender')]
```

Want to search through a dataframe for matching columns? We can do that too.

```{r}
vocab[which(vocab$gender == 'F'), ]
vocab[which(vocab$gender == 'F' & vocab$age == 17), ]
vocab[which(vocab$age == 18 | vocab$age == 17), ]
vocab[which(vocab$age > 16 & vocab$age < 19), ]
```

Using assignment, we can overwrite values in a dataframe.
For example, let's set the median vocab for 17-month-old females to 1000.

```{r}
vocab[which(vocab$gender == 'F' & vocab$age == 17), 'median'] <- 1000
vocab[which(vocab$gender == 'F' & vocab$age == 17), ] # now shows 1000

vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(1000,NA)

# reset it now...
vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(75,37)
```

# Vectors

R is a vectorized language, meaning that we can perform operations on entire vectors
in parallel and not have to operate on each cell one-at-a-time.

Take our vector of medians, for example:

```{r}
medians
```

Let's perform some mathematical operations on it:

```{r}
medians + 2
medians*4
(medians^2) / 4
log(medians)
exp(medians)
sqrt(medians)
```

Those vectorized functions return new vectors that have been transformed in some way.

We can also use summary functions to boil our medians vector down to a single meaningful metric:

```{r}
mean(medians)
var(medians)
sqrt(var(medians))
sd(medians)
length(medians)

length(vocab)
nrow(vocab)
ncol(vocab)
```

Or combine summary functions and vector transformations to do useful things, like
calculate z-scores:

```{r}
z_scores <- (medians - mean(medians)) / sd(medians)
z_scores

# compare to R's functions for z-score scaling...
z_scores <- scale(medians,center=T,scale=T)
```

Almost all functions can apply to vectors, even non-mathematical functions:

For example, `paste()` combines strings together:

```{r}
paste('this','is','a','sentence',sep=' ')
paste('median: ',medians,sep='')
```

And, of course, we can perform operations involving multiple vectors:

```{r}
# vector * integer
medians * rnorm(1, 0, 1)

# vector * vector
medians * rnorm(length(medians), 0, 1)
```

# Visualization

R includes basic functions for visualizing your data:

```{r}
plot(medians)
hist(medians)
plot(vocab$age, vocab$median)
```

...but I prefer using ggplot2 which gives us more intuitive options for visualization
and greater control over the appearance:

```{r}
ggplot(vocab, aes(x=age, y=median, color=gender)) +
        geom_point()

ggplot(vocab, aes(x=age, y=median, color=gender)) +
        geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) +
        labs(x="Age (Months)", y="Productive Vocabulary")

ggplot(vocab, aes(x=age, y=median, fill=gender)) +
        geom_bar(stat='identity', position=position_dodge()) +
        geom_errorbar(aes(ymax=median+ci.h, ymin=median-ci.l), position=position_dodge(.9), width=0.25) +
        labs(x="Age (Months)", y="Productive Vocabulary")
```

We'll gradually use more and more advanced aspects of ggplot throughout these tutorials. For now, we are simply plotting single variables from a dataframe (much like we could with Excel). However, by the 4th tutorial, we'll be using ggplot to group, transform, and summarize our (almost) raw data and visualize it in a single shot.

# Statistical tests and models

Finally, R provides a host of functions to run statistical tests and/or fit models to your data.

Let's do a basic correlation just to see how these functions work.

First, get just the female data:

```{r}
female_vocab <- vocab[which(vocab$gender == 'F'), c('age','median')]
female_vocab 
```
                  
## Does age correlate with productive vocabulary?

Plot using basic R plots.

```{r}
plot(female_vocab$age, female_vocab$median)

# add a line of best fit
abline(lm(female_vocab$median ~ female_vocab$age))
```

We'll get more into the syntax of `lm()` later, but this function uses a linear model
to predict the 'median' vector from the 'age' vector, then feeds this to `abline()`
to plot this line of best fit.

We can do the same thing with ggplot, though it is nice enough to give us the SE around our linear model fit:

```{r}
ggplot(female_vocab, aes(x=age, y=median)) +
                                  geom_point() +
                                  stat_smooth(method="lm")
```

What is the Pearson's r for this correlation?

```{r}
cor(female_vocab$age, female_vocab$median)
```

Is this significant?

```{r}
cor.test(female_vocab$age, female_vocab$median)
```

We can access specific variables in this statistical test's output directly:

```{r}
cor_output <- cor.test(female_vocab$age, female_vocab$median)

# useful command to show the available data in any variable
str(cor_output)

cor_output$p.value
cor_output$statistic

# also:
names(cor_output)
cor_output$conf.int

# round the p-value:
cor_output$p.value
round(cor_output$p.value, 3)

# how much variance does this account for (e.g., R-squared, R^2)?
cor_output$estimate^2
```

When we run statistical models, the vector we return is not only a collection of variables
(e.g., from which we can select the p-value, estimate, etc.) but also an *object*
on which we can run various *class* methods.

```{r}
class(cor_output)
```

Here we can see we have an "htest" (i.e., "hypothesis test") object.

This class is limited to the "print" method.

```{r}
print(cor_output)
```

... but other classes include other methods like 'plot', 'summary', etc.

Let's run another statistical test on some random data. A Welch's t-test:

```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5))
```

We can pass specific options to these tests. One common option for 't.test' is var.equal=T so that we run a standard Student t-test rather than a Welch's t-test.

```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T)
```

We can also use `paired` to run a dependent, paired t-test, or use `alt` to specify whether we want a
one-sided or two-sided p-value.

```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5), paired=T)
t.test(rnorm(30,5,5), rnorm(30,10,5), alt='less',var.equal=T) # by default: 'two.sided'
```

What class is a `t.test()` object?

```{r}
class(t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T))
```

It's the same as `cor.test()`, which is why the output is similar.

Finally, we loaded our data from a CSV, now let's save a dataset to a CSV file.

```{r}
write.csv(female_vocab, 'data-vocab-females.csv')
```

Clean up our workspace.

```{r}
ls()
rm(list=ls())
```