-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathtutorial1-introduction.Rmd
321 lines (226 loc) · 8.12 KB
/
tutorial1-introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 1: Introduction to R"
author: "Brock Ferguson"
date: "July 1, 2015"
output:
html_document:
toc: true
theme: readable
---
# Dependencies
Install the packages we'll be using today, and load ggplot -- which we'll use in this first tutorial.
```{r quiet=T}
# commented out because I already have these
# install.packages('reshape2','dplyr','ggplot2','lme4','lmerTest')
library(ggplot2)
```
# Working with dataframes
We are going to use dataset from Wordbank (http://wordbank.stanford.edu/) for these
early tutorials. This dataset tracks boys' and girls' productive vocabularies
through toddlerhood. To make things easier, I have aggregated the dataset across
subjects to yield a median vocabulary size by gender and age, with confidence intervals.
```{r}
vocab <- read.csv('data-vocab.csv')
```
This dataset is loaded as a "dataframe" -- a series of columns.
You can see this dataframe now in our "Environment" window if you are using RStudio.
Each column in a dataframe can be of a unique type.
Let's see what types of columns we have:
```{r}
summary(vocab)
```
We don't need the X column (it's just indicates the row number) so we will get rid of it
by selecting the columns of the dataframe using a vector representing the range 2 to 7.
```{r}
vocab <- vocab[, c(2:7)]
summary(vocab)
```
We can select one column and assign it to a new vector variable.
```{r}
medians <- vocab[, 'median']
```
This vector is now listed in the Environment as well -- under "values".
We can also select rows by number. Because we haven't assigned their values to a new variable, they will just be printed to the screen.
```{r}
vocab[c(1:5), ]
vocab[c(1,2,3,4,5), ]
```
... Or combine row and column selections...
```{r}
vocab[5, 'gender']
vocab[c(1:5), c('age','gender')]
```
Want to search through a dataframe for matching columns? We can do that too.
```{r}
vocab[which(vocab$gender == 'F'), ]
vocab[which(vocab$gender == 'F' & vocab$age == 17), ]
vocab[which(vocab$age == 18 | vocab$age == 17), ]
vocab[which(vocab$age > 16 & vocab$age < 19), ]
```
Using assignment, we can overwrite values in a dataframe.
For example, let's set the median vocab for 17-month-old females to 1000.
```{r}
vocab[which(vocab$gender == 'F' & vocab$age == 17), 'median'] <- 1000
vocab[which(vocab$gender == 'F' & vocab$age == 17), ] # now shows 1000
vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(1000,NA)
# reset it now...
vocab[which(vocab$gender == 'F' & vocab$age == 17), c('median','ci.h')] <- c(75,37)
```
# Vectors
R is a vectorized language, meaning that we can perform operations on entire vectors
in parallel and not have to operate on each cell one-at-a-time.
Take our vector of medians, for example:
```{r}
medians
```
Let's perform some mathematical operations on it:
```{r}
medians + 2
medians*4
(medians^2) / 4
log(medians)
exp(medians)
sqrt(medians)
```
Those vectorized functions return new vectors that have been transformed in some way.
We can also use summary functions to boil our medians vector down to a single meaningful metric:
```{r}
mean(medians)
var(medians)
sqrt(var(medians))
sd(medians)
length(medians)
length(vocab)
nrow(vocab)
ncol(vocab)
```
Or combine summary functions and vector transformations to do useful things, like
calculate z-scores:
```{r}
z_scores <- (medians - mean(medians)) / sd(medians)
z_scores
# compare to R's functions for z-score scaling...
z_scores <- scale(medians,center=T,scale=T)
```
Almost all functions can apply to vectors, even non-mathematical functions:
For example, `paste()` combines strings together:
```{r}
paste('this','is','a','sentence',sep=' ')
paste('median: ',medians,sep='')
```
And, of course, we can perform operations involving multiple vectors:
```{r}
# vector * integer
medians * rnorm(1, 0, 1)
# vector * vector
medians * rnorm(length(medians), 0, 1)
```
# Visualization
R includes basic functions for visualizing your data:
```{r}
plot(medians)
hist(medians)
plot(vocab$age, vocab$median)
```
...but I prefer using ggplot2 which gives us more intuitive options for visualization
and greater control over the appearance:
```{r}
ggplot(vocab, aes(x=age, y=median, color=gender)) +
geom_point()
ggplot(vocab, aes(x=age, y=median, color=gender)) +
geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) +
labs(x="Age (Months)", y="Productive Vocabulary")
ggplot(vocab, aes(x=age, y=median, fill=gender)) +
geom_bar(stat='identity', position=position_dodge()) +
geom_errorbar(aes(ymax=median+ci.h, ymin=median-ci.l), position=position_dodge(.9), width=0.25) +
labs(x="Age (Months)", y="Productive Vocabulary")
```
We'll gradually use more and more advanced aspects of ggplot throughout these tutorials. For now, we are simply plotting single variables from a dataframe (much like we could with Excel). However, by the 4th tutorial, we'll be using ggplot to group, transform, and summarize our (almost) raw data and visualize it in a single shot.
# Statistical tests and models
Finally, R provides a host of functions to run statistical tests and/or fit models to your data.
Let's do a basic correlation just to see how these functions work.
First, get just the female data:
```{r}
female_vocab <- vocab[which(vocab$gender == 'F'), c('age','median')]
female_vocab
```
## Does age correlate with productive vocabulary?
Plot using basic R plots.
```{r}
plot(female_vocab$age, female_vocab$median)
# add a line of best fit
abline(lm(female_vocab$median ~ female_vocab$age))
```
We'll get more into the syntax of `lm()` later, but this function uses a linear model
to predict the 'median' vector from the 'age' vector, then feeds this to `abline()`
to plot this line of best fit.
We can do the same thing with ggplot, though it is nice enough to give us the SE around our linear model fit:
```{r}
ggplot(female_vocab, aes(x=age, y=median)) +
geom_point() +
stat_smooth(method="lm")
```
What is the Pearson's r for this correlation?
```{r}
cor(female_vocab$age, female_vocab$median)
```
Is this significant?
```{r}
cor.test(female_vocab$age, female_vocab$median)
```
We can access specific variables in this statistical test's output directly:
```{r}
cor_output <- cor.test(female_vocab$age, female_vocab$median)
# useful command to show the available data in any variable
str(cor_output)
cor_output$p.value
cor_output$statistic
# also:
names(cor_output)
cor_output$conf.int
# round the p-value:
cor_output$p.value
round(cor_output$p.value, 3)
# how much variance does this account for (e.g., R-squared, R^2)?
cor_output$estimate^2
```
When we run statistical models, the vector we return is not only a collection of variables
(e.g., from which we can select the p-value, estimate, etc.) but also an *object*
on which we can run various *class* methods.
```{r}
class(cor_output)
```
Here we can see we have an "htest" (i.e., "hypothesis test") object.
This class is limited to the "print" method.
```{r}
print(cor_output)
```
... but other classes include other methods like 'plot', 'summary', etc.
Let's run another statistical test on some random data. A Welch's t-test:
```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5))
```
We can pass specific options to these tests. One common option for 't.test' is var.equal=T so that we run a standard Student t-test rather than a Welch's t-test.
```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T)
```
We can also use `paired` to run a dependent, paired t-test, or use `alt` to specify whether we want a
one-sided or two-sided p-value.
```{r}
t.test(rnorm(30,5,5), rnorm(30,10,5), paired=T)
t.test(rnorm(30,5,5), rnorm(30,10,5), alt='less',var.equal=T) # by default: 'two.sided'
```
What class is a `t.test()` object?
```{r}
class(t.test(rnorm(30,5,5), rnorm(30,10,5), var.equal=T))
```
It's the same as `cor.test()`, which is why the output is similar.
Finally, we loaded our data from a CSV, now let's save a dataset to a CSV file.
```{r}
write.csv(female_vocab, 'data-vocab-females.csv')
```
Clean up our workspace.
```{r}
ls()
rm(list=ls())
```