-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathanova_basic.R
145 lines (112 loc) · 6.6 KB
/
anova_basic.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
##########################################################################################
# Anova
# https://en.wikipedia.org/wiki/Analysis_of_variance
#
# Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures
# (such as the "variation" among and between groups) used to analyze the differences among group means in a sample.
# ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. In the ANOVA setting, the observed variance
#in a particular variable is partitioned into components attributable to different sources of variation.
# In its simplest form, ANOVA provides a statistical test of whether the population means of several groups are equal,
# and therefore generalizes the t-test to more than two groups. ANOVA is useful for comparing (testing) three or more group
# means for statistical significance. It is conceptually similar to multiple two-sample t-tests, but is more conservative,
#resulting in fewer type I errors,[1] and is therefore suited to a wide range of practical problems.
#
# from http://homepages.inf.ed.ac.uk/bwebb/statistics/ANOVA_in_R.pdf
##########################################################################################
##########################################################
# Load libraries
##########################################################
library(lme4)
library(sqldf)
##########################################################
# Load data
##########################################################
attach(InsectSprays)
data("InsectSprays")
str(InsectSprays)
head(InsectSprays)
summary(InsectSprays)
##########################################################
# Plot it
##########################################################
boxplot(count ~ spray, data = InsectSprays)
##########################################################
# One-way ANOVA
##########################################################
oneway.test(formula = count ~ spray, data = InsectSprays)
##########################################################
# ANOVA
##########################################################
aov_out = aov(formula = count ~ spray, data = InsectSprays)
summary(aov_out)
# ANOVA and linear regression are the same thing – more on that tomorrow. For the
# moment, the main point to note is that you can look at the results from aov() in terms of the
# linear regression that was carried out, i.e. you can see the parameters that were estimated.
# Implicitly this can be understood as a set of (non-orthogonal) contrasts of the first group
# against each of the other groups. R uses these so-called ‘Treatment’ contrasts as the default,
# but you can request alternative contrasts (see later)
summary.lm(aov_out)
# Interpret as treatment contrasts compred to control (see PDF)
#########################################################
# Plot residuals
#########################################################
plot(aov_out)
# 1. This shows if there is a pattern in the residuals, and ideally should show similar scatter for
# each condition. Here there is a worrying effect of larger residuals for larger fitted values. This
# is called ‘heteroscedascity’ meaning that not only is variance in the response not equal across
# groups, but that the variance has some specific relationship with the size of the response. In
# fact you could see this in the original boxplots. It contradicts assumptions made when doing
# an ANOVA.
# 2. Q-Q plots
# This looks for normality of the residuals; if they are not normal, the assumptions of ANOVA
# are potentially violated.
# 3. This is like the first plot but now to specifically test if the residuals increase with the fitted
# values, which they do
####################################################################################
# Contrasts
# taken from
# http://homepages.inf.ed.ac.uk/bwebb/statistics/Statistical_models.html
####################################################################################
# In general, the F-ratio its p-value for an ANOVA tell us that the groups differ,
# but not which groups differ. In some experimental designs we might have particular
# expectations about which group differences will be meaningful.
# First, we need to specify contrasts attribute for our predictor variable (spray type).
# We can use one of the default contrast matrices , e.g. contr.helmert() is an orthogonal
# contrast matrix where each category (except the last) is compared to the mean effect
# of all subsequent categories a standard procedure for studies along the lines of control
# vs. drug A vs. drug B.
contrasts(InsectSprays$spray)
contr.helmert(InsectSprays$spray)
# if we want to define our own contrasts it is possible too.
# Let's say that first we want to compare first 3 conditions to the last 3 and
# then run more contrasts within those two subgroups of conditions.
# First, we define the weights
con1=c(1,1,1,-1,-1,-1)
con2=c(2,-1,-1,0,0,0)
con3=c(0,1,-1,0,0,0)
con4=c(0,0,0,2,-1,-1)
con5=c(0,0,0,0,1,-1)
our.contrasts=cbind(con1,con2,con3,con4,con5)
our.contrasts
# Then we attach contrasts attribute
contrasts(InsectSprays$spray)=our.contrasts
# Finally, once we obtain our model with aov() function we use summary.lm()
# instead of standard summary() to observe the results of the contrasts.
model_aov_newcontrasts = aov(count~spray, data=InsectSprays)
summary.lm(model_aov_newcontrasts)
#############################################################
# Post-hoc tests
#############################################################
# A common approach alternative approach to planned contrasts is simply to carry out a set of t-tests between various groups,
# potentially whatever looks interesting in the data (hence the name, post-hoc). It is necessary to control for the overall
# type I error rate, α, given the number of comparisons, c. The simplest approach is using the simple Bonferonni correction
# set the significance level at α/c, e.g. here, if comparing each group against the first group we would be doing 5 comparisons,
# so would need each t-test to have a p<.01 to be significant to ensure overall α<0.05. More precise is to use Dunn-Sidak correction:
# There are various other methods for multiple comparisons, each of which makes slightly different assumptions in attempting to control
# for both type I and type II error (power), including:
# Neuman-Keuls (liberal on Type I);
# Scheff(strict on Type I but bad for Type II);
# Dunnett� (optimised for comparing all treatments to a single control);
# Games-Howell (doesn� sume equal variances); and
# Tukey HSD (optimised for when all possible pairwise comparisons are being made)
TukeyHSD(aov_out)