-
Notifications
You must be signed in to change notification settings - Fork 36
/
lab2.Rmd
160 lines (122 loc) · 6.64 KB
/
lab2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: "In-class Lab 2"
author: "ECON 4223 (Prof. Tyler Ransom, U of Oklahoma)"
date: "January 25, 2022"
output:
html_document:
df_print: paged
pdf_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'hide', fig.keep = 'none')
```
The purpose of this lab is to practice using R to conduct hypothesis tests and run a basic OLS regression. The lab may be completed as a group. To get credit, upload your .R script to the appropriate place on Canvas. If done as a group, please list the names of all group members in a comment at the top of the file.
## For starters
Open up a new R script (named `ICL2_XYZ.R`, where `XYZ` are your initials) and add the following to the top:
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(modelsummary)
library(broom) # you'll need to install this in the console first
library(wooldridge)
```
Load the dataset `audit` from the `wooldridge` package, like so:
```{r}
df <- as_tibble(audit)
```
## A one-tailed test
The `audit` data set contains three variables: `w`, `b`, and `y`. The variables `b` and `w` respectively denote whether the black or white member of a pair of resumes was offered a job by the same employer. `y` is simply the difference between the two, i.e. `y=b-w`.
We want to test the following hypothesis:
\begin{equation*}
H_0: \mu=0; \\
H_a: \mu<0
\end{equation*}
where $\mu = \theta_{B}-\theta_{W}$, i.e. the difference in respective job offer rates for blacks and whites.
### The `t.test()` function in R
To conduct a t-test in R, simply provide the appropriate information to the `t.test` function.
How do you know what the "appropriate information" is?
- In the RStudio console, type `?t.test` and hit enter.
- A help page should open in the bottom-right of your RStudio screen. The page should say "Student's t-Test"
- Under `Usage` it says `t.test(x, ...)`.
* This means that *at minimum* we only have to provide it with is some object `x`. The `...` signals that we can provide it more than just `x`.
- Under `Arguments` it explains what `x` is: "a (non-empty) numeric vector of data values"
* This means that R is expecting us to pass a column of a data frame to `t.test()`
- The other information in the help explains default settings of `t.test()`. For example:
* `alternative` is `"two.sided"` by default
* `mu` is 0 by default
* ... other options that we won't worry about right now
Now let's do the hypothesis test written above. Add the following code to your script:
```{r}
t.test(df$y,alternative="less")
```
R automatically computes for us the t-statistic using the formula
\[
\frac{\overline{y} - \mu}{SE_{\bar{y}}}
\]
All we had to give R was the sample of data (`y`, in our case) and the null value (0, which is the `t.test` default)!
### Interpreting the output of `t.test()`
Now that we've conducted the t-test, how do we know the result of our hypothesis test? If you run your script, you should see something like
```
> t.test(df$y,alternative="less")
One Sample t-test
data: df$y
t = -4.2768, df = 240, p-value = 1.369e-05
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -0.08151529
sample estimates:
mean of x
-0.1327801
```
R reports the value of the t-statistic, how many degrees of freedom, and the p-value associated with the test. R *does not* report the critical value, but the p-value provides the same information.
In this case, our p-value is approximatley 0.00001369, which is much lower than 0.05 (our significance level). Thus, we reject $H_0$.
## A two-tailed test
Now suppose instead we want to test if job offer rates of blacks are *different* from those of whites.
We want to test the following hypothesis:
\[
H_0: \theta_{b}=\theta_{w}; \\
H_a: \theta_{b}\neq\theta_{w}
\]
This hypothesis test considers the case where there might be *reverse discrimination* (e.g. through affirmative action policies).
The code to conduct this test is similar to the code we used previously. (Add the following code to your script:)
```{r}
t.test(df$b,df$w,alternative="two.sided",paired=TRUE)
```
You'll notice that the t-statistic is the exact same (-4.2768) for both of the tests. But the p-value for the two-tailed test is twice as large (0.00002739). This is because the two-tailed test must allow for the possibility of either direction of the $\neq$ sign. (In other words, that the job offer rate for blacks could be higher or lower than for whites.)
## Your first regression (of this class)
Let's load a new data set and run an OLS regression. This data set contains year-by-year statistics about counties in the US. It has counts on number of various crimes committed, as well as demographic characteristics about the county.
```{r}
df <- as_tibble(countymurders)
```
A handy command to get a quick overview of an unfamiliar dataset is `glimpse()`:
```{r}
glimpse(df)
```
`glimpse()` tells you the number of observations, number of variables, and the name and type of each variable (e.g. integer, double).[^1]
### Regression syntax
To run a regression of $y$ on $x$ in R, use the following syntax:
```
est <- lm(y ~ x, data=data.name)
```
Here, `est` is an object where the regression coefficients (and other information about the model) is stored. `lm()` stands for "linear model" and is the function that you call to tell R to compute the OLS coefficients. `y` and `x` are variables names from whatever `tibble` you've stored your data in. The name of the `tibble` is `data.name`.
### Regress murder rate on execution rate
Using the `df` data set we created above, let's run a regression where `murders` is the dependent variable and `execs` is the independent variable:
```{r}
est <- lm(murders ~ execs, data=df)
```
To view the output of the regression in a friendly format, type
```{r results='show'}
tidy(est)
```
In the `estimate` column, we can see the estimated coefficients for $\beta_0$---`(Intercept)` in this case---and $\beta_1$ (`execs`). `est` also contains other information that we will use later in the course.
You can also look at the $R^2$ by typing
```{r results='show'}
glance(est)
```
Again, there's a lot of information here, but for now just focus on the $R^2$ term reported in the first column.
We can also view the output in a table format with `modelsummary()`:
```{r results='show'}
modelsummary(est)
```
<!--- # References --->
[^1]: "double" means "double precision floating point" and is a computer science-y way of expressing a real number (as opposed to an integer or a rational number).