lab16.Rmd

---
title: "In-Class Lab 16"
author: "ECON 4223 (Prof. Tyler Ransom, U of Oklahoma)"
date: "April 14, 2022"
output:
  pdf_document: default
  html_document:
    df_print: paged
  word_document: default
bibliography: biblio.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'hide', fig.keep = 'none')
```

The purpose of this in-class lab is to use R to practice with difference-in-differences. To get credit, upload your .R script to the appropriate place on Canvas.

## For starters

Open up a new R script (named `ICL16_XYZ.R`, where `XYZ` are your initials) and add the usual "preamble" to the top:

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(wooldridge)
library(broom)
library(magrittr)
library(modelsummary)
library(estimatr)
```

### Load the data

Our data set will be a pooled cross section of workers in Kentucky and Michigan, as analyzed by @meyer_al1995. Load the data from the `wooldridge` package and restrict to those living in Kentucky:

```{r}
df <- as_tibble(injury)
df %<>% filter(ky==1)
```

### Policy setting

In 1980, Kentucky raised its cap on weekly earnings that were covered by worker's compensation.[^1] The outcome variable is `ldurat`, which is the log duration (in weeks) of worker's compensation benefits. The policy was such that the cap increase did not affect low-earnings workers, but did affect high-earnings workers. Thus, low-earnings workers serve as the control group, while high-earnings workers serve as the treatment group.

[^1]: Worker's compensation is "a form of insurance providing wage replacement and medical benefits to employees injured in the course of employment in exchange for mandatory relinquishment of the employee's right to sue their employer for ... negligence." ([Wikipedia](https://en.wikipedia.org/wiki/Workers%27_compensation))

We are interested to know if the policy caused workers to spend more time out of work. If benefits are not generous enough, then workers may sue the company for on-the-job injuries. On the other hand, benefits that are too generous may induce workers to be more reckless on the job, or to claim that off-the-job injuries were incurred while at work.

### Summary statistics

In difference-in-differences settings, it is often helpful to see if there even is a policy effect. Let's compute the difference-in-differences (with no regression) based on the pre- and post-means for the treatment and control groups

```{r, warning=F, message=F}
df %>% group_by(afchnge,highearn) %>% summarize(mean.ldurat = mean(ldurat))
```

1.  What is the difference in the differences? What is its interpretation? (Hint: recall that the dependent variable is logged)

## Difference-in-differences regression

The basic regression model to analyze the policy's impact is $$
\log(durat) = \beta_0 + \delta_0 afchnge + \beta_1 highearn + \delta_1 afchnge \cdot highearn + u
$$

Estimate this model, calling it `est.did`. I suppress the code here, since this should be old hat by now. (Hint: recall that, for two dummy variables `x1` and `x2`, putting `x1*x2` in the `lm()` formula will include in the formula each dummy and the interaction of the two.)

```{r include=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
est.did <- lm_robust(ldurat ~ afchnge*highearn, data=df)
```

2.  Verify that your estimated $\hat{\delta}_1$ is the same as your answer in (1). Is the effect significant?

## Including more $x$'s

We might want to control for other aspects of our workers. For example, perhaps claims made by construction or manufacturing workers tend to have longer duration than claims made workers in other industries. Or maybe those claiming back injuries tend to have longer claims than those claiming head injuries. One may also want to control for worker demographics such as gender, marital status, and age.

Estimate an expanded version of the basic regression model, where the following additional variables are included:

-   `male`
-   `married`
-   Quadratic in `age`
-   `hosp` (1 = hospitalized)
-   `indust` (1 = manuf, 2 = construc, 3 = other)
-   `injtype` (1-8; categories for different types of injury)
-   `lprewage` (log of wage prior to filing a claim)

**Be sure to format `indust` and `injtyp` as factors before proceeding.** Call your model `est.did.x`.

```{r include=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
df %<>% mutate(indust=as.factor(indust), injtype=as.factor(injtype))
est.did.x <- lm_robust(ldurat ~ afchnge*highearn + male + married + age + I(age^2) + injtype + lprewage + indust, data=df)
```

Calling `modelsummary` can help view the results:

```{r, results = 'show', warning=F}
modelsummary(list(est.did,est.did.x),output="latex")
```

3.  How different is your estimate of $\hat{\delta}_1$ from that in (2)?

4.  Do you see any interesting patterns among the $x$'s in predicting log duration of benefits?

# References