diff --git a/solutions2.Rmd b/solutions2.Rmd new file mode 100644 index 0000000..8c11b57 --- /dev/null +++ b/solutions2.Rmd @@ -0,0 +1,844 @@ +--- +title: "Solutions Parr" +author: "Christianna Parr" +date: "April 21, 2017" +output: + pdf_document: default + html_document: default +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +### This is the second assignment for POLS 503. The data and paper are by Yule, the original researcher. + +```{r} +library("dplyr") +library("tidyr") +library("tidyverse") +library("modelr") +library("broom") +library("ggplot2") +library ("car") +library ("stats") +library ("texreg") #makes tables +library ("bootstrap") +library ("sandwich") +``` + +```{r} +# devtools::install_github("jrnold/datums") +library(datums) +pauperism <- + left_join(datums::pauperism_plu, datums::pauperism_year, + by = "ID") %>% + mutate (year = as.character(year)) +``` + +### Run regressions of pauper using the yearly level data with the following specifications. + +```{r} + +M1 <- lm(paupratiodiff ~ outratiodiff + year + Type, data = pauperism) + +M2 <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) + +M3 <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) + +M4 <- lm(paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) +``` + +### 1. Present the regressions results in a regression table + +```{r, error = TRUE, results = 'as is'} + +table1 <- screenreg(list(M1, M2, M3, M4)) + +table1 + +``` + +### 2. Interpret the coefficients for outratiodiff for each model. + +M1: Holding year and type of area constant, on average when there is an outratio difference of 1 unit, there is a 0.2343 increase in pauperism, ceteris parabis. As the proportion of people recieving outside aid increases, there is an increase in the level of pauperism. + +M2: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.23 increase in pauperism, ceteris parabis. In this model we include oldratiodiff in the controls to make sure we control for the amount of elderly persons recieving out relief payments. + +M3: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. In this regression we are transforming the variable by adding negative 1. We include interactions between outratiodiff and year, Type Mixed, Type Urban, and Type Rural. The coefficient for outratiodiff indicates the increase in pauperism for Metropolitan areas, and the covariance between outratiodiff and PLU type indicates the change on pauperism from Metropolitan areas and the other types of areas. + +M4: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. This model is the same as Model 3 but without the negative 1 transformation. + +### 3. Write the equations for each or all models, and describe the model with a sentence or two. Try to be as concise as possible. Look at recent journal articles for examples of the wording and format. + + +###Correction for writing equations incorrectly: +###M1: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \alpha_{type} + \rho_{t} + e_{it}$$ +###M2: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \beta_{2}\text{popratiodiff}_{it} + \beta_{3}\text{oldratiodiff}_{it}+{\bf\it{\Gamma}}\Big ((\text{popratiodiff}_{it} + \text{oldratiodiff}_{it}) \times (\alpha_{type} + \rho_{t})\Big) + \alpha_{type} + \rho_{t} + e_{it}$$ +###M3: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \beta_{2}\text{popratiodiff}_{it} + \beta_{3}\text{oldratiodiff}_{it} + {\bf\it{\Gamma}}\Big ((\text{outratiodiff}_{it} + \text{popratiodiff}_{it} + \text{oldratiodiff}_{it}) \times (\alpha_{type} + \rho_{t})\Big) + \alpha_{type} + \rho_{t} + e_{it}$$ + + +##Previous Answers: +M1: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2year\ + \beta_3type\ + \epsilon\ +$$ +In this first model we are regressing outratiodiff on paupratiodiff and controlling for year and Type. It is a simple linear model without interaction terms. + +M2: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ + \epsilon\ +$$ +In this second model we are running another linear regression but we include interaction terms. We can measure the conditional changes when we include year and type with oldratiodiff, popratiodiff and outratiodiff. + +M3: +$$ +paupratiodiff - 1 = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +$$ +In this third model we are including interaction terms for PLU, year, outratiodiff, popratiodiff and oldratiodiff. We are also including a negative 1 to transform the intercept but should not have any effect on the results. + +M4: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +$$ +In this fourth model we are running the same regression as model 3, but without the negative 1. + +### 4. What is the difference between M3 and M4. What are the pros and cons of each parameterization? + +Model 3 has a negative 1 to transform the parameters. The correlation of the data does not change, but it simply shifts the information down by 1 y unit. Maybe this would make more sense to subtract a 100 to see a substantive move. + +## 5. Conduct F-tests on the hypotheses: + +### a) All interactions in M4 are 0 + +```{r} +linearHypothesis(M4, c("popratiodiff:year1891", "popratiodiff:TypeUrban", "popratiodiff:TypeRural", "popratiodiff:TypeMixed", "oldratiodiff:year1891", "oldratiodiff:TypeMixed","oldratiodiff:TypeUrban", "oldratiodiff:TypeRural", "outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) + +``` + +### b) The coefficients on outratiodiff in M4 are the same across years +```{r} +linearHypothesis(M4, c("outratiodiff:year1891")) +``` + +### c) The coefficients on outratiodiff in M4 are the same across PLU Types +```{r} +linearHypothesis(M4, c("outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) +``` + +### d) The coefficients on outratiodiff in M4 are the same across PLU Types and years. +```{r} +linearHypothesis(M4, c("outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) + +``` + +###6.Calculate the predicted value and confidence interval for the PLU with the median value of outratiodiff, popratiodiff, and oldratiodiff in each year and PLU Type for these models. Plot the predicted value and confidence interval of these as point-ranges. + +```{r} + +data1 <- pauperism %>% + group_by(year, Type) %>% + filter (!is.na(Type), year %in% c("1881", "1891")) %>% + summarise_at (vars(outratiodiff, popratiodiff, oldratiodiff), median, na.rm = TRUE) + +M1_new <- tidy(predict(M1, newdata = data1, interval = "confidence", level = 0.95)) + +dataM1 <- bind_cols (data1, M1_new) + +ggplot (data = dataM1) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M2_new <- tidy(predict(M2, newdata = data1, interval = "confidence", level = 0.95)) + +dataM2 <- bind_cols(data1, M2_new) + +ggplot (data = dataM2) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M3_new <- tidy(predict(M3, newdata = data1, interval = "confidence", level = 0.95)) + +dataM3 <- bind_cols(data1, M3_new) + +ggplot (data = dataM3) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M4_new <- tidy(predict(M4, newdata = data1, interval = "confidence", level = 0.95)) + +dataM4 <- bind_cols (data1, M4_new) + +ggplot (data = dataM4) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +``` + +###7.As previously, calculate the predicted value of the median PLU in each year and PLU Type. But instead of confidence intervals include the prediction interval. How do the confidence and prediction intervals differ? What are their definitions? + +```{r} +M1_new <- tidy(predict(M1, newdata = data1, interval = "prediction", level = 0.95)) + +dataM1 <- bind_cols (data1, M1_new) + +ggplot (data = dataM1) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M2_new <- tidy(predict(M2, newdata = data1, interval = "prediction", level = 0.95)) + +dataM2 <- bind_cols(data1, M2_new) + +ggplot (data = dataM2) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M3_new <- tidy(predict(M3, newdata = data1, interval = "prediction", level = 0.95)) + +dataM3 <- bind_cols(data1, M3_new) + +ggplot (data = dataM3) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M4_new <- tidy(predict(M4, newdata = data1, interval = "prediction", level = 0.95)) + +dataM4 <- bind_cols (data1, M4_new) + +ggplot (data = dataM4) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +``` + +A prediction interval is an interval of the predicted values, which extimates the true value of a y that has not been observed. It will contain the future values of a y, 95% of the time. A confidence interval contains the parameter estimate around the mean. It will contain the true population mean, 95% of the time. + +The difference between the confidence intervals and the prediction intervals: the prediction intervals are much wider. + +## Functional Forms + +The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as +$$ +\begin{aligned}[t] +100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}} +&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\ +& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}} +\end{aligned} +$$ + +1. Take the logarithm of each side, and simplify so that $\log(\mathtt{pauper2}_t/\mathtt{pauper2}_{t -1})$ is the outcome and the predictors are all in the form $\log(x_t) - \log(x_{t - 1}) = \log(x_t / x_{t - 1})$ +$$ +\log (\mathtt{pauper2}_t/\mathtt{pauper2}_{t-1}) = \beta_0\ + \beta_1(\log(\mathtt{outratiodiff}))\\ - +\log(\mathtt{outratiodiff_{t-1}})\ + \beta_2(\log(\mathtt{popn65_t}))\\ - \log(\mathtt{popn65_{t-1}})\ + +\log(\mathtt{popn2_t})\ - \log(\mathtt{pop2_{t-1}})\ + \beta_3(\log\mathtt{popn2_t})\ - \log(\mathtt{popn2_{t-1}}) + +$$ + +2. Estimate the model with logged difference predictors, Year, and month and interpret the coefficient on $\log(outratio_t)$ . + +```{r} + +tidy(lm (log(pauper2) ~ (log (outratio) + log(Popn2) + log(Prop65)) * year + Type, data = pauperism)) + +``` + +In this logged model we can see that on average, ceteris paribus, a 1% change in the outratiodiff will have a 0.21% change on pauperism. In order to find $\log(outratio_t)$ we need to look at the covariance between outratio and year. Here we find a 6% increase on the effect of outratio per year, therefore a 0.21 to 0.27 increase. + +3. What are the pros and cons of this parameterization of the model relative to the one in @Yule1899a? Focus on interpretation and the desired goal of the inference rather than the formal tests of the regression. Can you think of other, better functional forms? + +When we log variables they are better to interpret since they are now based on percentage change rather than ratio differences. Over time this becomes easier to interpret as well as less labor intensive. Especially for a project such as Yule's where seeing the percentage changes over time would make more sense than ratio differences. Yule could see the percent change from 1871 to 1881, 1881 to 1891. Another functional form one could use is a weighted regression. We can weight each statistic by the population to control for size. + +## Non-differenced Model + +Suppose you estimate the model (*M5*) without differencing, +```{r} +M5 <- tidy(lm(pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type), data = pauperism)) +tidy(M5) +tidy(M2) +``` + +- Interpret the coefficient on `outratio`. How is this different than model *M2*. + +In the nondifferenced model the coefficient is much smaller (0.0012) than model 2 which is 0.23. The non differenced model has no baseline value unlike outratiodiff (which can use the initial year, 1871, as a comparison point). Therefore, we are unable to see the true effect of outratio on pauperism with a non differenced model since we are comparing outratio and paupratio directly, without seeing the changes over time from 1871 to 1881, 1881 to 1891. Model 5 is a difference-in-differences model while model 2 is a before-and-after design. + +- What accounts for the different in sample sizes in *M5* and *M2*? + +When the difference is not taken into account, you can use the data from 1871. This explains why M5 is has a bigger sample size since the year 1871 is taken into account. + +- What model do you think will generally have less biased estimates of the effect of out-relief on pauperism: *M5* or *M2*? Explain your reasoning. + +M2 will have less biased estimates because of the comparing of different terms. This means changes inside a muncipality are measured with regards to pauperism. Instead of looking at out relief pauperism by itself. With the differing you are comparing from one year to another, you have a baseline each time to compare to. + +## Substantive Effects + +Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing. + +McCaskey and Rainy (2015): Suggestions for Substantive Researchers +For the researcher making claims of substantive significance, we suggest the following +strategy: +1. Compute 90% confidence intervals around the estimated effects. +2. Interpret each endpoint of the interval. +3. Claim that the effect is substantively meaningful if and only if all effects in +the confidence interval are substantively meaningful. + +```{r} +# understanding the magnitude of the effect + +tstar <- 1.645 + +M4_ci <- tidy(M4) %>% + mutate (upr = (estimate + (std.error * tstar)), lwr = (estimate - (std.error * tstar))) + +tidy(M4_ci) + +``` + +Now that we have found the 90% confidence intervals for the chosen model (M4). We can see that outratiodiff has an influence between 0.43 and 0.63. Therefore, holding all else constant, on average each one unit increase on outratiodiff has an influence in the range of a 0.43 and 0.63 increase on pauperism. This effect is substantively meaningful due to the potential 43% or 63% growth in pauperism. + +## Influential Observations and Outliers + +### Influential Observations for the Regression + +For this use *M2*: + +1. For each observation, calculate and explain the following: + + - hat value (`hatvalues`) + - standardized error (`rstandard`) + - studentized error (`rstudent`) + - Cook's distance (`cooksd`) + +```{r} + +hatvalues <- hatvalues(M2) # Gives hat scores, where res is the result from lm() +hatscore <- hatvalues(M2)/mean(hatvalues(M2)) # Gives standardized hat scores +head (hatscore) + +# standardized error (rstandard) +rsta <- rstandard(M2) #Gives standardized residuals +head (rsta) + +# studentized error (rstudent) +rstu <- rstudent(M2) #Gives studentized residuals +head (rstu) + +# Cook’s distance (cooksd) +cooks <- cooks.distance(M2) #Gives Cook's Distance +head (cooks) + +# alternative: + +M2_aug <- augment(M2) %>% + mutate (.student.resid = .resid / .sigma * sqrt (1 - .hat)) + +glimpse (M2_aug) + +``` + +2. Create an outlier plot and label any outliers. See the example [here](https://jrnold.github.io/intro-methods-notes/outliers.html#iver-and-soskice-data) + +```{r} +plot(hatscore,rstu, xlab="Standardized hat-values", ylab="Studentized Residuals", +main="Influence Plot") + +abline(h=c(-2,2), lty=2) + +abline(v=c(2,3), lty=c(2,3)) + +``` +3. Using the plot and rules of thumb identify outliers and influential observations + +Looking at the plot we can say that any values which are outside of the box drawn from -2 to 2 are outliers. They are both high on leverage and discrepency. The standardized hat values which are more than 2 or 3 (past one or both the vertical lines) would also be outliers. + +## Influential Observations for a Coefficient + +1. Run *M2*, deleting each observation and saving the coefficient for `outratiodirff`. This is a method called the jackknife. You can use a for loop to do this, or you can use the function `jackknife` in the package [resamplr](https://github.com/jrnold/resamplr). + +```{r} + +jackknifeCOEF <- matrix(NA, ncol= 16, nrow= nrow(pauperism)) + +tidy_M2 <- tidy (M2) + +colnames(jackknifeCOEF) <- c (tidy_M2$term) + +for (i in 1:nrow(pauperism)){ +jackknifeCOEF[i,] <- coef((lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), data = pauperism[-i,]))) +} + +head(jackknifeCOEF) + +``` + +For which observations is there the largest change in the coefficient on `outratiodiff`? + +1. Which observations have the largest effect on the estimate of `outratiodiff`? +```{r} +summary (jackknifeCOEF) +jackknife1 <- as.data.frame(jackknifeCOEF) +jackknife1$.rownames <- as.character(seq.int(nrow(jackknife1))) +jack <- jackknife1 %>% + select (.rownames, outratiodiff) %>% + filter (outratiodiff > 0.235) %>% + print() + +``` + +After filtering out the observations which are more than 0.235, we can see there are 8 observations that have a large effect on outratiodiff. + +2. How do these observations compare with those that had the largest effect on the overall regression as measured with Cook's distance? + +```{r} + +M2_aug %>% + filter (.cooksd > 0.35) + +``` + +The observation from row 1413 shows up in both the jackknife and cooks distance calculations. However, only 1410 is visible in the cooks distance calculation. + +3. Compare the results of the jackknife to the `dfbeta` statistic for `outratiodiff` + +```{r} +dfbeta <- tidy(dfbetas (M2)) + +dfbeta %>% + filter (abs(outratiodiff) > 2/sqrt (nrow(dfbeta))) %>% + inner_join (jack, by = ".rownames") %>% + print + +``` + +When we compare the dfbeta statistics to the jackknife we see that the exact same rows appear from both. They correspond to each other. + +2.@AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for `outratiodiff`. + +1. Regress `outratiodiff` on the control variables. + +```{r} +M2_out <- lm (outratiodiff ~ (popratiodiff + oldratiodiff) * (year + Type), data = pauperism) + +M2_out + +``` + +2. The weights of the observations are those with the highest squared errors from this regression. Which observations have the highest coefficient values? + +```{r} +summary(resid (M2_out)) +M2_resid <- tidy (resid(M2_out)) %>% + mutate (x = abs(x)) %>% + top_n (x, n = 25) + +head(M2_resid) # printing only 6 rows but there are 25 values. +``` + +With the resid function we can see the top 25 rows with the absolute highest values from the M2_out regression. + +3. How do the observations with the highest regression weights compare with those with the highest changes in the regression coefficient from the jackknife? + +```{r} +M2_resid %>% + rename (.rownames = names) %>% + inner_join (jack, by = ".rownames") + +``` + + + +## Omitted Variable Bias + +An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a]. + +@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic "provide[s] a measure to gauge the strength of the likely bias arising from unobservables: how much stronger selection on unobservables, relative to selection on observables, must be to explain away the full estimated effect." + +1. Run a regression without any controls. Denote the coefficient on the variable of interest as $\hat\beta_R$. + +```{r} +beta_hat_R <- tidy (lm(paupratiodiff ~ outratiodiff, data = pauperism)) %>% + filter (term == "outratiodiff") %>% + select (estimate) + +beta_hat_R <- as.numeric(beta_hat_R) +beta_hat_R + +``` + +2. Run a regression with the full set of controls. Denote the coefficient on the variable of interest in this regression as $\hat\beta_F$. + +```{r} +beta_hat_F <- tidy_M2 %>% + filter (term == "outratiodiff") %>% + select (estimate) + +beta_hat_F <- as.numeric (beta_hat_F) +beta_hat_F + +``` + +3. The ratio is $$\hat\beta_F / (\hat\beta_R - \hat\beta_F)$$ + +```{r} +beta_hat = beta_hat_F / (beta_hat_R - beta_hat_F) +beta_hat + +``` + +Calculate this statistic for *M2* and interpret it. + +The ratio is 3.157. According to the Nunn and Wantchekon 2011 article, the selection on unobservables must be more than 3 times stronger than the selection on observables to explain away the entire effect (Pg. 3238). + +## Heteroskedasticity + +1. Run *M2* and *M3* with a heteroskedasticity consistent (HAC), also called robust, standard error. How does this affect the standard errors on `outratio` coefficients? Use the **sandwich** package to add HAC standard errors [@Zeileis2004a].Compare SE to the HAC. + +```{r} + +HAC_M2 <- vcovHAC(M2) #Gives the estimated HAC covariance matrix +diag_M2 <- sqrt(diag(vcovHAC(M2))) #Gives the HAC standard errors + +diag_M2[2] - tidy_M2$std.error[2] + +HAC_M3 <- vcovHAC(M3) +diag_M3 <- sqrt(diag(vcovHAC(M3))) +tidy_M3 <- tidy (M3) + +diag_M3[2] - tidy_M3$std.error[2] + +``` + +Using the HAC calculation we can see an increase in the standard error from the original M2 regression by around 0.005. + +Using the HAC calculation we can see an increase in the standard error from the original M3 regression by around 0.0132. + +### Multiple Regressions + +1. Run the model with interactions for all years and types +```{r} +tidy(lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism)) +``` + + +2. For each subset of year and type run the regression +```{r} +lm(pauper2 ~ outratio + Popn2 + Prop65, data = pauperism) +``` + +3. Compare the coefficients, standard errors, and regression standard errors in these regresions. + +```{r} +all_interact <- + crossing(Type = pauperism$Type, year = c(1881, 1891)) %>% + mutate(mod = map2(year, Type, + function(yr, ty) { + lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff, + data = filter(pauperism, + year == yr, + Type == ty)) + })) %>% + mutate(mod_glance = map(mod, broom::glance), + mod_tidy = map(mod, broom::tidy)) + +all_interact %>% + mutate (sigma = map_dbl(mod_glance, function (x) x$sigma)) %>% + mutate (std.error.out = map_dbl(mod_tidy, function (x) x$std.error[2])) %>% + mutate (estimate.out = map_dbl(mod_tidy, function (x) x$estimate[2])) %>% + select (year, Type, sigma, std.error.out, estimate.out) + +``` + +## Weighted Regression + +1. Run *M2* and *M3* as weighted regressions, weighted by the population (`Popn`) and interpret the coefficients on `outratiodiff` and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more? +```{r} + +M2_weights <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), weights = Popn, data = pauperism) + +head(tidy(M2_weights)) + +M3_weights <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), weights = Popn, data = pauperism) + +head(tidy(M3_weights)) + +``` + +Holding all else constant and with weights, on average in M2 a one unit change in outratiodiff has a 0.36 increase in paupratiodiff. Holding all else constant and with weights, on average in M3 a one unit change in outratiodiff has a 0.72 increase in paupratiodiff. The difference between the two is quite stark. Both M2 and M3 increase when compared to their original coefficients, without weights. However, between the two M3 is effected more (from 0.53 to 0.72). + +2. What are some rationales for weighting by population? See the discussion in @SolonHaiderWooldridge2013a and @AngristPischke2014a. + +According to "Mastering Metrics" weighting weights each term in the residual sum of squares by population or some other weight. We are able to take into account the population of people in each district and make sure the larger districts count for more. In this way we are able to account for small districts having more variability on average. + + +## Cross-Validation + +When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables. +Another criteria for selecting models is to use their fit to the data. +But a model's fit to data should not be assessed using only the in-sample data. +That leads to overfitting---and the best model would always be to include an indicator variable for every observation +Instead, a model's fit to data can be assessed by using its out-of-sample fit. +One way to estimate the *expected* fit of a model to *new* data is cross-validation. + +We want to compare the predictive performance of the following models +```{r} +mod_formulas <- + list( + m0 = paupratiodiff ~ 1, + m1 = paupratiodiff ~ year + Type, + m2 = paupratiodiff ~ outratiodiff + year + Type, + m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type), + m4 = -1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type), + m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type + ) +``` + +Let's split the data into 10 (train/test) folds for cross-validation, +```{r} +pauperism_nonmiss <- + pauperism %>% + filter(year %in% c(1881, 1891)) %>% + select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID, BoothGroup) %>% + tidyr::drop_na() +pauperism_10folds <- + pauperism_nonmiss %>% + resamplr::crossv_kfold(10) +``` + + +For each model formula `f`, training data set `train`, and test data set, `test`, +run the model specified by `f` on `train`, and predict new observations in `test`, and calculate the RMSE from the residuals +```{r} +mod_rmse_fold <- function(f, train, test) { + fit <- lm(f, data = as.data.frame(train)) + test_data <- as.data.frame(test) + err <- test_data$paupratiodiff - predict(fit, newdata = test_data) + sqrt(mean(err ^ 2)) +} +``` +E.g. for one fold and formula, +```{r} +mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]], + pauperism_10folds$test[[1]]) +``` + +Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with `train` and `test` list-columns: +```{r} +mod_rmse <- function(f, data) { + map2_dbl(data$train, data$test, + function(train, test) { + mod_rmse_fold(f, train, test) + }) %>% + mean() +} +``` +```{r} +mod_rmse(mod_formulas[[1]], pauperism_10folds) +``` + +Finally, we want to run `mod_rmse` for each formula in `mod_formulas`. +It will be easiest to store this in a data frame: +```{r} +cv_results <- tibble( + model_formula = mod_formulas, + .id = names(mod_formulas), + # Formula as a string + .name = map(model_formula, + function(x) gsub(" +", " ", paste0(deparse(x), collapse = ""))) +) +``` +Use `map` to run `mod_rmse` for each model and save it as a list frame in +the data frame, +```{r} +cv_results <- + mutate(cv_results, + cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds)) +``` + +In the case of linear regression, the MSE of the Leave-one-out ($n$-fold) cross-validation can be analytically calculated without having to run $n$ regressions. +```{r} +loocv <- function(x) { + mean((residuals(x) / (1 - hatvalues(x))) ^ 2) +} +``` + +```{r} +cv_results <- + mutate(cv_results, + rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss))))) +``` + + +1. In the 10-fold cross validation, which model has the best out of sample prediction? +```{r} + +for (i in seq_len(nrow(cv_results))) { + print (cv_results$cv10_rmse[i]) +} + +``` + +From this analysis we can see that Model 4 has the best out of sample prediction. + +2. Using the LOO-CV cross-validation, which model has the best out of sample prediction? + +```{r} +for (i in seq_len(nrow(cv_results))) { + print (cv_results$rmse_loo[i]) +} +``` + +Once again, Model 4 has the best out of sample prediction. + +3. Does the prediction metric (RMSE) and prediction task---predicting individual PLUs from other PLUs---make sense? Can you think of others that you would prefer? + +The task makes sense. The individual PLUs on average should have similar standard errors. However, as seen in the weighted regression section, smaller PLUs can have more variation, so we should make sure we account for this variation in the PLU size and effect. + +## Bootstrapping + +Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows: + +Let $\hat\theta$ be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure. + +For samples $b = 1, ..., B$. + +1. Draw a sample with replacement from the data +2. Estimate the statistic of interest and call it $\theta_b^*$. + +Let $\theta^* = \{\theta_1^*, \dots, \theta_B^*\}$ be the set of bootstrapped statistics. + +- standard error: $\hat\theta$ is $\sd(\theta^*)$. +- confidence interval: + + - normal approximation. This calculates the confidence interval as usual but uses the bootstrapped standard error instead of the classical OLS standard error: $\hat\theta \pm t_{\alpha/2,df} \cdot \sd(\theta^*)$ + - quantiles: A 95% confidence interval uses the 2.5% and 97.5% quantiles of $\theta^*$ for its upper and lower bounds. + + +Original model +```{r} +mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type +mod_orig <- lm(mod_formula, data = pauperism_nonmiss) +``` + +```{r} +bs_coef_se <- + resamplr::bootstrap(pauperism_nonmiss, 1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) +``` + +Now compare the std.error of the original and the bootstrap for `outratiodiff` +```{r} +broom::tidy(mod_orig, conf.int = TRUE) %>% + select(term, estimate, std.error) %>% + filter(term == "outratiodiff") %>% + left_join(bs_coef_se, by = "term") +``` +The bootstrap standard error is slightly higher. +It is similar to the standard error generated using the heteroskedasticity consistent standard error. +```{r} +sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"]) +``` + +It is likely that there is correlation between the error terms of observations. +At the very least, each PLU is included twice; these observations are likely +correlated, so we are effectively overstating the sample size of our data. +One way to account for that is to resample "PLUs", not PLU-years. +This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves. +```{r} +pauperism_nonmiss %>% + group_by(ID) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") +``` +However, this yields a standard error not much different than the Robust standard error. + +1. Try bootstrapping "Region" and "BoothGroup". Do either of these make much difference in the standard errors. + +```{r} + +# Region: + +pauperism_nonmiss %>% + group_by(Region) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") + +# BoothGroup + +pauperism_nonmiss %>% + group_by(BoothGroup) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") +``` + +When we run these two calculations we can see that neither of these makes a significantly large difference in the standard errors for Region or BoothGroup, compared to the original bootstrap. diff --git a/solutions2.html b/solutions2.html new file mode 100644 index 0000000..b59b7b9 --- /dev/null +++ b/solutions2.html @@ -0,0 +1,1405 @@ + + + + + + + + + + + + + + + +Solutions Parr + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+

This is the second assignment for POLS 503. The data and paper are by Yule, the original researcher.

+
library("dplyr")
+
## 
+## Attaching package: 'dplyr'
+
## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+
## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
library("tidyr")
+library("tidyverse")
+
## Loading tidyverse: ggplot2
+## Loading tidyverse: tibble
+## Loading tidyverse: readr
+## Loading tidyverse: purrr
+
## Conflicts with tidy packages ----------------------------------------------
+
## filter(): dplyr, stats
+## lag():    dplyr, stats
+
library("modelr")
+library("broom")
+
## 
+## Attaching package: 'broom'
+
## The following object is masked from 'package:modelr':
+## 
+##     bootstrap
+
library("ggplot2")
+library ("car")
+
## Warning: package 'car' was built under R version 3.3.3
+
## 
+## Attaching package: 'car'
+
## The following object is masked from 'package:purrr':
+## 
+##     some
+
## The following object is masked from 'package:dplyr':
+## 
+##     recode
+
library ("stats")
+library ("texreg") #makes tables
+
## Warning: package 'texreg' was built under R version 3.3.3
+
## Version:  1.36.23
+## Date:     2017-03-03
+## Author:   Philip Leifeld (University of Glasgow)
+## 
+## Please cite the JSS article in your publications -- see citation("texreg").
+
## 
+## Attaching package: 'texreg'
+
## The following object is masked from 'package:tidyr':
+## 
+##     extract
+
library ("bootstrap")
+
## 
+## Attaching package: 'bootstrap'
+
## The following object is masked from 'package:broom':
+## 
+##     bootstrap
+
## The following object is masked from 'package:modelr':
+## 
+##     bootstrap
+
library ("sandwich")
+
## Warning: package 'sandwich' was built under R version 3.3.3
+
# devtools::install_github("jrnold/datums")
+library(datums)
+pauperism <-
+  left_join(datums::pauperism_plu, datums::pauperism_year,
+            by = "ID") %>%
+  mutate (year = as.character(year))
+
+
+

Run regressions of pauper using the yearly level data with the following specifications.

+
M1 <- lm(paupratiodiff ~ outratiodiff + year + Type, data = pauperism)
+
+M2 <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * 
+           (year + Type), data = pauperism)
+
+M3 <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+           (year + Type), data = pauperism)
+
+M4 <- lm(paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+           (year + Type), data = pauperism)
+
+
+

1. Present the regressions results in a regression table

+
table1 <- screenreg(list(M1, M2, M3, M4))
+
+table1
+
## 
+## ==========================================================================
+##                         Model 1      Model 2      Model 3      Model 4    
+## --------------------------------------------------------------------------
+## (Intercept)               56.71 ***   -55.34         6.17         7.17    
+##                           (2.52)      (35.37)      (37.86)      (37.86)   
+## outratiodiff               0.23 ***     0.23 ***     0.53 ***     0.53 ***
+##                           (0.02)       (0.01)       (0.06)       (0.06)   
+## year1891                  14.70 ***    15.81        28.05        28.05    
+##                           (1.28)      (16.48)      (16.48)      (16.48)   
+## TypeMixed                 -6.45 *      64.41        -5.58        -5.58    
+##                           (2.61)      (37.22)      (39.90)      (39.90)   
+## TypeRural                 -7.32 **     31.75       -39.61       -39.61    
+##                           (2.61)      (41.72)      (44.04)      (44.04)   
+## TypeUrban                 -3.02       109.08 **     45.07        45.07    
+##                           (2.77)      (37.55)      (40.24)      (40.24)   
+## popratiodiff                           -0.15        -0.19 *      -0.19 *  
+##                                        (0.09)       (0.09)       (0.09)   
+## oldratiodiff                            1.27 ***     0.55         0.55    
+##                                        (0.29)       (0.33)       (0.33)   
+## popratiodiff:year1891                  -0.24 ***    -0.26 ***    -0.26 ***
+##                                        (0.06)       (0.06)       (0.06)   
+## popratiodiff:TypeMixed                  0.25 *       0.31 **      0.31 ** 
+##                                        (0.11)       (0.11)       (0.11)   
+## popratiodiff:TypeRural                  0.43 *       0.50 **      0.50 ** 
+##                                        (0.17)       (0.17)       (0.17)   
+## popratiodiff:TypeUrban                  0.05         0.09         0.09    
+##                                        (0.10)       (0.10)       (0.10)   
+## oldratiodiff:year1891                   0.20         0.18         0.18    
+##                                        (0.13)       (0.13)       (0.13)   
+## oldratiodiff:TypeMixed                 -0.96 **     -0.20        -0.20    
+##                                        (0.30)       (0.34)       (0.34)   
+## oldratiodiff:TypeRural                 -0.84 **     -0.09        -0.09    
+##                                        (0.31)       (0.35)       (0.35)   
+## oldratiodiff:TypeUrban                 -1.12 ***    -0.50        -0.50    
+##                                        (0.32)       (0.36)       (0.36)   
+## outratiodiff:year1891                               -0.09 **     -0.09 ** 
+##                                                     (0.03)       (0.03)   
+## outratiodiff:TypeMixed                              -0.27 ***    -0.27 ***
+##                                                     (0.06)       (0.06)   
+## outratiodiff:TypeRural                              -0.25 ***    -0.25 ***
+##                                                     (0.06)       (0.06)   
+## outratiodiff:TypeUrban                              -0.14 *      -0.14 *  
+##                                                     (0.07)       (0.07)   
+## --------------------------------------------------------------------------
+## R^2                        0.38         0.45         0.46         0.46    
+## Adj. R^2                   0.38         0.44         0.46         0.46    
+## Num. obs.               1180         1180         1180         1180       
+## RMSE                      19.16        18.18        17.93        17.93    
+## ==========================================================================
+## *** p < 0.001, ** p < 0.01, * p < 0.05
+
+
+

2. Interpret the coefficients for outratiodiff for each model.

+

M1: Holding year and type of area constant, on average when there is an outratio difference of 1 unit, there is a 0.2343 increase in pauperism, ceteris parabis. As the proportion of people recieving outside aid increases, there is an increase in the level of pauperism.

+

M2: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.23 increase in pauperism, ceteris parabis. In this model we include oldratiodiff in the controls to make sure we control for the amount of elderly persons recieving out relief payments.

+

M3: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. In this regression we are transforming the variable by adding negative 1. We include interactions between outratiodiff and year, Type Mixed, Type Urban, and Type Rural. The coefficient for outratiodiff indicates the increase in pauperism for Metropolitan areas, and the covariance between outratiodiff and PLU type indicates the change on pauperism from Metropolitan areas and the other types of areas.

+

M4: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. This model is the same as Model 3 but without the negative 1 transformation.

+
+
+

3. Write the equations for each or all models, and describe the model with a sentence or two. Try to be as concise as possible. Look at recent journal articles for examples of the wording and format.

+

M1: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2year\ + \beta_3type\ + \epsilon\ +\] In this first model we are regressing outratiodiff on paupratiodiff and controlling for year and Type. It is a simple linear model without interaction terms.

+

M2: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ + \epsilon\ +\] In this second model we are running another linear regression but we include interaction terms. We can measure the conditional changes when we include year and type with oldratiodiff, popratiodiff and outratiodiff.

+

M3: \[ +paupratiodiff - 1 = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +\] In this third model we are including interaction terms for PLU, year, outratiodiff, popratiodiff and oldratiodiff. We are also including a negative 1 to transform the intercept but should not have any effect on the results.

+

M4: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +\] In this fourth model we are running the same regression as model 3, but without the negative 1.

+
+
+

4. What is the difference between M3 and M4. What are the pros and cons of each parameterization?

+

Model 3 has a negative 1 to transform the parameters. The correlation of the data does not change, but it simply shifts the information down by 1 y unit. Maybe this would make more sense to subtract a 100 to see a substantive move.

+
+
+

5. Conduct F-tests on the hypotheses:

+
+

a) All interactions in M4 are 0

+
linearHypothesis(M4, c("popratiodiff:year1891", "popratiodiff:TypeUrban", "popratiodiff:TypeRural", "popratiodiff:TypeMixed", "oldratiodiff:year1891", "oldratiodiff:TypeMixed","oldratiodiff:TypeUrban", "oldratiodiff:TypeRural", "outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))  
+
## Linear hypothesis test
+## 
+## Hypothesis:
+## popratiodiff:year1891 = 0
+## popratiodiff:TypeUrban = 0
+## popratiodiff:TypeRural = 0
+## popratiodiff:TypeMixed = 0
+## oldratiodiff:year1891 = 0
+## oldratiodiff:TypeMixed = 0
+## oldratiodiff:TypeUrban = 0
+## oldratiodiff:TypeRural = 0
+## outratiodiff:year1891 = 0
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+## 
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+##     (year + Type)
+## 
+##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
+## 1   1172 402371                                  
+## 2   1160 373054 12     29318 7.5969 1.134e-13 ***
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+
+

b) The coefficients on outratiodiff in M4 are the same across years

+
linearHypothesis(M4, c("outratiodiff:year1891"))  
+
## Linear hypothesis test
+## 
+## Hypothesis:
+## outratiodiff:year1891 = 0
+## 
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+##     (year + Type)
+## 
+##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
+## 1   1161 375552                                
+## 2   1160 373054  1    2498.8 7.7698 0.005399 **
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+
+

c) The coefficients on outratiodiff in M4 are the same across PLU Types

+
linearHypothesis(M4, c("outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))
+
## Linear hypothesis test
+## 
+## Hypothesis:
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+## 
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+##     (year + Type)
+## 
+##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
+## 1   1163 380862                                 
+## 2   1160 373054  3    7808.7 8.0937 2.45e-05 ***
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+
+

d) The coefficients on outratiodiff in M4 are the same across PLU Types and years.

+
linearHypothesis(M4, c("outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))
+
## Linear hypothesis test
+## 
+## Hypothesis:
+## outratiodiff:year1891 = 0
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+## 
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+##     (year + Type)
+## 
+##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
+## 1   1164 384696                                 
+## 2   1160 373054  4     11642 9.0502 3.37e-07 ***
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+
+

6.Calculate the predicted value and confidence interval for the PLU with the median value of outratiodiff, popratiodiff, and oldratiodiff in each year and PLU Type for these models. Plot the predicted value and confidence interval of these as point-ranges.

+
data1 <- pauperism %>%
+  group_by(year, Type) %>%
+  filter (!is.na(Type), year %in% c("1881", "1891")) %>%
+  summarise_at (vars(outratiodiff, popratiodiff, oldratiodiff), median, na.rm = TRUE)
+
+M1_new <- tidy(predict(M1, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM1 <- bind_cols (data1, M1_new)
+
+ggplot (data = dataM1) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M2_new <- tidy(predict(M2, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM2 <- bind_cols(data1, M2_new)
+
+ggplot (data = dataM2) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M3_new <- tidy(predict(M3, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM3 <- bind_cols(data1, M3_new)
+
+ggplot (data = dataM3) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M4_new <- tidy(predict(M4, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM4 <- bind_cols (data1, M4_new)
+
+ggplot (data = dataM4) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
+
+

7.As previously, calculate the predicted value of the median PLU in each year and PLU Type. But instead of confidence intervals include the prediction interval. How do the confidence and prediction intervals differ? What are their definitions?

+
M1_new <- tidy(predict(M1, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM1 <- bind_cols (data1, M1_new)
+
+ggplot (data = dataM1) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M2_new <- tidy(predict(M2, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM2 <- bind_cols(data1, M2_new)
+
+ggplot (data = dataM2) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M3_new <- tidy(predict(M3, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM3 <- bind_cols(data1, M3_new)
+
+ggplot (data = dataM3) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+
M4_new <- tidy(predict(M4, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM4 <- bind_cols (data1, M4_new)
+
+ggplot (data = dataM4) +
+  facet_grid(year ~ .) +
+  geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+

+

A prediction interval is an interval of the predicted values, which extimates the true value of a y that has not been observed. It will contain the future values of a y, 95% of the time. A confidence interval contains the parameter estimate around the mean. It will contain the true population mean, 95% of the time.

+

The difference between the confidence intervals and the prediction intervals: the prediction intervals are much wider.

+
+
+
+

Functional Forms

+

The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as \[ +\begin{aligned}[t] +100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}} +&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\ +& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}} +\end{aligned} +\]

+
    +
  1. Take the logarithm of each side, and simplify so that \(\log(\mathtt{pauper2}_t/\mathtt{pauper2}_{t -1})\) is the outcome and the predictors are all in the form \(\log(x_t) - \log(x_{t - 1}) = \log(x_t / x_{t - 1})\) $$ (t/{t-1}) = _0 + _1(())\ - () + _2(())\ - () + () - () + _3() - ()
  2. +
+

$$

+
    +
  1. Estimate the model with logged difference predictors, Year, and month and interpret the coefficient on \(\log(outratio_t)\) .
  2. +
+
tidy(lm (log(pauper2) ~ (log (outratio) + log(Popn2) + log(Prop65)) * year + Type, data = pauperism))
+
##                      term     estimate  std.error   statistic      p.value
+## 1             (Intercept)  0.532584892 0.25222351   2.1115593 3.486505e-02
+## 2           log(outratio)  0.207982715 0.02860373   7.2711757 5.339890e-13
+## 3              log(Popn2) -0.076986692 0.02492738  -3.0884385 2.043590e-03
+## 4             log(Prop65)  0.898523115 0.06963985  12.9024273 1.861463e-36
+## 5                year1881 -1.172861822 0.26312476  -4.4574361 8.818390e-06
+## 6                year1891 -1.082628536 0.25949596  -4.1720438 3.166064e-05
+## 7               TypeMixed -0.594474654 0.04903643 -12.1231221 1.522963e-32
+## 8               TypeRural -0.694681619 0.05556267 -12.5026692 2.004940e-34
+## 9               TypeUrban -0.460494803 0.04371829 -10.5332288 3.359133e-25
+## 10 log(outratio):year1881  0.061980549 0.03537396   1.7521518 7.992204e-02
+## 11 log(outratio):year1891  0.004959228 0.03410923   0.1453925 8.844176e-01
+## 12    log(Popn2):year1881  0.030172996 0.03313236   0.9106806 3.625885e-01
+## 13    log(Popn2):year1891  0.017453919 0.03271408   0.5335292 5.937347e-01
+## 14   log(Prop65):year1881 -0.150544910 0.09327188  -1.6140439 1.066974e-01
+## 15   log(Prop65):year1891 -0.132271627 0.09508523  -1.3910848 1.643757e-01
+

In this logged model we can see that on average, ceteris paribus, a 1% change in the outratiodiff will have a 0.21% change on pauperism. In order to find \(\log(outratio_t)\) we need to look at the covariance between outratio and year. Here we find a 6% increase on the effect of outratio per year, therefore a 0.21 to 0.27 increase.

+
    +
  1. What are the pros and cons of this parameterization of the model relative to the one in @Yule1899a? Focus on interpretation and the desired goal of the inference rather than the formal tests of the regression. Can you think of other, better functional forms?
  2. +
+

When we log variables they are better to interpret since they are now based on percentage change rather than ratio differences. Over time this becomes easier to interpret as well as less labor intensive. Especially for a project such as Yule’s where seeing the percentage changes over time would make more sense than ratio differences. Yule could see the percent change from 1871 to 1881, 1881 to 1891. Another functional form one could use is a weighted regression. We can weight each statistic by the population to control for size.

+
+
+

Non-differenced Model

+

Suppose you estimate the model (M5) without differencing,

+
M5 <- tidy(lm(pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type), data = pauperism))
+tidy(M5)
+
## Warning: NAs introduced by coercion
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
##      column  n        mean        sd       median     trimmed        mad
+## 1     term* 19         NaN        NA           NA         NaN         NA
+## 2  estimate 19 0.034123738 0.2133329 6.617932e-08  0.03462656 0.03772393
+## 3 std.error 19 0.066373662 0.1100641 1.078133e-02  0.05801804 0.01598437
+## 4 statistic 19 0.002208183 3.8717500 3.118774e-01 -0.25821208 3.51922885
+## 5   p.value 19 0.166782878 0.2412675 2.181532e-02  0.14198258 0.03234340
+##             min        max      range      skew    kurtosis         se
+## 1           Inf       -Inf       -Inf        NA          NA         NA
+## 2 -4.283603e-01  0.4880598  0.9164200 0.2347724  0.35159431 0.04894192
+## 3  2.231202e-08  0.2747928  0.2747928 1.2238901 -0.41769516 0.02525043
+## 4 -6.944277e+00 11.3758381 18.3201153 0.9026189  1.71319524 0.88824036
+## 5  5.564443e-29  0.7551707  0.7551707 1.1752498 -0.04556811 0.05535055
+
tidy(M2)
+
##                      term     estimate   std.error  statistic      p.value
+## 1             (Intercept) -55.34359311 35.37428433 -1.5645148 1.179685e-01
+## 2            outratiodiff   0.23257691  0.01437241 16.1821800 2.796381e-53
+## 3            popratiodiff  -0.15460467  0.09242455 -1.6727663 9.464203e-02
+## 4            oldratiodiff   1.27422786  0.29222562  4.3604248 1.412925e-05
+## 5                year1891  15.81459463 16.47984545  0.9596325 3.374395e-01
+## 6               TypeMixed  64.41009425 37.21878051  1.7305805 8.379165e-02
+## 7               TypeRural  31.75172916 41.71570256  0.7611457 4.467242e-01
+## 8               TypeUrban 109.07760812 37.55456303  2.9045101 3.747781e-03
+## 9   popratiodiff:year1891  -0.23715938  0.05852434 -4.0523207 5.407671e-05
+## 10 popratiodiff:TypeMixed   0.24809583  0.11338910  2.1880043 2.886757e-02
+## 11 popratiodiff:TypeRural   0.43310791  0.16858281  2.5691108 1.031968e-02
+## 12 popratiodiff:TypeUrban   0.04596014  0.09636220  0.4769520 6.334858e-01
+## 13  oldratiodiff:year1891   0.20294088  0.12744222  1.5924149 1.115630e-01
+## 14 oldratiodiff:TypeMixed  -0.95692127  0.29917492 -3.1985344 1.418526e-03
+## 15 oldratiodiff:TypeRural  -0.83827902  0.31010130 -2.7032425 6.966637e-03
+## 16 oldratiodiff:TypeUrban  -1.11662635  0.31639892 -3.5291725 4.331264e-04
+ +

In the nondifferenced model the coefficient is much smaller (0.0012) than model 2 which is 0.23. The non differenced model has no baseline value unlike outratiodiff (which can use the initial year, 1871, as a comparison point). Therefore, we are unable to see the true effect of outratio on pauperism with a non differenced model since we are comparing outratio and paupratio directly, without seeing the changes over time from 1871 to 1881, 1881 to 1891. Model 5 is a difference-in-differences model while model 2 is a before-and-after design.

+ +

When the difference is not taken into account, you can use the data from 1871. This explains why M5 is has a bigger sample size since the year 1871 is taken into account.

+ +

M2 will have less biased estimates because of the comparing of different terms. This means changes inside a muncipality are measured with regards to pauperism. Instead of looking at out relief pauperism by itself. With the differing you are comparing from one year to another, you have a baseline each time to compare to.

+
+
+

Substantive Effects

+

Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing.

+

McCaskey and Rainy (2015): Suggestions for Substantive Researchers For the researcher making claims of substantive significance, we suggest the following strategy: 1. Compute 90% confidence intervals around the estimated effects. 2. Interpret each endpoint of the interval. 3. Claim that the effect is substantively meaningful if and only if all effects in the confidence interval are substantively meaningful.

+
# understanding the magnitude of the effect 
+
+tstar <- 1.645
+
+M4_ci <- tidy(M4) %>%
+  mutate (upr = (estimate + (std.error * tstar)), lwr = (estimate - (std.error * tstar)))
+
+tidy(M4_ci)
+
## Warning: NAs introduced by coercion
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
##      column  n         mean         sd      median    trimmed       mad
+## 1     term* 20          NaN         NA          NA        NaN        NA
+## 2  estimate 20   1.76399914 15.2326789 -0.08882782  0.4590258 0.4887553
+## 3 std.error 20   9.04223179 16.5778174  0.14673840  6.0294163 0.1507126
+## 4 statistic 20  -0.08260028  3.0272100 -0.19739114 -0.2793311 2.7507670
+## 5   p.value 20   0.23474321  0.3055838  0.09144136  0.1847654 0.1355041
+## 6       upr 20  16.63847043 31.9529777  0.43500310  9.5235824 0.7031488
+## 7       lwr 20 -13.11047215 30.5030756 -0.34683071 -5.0189513 0.5674414
+##             min         max       range       skew   kurtosis         se
+## 1           Inf        -Inf        -Inf         NA         NA         NA
+## 2 -3.960588e+01  45.0723637  84.6782429  0.3997320  3.4340538 3.40613055
+## 3  3.178236e-02  44.0449941  44.0132118  1.2779185 -0.2582121 3.70691267
+## 4 -4.502796e+00   8.6129042  13.1156999  0.8410447  1.1131842 0.67690474
+## 5  2.291745e-17   0.8888402   0.8888402  1.0912734 -0.3380284 0.06833062
+## 6 -1.668551e-01 111.2652104 111.4320655  1.6429652  1.5503020 7.14490301
+## 7 -1.120599e+02   0.9371370 112.9970315 -2.1224544  3.4010009 6.82069506
+

Now that we have found the 90% confidence intervals for the chosen model (M4). We can see that outratiodiff has an influence between 0.43 and 0.63. Therefore, holding all else constant, on average each one unit increase on outratiodiff has an influence in the range of a 0.43 and 0.63 increase on pauperism. This effect is substantively meaningful due to the potential 43% or 63% growth in pauperism.

+
+
+

Influential Observations and Outliers

+
+

Influential Observations for the Regression

+

For this use M2:

+
    +
  1. For each observation, calculate and explain the following:
  2. +
+
    +
  • hat value (hatvalues)
  • +
  • standardized error (rstandard)
  • +
  • studentized error (rstudent)
  • +
  • Cook’s distance (cooksd)
  • +
+
hatvalues <- hatvalues(M2) # Gives hat scores, where res is the result from lm()
+hatscore <- hatvalues(M2)/mean(hatvalues(M2)) # Gives standardized hat scores
+head (hatscore)
+
##        2        3        5        6        8        9 
+## 2.256707 5.372202 9.182804 6.406825 3.945256 1.418897
+
# standardized error (rstandard)
+rsta <- rstandard(M2) #Gives standardized residuals
+head (rsta)
+
##           2           3           5           6           8           9 
+## -1.70910056  0.28214127 -0.05082361  1.23807840 -1.70039734 -0.26657462
+
# studentized error (rstudent)
+rstu <- rstudent(M2) #Gives studentized residuals
+head (rstu)
+
##           2           3           5           6           8           9 
+## -1.71051384  0.28202969 -0.05080184  1.23836212 -1.70178168 -0.26646822
+
# Cook’s distance (cooksd)
+cooks <- cooks.distance(M2) #Gives Cook's Distance
+head (cooks)
+
##            2            3            5            6            8 
+## 5.762689e-03 3.908863e-04 2.296014e-05 9.114347e-03 1.021342e-02 
+##            9 
+## 8.712515e-05
+
# alternative:
+
+M2_aug <- augment(M2) %>%
+  mutate (.student.resid = .resid / .sigma * sqrt (1 - .hat))
+
+glimpse (M2_aug)
+
## Observations: 1,180
+## Variables: 15
+## $ .rownames      <chr> "2", "3", "5", "6", "8", "9", "11", "12", "14",...
+## $ paupratiodiff  <dbl> 27.02455, 126.21656, 30.55314, 81.81045, 47.132...
+## $ outratiodiff   <dbl> 4.709511, 105.770522, 20.691155, 40.996818, 12....
+## $ popratiodiff   <dbl> 135.62124, 101.93502, 173.89046, 164.47200, 110...
+## $ oldratiodiff   <dbl> 104.24513, 119.24258, 85.41113, 104.74832, 115....
+## $ year           <chr> "1881", "1891", "1881", "1891", "1881", "1891",...
+## $ Type           <chr> "Metropolitan", "Metropolitan", "Metropolitan",...
+## $ .fitted        <dbl> 57.61610, 121.27772, 31.41765, 60.30265, 77.206...
+## $ .se.fit        <dbl> 3.180086, 4.906563, 6.414884, 5.358245, 4.20473...
+## $ .resid         <dbl> -30.5915473, 4.9388445, -0.8645153, 21.5078065,...
+## $ .hat           <dbl> 0.03059942, 0.07284342, 0.12451260, 0.08687220,...
+## $ .sigma         <dbl> 18.16449, 18.18670, 18.18730, 18.17535, 18.1647...
+## $ .cooksd        <dbl> 5.762689e-03, 3.908863e-04, 2.296014e-05, 9.114...
+## $ .std.resid     <dbl> -1.70910056, 0.28214127, -0.05082361, 1.2380784...
+## $ .student.resid <dbl> -1.65817311, 0.26148569, -0.04447637, 1.1307828...
+
    +
  1. Create an outlier plot and label any outliers. See the example here
  2. +
+
plot(hatscore,rstu, xlab="Standardized hat-values", ylab="Studentized Residuals",
+main="Influence Plot") +
+abline(h=c(-2,2), lty=2) +
+abline(v=c(2,3), lty=c(2,3))
+

+
## numeric(0)
+
    +
  1. Using the plot and rules of thumb identify outliers and influential observations
  2. +
+

Looking at the plot we can say that any values which are outside of the box drawn from -2 to 2 are outliers. They are both high on leverage and discrepency. The standardized hat values which are more than 2 or 3 (past one or both the vertical lines) would also be outliers.

+
+
+
+

Influential Observations for a Coefficient

+
    +
  1. Run M2, deleting each observation and saving the coefficient for outratiodirff. This is a method called the jackknife. You can use a for loop to do this, or you can use the function jackknife in the package resamplr.
  2. +
+
jackknifeCOEF <- matrix(NA, ncol= 16, nrow= nrow(pauperism))
+
+tidy_M2 <- tidy (M2)
+
+colnames(jackknifeCOEF) <- c (tidy_M2$term)
+
+for (i in 1:nrow(pauperism)){
+jackknifeCOEF[i,] <- coef((lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * 
+           (year + Type), data = pauperism[-i,])))
+}
+
+head(jackknifeCOEF)
+
##      (Intercept) outratiodiff popratiodiff oldratiodiff year1891 TypeMixed
+## [1,]   -55.34359    0.2325769   -0.1546047     1.274228 15.81459  64.41009
+## [2,]   -60.03163    0.2317537   -0.1365587     1.305022 17.11302  68.36327
+## [3,]   -53.53339    0.2325645   -0.1549528     1.256332 15.94401  62.53052
+## [4,]   -55.34359    0.2325769   -0.1546047     1.274228 15.81459  64.41009
+## [5,]   -55.15810    0.2325761   -0.1536138     1.271479 15.82996  64.21723
+## [6,]   -50.00969    0.2332978   -0.1790220     1.246527 16.61523  58.50446
+##      TypeRural TypeUrban popratiodiff:year1891 popratiodiff:TypeMixed
+## [1,]  31.75173  109.0776            -0.2371594              0.2480958
+## [2,]  35.81002  112.8970            -0.2419905              0.2327206
+## [3,]  29.85320  107.2313            -0.2367462              0.2483106
+## [4,]  31.75173  109.0776            -0.2371594              0.2480958
+## [5,]  31.56388  108.8799            -0.2372897              0.2471626
+## [6,]  26.07548  102.8957            -0.2443414              0.2758056
+##      popratiodiff:TypeRural popratiodiff:TypeUrban oldratiodiff:year1891
+## [1,]              0.4331079             0.04596014             0.2029409
+## [2,]              0.4175821             0.03022029             0.1951028
+## [3,]              0.4333823             0.04611562             0.2012715
+## [4,]              0.4331079             0.04596014             0.2029409
+## [5,]              0.4321577             0.04503012             0.2029273
+## [6,]              0.4604489             0.07372856             0.2018767
+##      oldratiodiff:TypeMixed oldratiodiff:TypeRural oldratiodiff:TypeUrban
+## [1,]             -0.9569213             -0.8382790              -1.116626
+## [2,]             -0.9824004             -0.8646263              -1.140453
+## [3,]             -0.9381813             -0.8194209              -1.098145
+## [4,]             -0.9569213             -0.8382790              -1.116626
+## [5,]             -0.9541600             -0.8355499              -1.113821
+## [6,]             -0.9273552             -0.8106171              -1.084341
+

For which observations is there the largest change in the coefficient on outratiodiff?

+
    +
  1. Which observations have the largest effect on the estimate of outratiodiff?
  2. +
+
summary (jackknifeCOEF)
+
##   (Intercept)      outratiodiff     popratiodiff      oldratiodiff  
+##  Min.   :-77.54   Min.   :0.2291   Min.   :-0.1930   Min.   :1.077  
+##  1st Qu.:-55.37   1st Qu.:0.2325   1st Qu.:-0.1547   1st Qu.:1.274  
+##  Median :-55.34   Median :0.2326   Median :-0.1546   Median :1.274  
+##  Mean   :-55.35   Mean   :0.2326   Mean   :-0.1546   Mean   :1.274  
+##  3rd Qu.:-55.32   3rd Qu.:0.2326   3rd Qu.:-0.1545   3rd Qu.:1.274  
+##  Max.   :-31.49   Max.   :0.2415   Max.   :-0.0951   Max.   :1.467  
+##     year1891        TypeMixed       TypeRural        TypeUrban     
+##  Min.   : 9.357   Min.   :39.23   Min.   : 6.742   Min.   : 83.81  
+##  1st Qu.:15.792   1st Qu.:64.39   1st Qu.:31.718   1st Qu.:109.06  
+##  Median :15.815   Median :64.41   Median :31.752   Median :109.08  
+##  Mean   :15.818   Mean   :64.41   Mean   :31.755   Mean   :109.09  
+##  3rd Qu.:15.859   3rd Qu.:64.43   3rd Qu.:31.777   3rd Qu.:109.09  
+##  Max.   :31.861   Max.   :85.56   Max.   :53.088   Max.   :130.13  
+##  popratiodiff:year1891 popratiodiff:TypeMixed popratiodiff:TypeRural
+##  Min.   :-0.3241       Min.   :0.1835         Min.   :0.3713        
+##  1st Qu.:-0.2373       1st Qu.:0.2480         1st Qu.:0.4330        
+##  Median :-0.2372       Median :0.2481         Median :0.4331        
+##  Mean   :-0.2372       Mean   :0.2481         Mean   :0.4331        
+##  3rd Qu.:-0.2370       3rd Qu.:0.2481         3rd Qu.:0.4332        
+##  Max.   :-0.1647       Max.   :0.2901         Max.   :0.4746        
+##  popratiodiff:TypeUrban oldratiodiff:year1891 oldratiodiff:TypeMixed
+##  Min.   :-0.02761       Min.   :0.08927       Min.   :-1.1426       
+##  1st Qu.: 0.04594       1st Qu.:0.20255       1st Qu.:-0.9571       
+##  Median : 0.04596       Median :0.20294       Median :-0.9569       
+##  Mean   : 0.04593       Mean   :0.20292       Mean   :-0.9569       
+##  3rd Qu.: 0.04599       3rd Qu.:0.20317       3rd Qu.:-0.9567       
+##  Max.   : 0.10453       Max.   :0.25034       Max.   :-0.7492       
+##  oldratiodiff:TypeRural oldratiodiff:TypeUrban
+##  Min.   :-1.0252        Min.   :-1.3010       
+##  1st Qu.:-0.8384        1st Qu.:-1.1167       
+##  Median :-0.8383        Median :-1.1166       
+##  Mean   :-0.8383        Mean   :-1.1167       
+##  3rd Qu.:-0.8380        3rd Qu.:-1.1165       
+##  Max.   :-0.6318        Max.   :-0.9075
+
jackknife1 <- as.data.frame(jackknifeCOEF) 
+jackknife1$.rownames <- as.character(seq.int(nrow(jackknife1)))
+jack <- jackknife1 %>%
+  select (.rownames, outratiodiff) %>%
+  filter (outratiodiff > 0.235) %>%
+  print()
+
##   .rownames outratiodiff
+## 1        93    0.2353312
+## 2       242    0.2400089
+## 3       321    0.2414890
+## 4       456    0.2354546
+## 5      1047    0.2357447
+## 6      1134    0.2364103
+## 7      1413    0.2371184
+## 8      1620    0.2389753
+

After filtering out the observations which are more than 0.235, we can see there are 8 observations that have a large effect on outratiodiff.

+
    +
  1. How do these observations compare with those that had the largest effect on the overall regression as measured with Cook’s distance?
  2. +
+
M2_aug %>%
+  filter (.cooksd > 0.35)
+
##   .rownames paupratiodiff outratiodiff popratiodiff oldratiodiff year
+## 1      1410      23.39941     83.12100    379.74075     109.7245 1891
+## 2      1413     278.60457     58.87793     35.73522     122.9814 1891
+##    Type    .fitted   .se.fit    .resid       .hat   .sigma   .cooksd
+## 1 Urban  -2.874848 13.472445  26.27426 0.54919739 18.15109 0.3528030
+## 2 Urban 115.224894  4.993271 163.37967 0.07544072 17.49155 0.4455003
+##   .std.resid .student.resid
+## 1   2.152559      0.9718988
+## 2   9.346488      8.9812582
+

The observation from row 1413 shows up in both the jackknife and cooks distance calculations. However, only 1410 is visible in the cooks distance calculation.

+
    +
  1. Compare the results of the jackknife to the dfbeta statistic for outratiodiff
  2. +
+
dfbeta <- tidy(dfbetas (M2))
+
+dfbeta %>%
+  filter (abs(outratiodiff) > 2/sqrt (nrow(dfbeta))) %>%
+  inner_join (jack, by = ".rownames") %>%
+  print 
+
##   .rownames X.Intercept. outratiodiff.x popratiodiff oldratiodiff
+## 1        93   0.56743376     -0.1920594 -0.288357946  -0.58185633
+## 2       242  -0.08516041     -0.5197752  0.048325627   0.09392898
+## 3       321  -0.04612685     -0.6205123  0.025062100   0.05713621
+## 4       456  -0.02003577     -0.2003758  0.008077134   0.02511958
+## 5      1047  -0.02126329     -0.2207569  0.004198310   0.02866121
+## 6      1134  -0.02580265     -0.2672048  0.015376390   0.03076645
+## 7      1413  -0.29264457     -0.3284116  0.371965630   0.22264136
+## 8      1620  -0.04961977     -0.4453968  0.029293890   0.05677375
+##       year1891   TypeMixed    TypeRural   TypeUrban popratiodiff.year1891
+## 1  0.136496813 -0.56384874 -0.501126078 -0.55775059          -0.054926491
+## 2  0.271916012 -0.06403943  0.035639267  0.04696809          -0.031194292
+## 3 -0.003052264  0.04116431  0.055215626  0.05888512          -0.009965884
+## 4  0.023242039  0.01823860  0.035396621  0.01819242           0.004779178
+## 5  0.032838882  0.01669058  0.004075248  0.01848190           0.026178107
+## 6 -0.006831864  0.03484534  0.026993426  0.02696833          -0.023254825
+## 7  0.407249707  0.20134869  0.248061988 -0.17891278          -1.286102999
+## 8  0.086022476  0.04276304  0.093888252  0.04222417          -0.024381113
+##   popratiodiff.TypeMixed popratiodiff.TypeRural popratiodiff.TypeUrban
+## 1            0.254384432            0.166506078            0.292960278
+## 2           -0.025457254           -0.021688023           -0.035494141
+## 3            0.014026729           -0.011969316           -0.018171955
+## 4           -0.002729445           -0.021453301           -0.008013993
+## 5           -0.002880387            0.046260586           -0.010191113
+## 6           -0.016022363           -0.001986061           -0.006597184
+## 7           -0.008931584           -0.065739497           -0.631677973
+## 8           -0.009480686           -0.063667800           -0.019169602
+##   oldratiodiff.year1891 oldratiodiff.TypeMixed oldratiodiff.TypeRural
+## 1           -0.13605297             0.59741113             0.57639335
+## 2           -0.29860619             0.09118345            -0.02595994
+## 3            0.03091964            -0.04765164            -0.05433189
+## 4           -0.02576518            -0.01770934            -0.03306525
+## 5           -0.04821125            -0.01551577            -0.02744388
+## 6            0.02529348            -0.03434564            -0.02920560
+## 7            0.11043376            -0.23031013            -0.27904505
+## 8           -0.07750413            -0.04061349            -0.08217588
+##   oldratiodiff.TypeUrban outratiodiff.y
+## 1             0.56508606      0.2353312
+## 2            -0.03667247      0.2400089
+## 3            -0.05514832      0.2414890
+## 4            -0.01614687      0.2354546
+## 5            -0.01559710      0.2357447
+## 6            -0.02590141      0.2364103
+## 7             0.47280266      0.2371184
+## 8            -0.03744212      0.2389753
+

When we compare the dfbeta statistics to the jackknife we see that the exact same rows appear from both. They correspond to each other.

+

2.@AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for outratiodiff.

+
    +
  1. Regress outratiodiff on the control variables.
  2. +
+
M2_out <- lm (outratiodiff ~ (popratiodiff + oldratiodiff) * (year + Type), data = pauperism)
+
+M2_out
+
## 
+## Call:
+## lm(formula = outratiodiff ~ (popratiodiff + oldratiodiff) * (year + 
+##     Type), data = pauperism)
+## 
+## Coefficients:
+##            (Intercept)            popratiodiff            oldratiodiff  
+##              -204.4514                  0.3320                  1.9749  
+##               year1891               TypeMixed               TypeRural  
+##                91.6318                256.6422                259.3359  
+##              TypeUrban   popratiodiff:year1891  popratiodiff:TypeMixed  
+##               247.5671                 -0.1445                 -0.1605  
+## popratiodiff:TypeRural  popratiodiff:TypeUrban   oldratiodiff:year1891  
+##                -0.3584                 -0.2358                 -0.3333  
+## oldratiodiff:TypeMixed  oldratiodiff:TypeRural  oldratiodiff:TypeUrban  
+##                -1.9846                 -1.7723                 -1.9101
+
    +
  1. The weights of the observations are those with the highest squared errors from this regression. Which observations have the highest coefficient values?
  2. +
+
summary(resid (M2_out))
+
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
+## -103.400  -21.690   -4.504    0.000   16.320  452.300
+
M2_resid <- tidy (resid(M2_out)) %>%
+  mutate (x = abs(x)) %>%
+  top_n (x, n = 25)
+
+head(M2_resid) # printing only 6 rows but there are 25 values.
+
## # A tibble: 6 × 2
+##   names         x
+##   <chr>     <dbl>
+## 1    81 124.83695
+## 2    93  93.65999
+## 3   242 177.80260
+## 4   321 452.27249
+## 5   372  97.06956
+## 6   390  82.75111
+

With the resid function we can see the top 25 rows with the absolute highest values from the M2_out regression.

+
    +
  1. How do the observations with the highest regression weights compare with those with the highest changes in the regression coefficient from the jackknife?
  2. +
+
M2_resid %>%
+  rename (.rownames = names) %>%
+  inner_join (jack, by = ".rownames")
+
## # A tibble: 7 × 3
+##   .rownames         x outratiodiff
+##       <chr>     <dbl>        <dbl>
+## 1        93  93.65999    0.2353312
+## 2       242 177.80260    0.2400089
+## 3       321 452.27249    0.2414890
+## 4       456 151.54252    0.2354546
+## 5      1047 127.49436    0.2357447
+## 6      1134 146.44627    0.2364103
+## 7      1620 367.86213    0.2389753
+
+
+

Omitted Variable Bias

+

An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a].

+

@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic “provide[s] a measure to gauge the strength of the likely bias arising from unobservables: how much stronger selection on unobservables, relative to selection on observables, must be to explain away the full estimated effect.”

+
    +
  1. Run a regression without any controls. Denote the coefficient on the variable of interest as \(\hat\beta_R\).
  2. +
+
beta_hat_R <- tidy (lm(paupratiodiff ~ outratiodiff, data = pauperism)) %>%
+  filter (term == "outratiodiff") %>%
+  select (estimate)
+
+beta_hat_R <- as.numeric(beta_hat_R)
+beta_hat_R
+
## [1] 0.3062392
+
    +
  1. Run a regression with the full set of controls. Denote the coefficient on the variable of interest in this regression as \(\hat\beta_F\).
  2. +
+
beta_hat_F <- tidy_M2 %>%
+  filter (term == "outratiodiff") %>%
+  select (estimate)
+
+beta_hat_F <- as.numeric (beta_hat_F)
+beta_hat_F
+
## [1] 0.2325769
+
    +
  1. The ratio is \[\hat\beta_F / (\hat\beta_R - \hat\beta_F)\]
  2. +
+
beta_hat = beta_hat_F / (beta_hat_R - beta_hat_F)
+beta_hat
+
## [1] 3.157342
+

Calculate this statistic for M2 and interpret it.

+

The ratio is 3.157. According to the Nunn and Wantchekon 2011 article, the selection on unobservables must be more than 3 times stronger than the selection on observables to explain away the entire effect (Pg. 3238).

+
+
+

Heteroskedasticity

+
    +
  1. Run M2 and M3 with a heteroskedasticity consistent (HAC), also called robust, standard error. How does this affect the standard errors on outratio coefficients? Use the sandwich package to add HAC standard errors [@Zeileis2004a].Compare SE to the HAC.
  2. +
+
HAC_M2 <- vcovHAC(M2) #Gives the estimated HAC covariance matrix
+diag_M2 <- sqrt(diag(vcovHAC(M2))) #Gives the HAC standard errors
+
+diag_M2[2] - tidy_M2$std.error[2]
+
## outratiodiff 
+##  0.005028073
+
HAC_M3 <- vcovHAC(M3)
+diag_M3 <- sqrt(diag(vcovHAC(M3)))
+tidy_M3 <- tidy (M3)
+
+diag_M3[2] - tidy_M3$std.error[2]
+
## outratiodiff 
+##    0.0132523
+

Using the HAC calculation we can see an increase in the standard error from the original M2 regression by around 0.005.

+

Using the HAC calculation we can see an increase in the standard error from the original M3 regression by around 0.0132.

+
+

Multiple Regressions

+
    +
  1. Run the model with interactions for all years and types
  2. +
+
tidy(lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism))
+
##                           term      estimate    std.error     statistic
+## 1                     outratio  5.708958e-04 8.005294e-04  0.7131479210
+## 2                        Popn2 -5.025627e-08 4.494016e-08 -1.1182931580
+## 3                       Prop65 -1.801554e-01 5.149112e-01 -0.3498766749
+## 4                     year1871  5.955222e-02 2.020238e-02  2.9477822141
+## 5                     year1881 -9.485841e-06 1.912446e-02 -0.0004960056
+## 6                     year1891  3.542339e-02 1.771497e-02  1.9996298499
+## 7                    TypeMixed -4.853681e-02 2.097216e-02 -2.3143446585
+## 8                    TypeRural -6.334421e-02 2.153513e-02 -2.9414362704
+## 9                    TypeUrban -6.401214e-02 2.180260e-02 -2.9359859167
+## 10           outratio:year1881  5.516347e-03 3.450922e-03  1.5985140373
+## 11           outratio:year1891  2.855633e-03 2.984223e-03  0.9569101161
+## 12              Popn2:year1881 -2.762783e-08 5.891525e-08 -0.4689418597
+## 13              Popn2:year1891 -2.761539e-08 5.465754e-08 -0.5052438011
+## 14             Prop65:year1881  9.770247e-01 7.138906e-01  1.3685917152
+## 15             Prop65:year1891  1.360550e-01 6.667050e-01  0.2040707241
+## 16          outratio:TypeMixed  2.857434e-04 8.336068e-04  0.3427796522
+## 17          outratio:TypeRural  4.607440e-04 8.351230e-04  0.5517079504
+## 18          outratio:TypeUrban  9.951729e-04 8.819306e-04  1.1284027694
+## 19             Popn2:TypeMixed -3.793316e-07 8.310262e-08 -4.5646165868
+## 20             Popn2:TypeRural  6.734344e-08 8.570810e-08  0.7857302403
+## 21             Popn2:TypeUrban  9.575764e-08 5.425215e-08  1.7650477300
+## 22            Prop65:TypeMixed  9.930927e-01 5.211824e-01  1.9054609278
+## 23            Prop65:TypeRural  9.762405e-01 5.246110e-01  1.8608845521
+## 24            Prop65:TypeUrban  9.254490e-01 5.330645e-01  1.7360920998
+## 25          year1881:TypeMixed  4.835546e-02 2.893869e-02  1.6709623975
+## 26          year1891:TypeMixed  1.629334e-02 2.806250e-02  0.5806090462
+## 27          year1881:TypeRural  4.352313e-02 2.973288e-02  1.4638047731
+## 28          year1891:TypeRural  1.142985e-02 2.877682e-02  0.3971893005
+## 29          year1881:TypeUrban  7.359464e-02 3.000889e-02  2.4524276041
+## 30          year1891:TypeUrban  3.345950e-02 2.909343e-02  1.1500709107
+## 31 outratio:year1881:TypeMixed -4.749300e-03 3.477899e-03 -1.3655657294
+## 32 outratio:year1891:TypeMixed -2.577668e-03 3.013496e-03 -0.8553745772
+## 33 outratio:year1881:TypeRural -4.820566e-03 3.473224e-03 -1.3879224976
+## 34 outratio:year1891:TypeRural -2.546709e-03 3.007052e-03 -0.8469122618
+## 35 outratio:year1881:TypeUrban -5.802898e-03 3.529730e-03 -1.6440062308
+## 36 outratio:year1891:TypeUrban -3.560759e-03 3.066255e-03 -1.1612726322
+## 37    Popn2:year1881:TypeMixed  3.546083e-07 1.094249e-07  3.2406543554
+## 38    Popn2:year1891:TypeMixed  3.728497e-07 1.040752e-07  3.5825034936
+## 39    Popn2:year1881:TypeRural  9.482280e-08 1.142364e-07  0.8300578965
+## 40    Popn2:year1891:TypeRural  7.434405e-08 1.079794e-07  0.6885022733
+## 41    Popn2:year1881:TypeUrban -2.211217e-08 7.076765e-08 -0.3124615712
+## 42    Popn2:year1891:TypeUrban -3.374132e-08 6.613930e-08 -0.5101553026
+## 43   Prop65:year1881:TypeMixed -1.292371e+00 7.226732e-01 -1.7883203392
+## 44   Prop65:year1891:TypeMixed -5.667597e-01 6.757782e-01 -0.8386770918
+## 45   Prop65:year1881:TypeRural -1.112461e+00 7.276014e-01 -1.5289430356
+## 46   Prop65:year1891:TypeRural -3.870140e-01 6.800808e-01 -0.5690705995
+## 47   Prop65:year1881:TypeUrban -1.428906e+00 7.389726e-01 -1.9336386438
+## 48   Prop65:year1891:TypeUrban -5.094620e-01 6.922665e-01 -0.7359333429
+##         p.value
+## 1  4.758507e-01
+## 2  2.635975e-01
+## 3  7.264740e-01
+## 4  3.243628e-03
+## 5  9.996043e-01
+## 6  4.569683e-02
+## 7  2.076578e-02
+## 8  3.310491e-03
+## 9  3.368918e-03
+## 10 1.101118e-01
+## 11 3.387467e-01
+## 12 6.391705e-01
+## 13 6.134522e-01
+## 14 1.713051e-01
+## 15 8.383223e-01
+## 16 7.318060e-01
+## 17 5.812199e-01
+## 18 2.593068e-01
+## 19 5.357876e-06
+## 20 4.321335e-01
+## 21 7.773266e-02
+## 22 5.688629e-02
+## 23 6.293051e-02
+## 24 8.272610e-02
+## 25 9.491050e-02
+## 26 5.615798e-01
+## 27 1.434295e-01
+## 28 6.912770e-01
+## 29 1.428804e-02
+## 30 2.502741e-01
+## 31 1.722533e-01
+## 32 3.924626e-01
+## 33 1.653399e-01
+## 34 3.971615e-01
+## 35 1.003571e-01
+## 36 2.456917e-01
+## 37 1.215279e-03
+## 38 3.497614e-04
+## 39 4.066209e-01
+## 40 4.912291e-01
+## 41 7.547275e-01
+## 42 6.100079e-01
+## 43 7.389979e-02
+## 44 4.017667e-01
+## 45 1.264619e-01
+## 46 5.693823e-01
+## 47 5.332100e-02
+## 48 4.618713e-01
+
    +
  1. For each subset of year and type run the regression
  2. +
+
lm(pauper2 ~ outratio + Popn2 + Prop65, data = pauperism)
+
## 
+## Call:
+## lm(formula = pauper2 ~ outratio + Popn2 + Prop65, data = pauperism)
+## 
+## Coefficients:
+## (Intercept)     outratio        Popn2       Prop65  
+##   8.398e-03    1.956e-03   -1.475e-08    3.284e-01
+
    +
  1. Compare the coefficients, standard errors, and regression standard errors in these regresions.
  2. +
+
all_interact <-
+  crossing(Type = pauperism$Type, year = c(1881, 1891)) %>%
+  mutate(mod = map2(year, Type, 
+                    function(yr, ty) {
+                    lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff,
+                       data = filter(pauperism,
+                                      year == yr,
+                                      Type == ty))
+                    })) %>%
+  mutate(mod_glance = map(mod, broom::glance),
+         mod_tidy = map(mod, broom::tidy))
+
+all_interact %>%
+  mutate (sigma = map_dbl(mod_glance, function (x) x$sigma)) %>%
+  mutate (std.error.out = map_dbl(mod_tidy, function (x) x$std.error[2])) %>%
+  mutate (estimate.out = map_dbl(mod_tidy, function (x) x$estimate[2])) %>%
+            select (year, Type, sigma, std.error.out, estimate.out)
+
## # A tibble: 8 × 5
+##    year         Type     sigma std.error.out estimate.out
+##   <dbl>        <chr>     <dbl>         <dbl>        <dbl>
+## 1  1881 Metropolitan  9.886436    0.13716458    0.7085593
+## 2  1891 Metropolitan 24.790240    0.11858387    0.3263541
+## 3  1881        Mixed 16.437527    0.04011237    0.1996762
+## 4  1891        Mixed 17.403411    0.02576831    0.1656616
+## 5  1881        Rural 13.801753    0.03316465    0.2308180
+## 6  1891        Rural 17.078948    0.02282518    0.2137457
+## 7  1881        Urban 19.523919    0.07970376    0.7107494
+## 8  1891        Urban 25.557318    0.07559362    0.2632868
+
+
+
+

Weighted Regression

+
    +
  1. Run M2 and M3 as weighted regressions, weighted by the population (Popn) and interpret the coefficients on outratiodiff and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more?
  2. +
+
M2_weights <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * 
+           (year + Type), weights = Popn, data = pauperism)
+
+head(tidy(M2_weights))
+
##           term    estimate   std.error statistic      p.value
+## 1  (Intercept) -27.6860026 24.91557702 -1.111193 2.667150e-01
+## 2 outratiodiff   0.3644746  0.01907747 19.104979 5.472523e-71
+## 3 popratiodiff  -0.1960724  0.06021243 -3.256344 1.161100e-03
+## 4 oldratiodiff   1.0143631  0.20524209  4.942276 8.851788e-07
+## 5     year1891 -23.1753237 17.39122610 -1.332587 1.829281e-01
+## 6    TypeMixed  43.7123212 30.20646454  1.447118 1.481331e-01
+
M3_weights <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * 
+           (year + Type), weights = Popn, data = pauperism)
+
+head(tidy(M3_weights))
+
##           term    estimate   std.error  statistic      p.value
+## 1  (Intercept)  23.7603236 25.80279872  0.9208429 3.573239e-01
+## 2 outratiodiff   0.7167659  0.04453160 16.0956706 9.131931e-53
+## 3 popratiodiff  -0.2379611  0.05898587 -4.0342058 5.837204e-05
+## 4 oldratiodiff   0.3779429  0.22090876  1.7108552 8.737520e-02
+## 5     year1891   2.3389679 17.13905009  0.1364701 8.914733e-01
+## 6    TypeMixed -27.6445457 31.11298198 -0.8885213 3.744446e-01
+

Holding all else constant and with weights, on average in M2 a one unit change in outratiodiff has a 0.36 increase in paupratiodiff. Holding all else constant and with weights, on average in M3 a one unit change in outratiodiff has a 0.72 increase in paupratiodiff. The difference between the two is quite stark. Both M2 and M3 increase when compared to their original coefficients, without weights. However, between the two M3 is effected more (from 0.53 to 0.72).

+
    +
  1. What are some rationales for weighting by population? See the discussion in @SolonHaiderWooldridge2013a and @AngristPischke2014a.
  2. +
+

According to “Mastering Metrics” weighting weights each term in the residual sum of squares by population or some other weight. We are able to take into account the population of people in each district and make sure the larger districts count for more. In this way we are able to account for small districts having more variability on average.

+
+
+

Cross-Validation

+

When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables. Another criteria for selecting models is to use their fit to the data. But a model’s fit to data should not be assessed using only the in-sample data. That leads to overfitting—and the best model would always be to include an indicator variable for every observation Instead, a model’s fit to data can be assessed by using its out-of-sample fit. One way to estimate the expected fit of a model to new data is cross-validation.

+

We want to compare the predictive performance of the following models

+
mod_formulas <- 
+  list(
+    m0 = paupratiodiff ~ 1,
+    m1 = paupratiodiff ~ year + Type,    
+    m2 = paupratiodiff ~ outratiodiff + year + Type,
+    m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type),
+    m4 = -1  + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type),
+    m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type
+  )
+

Let’s split the data into 10 (train/test) folds for cross-validation,

+
pauperism_nonmiss <- 
+  pauperism %>%
+  filter(year %in% c(1881, 1891)) %>%
+  select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID, BoothGroup) %>%
+  tidyr::drop_na()
+pauperism_10folds <-
+  pauperism_nonmiss %>%
+  resamplr::crossv_kfold(10)
+

For each model formula f, training data set train, and test data set, test, run the model specified by f on train, and predict new observations in test, and calculate the RMSE from the residuals

+
mod_rmse_fold <- function(f, train, test) {
+  fit <- lm(f, data = as.data.frame(train))
+  test_data <- as.data.frame(test)
+  err <- test_data$paupratiodiff - predict(fit, newdata = test_data)
+  sqrt(mean(err ^ 2))
+}
+

E.g. for one fold and formula,

+
mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]],
+              pauperism_10folds$test[[1]])
+
## [1] 24.16498
+

Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with train and test list-columns:

+
mod_rmse <- function(f, data) {
+  map2_dbl(data$train, data$test, 
+           function(train, test) {
+             mod_rmse_fold(f, train, test)
+           }) %>%
+    mean()
+}
+
mod_rmse(mod_formulas[[1]], pauperism_10folds)
+
## [1] 24.23181
+

Finally, we want to run mod_rmse for each formula in mod_formulas. It will be easiest to store this in a data frame:

+
cv_results <- tibble(
+  model_formula = mod_formulas,
+  .id = names(mod_formulas),
+  # Formula as a string
+  .name = map(model_formula,
+              function(x) gsub(" +", " ", paste0(deparse(x), collapse = "")))
+)
+

Use map to run mod_rmse for each model and save it as a list frame in the data frame,

+
cv_results <-
+  mutate(cv_results,
+         cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds))
+

In the case of linear regression, the MSE of the Leave-one-out (\(n\)-fold) cross-validation can be analytically calculated without having to run \(n\) regressions.

+
loocv <- function(x) {
+  mean((residuals(x) / (1 - hatvalues(x))) ^ 2)
+}
+
cv_results <- 
+  mutate(cv_results, 
+         rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss)))))
+
    +
  1. In the 10-fold cross validation, which model has the best out of sample prediction?
  2. +
+
for (i in seq_len(nrow(cv_results))) {
+  print (cv_results$cv10_rmse[i])
+}
+
## $m0
+## [1] 24.23181
+## 
+## $m1
+## [1] 20.98203
+## 
+## $m2
+## [1] 19.05795
+## 
+## $m3
+## [1] 18.38304
+## 
+## $m4
+## [1] 18.20314
+## 
+## $m5
+## [1] 18.25599
+

From this analysis we can see that Model 4 has the best out of sample prediction.

+
    +
  1. Using the LOO-CV cross-validation, which model has the best out of sample prediction?
  2. +
+
for (i in seq_len(nrow(cv_results))) {
+  print (cv_results$rmse_loo[i])
+}
+
## $m0
+## [1] 24.34132
+## 
+## $m1
+## [1] 21.11967
+## 
+## $m2
+## [1] 19.2609
+## 
+## $m3
+## [1] 18.56803
+## 
+## $m4
+## [1] 18.43676
+## 
+## $m5
+## [1] 18.4601
+

Once again, Model 4 has the best out of sample prediction.

+
    +
  1. Does the prediction metric (RMSE) and prediction task—predicting individual PLUs from other PLUs—make sense? Can you think of others that you would prefer?
  2. +
+

The task makes sense. The individual PLUs on average should have similar standard errors. However, as seen in the weighted regression section, smaller PLUs can have more variation, so we should make sure we account for this variation in the PLU size and effect.

+
+
+

Bootstrapping

+

Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows:

+

Let \(\hat\theta\) be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure.

+

For samples \(b = 1, ..., B\).

+
    +
  1. Draw a sample with replacement from the data
  2. +
  3. Estimate the statistic of interest and call it \(\theta_b^*\).
  4. +
+

Let \(\theta^* = \{\theta_1^*, \dots, \theta_B^*\}\) be the set of bootstrapped statistics.

+ +

Original model

+
mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type
+mod_orig <- lm(mod_formula, data = pauperism_nonmiss)
+
bs_coef_se <-
+  resamplr::bootstrap(pauperism_nonmiss, 1024) %>%
+  # extract the strap column
+  `[[`("sample") %>%
+  # run 
+  map_df(function(dat) {
+    lm(mod_formula, data = dat) %>%
+    broom::tidy() %>%
+    select(term, estimate)
+  }) %>%
+  # calculate 2.5%, 97.5% and sd of estimates
+  group_by(term) %>%
+  summarise(
+    std.error_bs = sd(estimate),
+    conf.low_bsq = quantile(estimate, 0.025),
+    conf.low_bsq = quantile(estimate, 0.975)
+  )
+

Now compare the std.error of the original and the bootstrap for outratiodiff

+
broom::tidy(mod_orig, conf.int = TRUE) %>%
+  select(term, estimate, std.error) %>%
+  filter(term == "outratiodiff") %>%
+  left_join(bs_coef_se, by = "term")
+
##           term  estimate  std.error std.error_bs conf.low_bsq
+## 1 outratiodiff 0.2273004 0.01434334   0.01881032    0.2668222
+

The bootstrap standard error is slightly higher. It is similar to the standard error generated using the heteroskedasticity consistent standard error.

+
sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"])
+
## [1] 0.01986397
+

It is likely that there is correlation between the error terms of observations. At the very least, each PLU is included twice; these observations are likely correlated, so we are effectively overstating the sample size of our data. One way to account for that is to resample “PLUs”, not PLU-years. This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves.

+
pauperism_nonmiss %>%
+  group_by(ID) %>%
+  resamplr::bootstrap(1024) %>%
+  # extract the strap column
+  `[[`("sample") %>%
+  # run 
+  map_df(function(dat) {
+    lm(mod_formula, data = dat) %>%
+    broom::tidy() %>%
+    select(term, estimate)
+  }) %>%
+  # calculate 2.5%, 97.5% and sd of estimates
+  group_by(term) %>%
+  summarise(
+    std.error_bs = sd(estimate),
+    conf.low_bsq = quantile(estimate, 0.025),
+    conf.low_bsq = quantile(estimate, 0.975)
+  ) %>%
+  filter(term == "outratiodiff")
+
## # A tibble: 1 × 3
+##           term std.error_bs conf.low_bsq
+##          <chr>        <dbl>        <dbl>
+## 1 outratiodiff   0.02006236    0.2733225
+

However, this yields a standard error not much different than the Robust standard error.

+
    +
  1. Try bootstrapping “Region” and “BoothGroup”. Do either of these make much difference in the standard errors.
  2. +
+
# Region:
+
+pauperism_nonmiss %>%
+  group_by(Region) %>%
+  resamplr::bootstrap(1024) %>%
+  # extract the strap column
+  `[[`("sample") %>%
+  # run 
+  map_df(function(dat) {
+    lm(mod_formula, data = dat) %>%
+    broom::tidy() %>%
+    select(term, estimate)
+  }) %>%
+  # calculate 2.5%, 97.5% and sd of estimates
+  group_by(term) %>%
+  summarise(
+    std.error_bs = sd(estimate),
+    conf.low_bsq = quantile(estimate, 0.025),
+    conf.low_bsq = quantile(estimate, 0.975)
+  ) %>%
+  filter(term == "outratiodiff")
+
## # A tibble: 1 × 3
+##           term std.error_bs conf.low_bsq
+##          <chr>        <dbl>        <dbl>
+## 1 outratiodiff   0.01882124    0.2751381
+
# BoothGroup
+
+pauperism_nonmiss %>%
+  group_by(BoothGroup) %>%
+  resamplr::bootstrap(1024) %>%
+  # extract the strap column
+  `[[`("sample") %>%
+  # run 
+  map_df(function(dat) {
+    lm(mod_formula, data = dat) %>%
+    broom::tidy() %>%
+    select(term, estimate)
+  }) %>%
+  # calculate 2.5%, 97.5% and sd of estimates
+  group_by(term) %>%
+  summarise(
+    std.error_bs = sd(estimate),
+    conf.low_bsq = quantile(estimate, 0.025),
+    conf.low_bsq = quantile(estimate, 0.975)
+  ) %>%
+  filter(term == "outratiodiff")
+
## # A tibble: 1 × 3
+##           term std.error_bs conf.low_bsq
+##          <chr>        <dbl>        <dbl>
+## 1 outratiodiff   0.02148936     0.277693
+

When we run these two calculations we can see that neither of these makes a significantly large difference in the standard errors for Region or BoothGroup, compared to the original bootstrap.

+
+ + + + +
+ + + + + + + +