diff --git a/solutions2.Rmd b/solutions2.Rmd new file mode 100644 index 0000000..8c11b57 --- /dev/null +++ b/solutions2.Rmd @@ -0,0 +1,844 @@ +--- +title: "Solutions Parr" +author: "Christianna Parr" +date: "April 21, 2017" +output: + pdf_document: default + html_document: default +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +### This is the second assignment for POLS 503. The data and paper are by Yule, the original researcher. + +```{r} +library("dplyr") +library("tidyr") +library("tidyverse") +library("modelr") +library("broom") +library("ggplot2") +library ("car") +library ("stats") +library ("texreg") #makes tables +library ("bootstrap") +library ("sandwich") +``` + +```{r} +# devtools::install_github("jrnold/datums") +library(datums) +pauperism <- + left_join(datums::pauperism_plu, datums::pauperism_year, + by = "ID") %>% + mutate (year = as.character(year)) +``` + +### Run regressions of pauper using the yearly level data with the following specifications. + +```{r} + +M1 <- lm(paupratiodiff ~ outratiodiff + year + Type, data = pauperism) + +M2 <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) + +M3 <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) + +M4 <- lm(paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), data = pauperism) +``` + +### 1. Present the regressions results in a regression table + +```{r, error = TRUE, results = 'as is'} + +table1 <- screenreg(list(M1, M2, M3, M4)) + +table1 + +``` + +### 2. Interpret the coefficients for outratiodiff for each model. + +M1: Holding year and type of area constant, on average when there is an outratio difference of 1 unit, there is a 0.2343 increase in pauperism, ceteris parabis. As the proportion of people recieving outside aid increases, there is an increase in the level of pauperism. + +M2: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.23 increase in pauperism, ceteris parabis. In this model we include oldratiodiff in the controls to make sure we control for the amount of elderly persons recieving out relief payments. + +M3: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. In this regression we are transforming the variable by adding negative 1. We include interactions between outratiodiff and year, Type Mixed, Type Urban, and Type Rural. The coefficient for outratiodiff indicates the increase in pauperism for Metropolitan areas, and the covariance between outratiodiff and PLU type indicates the change on pauperism from Metropolitan areas and the other types of areas. + +M4: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. This model is the same as Model 3 but without the negative 1 transformation. + +### 3. Write the equations for each or all models, and describe the model with a sentence or two. Try to be as concise as possible. Look at recent journal articles for examples of the wording and format. + + +###Correction for writing equations incorrectly: +###M1: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \alpha_{type} + \rho_{t} + e_{it}$$ +###M2: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \beta_{2}\text{popratiodiff}_{it} + \beta_{3}\text{oldratiodiff}_{it}+{\bf\it{\Gamma}}\Big ((\text{popratiodiff}_{it} + \text{oldratiodiff}_{it}) \times (\alpha_{type} + \rho_{t})\Big) + \alpha_{type} + \rho_{t} + e_{it}$$ +###M3: +$$\text{paupratiodiff}_{it} = \beta_{1}\text{outratiodiff}_{it} + \beta_{2}\text{popratiodiff}_{it} + \beta_{3}\text{oldratiodiff}_{it} + {\bf\it{\Gamma}}\Big ((\text{outratiodiff}_{it} + \text{popratiodiff}_{it} + \text{oldratiodiff}_{it}) \times (\alpha_{type} + \rho_{t})\Big) + \alpha_{type} + \rho_{t} + e_{it}$$ + + +##Previous Answers: +M1: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2year\ + \beta_3type\ + \epsilon\ +$$ +In this first model we are regressing outratiodiff on paupratiodiff and controlling for year and Type. It is a simple linear model without interaction terms. + +M2: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ + \epsilon\ +$$ +In this second model we are running another linear regression but we include interaction terms. We can measure the conditional changes when we include year and type with oldratiodiff, popratiodiff and outratiodiff. + +M3: +$$ +paupratiodiff - 1 = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +$$ +In this third model we are including interaction terms for PLU, year, outratiodiff, popratiodiff and oldratiodiff. We are also including a negative 1 to transform the intercept but should not have any effect on the results. + +M4: +$$ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +$$ +In this fourth model we are running the same regression as model 3, but without the negative 1. + +### 4. What is the difference between M3 and M4. What are the pros and cons of each parameterization? + +Model 3 has a negative 1 to transform the parameters. The correlation of the data does not change, but it simply shifts the information down by 1 y unit. Maybe this would make more sense to subtract a 100 to see a substantive move. + +## 5. Conduct F-tests on the hypotheses: + +### a) All interactions in M4 are 0 + +```{r} +linearHypothesis(M4, c("popratiodiff:year1891", "popratiodiff:TypeUrban", "popratiodiff:TypeRural", "popratiodiff:TypeMixed", "oldratiodiff:year1891", "oldratiodiff:TypeMixed","oldratiodiff:TypeUrban", "oldratiodiff:TypeRural", "outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) + +``` + +### b) The coefficients on outratiodiff in M4 are the same across years +```{r} +linearHypothesis(M4, c("outratiodiff:year1891")) +``` + +### c) The coefficients on outratiodiff in M4 are the same across PLU Types +```{r} +linearHypothesis(M4, c("outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) +``` + +### d) The coefficients on outratiodiff in M4 are the same across PLU Types and years. +```{r} +linearHypothesis(M4, c("outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed")) + +``` + +###6.Calculate the predicted value and confidence interval for the PLU with the median value of outratiodiff, popratiodiff, and oldratiodiff in each year and PLU Type for these models. Plot the predicted value and confidence interval of these as point-ranges. + +```{r} + +data1 <- pauperism %>% + group_by(year, Type) %>% + filter (!is.na(Type), year %in% c("1881", "1891")) %>% + summarise_at (vars(outratiodiff, popratiodiff, oldratiodiff), median, na.rm = TRUE) + +M1_new <- tidy(predict(M1, newdata = data1, interval = "confidence", level = 0.95)) + +dataM1 <- bind_cols (data1, M1_new) + +ggplot (data = dataM1) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M2_new <- tidy(predict(M2, newdata = data1, interval = "confidence", level = 0.95)) + +dataM2 <- bind_cols(data1, M2_new) + +ggplot (data = dataM2) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M3_new <- tidy(predict(M3, newdata = data1, interval = "confidence", level = 0.95)) + +dataM3 <- bind_cols(data1, M3_new) + +ggplot (data = dataM3) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M4_new <- tidy(predict(M4, newdata = data1, interval = "confidence", level = 0.95)) + +dataM4 <- bind_cols (data1, M4_new) + +ggplot (data = dataM4) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +``` + +###7.As previously, calculate the predicted value of the median PLU in each year and PLU Type. But instead of confidence intervals include the prediction interval. How do the confidence and prediction intervals differ? What are their definitions? + +```{r} +M1_new <- tidy(predict(M1, newdata = data1, interval = "prediction", level = 0.95)) + +dataM1 <- bind_cols (data1, M1_new) + +ggplot (data = dataM1) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M2_new <- tidy(predict(M2, newdata = data1, interval = "prediction", level = 0.95)) + +dataM2 <- bind_cols(data1, M2_new) + +ggplot (data = dataM2) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M3_new <- tidy(predict(M3, newdata = data1, interval = "prediction", level = 0.95)) + +dataM3 <- bind_cols(data1, M3_new) + +ggplot (data = dataM3) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +M4_new <- tidy(predict(M4, newdata = data1, interval = "prediction", level = 0.95)) + +dataM4 <- bind_cols (data1, M4_new) + +ggplot (data = dataM4) + + facet_grid(year ~ .) + + geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr)) + +``` + +A prediction interval is an interval of the predicted values, which extimates the true value of a y that has not been observed. It will contain the future values of a y, 95% of the time. A confidence interval contains the parameter estimate around the mean. It will contain the true population mean, 95% of the time. + +The difference between the confidence intervals and the prediction intervals: the prediction intervals are much wider. + +## Functional Forms + +The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as +$$ +\begin{aligned}[t] +100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}} +&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\ +& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}} +\end{aligned} +$$ + +1. Take the logarithm of each side, and simplify so that $\log(\mathtt{pauper2}_t/\mathtt{pauper2}_{t -1})$ is the outcome and the predictors are all in the form $\log(x_t) - \log(x_{t - 1}) = \log(x_t / x_{t - 1})$ +$$ +\log (\mathtt{pauper2}_t/\mathtt{pauper2}_{t-1}) = \beta_0\ + \beta_1(\log(\mathtt{outratiodiff}))\\ - +\log(\mathtt{outratiodiff_{t-1}})\ + \beta_2(\log(\mathtt{popn65_t}))\\ - \log(\mathtt{popn65_{t-1}})\ + +\log(\mathtt{popn2_t})\ - \log(\mathtt{pop2_{t-1}})\ + \beta_3(\log\mathtt{popn2_t})\ - \log(\mathtt{popn2_{t-1}}) + +$$ + +2. Estimate the model with logged difference predictors, Year, and month and interpret the coefficient on $\log(outratio_t)$ . + +```{r} + +tidy(lm (log(pauper2) ~ (log (outratio) + log(Popn2) + log(Prop65)) * year + Type, data = pauperism)) + +``` + +In this logged model we can see that on average, ceteris paribus, a 1% change in the outratiodiff will have a 0.21% change on pauperism. In order to find $\log(outratio_t)$ we need to look at the covariance between outratio and year. Here we find a 6% increase on the effect of outratio per year, therefore a 0.21 to 0.27 increase. + +3. What are the pros and cons of this parameterization of the model relative to the one in @Yule1899a? Focus on interpretation and the desired goal of the inference rather than the formal tests of the regression. Can you think of other, better functional forms? + +When we log variables they are better to interpret since they are now based on percentage change rather than ratio differences. Over time this becomes easier to interpret as well as less labor intensive. Especially for a project such as Yule's where seeing the percentage changes over time would make more sense than ratio differences. Yule could see the percent change from 1871 to 1881, 1881 to 1891. Another functional form one could use is a weighted regression. We can weight each statistic by the population to control for size. + +## Non-differenced Model + +Suppose you estimate the model (*M5*) without differencing, +```{r} +M5 <- tidy(lm(pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type), data = pauperism)) +tidy(M5) +tidy(M2) +``` + +- Interpret the coefficient on `outratio`. How is this different than model *M2*. + +In the nondifferenced model the coefficient is much smaller (0.0012) than model 2 which is 0.23. The non differenced model has no baseline value unlike outratiodiff (which can use the initial year, 1871, as a comparison point). Therefore, we are unable to see the true effect of outratio on pauperism with a non differenced model since we are comparing outratio and paupratio directly, without seeing the changes over time from 1871 to 1881, 1881 to 1891. Model 5 is a difference-in-differences model while model 2 is a before-and-after design. + +- What accounts for the different in sample sizes in *M5* and *M2*? + +When the difference is not taken into account, you can use the data from 1871. This explains why M5 is has a bigger sample size since the year 1871 is taken into account. + +- What model do you think will generally have less biased estimates of the effect of out-relief on pauperism: *M5* or *M2*? Explain your reasoning. + +M2 will have less biased estimates because of the comparing of different terms. This means changes inside a muncipality are measured with regards to pauperism. Instead of looking at out relief pauperism by itself. With the differing you are comparing from one year to another, you have a baseline each time to compare to. + +## Substantive Effects + +Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing. + +McCaskey and Rainy (2015): Suggestions for Substantive Researchers +For the researcher making claims of substantive significance, we suggest the following +strategy: +1. Compute 90% confidence intervals around the estimated effects. +2. Interpret each endpoint of the interval. +3. Claim that the effect is substantively meaningful if and only if all effects in +the confidence interval are substantively meaningful. + +```{r} +# understanding the magnitude of the effect + +tstar <- 1.645 + +M4_ci <- tidy(M4) %>% + mutate (upr = (estimate + (std.error * tstar)), lwr = (estimate - (std.error * tstar))) + +tidy(M4_ci) + +``` + +Now that we have found the 90% confidence intervals for the chosen model (M4). We can see that outratiodiff has an influence between 0.43 and 0.63. Therefore, holding all else constant, on average each one unit increase on outratiodiff has an influence in the range of a 0.43 and 0.63 increase on pauperism. This effect is substantively meaningful due to the potential 43% or 63% growth in pauperism. + +## Influential Observations and Outliers + +### Influential Observations for the Regression + +For this use *M2*: + +1. For each observation, calculate and explain the following: + + - hat value (`hatvalues`) + - standardized error (`rstandard`) + - studentized error (`rstudent`) + - Cook's distance (`cooksd`) + +```{r} + +hatvalues <- hatvalues(M2) # Gives hat scores, where res is the result from lm() +hatscore <- hatvalues(M2)/mean(hatvalues(M2)) # Gives standardized hat scores +head (hatscore) + +# standardized error (rstandard) +rsta <- rstandard(M2) #Gives standardized residuals +head (rsta) + +# studentized error (rstudent) +rstu <- rstudent(M2) #Gives studentized residuals +head (rstu) + +# Cook’s distance (cooksd) +cooks <- cooks.distance(M2) #Gives Cook's Distance +head (cooks) + +# alternative: + +M2_aug <- augment(M2) %>% + mutate (.student.resid = .resid / .sigma * sqrt (1 - .hat)) + +glimpse (M2_aug) + +``` + +2. Create an outlier plot and label any outliers. See the example [here](https://jrnold.github.io/intro-methods-notes/outliers.html#iver-and-soskice-data) + +```{r} +plot(hatscore,rstu, xlab="Standardized hat-values", ylab="Studentized Residuals", +main="Influence Plot") + +abline(h=c(-2,2), lty=2) + +abline(v=c(2,3), lty=c(2,3)) + +``` +3. Using the plot and rules of thumb identify outliers and influential observations + +Looking at the plot we can say that any values which are outside of the box drawn from -2 to 2 are outliers. They are both high on leverage and discrepency. The standardized hat values which are more than 2 or 3 (past one or both the vertical lines) would also be outliers. + +## Influential Observations for a Coefficient + +1. Run *M2*, deleting each observation and saving the coefficient for `outratiodirff`. This is a method called the jackknife. You can use a for loop to do this, or you can use the function `jackknife` in the package [resamplr](https://github.com/jrnold/resamplr). + +```{r} + +jackknifeCOEF <- matrix(NA, ncol= 16, nrow= nrow(pauperism)) + +tidy_M2 <- tidy (M2) + +colnames(jackknifeCOEF) <- c (tidy_M2$term) + +for (i in 1:nrow(pauperism)){ +jackknifeCOEF[i,] <- coef((lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), data = pauperism[-i,]))) +} + +head(jackknifeCOEF) + +``` + +For which observations is there the largest change in the coefficient on `outratiodiff`? + +1. Which observations have the largest effect on the estimate of `outratiodiff`? +```{r} +summary (jackknifeCOEF) +jackknife1 <- as.data.frame(jackknifeCOEF) +jackknife1$.rownames <- as.character(seq.int(nrow(jackknife1))) +jack <- jackknife1 %>% + select (.rownames, outratiodiff) %>% + filter (outratiodiff > 0.235) %>% + print() + +``` + +After filtering out the observations which are more than 0.235, we can see there are 8 observations that have a large effect on outratiodiff. + +2. How do these observations compare with those that had the largest effect on the overall regression as measured with Cook's distance? + +```{r} + +M2_aug %>% + filter (.cooksd > 0.35) + +``` + +The observation from row 1413 shows up in both the jackknife and cooks distance calculations. However, only 1410 is visible in the cooks distance calculation. + +3. Compare the results of the jackknife to the `dfbeta` statistic for `outratiodiff` + +```{r} +dfbeta <- tidy(dfbetas (M2)) + +dfbeta %>% + filter (abs(outratiodiff) > 2/sqrt (nrow(dfbeta))) %>% + inner_join (jack, by = ".rownames") %>% + print + +``` + +When we compare the dfbeta statistics to the jackknife we see that the exact same rows appear from both. They correspond to each other. + +2.@AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for `outratiodiff`. + +1. Regress `outratiodiff` on the control variables. + +```{r} +M2_out <- lm (outratiodiff ~ (popratiodiff + oldratiodiff) * (year + Type), data = pauperism) + +M2_out + +``` + +2. The weights of the observations are those with the highest squared errors from this regression. Which observations have the highest coefficient values? + +```{r} +summary(resid (M2_out)) +M2_resid <- tidy (resid(M2_out)) %>% + mutate (x = abs(x)) %>% + top_n (x, n = 25) + +head(M2_resid) # printing only 6 rows but there are 25 values. +``` + +With the resid function we can see the top 25 rows with the absolute highest values from the M2_out regression. + +3. How do the observations with the highest regression weights compare with those with the highest changes in the regression coefficient from the jackknife? + +```{r} +M2_resid %>% + rename (.rownames = names) %>% + inner_join (jack, by = ".rownames") + +``` + + + +## Omitted Variable Bias + +An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a]. + +@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic "provide[s] a measure to gauge the strength of the likely bias arising from unobservables: how much stronger selection on unobservables, relative to selection on observables, must be to explain away the full estimated effect." + +1. Run a regression without any controls. Denote the coefficient on the variable of interest as $\hat\beta_R$. + +```{r} +beta_hat_R <- tidy (lm(paupratiodiff ~ outratiodiff, data = pauperism)) %>% + filter (term == "outratiodiff") %>% + select (estimate) + +beta_hat_R <- as.numeric(beta_hat_R) +beta_hat_R + +``` + +2. Run a regression with the full set of controls. Denote the coefficient on the variable of interest in this regression as $\hat\beta_F$. + +```{r} +beta_hat_F <- tidy_M2 %>% + filter (term == "outratiodiff") %>% + select (estimate) + +beta_hat_F <- as.numeric (beta_hat_F) +beta_hat_F + +``` + +3. The ratio is $$\hat\beta_F / (\hat\beta_R - \hat\beta_F)$$ + +```{r} +beta_hat = beta_hat_F / (beta_hat_R - beta_hat_F) +beta_hat + +``` + +Calculate this statistic for *M2* and interpret it. + +The ratio is 3.157. According to the Nunn and Wantchekon 2011 article, the selection on unobservables must be more than 3 times stronger than the selection on observables to explain away the entire effect (Pg. 3238). + +## Heteroskedasticity + +1. Run *M2* and *M3* with a heteroskedasticity consistent (HAC), also called robust, standard error. How does this affect the standard errors on `outratio` coefficients? Use the **sandwich** package to add HAC standard errors [@Zeileis2004a].Compare SE to the HAC. + +```{r} + +HAC_M2 <- vcovHAC(M2) #Gives the estimated HAC covariance matrix +diag_M2 <- sqrt(diag(vcovHAC(M2))) #Gives the HAC standard errors + +diag_M2[2] - tidy_M2$std.error[2] + +HAC_M3 <- vcovHAC(M3) +diag_M3 <- sqrt(diag(vcovHAC(M3))) +tidy_M3 <- tidy (M3) + +diag_M3[2] - tidy_M3$std.error[2] + +``` + +Using the HAC calculation we can see an increase in the standard error from the original M2 regression by around 0.005. + +Using the HAC calculation we can see an increase in the standard error from the original M3 regression by around 0.0132. + +### Multiple Regressions + +1. Run the model with interactions for all years and types +```{r} +tidy(lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism)) +``` + + +2. For each subset of year and type run the regression +```{r} +lm(pauper2 ~ outratio + Popn2 + Prop65, data = pauperism) +``` + +3. Compare the coefficients, standard errors, and regression standard errors in these regresions. + +```{r} +all_interact <- + crossing(Type = pauperism$Type, year = c(1881, 1891)) %>% + mutate(mod = map2(year, Type, + function(yr, ty) { + lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff, + data = filter(pauperism, + year == yr, + Type == ty)) + })) %>% + mutate(mod_glance = map(mod, broom::glance), + mod_tidy = map(mod, broom::tidy)) + +all_interact %>% + mutate (sigma = map_dbl(mod_glance, function (x) x$sigma)) %>% + mutate (std.error.out = map_dbl(mod_tidy, function (x) x$std.error[2])) %>% + mutate (estimate.out = map_dbl(mod_tidy, function (x) x$estimate[2])) %>% + select (year, Type, sigma, std.error.out, estimate.out) + +``` + +## Weighted Regression + +1. Run *M2* and *M3* as weighted regressions, weighted by the population (`Popn`) and interpret the coefficients on `outratiodiff` and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more? +```{r} + +M2_weights <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * + (year + Type), weights = Popn, data = pauperism) + +head(tidy(M2_weights)) + +M3_weights <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * + (year + Type), weights = Popn, data = pauperism) + +head(tidy(M3_weights)) + +``` + +Holding all else constant and with weights, on average in M2 a one unit change in outratiodiff has a 0.36 increase in paupratiodiff. Holding all else constant and with weights, on average in M3 a one unit change in outratiodiff has a 0.72 increase in paupratiodiff. The difference between the two is quite stark. Both M2 and M3 increase when compared to their original coefficients, without weights. However, between the two M3 is effected more (from 0.53 to 0.72). + +2. What are some rationales for weighting by population? See the discussion in @SolonHaiderWooldridge2013a and @AngristPischke2014a. + +According to "Mastering Metrics" weighting weights each term in the residual sum of squares by population or some other weight. We are able to take into account the population of people in each district and make sure the larger districts count for more. In this way we are able to account for small districts having more variability on average. + + +## Cross-Validation + +When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables. +Another criteria for selecting models is to use their fit to the data. +But a model's fit to data should not be assessed using only the in-sample data. +That leads to overfitting---and the best model would always be to include an indicator variable for every observation +Instead, a model's fit to data can be assessed by using its out-of-sample fit. +One way to estimate the *expected* fit of a model to *new* data is cross-validation. + +We want to compare the predictive performance of the following models +```{r} +mod_formulas <- + list( + m0 = paupratiodiff ~ 1, + m1 = paupratiodiff ~ year + Type, + m2 = paupratiodiff ~ outratiodiff + year + Type, + m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type), + m4 = -1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type), + m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type + ) +``` + +Let's split the data into 10 (train/test) folds for cross-validation, +```{r} +pauperism_nonmiss <- + pauperism %>% + filter(year %in% c(1881, 1891)) %>% + select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID, BoothGroup) %>% + tidyr::drop_na() +pauperism_10folds <- + pauperism_nonmiss %>% + resamplr::crossv_kfold(10) +``` + + +For each model formula `f`, training data set `train`, and test data set, `test`, +run the model specified by `f` on `train`, and predict new observations in `test`, and calculate the RMSE from the residuals +```{r} +mod_rmse_fold <- function(f, train, test) { + fit <- lm(f, data = as.data.frame(train)) + test_data <- as.data.frame(test) + err <- test_data$paupratiodiff - predict(fit, newdata = test_data) + sqrt(mean(err ^ 2)) +} +``` +E.g. for one fold and formula, +```{r} +mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]], + pauperism_10folds$test[[1]]) +``` + +Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with `train` and `test` list-columns: +```{r} +mod_rmse <- function(f, data) { + map2_dbl(data$train, data$test, + function(train, test) { + mod_rmse_fold(f, train, test) + }) %>% + mean() +} +``` +```{r} +mod_rmse(mod_formulas[[1]], pauperism_10folds) +``` + +Finally, we want to run `mod_rmse` for each formula in `mod_formulas`. +It will be easiest to store this in a data frame: +```{r} +cv_results <- tibble( + model_formula = mod_formulas, + .id = names(mod_formulas), + # Formula as a string + .name = map(model_formula, + function(x) gsub(" +", " ", paste0(deparse(x), collapse = ""))) +) +``` +Use `map` to run `mod_rmse` for each model and save it as a list frame in +the data frame, +```{r} +cv_results <- + mutate(cv_results, + cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds)) +``` + +In the case of linear regression, the MSE of the Leave-one-out ($n$-fold) cross-validation can be analytically calculated without having to run $n$ regressions. +```{r} +loocv <- function(x) { + mean((residuals(x) / (1 - hatvalues(x))) ^ 2) +} +``` + +```{r} +cv_results <- + mutate(cv_results, + rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss))))) +``` + + +1. In the 10-fold cross validation, which model has the best out of sample prediction? +```{r} + +for (i in seq_len(nrow(cv_results))) { + print (cv_results$cv10_rmse[i]) +} + +``` + +From this analysis we can see that Model 4 has the best out of sample prediction. + +2. Using the LOO-CV cross-validation, which model has the best out of sample prediction? + +```{r} +for (i in seq_len(nrow(cv_results))) { + print (cv_results$rmse_loo[i]) +} +``` + +Once again, Model 4 has the best out of sample prediction. + +3. Does the prediction metric (RMSE) and prediction task---predicting individual PLUs from other PLUs---make sense? Can you think of others that you would prefer? + +The task makes sense. The individual PLUs on average should have similar standard errors. However, as seen in the weighted regression section, smaller PLUs can have more variation, so we should make sure we account for this variation in the PLU size and effect. + +## Bootstrapping + +Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows: + +Let $\hat\theta$ be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure. + +For samples $b = 1, ..., B$. + +1. Draw a sample with replacement from the data +2. Estimate the statistic of interest and call it $\theta_b^*$. + +Let $\theta^* = \{\theta_1^*, \dots, \theta_B^*\}$ be the set of bootstrapped statistics. + +- standard error: $\hat\theta$ is $\sd(\theta^*)$. +- confidence interval: + + - normal approximation. This calculates the confidence interval as usual but uses the bootstrapped standard error instead of the classical OLS standard error: $\hat\theta \pm t_{\alpha/2,df} \cdot \sd(\theta^*)$ + - quantiles: A 95% confidence interval uses the 2.5% and 97.5% quantiles of $\theta^*$ for its upper and lower bounds. + + +Original model +```{r} +mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type +mod_orig <- lm(mod_formula, data = pauperism_nonmiss) +``` + +```{r} +bs_coef_se <- + resamplr::bootstrap(pauperism_nonmiss, 1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) +``` + +Now compare the std.error of the original and the bootstrap for `outratiodiff` +```{r} +broom::tidy(mod_orig, conf.int = TRUE) %>% + select(term, estimate, std.error) %>% + filter(term == "outratiodiff") %>% + left_join(bs_coef_se, by = "term") +``` +The bootstrap standard error is slightly higher. +It is similar to the standard error generated using the heteroskedasticity consistent standard error. +```{r} +sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"]) +``` + +It is likely that there is correlation between the error terms of observations. +At the very least, each PLU is included twice; these observations are likely +correlated, so we are effectively overstating the sample size of our data. +One way to account for that is to resample "PLUs", not PLU-years. +This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves. +```{r} +pauperism_nonmiss %>% + group_by(ID) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") +``` +However, this yields a standard error not much different than the Robust standard error. + +1. Try bootstrapping "Region" and "BoothGroup". Do either of these make much difference in the standard errors. + +```{r} + +# Region: + +pauperism_nonmiss %>% + group_by(Region) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") + +# BoothGroup + +pauperism_nonmiss %>% + group_by(BoothGroup) %>% + resamplr::bootstrap(1024) %>% + # extract the strap column + `[[`("sample") %>% + # run + map_df(function(dat) { + lm(mod_formula, data = dat) %>% + broom::tidy() %>% + select(term, estimate) + }) %>% + # calculate 2.5%, 97.5% and sd of estimates + group_by(term) %>% + summarise( + std.error_bs = sd(estimate), + conf.low_bsq = quantile(estimate, 0.025), + conf.low_bsq = quantile(estimate, 0.975) + ) %>% + filter(term == "outratiodiff") +``` + +When we run these two calculations we can see that neither of these makes a significantly large difference in the standard errors for Region or BoothGroup, compared to the original bootstrap. diff --git a/solutions2.html b/solutions2.html new file mode 100644 index 0000000..b59b7b9 --- /dev/null +++ b/solutions2.html @@ -0,0 +1,1405 @@ + + + + +
+ + + + + + + + + + +library("dplyr")
+##
+## Attaching package: 'dplyr'
+## The following objects are masked from 'package:stats':
+##
+## filter, lag
+## The following objects are masked from 'package:base':
+##
+## intersect, setdiff, setequal, union
+library("tidyr")
+library("tidyverse")
+## Loading tidyverse: ggplot2
+## Loading tidyverse: tibble
+## Loading tidyverse: readr
+## Loading tidyverse: purrr
+## Conflicts with tidy packages ----------------------------------------------
+## filter(): dplyr, stats
+## lag(): dplyr, stats
+library("modelr")
+library("broom")
+##
+## Attaching package: 'broom'
+## The following object is masked from 'package:modelr':
+##
+## bootstrap
+library("ggplot2")
+library ("car")
+## Warning: package 'car' was built under R version 3.3.3
+##
+## Attaching package: 'car'
+## The following object is masked from 'package:purrr':
+##
+## some
+## The following object is masked from 'package:dplyr':
+##
+## recode
+library ("stats")
+library ("texreg") #makes tables
+## Warning: package 'texreg' was built under R version 3.3.3
+## Version: 1.36.23
+## Date: 2017-03-03
+## Author: Philip Leifeld (University of Glasgow)
+##
+## Please cite the JSS article in your publications -- see citation("texreg").
+##
+## Attaching package: 'texreg'
+## The following object is masked from 'package:tidyr':
+##
+## extract
+library ("bootstrap")
+##
+## Attaching package: 'bootstrap'
+## The following object is masked from 'package:broom':
+##
+## bootstrap
+## The following object is masked from 'package:modelr':
+##
+## bootstrap
+library ("sandwich")
+## Warning: package 'sandwich' was built under R version 3.3.3
+# devtools::install_github("jrnold/datums")
+library(datums)
+pauperism <-
+ left_join(datums::pauperism_plu, datums::pauperism_year,
+ by = "ID") %>%
+ mutate (year = as.character(year))
+M1 <- lm(paupratiodiff ~ outratiodiff + year + Type, data = pauperism)
+
+M2 <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) *
+ (year + Type), data = pauperism)
+
+M3 <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+ (year + Type), data = pauperism)
+
+M4 <- lm(paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+ (year + Type), data = pauperism)
+table1 <- screenreg(list(M1, M2, M3, M4))
+
+table1
+##
+## ==========================================================================
+## Model 1 Model 2 Model 3 Model 4
+## --------------------------------------------------------------------------
+## (Intercept) 56.71 *** -55.34 6.17 7.17
+## (2.52) (35.37) (37.86) (37.86)
+## outratiodiff 0.23 *** 0.23 *** 0.53 *** 0.53 ***
+## (0.02) (0.01) (0.06) (0.06)
+## year1891 14.70 *** 15.81 28.05 28.05
+## (1.28) (16.48) (16.48) (16.48)
+## TypeMixed -6.45 * 64.41 -5.58 -5.58
+## (2.61) (37.22) (39.90) (39.90)
+## TypeRural -7.32 ** 31.75 -39.61 -39.61
+## (2.61) (41.72) (44.04) (44.04)
+## TypeUrban -3.02 109.08 ** 45.07 45.07
+## (2.77) (37.55) (40.24) (40.24)
+## popratiodiff -0.15 -0.19 * -0.19 *
+## (0.09) (0.09) (0.09)
+## oldratiodiff 1.27 *** 0.55 0.55
+## (0.29) (0.33) (0.33)
+## popratiodiff:year1891 -0.24 *** -0.26 *** -0.26 ***
+## (0.06) (0.06) (0.06)
+## popratiodiff:TypeMixed 0.25 * 0.31 ** 0.31 **
+## (0.11) (0.11) (0.11)
+## popratiodiff:TypeRural 0.43 * 0.50 ** 0.50 **
+## (0.17) (0.17) (0.17)
+## popratiodiff:TypeUrban 0.05 0.09 0.09
+## (0.10) (0.10) (0.10)
+## oldratiodiff:year1891 0.20 0.18 0.18
+## (0.13) (0.13) (0.13)
+## oldratiodiff:TypeMixed -0.96 ** -0.20 -0.20
+## (0.30) (0.34) (0.34)
+## oldratiodiff:TypeRural -0.84 ** -0.09 -0.09
+## (0.31) (0.35) (0.35)
+## oldratiodiff:TypeUrban -1.12 *** -0.50 -0.50
+## (0.32) (0.36) (0.36)
+## outratiodiff:year1891 -0.09 ** -0.09 **
+## (0.03) (0.03)
+## outratiodiff:TypeMixed -0.27 *** -0.27 ***
+## (0.06) (0.06)
+## outratiodiff:TypeRural -0.25 *** -0.25 ***
+## (0.06) (0.06)
+## outratiodiff:TypeUrban -0.14 * -0.14 *
+## (0.07) (0.07)
+## --------------------------------------------------------------------------
+## R^2 0.38 0.45 0.46 0.46
+## Adj. R^2 0.38 0.44 0.46 0.46
+## Num. obs. 1180 1180 1180 1180
+## RMSE 19.16 18.18 17.93 17.93
+## ==========================================================================
+## *** p < 0.001, ** p < 0.01, * p < 0.05
+M1: Holding year and type of area constant, on average when there is an outratio difference of 1 unit, there is a 0.2343 increase in pauperism, ceteris parabis. As the proportion of people recieving outside aid increases, there is an increase in the level of pauperism.
+M2: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.23 increase in pauperism, ceteris parabis. In this model we include oldratiodiff in the controls to make sure we control for the amount of elderly persons recieving out relief payments.
+M3: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. In this regression we are transforming the variable by adding negative 1. We include interactions between outratiodiff and year, Type Mixed, Type Urban, and Type Rural. The coefficient for outratiodiff indicates the increase in pauperism for Metropolitan areas, and the covariance between outratiodiff and PLU type indicates the change on pauperism from Metropolitan areas and the other types of areas.
+M4: Holding all else constant, on average when there is an outratio difference of 1 unit, there is a 0.53 increase in pauperism, ceteris parabis. This model is the same as Model 3 but without the negative 1 transformation.
+M1: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2year\ + \beta_3type\ + \epsilon\ +\] In this first model we are regressing outratiodiff on paupratiodiff and controlling for year and Type. It is a simple linear model without interaction terms.
+M2: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ + \epsilon\ +\] In this second model we are running another linear regression but we include interaction terms. We can measure the conditional changes when we include year and type with oldratiodiff, popratiodiff and outratiodiff.
+M3: \[ +paupratiodiff - 1 = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +\] In this third model we are including interaction terms for PLU, year, outratiodiff, popratiodiff and oldratiodiff. We are also including a negative 1 to transform the intercept but should not have any effect on the results.
+M4: \[ +paupratiodiff = \beta_0\ + \beta_1outratiodiff\ + \beta_2popratiodiff\ + \beta_3oldratiodiff\ + +\beta_4year\ + \beta_5Type\ + \\ \beta_6(popratiodiff \times\ year)\ + \beta_7(popratiodiff \times\ type)\ + \\ \beta_8(oldratiodiff \times\ year)\ + \beta_9(oldratiodiff \times\ type)\ \\ +\beta_{10}(outratiodiff \times\ year)\ + \beta{11}(outratiodiff \times\ type)\ + \epsilon\ +\] In this fourth model we are running the same regression as model 3, but without the negative 1.
+Model 3 has a negative 1 to transform the parameters. The correlation of the data does not change, but it simply shifts the information down by 1 y unit. Maybe this would make more sense to subtract a 100 to see a substantive move.
+linearHypothesis(M4, c("popratiodiff:year1891", "popratiodiff:TypeUrban", "popratiodiff:TypeRural", "popratiodiff:TypeMixed", "oldratiodiff:year1891", "oldratiodiff:TypeMixed","oldratiodiff:TypeUrban", "oldratiodiff:TypeRural", "outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))
+## Linear hypothesis test
+##
+## Hypothesis:
+## popratiodiff:year1891 = 0
+## popratiodiff:TypeUrban = 0
+## popratiodiff:TypeRural = 0
+## popratiodiff:TypeMixed = 0
+## oldratiodiff:year1891 = 0
+## oldratiodiff:TypeMixed = 0
+## oldratiodiff:TypeUrban = 0
+## oldratiodiff:TypeRural = 0
+## outratiodiff:year1891 = 0
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+##
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+## (year + Type)
+##
+## Res.Df RSS Df Sum of Sq F Pr(>F)
+## 1 1172 402371
+## 2 1160 373054 12 29318 7.5969 1.134e-13 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+linearHypothesis(M4, c("outratiodiff:year1891"))
+## Linear hypothesis test
+##
+## Hypothesis:
+## outratiodiff:year1891 = 0
+##
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+## (year + Type)
+##
+## Res.Df RSS Df Sum of Sq F Pr(>F)
+## 1 1161 375552
+## 2 1160 373054 1 2498.8 7.7698 0.005399 **
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+linearHypothesis(M4, c("outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))
+## Linear hypothesis test
+##
+## Hypothesis:
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+##
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+## (year + Type)
+##
+## Res.Df RSS Df Sum of Sq F Pr(>F)
+## 1 1163 380862
+## 2 1160 373054 3 7808.7 8.0937 2.45e-05 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+linearHypothesis(M4, c("outratiodiff:year1891","outratiodiff:TypeUrban", "outratiodiff:TypeRural", "outratiodiff:TypeMixed"))
+## Linear hypothesis test
+##
+## Hypothesis:
+## outratiodiff:year1891 = 0
+## outratiodiff:TypeUrban = 0
+## outratiodiff:TypeRural = 0
+## outratiodiff:TypeMixed = 0
+##
+## Model 1: restricted model
+## Model 2: paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+## (year + Type)
+##
+## Res.Df RSS Df Sum of Sq F Pr(>F)
+## 1 1164 384696
+## 2 1160 373054 4 11642 9.0502 3.37e-07 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+data1 <- pauperism %>%
+ group_by(year, Type) %>%
+ filter (!is.na(Type), year %in% c("1881", "1891")) %>%
+ summarise_at (vars(outratiodiff, popratiodiff, oldratiodiff), median, na.rm = TRUE)
+
+M1_new <- tidy(predict(M1, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM1 <- bind_cols (data1, M1_new)
+
+ggplot (data = dataM1) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M2_new <- tidy(predict(M2, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM2 <- bind_cols(data1, M2_new)
+
+ggplot (data = dataM2) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M3_new <- tidy(predict(M3, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM3 <- bind_cols(data1, M3_new)
+
+ggplot (data = dataM3) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M4_new <- tidy(predict(M4, newdata = data1, interval = "confidence", level = 0.95))
+
+dataM4 <- bind_cols (data1, M4_new)
+
+ggplot (data = dataM4) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M1_new <- tidy(predict(M1, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM1 <- bind_cols (data1, M1_new)
+
+ggplot (data = dataM1) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M2_new <- tidy(predict(M2, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM2 <- bind_cols(data1, M2_new)
+
+ggplot (data = dataM2) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M3_new <- tidy(predict(M3, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM3 <- bind_cols(data1, M3_new)
+
+ggplot (data = dataM3) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+M4_new <- tidy(predict(M4, newdata = data1, interval = "prediction", level = 0.95))
+
+dataM4 <- bind_cols (data1, M4_new)
+
+ggplot (data = dataM4) +
+ facet_grid(year ~ .) +
+ geom_pointrange(aes(x = Type, y = fit, ymin = lwr, ymax = upr))
+
+A prediction interval is an interval of the predicted values, which extimates the true value of a y that has not been observed. It will contain the future values of a y, 95% of the time. A confidence interval contains the parameter estimate around the mean. It will contain the true population mean, 95% of the time.
+The difference between the confidence intervals and the prediction intervals: the prediction intervals are much wider.
+The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as \[ +\begin{aligned}[t] +100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}} +&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\ +& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}} +\end{aligned} +\]
+$$
+tidy(lm (log(pauper2) ~ (log (outratio) + log(Popn2) + log(Prop65)) * year + Type, data = pauperism))
+## term estimate std.error statistic p.value
+## 1 (Intercept) 0.532584892 0.25222351 2.1115593 3.486505e-02
+## 2 log(outratio) 0.207982715 0.02860373 7.2711757 5.339890e-13
+## 3 log(Popn2) -0.076986692 0.02492738 -3.0884385 2.043590e-03
+## 4 log(Prop65) 0.898523115 0.06963985 12.9024273 1.861463e-36
+## 5 year1881 -1.172861822 0.26312476 -4.4574361 8.818390e-06
+## 6 year1891 -1.082628536 0.25949596 -4.1720438 3.166064e-05
+## 7 TypeMixed -0.594474654 0.04903643 -12.1231221 1.522963e-32
+## 8 TypeRural -0.694681619 0.05556267 -12.5026692 2.004940e-34
+## 9 TypeUrban -0.460494803 0.04371829 -10.5332288 3.359133e-25
+## 10 log(outratio):year1881 0.061980549 0.03537396 1.7521518 7.992204e-02
+## 11 log(outratio):year1891 0.004959228 0.03410923 0.1453925 8.844176e-01
+## 12 log(Popn2):year1881 0.030172996 0.03313236 0.9106806 3.625885e-01
+## 13 log(Popn2):year1891 0.017453919 0.03271408 0.5335292 5.937347e-01
+## 14 log(Prop65):year1881 -0.150544910 0.09327188 -1.6140439 1.066974e-01
+## 15 log(Prop65):year1891 -0.132271627 0.09508523 -1.3910848 1.643757e-01
+In this logged model we can see that on average, ceteris paribus, a 1% change in the outratiodiff will have a 0.21% change on pauperism. In order to find \(\log(outratio_t)\) we need to look at the covariance between outratio and year. Here we find a 6% increase on the effect of outratio per year, therefore a 0.21 to 0.27 increase.
+When we log variables they are better to interpret since they are now based on percentage change rather than ratio differences. Over time this becomes easier to interpret as well as less labor intensive. Especially for a project such as Yule’s where seeing the percentage changes over time would make more sense than ratio differences. Yule could see the percent change from 1871 to 1881, 1881 to 1891. Another functional form one could use is a weighted regression. We can weight each statistic by the population to control for size.
+Suppose you estimate the model (M5) without differencing,
+M5 <- tidy(lm(pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type), data = pauperism))
+tidy(M5)
+## Warning: NAs introduced by coercion
+## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+## column n mean sd median trimmed mad
+## 1 term* 19 NaN NA NA NaN NA
+## 2 estimate 19 0.034123738 0.2133329 6.617932e-08 0.03462656 0.03772393
+## 3 std.error 19 0.066373662 0.1100641 1.078133e-02 0.05801804 0.01598437
+## 4 statistic 19 0.002208183 3.8717500 3.118774e-01 -0.25821208 3.51922885
+## 5 p.value 19 0.166782878 0.2412675 2.181532e-02 0.14198258 0.03234340
+## min max range skew kurtosis se
+## 1 Inf -Inf -Inf NA NA NA
+## 2 -4.283603e-01 0.4880598 0.9164200 0.2347724 0.35159431 0.04894192
+## 3 2.231202e-08 0.2747928 0.2747928 1.2238901 -0.41769516 0.02525043
+## 4 -6.944277e+00 11.3758381 18.3201153 0.9026189 1.71319524 0.88824036
+## 5 5.564443e-29 0.7551707 0.7551707 1.1752498 -0.04556811 0.05535055
+tidy(M2)
+## term estimate std.error statistic p.value
+## 1 (Intercept) -55.34359311 35.37428433 -1.5645148 1.179685e-01
+## 2 outratiodiff 0.23257691 0.01437241 16.1821800 2.796381e-53
+## 3 popratiodiff -0.15460467 0.09242455 -1.6727663 9.464203e-02
+## 4 oldratiodiff 1.27422786 0.29222562 4.3604248 1.412925e-05
+## 5 year1891 15.81459463 16.47984545 0.9596325 3.374395e-01
+## 6 TypeMixed 64.41009425 37.21878051 1.7305805 8.379165e-02
+## 7 TypeRural 31.75172916 41.71570256 0.7611457 4.467242e-01
+## 8 TypeUrban 109.07760812 37.55456303 2.9045101 3.747781e-03
+## 9 popratiodiff:year1891 -0.23715938 0.05852434 -4.0523207 5.407671e-05
+## 10 popratiodiff:TypeMixed 0.24809583 0.11338910 2.1880043 2.886757e-02
+## 11 popratiodiff:TypeRural 0.43310791 0.16858281 2.5691108 1.031968e-02
+## 12 popratiodiff:TypeUrban 0.04596014 0.09636220 0.4769520 6.334858e-01
+## 13 oldratiodiff:year1891 0.20294088 0.12744222 1.5924149 1.115630e-01
+## 14 oldratiodiff:TypeMixed -0.95692127 0.29917492 -3.1985344 1.418526e-03
+## 15 oldratiodiff:TypeRural -0.83827902 0.31010130 -2.7032425 6.966637e-03
+## 16 oldratiodiff:TypeUrban -1.11662635 0.31639892 -3.5291725 4.331264e-04
+outratio
. How is this different than model M2.In the nondifferenced model the coefficient is much smaller (0.0012) than model 2 which is 0.23. The non differenced model has no baseline value unlike outratiodiff (which can use the initial year, 1871, as a comparison point). Therefore, we are unable to see the true effect of outratio on pauperism with a non differenced model since we are comparing outratio and paupratio directly, without seeing the changes over time from 1871 to 1881, 1881 to 1891. Model 5 is a difference-in-differences model while model 2 is a before-and-after design.
+When the difference is not taken into account, you can use the data from 1871. This explains why M5 is has a bigger sample size since the year 1871 is taken into account.
+M2 will have less biased estimates because of the comparing of different terms. This means changes inside a muncipality are measured with regards to pauperism. Instead of looking at out relief pauperism by itself. With the differing you are comparing from one year to another, you have a baseline each time to compare to.
+Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing.
+McCaskey and Rainy (2015): Suggestions for Substantive Researchers For the researcher making claims of substantive significance, we suggest the following strategy: 1. Compute 90% confidence intervals around the estimated effects. 2. Interpret each endpoint of the interval. 3. Claim that the effect is substantively meaningful if and only if all effects in the confidence interval are substantively meaningful.
+# understanding the magnitude of the effect
+
+tstar <- 1.645
+
+M4_ci <- tidy(M4) %>%
+ mutate (upr = (estimate + (std.error * tstar)), lwr = (estimate - (std.error * tstar)))
+
+tidy(M4_ci)
+## Warning: NAs introduced by coercion
+## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+## column n mean sd median trimmed mad
+## 1 term* 20 NaN NA NA NaN NA
+## 2 estimate 20 1.76399914 15.2326789 -0.08882782 0.4590258 0.4887553
+## 3 std.error 20 9.04223179 16.5778174 0.14673840 6.0294163 0.1507126
+## 4 statistic 20 -0.08260028 3.0272100 -0.19739114 -0.2793311 2.7507670
+## 5 p.value 20 0.23474321 0.3055838 0.09144136 0.1847654 0.1355041
+## 6 upr 20 16.63847043 31.9529777 0.43500310 9.5235824 0.7031488
+## 7 lwr 20 -13.11047215 30.5030756 -0.34683071 -5.0189513 0.5674414
+## min max range skew kurtosis se
+## 1 Inf -Inf -Inf NA NA NA
+## 2 -3.960588e+01 45.0723637 84.6782429 0.3997320 3.4340538 3.40613055
+## 3 3.178236e-02 44.0449941 44.0132118 1.2779185 -0.2582121 3.70691267
+## 4 -4.502796e+00 8.6129042 13.1156999 0.8410447 1.1131842 0.67690474
+## 5 2.291745e-17 0.8888402 0.8888402 1.0912734 -0.3380284 0.06833062
+## 6 -1.668551e-01 111.2652104 111.4320655 1.6429652 1.5503020 7.14490301
+## 7 -1.120599e+02 0.9371370 112.9970315 -2.1224544 3.4010009 6.82069506
+Now that we have found the 90% confidence intervals for the chosen model (M4). We can see that outratiodiff has an influence between 0.43 and 0.63. Therefore, holding all else constant, on average each one unit increase on outratiodiff has an influence in the range of a 0.43 and 0.63 increase on pauperism. This effect is substantively meaningful due to the potential 43% or 63% growth in pauperism.
+For this use M2:
+hatvalues
)rstandard
)rstudent
)cooksd
)hatvalues <- hatvalues(M2) # Gives hat scores, where res is the result from lm()
+hatscore <- hatvalues(M2)/mean(hatvalues(M2)) # Gives standardized hat scores
+head (hatscore)
+## 2 3 5 6 8 9
+## 2.256707 5.372202 9.182804 6.406825 3.945256 1.418897
+# standardized error (rstandard)
+rsta <- rstandard(M2) #Gives standardized residuals
+head (rsta)
+## 2 3 5 6 8 9
+## -1.70910056 0.28214127 -0.05082361 1.23807840 -1.70039734 -0.26657462
+# studentized error (rstudent)
+rstu <- rstudent(M2) #Gives studentized residuals
+head (rstu)
+## 2 3 5 6 8 9
+## -1.71051384 0.28202969 -0.05080184 1.23836212 -1.70178168 -0.26646822
+# Cook’s distance (cooksd)
+cooks <- cooks.distance(M2) #Gives Cook's Distance
+head (cooks)
+## 2 3 5 6 8
+## 5.762689e-03 3.908863e-04 2.296014e-05 9.114347e-03 1.021342e-02
+## 9
+## 8.712515e-05
+# alternative:
+
+M2_aug <- augment(M2) %>%
+ mutate (.student.resid = .resid / .sigma * sqrt (1 - .hat))
+
+glimpse (M2_aug)
+## Observations: 1,180
+## Variables: 15
+## $ .rownames <chr> "2", "3", "5", "6", "8", "9", "11", "12", "14",...
+## $ paupratiodiff <dbl> 27.02455, 126.21656, 30.55314, 81.81045, 47.132...
+## $ outratiodiff <dbl> 4.709511, 105.770522, 20.691155, 40.996818, 12....
+## $ popratiodiff <dbl> 135.62124, 101.93502, 173.89046, 164.47200, 110...
+## $ oldratiodiff <dbl> 104.24513, 119.24258, 85.41113, 104.74832, 115....
+## $ year <chr> "1881", "1891", "1881", "1891", "1881", "1891",...
+## $ Type <chr> "Metropolitan", "Metropolitan", "Metropolitan",...
+## $ .fitted <dbl> 57.61610, 121.27772, 31.41765, 60.30265, 77.206...
+## $ .se.fit <dbl> 3.180086, 4.906563, 6.414884, 5.358245, 4.20473...
+## $ .resid <dbl> -30.5915473, 4.9388445, -0.8645153, 21.5078065,...
+## $ .hat <dbl> 0.03059942, 0.07284342, 0.12451260, 0.08687220,...
+## $ .sigma <dbl> 18.16449, 18.18670, 18.18730, 18.17535, 18.1647...
+## $ .cooksd <dbl> 5.762689e-03, 3.908863e-04, 2.296014e-05, 9.114...
+## $ .std.resid <dbl> -1.70910056, 0.28214127, -0.05082361, 1.2380784...
+## $ .student.resid <dbl> -1.65817311, 0.26148569, -0.04447637, 1.1307828...
+plot(hatscore,rstu, xlab="Standardized hat-values", ylab="Studentized Residuals",
+main="Influence Plot") +
+abline(h=c(-2,2), lty=2) +
+abline(v=c(2,3), lty=c(2,3))
+
+## numeric(0)
+Looking at the plot we can say that any values which are outside of the box drawn from -2 to 2 are outliers. They are both high on leverage and discrepency. The standardized hat values which are more than 2 or 3 (past one or both the vertical lines) would also be outliers.
+outratiodirff
. This is a method called the jackknife. You can use a for loop to do this, or you can use the function jackknife
in the package resamplr.jackknifeCOEF <- matrix(NA, ncol= 16, nrow= nrow(pauperism))
+
+tidy_M2 <- tidy (M2)
+
+colnames(jackknifeCOEF) <- c (tidy_M2$term)
+
+for (i in 1:nrow(pauperism)){
+jackknifeCOEF[i,] <- coef((lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) *
+ (year + Type), data = pauperism[-i,])))
+}
+
+head(jackknifeCOEF)
+## (Intercept) outratiodiff popratiodiff oldratiodiff year1891 TypeMixed
+## [1,] -55.34359 0.2325769 -0.1546047 1.274228 15.81459 64.41009
+## [2,] -60.03163 0.2317537 -0.1365587 1.305022 17.11302 68.36327
+## [3,] -53.53339 0.2325645 -0.1549528 1.256332 15.94401 62.53052
+## [4,] -55.34359 0.2325769 -0.1546047 1.274228 15.81459 64.41009
+## [5,] -55.15810 0.2325761 -0.1536138 1.271479 15.82996 64.21723
+## [6,] -50.00969 0.2332978 -0.1790220 1.246527 16.61523 58.50446
+## TypeRural TypeUrban popratiodiff:year1891 popratiodiff:TypeMixed
+## [1,] 31.75173 109.0776 -0.2371594 0.2480958
+## [2,] 35.81002 112.8970 -0.2419905 0.2327206
+## [3,] 29.85320 107.2313 -0.2367462 0.2483106
+## [4,] 31.75173 109.0776 -0.2371594 0.2480958
+## [5,] 31.56388 108.8799 -0.2372897 0.2471626
+## [6,] 26.07548 102.8957 -0.2443414 0.2758056
+## popratiodiff:TypeRural popratiodiff:TypeUrban oldratiodiff:year1891
+## [1,] 0.4331079 0.04596014 0.2029409
+## [2,] 0.4175821 0.03022029 0.1951028
+## [3,] 0.4333823 0.04611562 0.2012715
+## [4,] 0.4331079 0.04596014 0.2029409
+## [5,] 0.4321577 0.04503012 0.2029273
+## [6,] 0.4604489 0.07372856 0.2018767
+## oldratiodiff:TypeMixed oldratiodiff:TypeRural oldratiodiff:TypeUrban
+## [1,] -0.9569213 -0.8382790 -1.116626
+## [2,] -0.9824004 -0.8646263 -1.140453
+## [3,] -0.9381813 -0.8194209 -1.098145
+## [4,] -0.9569213 -0.8382790 -1.116626
+## [5,] -0.9541600 -0.8355499 -1.113821
+## [6,] -0.9273552 -0.8106171 -1.084341
+For which observations is there the largest change in the coefficient on outratiodiff
?
outratiodiff
?summary (jackknifeCOEF)
+## (Intercept) outratiodiff popratiodiff oldratiodiff
+## Min. :-77.54 Min. :0.2291 Min. :-0.1930 Min. :1.077
+## 1st Qu.:-55.37 1st Qu.:0.2325 1st Qu.:-0.1547 1st Qu.:1.274
+## Median :-55.34 Median :0.2326 Median :-0.1546 Median :1.274
+## Mean :-55.35 Mean :0.2326 Mean :-0.1546 Mean :1.274
+## 3rd Qu.:-55.32 3rd Qu.:0.2326 3rd Qu.:-0.1545 3rd Qu.:1.274
+## Max. :-31.49 Max. :0.2415 Max. :-0.0951 Max. :1.467
+## year1891 TypeMixed TypeRural TypeUrban
+## Min. : 9.357 Min. :39.23 Min. : 6.742 Min. : 83.81
+## 1st Qu.:15.792 1st Qu.:64.39 1st Qu.:31.718 1st Qu.:109.06
+## Median :15.815 Median :64.41 Median :31.752 Median :109.08
+## Mean :15.818 Mean :64.41 Mean :31.755 Mean :109.09
+## 3rd Qu.:15.859 3rd Qu.:64.43 3rd Qu.:31.777 3rd Qu.:109.09
+## Max. :31.861 Max. :85.56 Max. :53.088 Max. :130.13
+## popratiodiff:year1891 popratiodiff:TypeMixed popratiodiff:TypeRural
+## Min. :-0.3241 Min. :0.1835 Min. :0.3713
+## 1st Qu.:-0.2373 1st Qu.:0.2480 1st Qu.:0.4330
+## Median :-0.2372 Median :0.2481 Median :0.4331
+## Mean :-0.2372 Mean :0.2481 Mean :0.4331
+## 3rd Qu.:-0.2370 3rd Qu.:0.2481 3rd Qu.:0.4332
+## Max. :-0.1647 Max. :0.2901 Max. :0.4746
+## popratiodiff:TypeUrban oldratiodiff:year1891 oldratiodiff:TypeMixed
+## Min. :-0.02761 Min. :0.08927 Min. :-1.1426
+## 1st Qu.: 0.04594 1st Qu.:0.20255 1st Qu.:-0.9571
+## Median : 0.04596 Median :0.20294 Median :-0.9569
+## Mean : 0.04593 Mean :0.20292 Mean :-0.9569
+## 3rd Qu.: 0.04599 3rd Qu.:0.20317 3rd Qu.:-0.9567
+## Max. : 0.10453 Max. :0.25034 Max. :-0.7492
+## oldratiodiff:TypeRural oldratiodiff:TypeUrban
+## Min. :-1.0252 Min. :-1.3010
+## 1st Qu.:-0.8384 1st Qu.:-1.1167
+## Median :-0.8383 Median :-1.1166
+## Mean :-0.8383 Mean :-1.1167
+## 3rd Qu.:-0.8380 3rd Qu.:-1.1165
+## Max. :-0.6318 Max. :-0.9075
+jackknife1 <- as.data.frame(jackknifeCOEF)
+jackknife1$.rownames <- as.character(seq.int(nrow(jackknife1)))
+jack <- jackknife1 %>%
+ select (.rownames, outratiodiff) %>%
+ filter (outratiodiff > 0.235) %>%
+ print()
+## .rownames outratiodiff
+## 1 93 0.2353312
+## 2 242 0.2400089
+## 3 321 0.2414890
+## 4 456 0.2354546
+## 5 1047 0.2357447
+## 6 1134 0.2364103
+## 7 1413 0.2371184
+## 8 1620 0.2389753
+After filtering out the observations which are more than 0.235, we can see there are 8 observations that have a large effect on outratiodiff.
+M2_aug %>%
+ filter (.cooksd > 0.35)
+## .rownames paupratiodiff outratiodiff popratiodiff oldratiodiff year
+## 1 1410 23.39941 83.12100 379.74075 109.7245 1891
+## 2 1413 278.60457 58.87793 35.73522 122.9814 1891
+## Type .fitted .se.fit .resid .hat .sigma .cooksd
+## 1 Urban -2.874848 13.472445 26.27426 0.54919739 18.15109 0.3528030
+## 2 Urban 115.224894 4.993271 163.37967 0.07544072 17.49155 0.4455003
+## .std.resid .student.resid
+## 1 2.152559 0.9718988
+## 2 9.346488 8.9812582
+The observation from row 1413 shows up in both the jackknife and cooks distance calculations. However, only 1410 is visible in the cooks distance calculation.
+dfbeta
statistic for outratiodiff
dfbeta <- tidy(dfbetas (M2))
+
+dfbeta %>%
+ filter (abs(outratiodiff) > 2/sqrt (nrow(dfbeta))) %>%
+ inner_join (jack, by = ".rownames") %>%
+ print
+## .rownames X.Intercept. outratiodiff.x popratiodiff oldratiodiff
+## 1 93 0.56743376 -0.1920594 -0.288357946 -0.58185633
+## 2 242 -0.08516041 -0.5197752 0.048325627 0.09392898
+## 3 321 -0.04612685 -0.6205123 0.025062100 0.05713621
+## 4 456 -0.02003577 -0.2003758 0.008077134 0.02511958
+## 5 1047 -0.02126329 -0.2207569 0.004198310 0.02866121
+## 6 1134 -0.02580265 -0.2672048 0.015376390 0.03076645
+## 7 1413 -0.29264457 -0.3284116 0.371965630 0.22264136
+## 8 1620 -0.04961977 -0.4453968 0.029293890 0.05677375
+## year1891 TypeMixed TypeRural TypeUrban popratiodiff.year1891
+## 1 0.136496813 -0.56384874 -0.501126078 -0.55775059 -0.054926491
+## 2 0.271916012 -0.06403943 0.035639267 0.04696809 -0.031194292
+## 3 -0.003052264 0.04116431 0.055215626 0.05888512 -0.009965884
+## 4 0.023242039 0.01823860 0.035396621 0.01819242 0.004779178
+## 5 0.032838882 0.01669058 0.004075248 0.01848190 0.026178107
+## 6 -0.006831864 0.03484534 0.026993426 0.02696833 -0.023254825
+## 7 0.407249707 0.20134869 0.248061988 -0.17891278 -1.286102999
+## 8 0.086022476 0.04276304 0.093888252 0.04222417 -0.024381113
+## popratiodiff.TypeMixed popratiodiff.TypeRural popratiodiff.TypeUrban
+## 1 0.254384432 0.166506078 0.292960278
+## 2 -0.025457254 -0.021688023 -0.035494141
+## 3 0.014026729 -0.011969316 -0.018171955
+## 4 -0.002729445 -0.021453301 -0.008013993
+## 5 -0.002880387 0.046260586 -0.010191113
+## 6 -0.016022363 -0.001986061 -0.006597184
+## 7 -0.008931584 -0.065739497 -0.631677973
+## 8 -0.009480686 -0.063667800 -0.019169602
+## oldratiodiff.year1891 oldratiodiff.TypeMixed oldratiodiff.TypeRural
+## 1 -0.13605297 0.59741113 0.57639335
+## 2 -0.29860619 0.09118345 -0.02595994
+## 3 0.03091964 -0.04765164 -0.05433189
+## 4 -0.02576518 -0.01770934 -0.03306525
+## 5 -0.04821125 -0.01551577 -0.02744388
+## 6 0.02529348 -0.03434564 -0.02920560
+## 7 0.11043376 -0.23031013 -0.27904505
+## 8 -0.07750413 -0.04061349 -0.08217588
+## oldratiodiff.TypeUrban outratiodiff.y
+## 1 0.56508606 0.2353312
+## 2 -0.03667247 0.2400089
+## 3 -0.05514832 0.2414890
+## 4 -0.01614687 0.2354546
+## 5 -0.01559710 0.2357447
+## 6 -0.02590141 0.2364103
+## 7 0.47280266 0.2371184
+## 8 -0.03744212 0.2389753
+When we compare the dfbeta statistics to the jackknife we see that the exact same rows appear from both. They correspond to each other.
+2.@AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for outratiodiff
.
outratiodiff
on the control variables.M2_out <- lm (outratiodiff ~ (popratiodiff + oldratiodiff) * (year + Type), data = pauperism)
+
+M2_out
+##
+## Call:
+## lm(formula = outratiodiff ~ (popratiodiff + oldratiodiff) * (year +
+## Type), data = pauperism)
+##
+## Coefficients:
+## (Intercept) popratiodiff oldratiodiff
+## -204.4514 0.3320 1.9749
+## year1891 TypeMixed TypeRural
+## 91.6318 256.6422 259.3359
+## TypeUrban popratiodiff:year1891 popratiodiff:TypeMixed
+## 247.5671 -0.1445 -0.1605
+## popratiodiff:TypeRural popratiodiff:TypeUrban oldratiodiff:year1891
+## -0.3584 -0.2358 -0.3333
+## oldratiodiff:TypeMixed oldratiodiff:TypeRural oldratiodiff:TypeUrban
+## -1.9846 -1.7723 -1.9101
+summary(resid (M2_out))
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## -103.400 -21.690 -4.504 0.000 16.320 452.300
+M2_resid <- tidy (resid(M2_out)) %>%
+ mutate (x = abs(x)) %>%
+ top_n (x, n = 25)
+
+head(M2_resid) # printing only 6 rows but there are 25 values.
+## # A tibble: 6 × 2
+## names x
+## <chr> <dbl>
+## 1 81 124.83695
+## 2 93 93.65999
+## 3 242 177.80260
+## 4 321 452.27249
+## 5 372 97.06956
+## 6 390 82.75111
+With the resid function we can see the top 25 rows with the absolute highest values from the M2_out regression.
+M2_resid %>%
+ rename (.rownames = names) %>%
+ inner_join (jack, by = ".rownames")
+## # A tibble: 7 × 3
+## .rownames x outratiodiff
+## <chr> <dbl> <dbl>
+## 1 93 93.65999 0.2353312
+## 2 242 177.80260 0.2400089
+## 3 321 452.27249 0.2414890
+## 4 456 151.54252 0.2354546
+## 5 1047 127.49436 0.2357447
+## 6 1134 146.44627 0.2364103
+## 7 1620 367.86213 0.2389753
+An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a].
+@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic “provide[s] a measure to gauge the strength of the likely bias arising from unobservables: how much stronger selection on unobservables, relative to selection on observables, must be to explain away the full estimated effect.”
+beta_hat_R <- tidy (lm(paupratiodiff ~ outratiodiff, data = pauperism)) %>%
+ filter (term == "outratiodiff") %>%
+ select (estimate)
+
+beta_hat_R <- as.numeric(beta_hat_R)
+beta_hat_R
+## [1] 0.3062392
+beta_hat_F <- tidy_M2 %>%
+ filter (term == "outratiodiff") %>%
+ select (estimate)
+
+beta_hat_F <- as.numeric (beta_hat_F)
+beta_hat_F
+## [1] 0.2325769
+beta_hat = beta_hat_F / (beta_hat_R - beta_hat_F)
+beta_hat
+## [1] 3.157342
+Calculate this statistic for M2 and interpret it.
+The ratio is 3.157. According to the Nunn and Wantchekon 2011 article, the selection on unobservables must be more than 3 times stronger than the selection on observables to explain away the entire effect (Pg. 3238).
+outratio
coefficients? Use the sandwich package to add HAC standard errors [@Zeileis2004a].Compare SE to the HAC.HAC_M2 <- vcovHAC(M2) #Gives the estimated HAC covariance matrix
+diag_M2 <- sqrt(diag(vcovHAC(M2))) #Gives the HAC standard errors
+
+diag_M2[2] - tidy_M2$std.error[2]
+## outratiodiff
+## 0.005028073
+HAC_M3 <- vcovHAC(M3)
+diag_M3 <- sqrt(diag(vcovHAC(M3)))
+tidy_M3 <- tidy (M3)
+
+diag_M3[2] - tidy_M3$std.error[2]
+## outratiodiff
+## 0.0132523
+Using the HAC calculation we can see an increase in the standard error from the original M2 regression by around 0.005.
+Using the HAC calculation we can see an increase in the standard error from the original M3 regression by around 0.0132.
+tidy(lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism))
+## term estimate std.error statistic
+## 1 outratio 5.708958e-04 8.005294e-04 0.7131479210
+## 2 Popn2 -5.025627e-08 4.494016e-08 -1.1182931580
+## 3 Prop65 -1.801554e-01 5.149112e-01 -0.3498766749
+## 4 year1871 5.955222e-02 2.020238e-02 2.9477822141
+## 5 year1881 -9.485841e-06 1.912446e-02 -0.0004960056
+## 6 year1891 3.542339e-02 1.771497e-02 1.9996298499
+## 7 TypeMixed -4.853681e-02 2.097216e-02 -2.3143446585
+## 8 TypeRural -6.334421e-02 2.153513e-02 -2.9414362704
+## 9 TypeUrban -6.401214e-02 2.180260e-02 -2.9359859167
+## 10 outratio:year1881 5.516347e-03 3.450922e-03 1.5985140373
+## 11 outratio:year1891 2.855633e-03 2.984223e-03 0.9569101161
+## 12 Popn2:year1881 -2.762783e-08 5.891525e-08 -0.4689418597
+## 13 Popn2:year1891 -2.761539e-08 5.465754e-08 -0.5052438011
+## 14 Prop65:year1881 9.770247e-01 7.138906e-01 1.3685917152
+## 15 Prop65:year1891 1.360550e-01 6.667050e-01 0.2040707241
+## 16 outratio:TypeMixed 2.857434e-04 8.336068e-04 0.3427796522
+## 17 outratio:TypeRural 4.607440e-04 8.351230e-04 0.5517079504
+## 18 outratio:TypeUrban 9.951729e-04 8.819306e-04 1.1284027694
+## 19 Popn2:TypeMixed -3.793316e-07 8.310262e-08 -4.5646165868
+## 20 Popn2:TypeRural 6.734344e-08 8.570810e-08 0.7857302403
+## 21 Popn2:TypeUrban 9.575764e-08 5.425215e-08 1.7650477300
+## 22 Prop65:TypeMixed 9.930927e-01 5.211824e-01 1.9054609278
+## 23 Prop65:TypeRural 9.762405e-01 5.246110e-01 1.8608845521
+## 24 Prop65:TypeUrban 9.254490e-01 5.330645e-01 1.7360920998
+## 25 year1881:TypeMixed 4.835546e-02 2.893869e-02 1.6709623975
+## 26 year1891:TypeMixed 1.629334e-02 2.806250e-02 0.5806090462
+## 27 year1881:TypeRural 4.352313e-02 2.973288e-02 1.4638047731
+## 28 year1891:TypeRural 1.142985e-02 2.877682e-02 0.3971893005
+## 29 year1881:TypeUrban 7.359464e-02 3.000889e-02 2.4524276041
+## 30 year1891:TypeUrban 3.345950e-02 2.909343e-02 1.1500709107
+## 31 outratio:year1881:TypeMixed -4.749300e-03 3.477899e-03 -1.3655657294
+## 32 outratio:year1891:TypeMixed -2.577668e-03 3.013496e-03 -0.8553745772
+## 33 outratio:year1881:TypeRural -4.820566e-03 3.473224e-03 -1.3879224976
+## 34 outratio:year1891:TypeRural -2.546709e-03 3.007052e-03 -0.8469122618
+## 35 outratio:year1881:TypeUrban -5.802898e-03 3.529730e-03 -1.6440062308
+## 36 outratio:year1891:TypeUrban -3.560759e-03 3.066255e-03 -1.1612726322
+## 37 Popn2:year1881:TypeMixed 3.546083e-07 1.094249e-07 3.2406543554
+## 38 Popn2:year1891:TypeMixed 3.728497e-07 1.040752e-07 3.5825034936
+## 39 Popn2:year1881:TypeRural 9.482280e-08 1.142364e-07 0.8300578965
+## 40 Popn2:year1891:TypeRural 7.434405e-08 1.079794e-07 0.6885022733
+## 41 Popn2:year1881:TypeUrban -2.211217e-08 7.076765e-08 -0.3124615712
+## 42 Popn2:year1891:TypeUrban -3.374132e-08 6.613930e-08 -0.5101553026
+## 43 Prop65:year1881:TypeMixed -1.292371e+00 7.226732e-01 -1.7883203392
+## 44 Prop65:year1891:TypeMixed -5.667597e-01 6.757782e-01 -0.8386770918
+## 45 Prop65:year1881:TypeRural -1.112461e+00 7.276014e-01 -1.5289430356
+## 46 Prop65:year1891:TypeRural -3.870140e-01 6.800808e-01 -0.5690705995
+## 47 Prop65:year1881:TypeUrban -1.428906e+00 7.389726e-01 -1.9336386438
+## 48 Prop65:year1891:TypeUrban -5.094620e-01 6.922665e-01 -0.7359333429
+## p.value
+## 1 4.758507e-01
+## 2 2.635975e-01
+## 3 7.264740e-01
+## 4 3.243628e-03
+## 5 9.996043e-01
+## 6 4.569683e-02
+## 7 2.076578e-02
+## 8 3.310491e-03
+## 9 3.368918e-03
+## 10 1.101118e-01
+## 11 3.387467e-01
+## 12 6.391705e-01
+## 13 6.134522e-01
+## 14 1.713051e-01
+## 15 8.383223e-01
+## 16 7.318060e-01
+## 17 5.812199e-01
+## 18 2.593068e-01
+## 19 5.357876e-06
+## 20 4.321335e-01
+## 21 7.773266e-02
+## 22 5.688629e-02
+## 23 6.293051e-02
+## 24 8.272610e-02
+## 25 9.491050e-02
+## 26 5.615798e-01
+## 27 1.434295e-01
+## 28 6.912770e-01
+## 29 1.428804e-02
+## 30 2.502741e-01
+## 31 1.722533e-01
+## 32 3.924626e-01
+## 33 1.653399e-01
+## 34 3.971615e-01
+## 35 1.003571e-01
+## 36 2.456917e-01
+## 37 1.215279e-03
+## 38 3.497614e-04
+## 39 4.066209e-01
+## 40 4.912291e-01
+## 41 7.547275e-01
+## 42 6.100079e-01
+## 43 7.389979e-02
+## 44 4.017667e-01
+## 45 1.264619e-01
+## 46 5.693823e-01
+## 47 5.332100e-02
+## 48 4.618713e-01
+lm(pauper2 ~ outratio + Popn2 + Prop65, data = pauperism)
+##
+## Call:
+## lm(formula = pauper2 ~ outratio + Popn2 + Prop65, data = pauperism)
+##
+## Coefficients:
+## (Intercept) outratio Popn2 Prop65
+## 8.398e-03 1.956e-03 -1.475e-08 3.284e-01
+all_interact <-
+ crossing(Type = pauperism$Type, year = c(1881, 1891)) %>%
+ mutate(mod = map2(year, Type,
+ function(yr, ty) {
+ lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff,
+ data = filter(pauperism,
+ year == yr,
+ Type == ty))
+ })) %>%
+ mutate(mod_glance = map(mod, broom::glance),
+ mod_tidy = map(mod, broom::tidy))
+
+all_interact %>%
+ mutate (sigma = map_dbl(mod_glance, function (x) x$sigma)) %>%
+ mutate (std.error.out = map_dbl(mod_tidy, function (x) x$std.error[2])) %>%
+ mutate (estimate.out = map_dbl(mod_tidy, function (x) x$estimate[2])) %>%
+ select (year, Type, sigma, std.error.out, estimate.out)
+## # A tibble: 8 × 5
+## year Type sigma std.error.out estimate.out
+## <dbl> <chr> <dbl> <dbl> <dbl>
+## 1 1881 Metropolitan 9.886436 0.13716458 0.7085593
+## 2 1891 Metropolitan 24.790240 0.11858387 0.3263541
+## 3 1881 Mixed 16.437527 0.04011237 0.1996762
+## 4 1891 Mixed 17.403411 0.02576831 0.1656616
+## 5 1881 Rural 13.801753 0.03316465 0.2308180
+## 6 1891 Rural 17.078948 0.02282518 0.2137457
+## 7 1881 Urban 19.523919 0.07970376 0.7107494
+## 8 1891 Urban 25.557318 0.07559362 0.2632868
+Popn
) and interpret the coefficients on outratiodiff
and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more?M2_weights <- lm(paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) *
+ (year + Type), weights = Popn, data = pauperism)
+
+head(tidy(M2_weights))
+## term estimate std.error statistic p.value
+## 1 (Intercept) -27.6860026 24.91557702 -1.111193 2.667150e-01
+## 2 outratiodiff 0.3644746 0.01907747 19.104979 5.472523e-71
+## 3 popratiodiff -0.1960724 0.06021243 -3.256344 1.161100e-03
+## 4 oldratiodiff 1.0143631 0.20524209 4.942276 8.851788e-07
+## 5 year1891 -23.1753237 17.39122610 -1.332587 1.829281e-01
+## 6 TypeMixed 43.7123212 30.20646454 1.447118 1.481331e-01
+M3_weights <- lm(-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) *
+ (year + Type), weights = Popn, data = pauperism)
+
+head(tidy(M3_weights))
+## term estimate std.error statistic p.value
+## 1 (Intercept) 23.7603236 25.80279872 0.9208429 3.573239e-01
+## 2 outratiodiff 0.7167659 0.04453160 16.0956706 9.131931e-53
+## 3 popratiodiff -0.2379611 0.05898587 -4.0342058 5.837204e-05
+## 4 oldratiodiff 0.3779429 0.22090876 1.7108552 8.737520e-02
+## 5 year1891 2.3389679 17.13905009 0.1364701 8.914733e-01
+## 6 TypeMixed -27.6445457 31.11298198 -0.8885213 3.744446e-01
+Holding all else constant and with weights, on average in M2 a one unit change in outratiodiff has a 0.36 increase in paupratiodiff. Holding all else constant and with weights, on average in M3 a one unit change in outratiodiff has a 0.72 increase in paupratiodiff. The difference between the two is quite stark. Both M2 and M3 increase when compared to their original coefficients, without weights. However, between the two M3 is effected more (from 0.53 to 0.72).
+According to “Mastering Metrics” weighting weights each term in the residual sum of squares by population or some other weight. We are able to take into account the population of people in each district and make sure the larger districts count for more. In this way we are able to account for small districts having more variability on average.
+When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables. Another criteria for selecting models is to use their fit to the data. But a model’s fit to data should not be assessed using only the in-sample data. That leads to overfitting—and the best model would always be to include an indicator variable for every observation Instead, a model’s fit to data can be assessed by using its out-of-sample fit. One way to estimate the expected fit of a model to new data is cross-validation.
+We want to compare the predictive performance of the following models
+mod_formulas <-
+ list(
+ m0 = paupratiodiff ~ 1,
+ m1 = paupratiodiff ~ year + Type,
+ m2 = paupratiodiff ~ outratiodiff + year + Type,
+ m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type),
+ m4 = -1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type),
+ m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type
+ )
+Let’s split the data into 10 (train/test) folds for cross-validation,
+pauperism_nonmiss <-
+ pauperism %>%
+ filter(year %in% c(1881, 1891)) %>%
+ select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID, BoothGroup) %>%
+ tidyr::drop_na()
+pauperism_10folds <-
+ pauperism_nonmiss %>%
+ resamplr::crossv_kfold(10)
+For each model formula f
, training data set train
, and test data set, test
, run the model specified by f
on train
, and predict new observations in test
, and calculate the RMSE from the residuals
mod_rmse_fold <- function(f, train, test) {
+ fit <- lm(f, data = as.data.frame(train))
+ test_data <- as.data.frame(test)
+ err <- test_data$paupratiodiff - predict(fit, newdata = test_data)
+ sqrt(mean(err ^ 2))
+}
+E.g. for one fold and formula,
+mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]],
+ pauperism_10folds$test[[1]])
+## [1] 24.16498
+Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with train
and test
list-columns:
mod_rmse <- function(f, data) {
+ map2_dbl(data$train, data$test,
+ function(train, test) {
+ mod_rmse_fold(f, train, test)
+ }) %>%
+ mean()
+}
+mod_rmse(mod_formulas[[1]], pauperism_10folds)
+## [1] 24.23181
+Finally, we want to run mod_rmse
for each formula in mod_formulas
. It will be easiest to store this in a data frame:
cv_results <- tibble(
+ model_formula = mod_formulas,
+ .id = names(mod_formulas),
+ # Formula as a string
+ .name = map(model_formula,
+ function(x) gsub(" +", " ", paste0(deparse(x), collapse = "")))
+)
+Use map
to run mod_rmse
for each model and save it as a list frame in the data frame,
cv_results <-
+ mutate(cv_results,
+ cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds))
+In the case of linear regression, the MSE of the Leave-one-out (\(n\)-fold) cross-validation can be analytically calculated without having to run \(n\) regressions.
+loocv <- function(x) {
+ mean((residuals(x) / (1 - hatvalues(x))) ^ 2)
+}
+cv_results <-
+ mutate(cv_results,
+ rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss)))))
+for (i in seq_len(nrow(cv_results))) {
+ print (cv_results$cv10_rmse[i])
+}
+## $m0
+## [1] 24.23181
+##
+## $m1
+## [1] 20.98203
+##
+## $m2
+## [1] 19.05795
+##
+## $m3
+## [1] 18.38304
+##
+## $m4
+## [1] 18.20314
+##
+## $m5
+## [1] 18.25599
+From this analysis we can see that Model 4 has the best out of sample prediction.
+for (i in seq_len(nrow(cv_results))) {
+ print (cv_results$rmse_loo[i])
+}
+## $m0
+## [1] 24.34132
+##
+## $m1
+## [1] 21.11967
+##
+## $m2
+## [1] 19.2609
+##
+## $m3
+## [1] 18.56803
+##
+## $m4
+## [1] 18.43676
+##
+## $m5
+## [1] 18.4601
+Once again, Model 4 has the best out of sample prediction.
+The task makes sense. The individual PLUs on average should have similar standard errors. However, as seen in the weighted regression section, smaller PLUs can have more variation, so we should make sure we account for this variation in the PLU size and effect.
+Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows:
+Let \(\hat\theta\) be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure.
+For samples \(b = 1, ..., B\).
+Let \(\theta^* = \{\theta_1^*, \dots, \theta_B^*\}\) be the set of bootstrapped statistics.
+confidence interval:
+Original model
+mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type
+mod_orig <- lm(mod_formula, data = pauperism_nonmiss)
+bs_coef_se <-
+ resamplr::bootstrap(pauperism_nonmiss, 1024) %>%
+ # extract the strap column
+ `[[`("sample") %>%
+ # run
+ map_df(function(dat) {
+ lm(mod_formula, data = dat) %>%
+ broom::tidy() %>%
+ select(term, estimate)
+ }) %>%
+ # calculate 2.5%, 97.5% and sd of estimates
+ group_by(term) %>%
+ summarise(
+ std.error_bs = sd(estimate),
+ conf.low_bsq = quantile(estimate, 0.025),
+ conf.low_bsq = quantile(estimate, 0.975)
+ )
+Now compare the std.error of the original and the bootstrap for outratiodiff
broom::tidy(mod_orig, conf.int = TRUE) %>%
+ select(term, estimate, std.error) %>%
+ filter(term == "outratiodiff") %>%
+ left_join(bs_coef_se, by = "term")
+## term estimate std.error std.error_bs conf.low_bsq
+## 1 outratiodiff 0.2273004 0.01434334 0.01881032 0.2668222
+The bootstrap standard error is slightly higher. It is similar to the standard error generated using the heteroskedasticity consistent standard error.
+sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"])
+## [1] 0.01986397
+It is likely that there is correlation between the error terms of observations. At the very least, each PLU is included twice; these observations are likely correlated, so we are effectively overstating the sample size of our data. One way to account for that is to resample “PLUs”, not PLU-years. This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves.
+pauperism_nonmiss %>%
+ group_by(ID) %>%
+ resamplr::bootstrap(1024) %>%
+ # extract the strap column
+ `[[`("sample") %>%
+ # run
+ map_df(function(dat) {
+ lm(mod_formula, data = dat) %>%
+ broom::tidy() %>%
+ select(term, estimate)
+ }) %>%
+ # calculate 2.5%, 97.5% and sd of estimates
+ group_by(term) %>%
+ summarise(
+ std.error_bs = sd(estimate),
+ conf.low_bsq = quantile(estimate, 0.025),
+ conf.low_bsq = quantile(estimate, 0.975)
+ ) %>%
+ filter(term == "outratiodiff")
+## # A tibble: 1 × 3
+## term std.error_bs conf.low_bsq
+## <chr> <dbl> <dbl>
+## 1 outratiodiff 0.02006236 0.2733225
+However, this yields a standard error not much different than the Robust standard error.
+# Region:
+
+pauperism_nonmiss %>%
+ group_by(Region) %>%
+ resamplr::bootstrap(1024) %>%
+ # extract the strap column
+ `[[`("sample") %>%
+ # run
+ map_df(function(dat) {
+ lm(mod_formula, data = dat) %>%
+ broom::tidy() %>%
+ select(term, estimate)
+ }) %>%
+ # calculate 2.5%, 97.5% and sd of estimates
+ group_by(term) %>%
+ summarise(
+ std.error_bs = sd(estimate),
+ conf.low_bsq = quantile(estimate, 0.025),
+ conf.low_bsq = quantile(estimate, 0.975)
+ ) %>%
+ filter(term == "outratiodiff")
+## # A tibble: 1 × 3
+## term std.error_bs conf.low_bsq
+## <chr> <dbl> <dbl>
+## 1 outratiodiff 0.01882124 0.2751381
+# BoothGroup
+
+pauperism_nonmiss %>%
+ group_by(BoothGroup) %>%
+ resamplr::bootstrap(1024) %>%
+ # extract the strap column
+ `[[`("sample") %>%
+ # run
+ map_df(function(dat) {
+ lm(mod_formula, data = dat) %>%
+ broom::tidy() %>%
+ select(term, estimate)
+ }) %>%
+ # calculate 2.5%, 97.5% and sd of estimates
+ group_by(term) %>%
+ summarise(
+ std.error_bs = sd(estimate),
+ conf.low_bsq = quantile(estimate, 0.025),
+ conf.low_bsq = quantile(estimate, 0.975)
+ ) %>%
+ filter(term == "outratiodiff")
+## # A tibble: 1 × 3
+## term std.error_bs conf.low_bsq
+## <chr> <dbl> <dbl>
+## 1 outratiodiff 0.02148936 0.277693
+When we run these two calculations we can see that neither of these makes a significantly large difference in the standard errors for Region or BoothGroup, compared to the original bootstrap.
+