Skip to content

Commit fd16b45

Browse files
authored
Merge pull request #222 from tidymodels/misc-updates-2022
Misc updates
2 parents c1ab3f9 + 1d456a8 commit fd16b45

7 files changed

+8
-28
lines changed

11-comparing-models.Rmd

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -257,13 +257,11 @@ where the residuals $\epsilon_{ij}$ are assumed to be independent and follow a G
257257
A Bayesian linear model makes additional assumptions. In addition to specifying a distribution for the residuals, we require _prior distribution_ specifications for the model parameters ( $\beta_j$ and $\sigma$ ). These are distributions for the parameters that the model assumes before being exposed to the observed data. For example, a simple set of prior distributions for our model might be:
258258

259259

260-
$$
261260
\begin{align}
262261
\epsilon_{ij} &\sim N(0, \sigma) \notag \\
263262
\beta_j &\sim N(0, 10) \notag \\
264263
\sigma &\sim \text{exponential}(1) \notag
265264
\end{align}
266-
$$
267265

268266
These priors set the possible/probable ranges of the model parameters and have no unknown parameters. For example, the prior on $\sigma$ indicates that values must be larger than zero, are very right-skewed, and have values that are usually less than 3 or 4.
269267

12-tuning-parameters.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ In the context of generalized linear models, the logit function is the _link fun
121121
$$\Phi^{-1}(\pi) = \beta_0 + \beta_1x_1 + \ldots + \beta_px_p$$
122122
were $\Phi$ is the cumulative standard normal function, as well as the _complementary log-log_ model:
123123

124-
$$\log(\log(1\pi)) = \beta_0 + \beta_1x_1 + \ldots + \beta_px_p$$
124+
$$\log(-\log(1-\pi)) = \beta_0 + \beta_1x_1 + \ldots + \beta_px_p$$
125125
Each of these models result in linear class boundaries. Which one should be we use? Since, for these data, the number of model parameters does not vary, the statistical approach is to compute the (log) likelihood for each model and determine the model with the largest value. Traditionally, the likelihood is computed using the same data that were used to estimate the parameters, not using approaches like data splitting or resampling from Chapters \@ref(splitting) and \@ref(resampling).
126126

127127
For a data frame `training_set`, let's create a function to compute the different models and extract the likelihood statistics for the training set (using `broom::glance()`):

18-explaining-models-and-predictions.Rmd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
```{r explain-setup, include = FALSE}
2+
knitr::opts_chunk$set(fig.path = "figures/")
23
library(tidymodels)
34
library(forcats)
45
tidymodels_prefer()

19-when-should-you-trust-predictions.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
21
```{r setup, include=FALSE}
2+
knitr::opts_chunk$set(fig.path = "figures/")
33
library(tidymodels)
44
library(applicable)
55
library(patchwork)

20-ensemble-models.Rmd

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
```{r ensembles-setup, include = FALSE}
2+
knitr::opts_chunk$set(fig.path = "figures/")
23
library(tidymodels)
34
library(rules)
45
library(baguette)
@@ -199,7 +200,7 @@ tmp <-
199200
unnest(cols = "data")
200201
201202
eqn <- paste(c(glmn_int$estimate, tmp$term), collapse = " \\\\\n\t+&")
202-
eqn <- paste0("$$\n\\begin{align}\n \\text{ensemble prediction} &=", eqn, "\n\\end{align}\n$$")
203+
eqn <- paste0("\n\\begin{align}\n \\text{ensemble prediction} &=", eqn, "\n\\end{align}\n")
203204
204205
cat(eqn)
205206
```

21-inferential-analysis.Rmd

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
```{r inferential-setup, include = FALSE}
2+
knitr::opts_chunk$set(fig.path = "figures/")
23
library(tidymodels)
34
library(poissonreg)
45
library(infer)
@@ -63,12 +64,11 @@ There were many more publications by men, although there were more men in the da
6364

6465
For our application, the hypotheses to compare the two sexes are:
6566

66-
$$
6767
\begin{align}
6868
H_0&: \lambda_m = \lambda_f \notag \\
6969
H_a&: \lambda_m \ne \lambda_f \notag
7070
\end{align}
71-
$$
71+
7272
where the $\lambda$ values are the rates of publications (over the same time period).
7373

7474
A basic application of the test is:
@@ -196,12 +196,10 @@ tidy(log_lin_fit, conf.int = TRUE, conf.level = 0.90)
196196

197197
In this output, the p-values correspond to separate hypothesis tests for each parameter:
198198

199-
$$
200199
\begin{align}
201200
H_0&: \beta_j = 0 \notag \\
202201
H_a&: \beta_j \ne 0 \notag
203202
\end{align}
204-
$$
205203

206204
for each of the model parameters. Looking at these results, `phd` (the prestige of their department) may not have any relationship with the outcome.
207205

@@ -238,12 +236,10 @@ glm_boot %>%
238236

239237
Determining which predictors to include in the model is a difficult problem. One approach is to conduct likelihood ratio tests (LRT) [@McCullaghNelder89] between nested models. Based on the confidence intervals, we have evidence that a simpler model without `phd` may be sufficient. Let's fit a smaller model, then conduct a statistical test:
240238

241-
$$
242239
\begin{align}
243240
H_0&: \beta_{phd} = 0 \notag \\
244241
H_a&: \beta_{phd} \ne 0 \notag
245242
\end{align}
246-
$$
247243

248244
This hypothesis was previously tested when we showed the tidied results for `log_lin_fit`. That particular approach used results from a single model fit via a Wald statistic (i.e. the parameter divided by its standard error). For that approach, the p-value was `r tidy(log_lin_fit) %>% filter(term == "phd") %>% pluck("p.value") %>% format.pval()`. We can tidy the results for the LRT to get the p-value:
249245

@@ -270,13 +266,11 @@ $$\lambda = 0 \pi + (1 - \pi) \lambda_{nz}$$
270266

271267
where
272268

273-
$$
274269
\begin{align}
275270
\log(\lambda_{nz}) &= \beta_0 + \beta_1x_1 + \ldots + \beta_px_p \notag \\
276271
& and \notag \\
277272
log\left(\frac{\pi}{1-\pi}\right) &= \gamma_0 + \gamma_1z_1 + \ldots + \gamma_qz_q \notag
278273
\end{align}
279-
$$
280274

281275
where the $x$ covariates affect the non-zero count values and the $z$ covariates influence the probability of a zero count. The two sets of predictors do not need to be mutually exclusive.
282276

@@ -295,13 +289,10 @@ zero_inflated_fit
295289

296290
Since the coefficients for this model are also estimated using maximum likelihood, let's try to use another likelihood ratio test to understand if the new model terms are helpful. We will _simultaneously_ test that
297291

298-
$$
299292
\begin{align}
300293
H_0&: \gamma_1 = 0, \gamma_2 = 0, \cdots, \gamma_5 = 0 \notag \\
301294
H_a&: \text{at least one} \gamma \ne 0 \notag
302295
\end{align}
303-
$$
304-
305296

306297
```{r inference-zip-anova, error = TRUE}
307298
anova(

TMwR.bib

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -224,18 +224,6 @@ @article{glmnet
224224
year={2010}
225225
}
226226

227-
@article{pvalue,
228-
author = {R Wasserstein and N Lazar},
229-
title = {The {ASA} statement on p-values: Context, process, and purpose},
230-
journal = {The American Statistician},
231-
volume = {70},
232-
number = {2},
233-
pages = {129-133},
234-
year = {2016},
235-
publisher = {Taylor & Francis}
236-
}
237-
238-
239227
@article{parallel,
240228
author = {M Schmidberger and M Morgan and D Eddelbuettel and H Yu and L Tierney and U Mansmann},
241229
title = {State of the art in parallel computing with {R}},
@@ -892,6 +880,7 @@ @Book{Molnar2021
892880
year = {2020},
893881
isbn = {9780244768522},
894882
url = {https://christophm.github.io/interpretable-ml-book/},
883+
publisher = {lulu.com}
895884
}
896885

897886
@inproceedings{Lundberg2017,

0 commit comments

Comments
 (0)