Skip to content

Commit

Permalink
Merge 7a38ab5 into b320b35
Browse files Browse the repository at this point in the history
  • Loading branch information
trevorcampbell authored Nov 10, 2023
2 parents b320b35 + 7a38ab5 commit f9a06d2
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 16 deletions.
15 changes: 8 additions & 7 deletions source/inference.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ population_proportion <- airbnb |>
```

We can see that the proportion of `Entire home/apt` listings in
the data set is `r round(population_proportion,3)`. This
the data set is `r round(population_proportion,3)`. This
value, `r round(population_proportion,3)`, is the population parameter. Remember, this
parameter value is usually unknown in real data analysis problems, as it is
typically not possible to make measurements for an entire population.
Expand Down Expand Up @@ -398,7 +398,7 @@ estimates
```

The average value of the sample of size 40
is \$`r round(estimates$mean_price, 2)`. This
is \$`r format(round(estimates$mean_price, 2), nsmall=2)`. This
number is a point estimate for the mean of the full population.
Recall that the population mean was
\$`r round(population_parameters$mean_price,2)`. So our estimate was fairly close to
Expand Down Expand Up @@ -771,7 +771,7 @@ and use a bootstrap distribution using just a single sample from the population.
Once again, suppose we are
interested in estimating the population mean price per night of all Airbnb
listings in Vancouver, Canada, using a single sample size of 40.
Recall our point estimate was \$`r round(estimates$mean_price, 2)`. The
Recall our point estimate was \$`r format(round(estimates$mean_price, 2), nsmall=2)`. The
histogram of prices in the sample is displayed in Figure \@ref(fig:11-bootstrapping1).

```{r, echo = F, message = F, warning = F}
Expand All @@ -791,7 +791,7 @@ one_sample_dist
```

The histogram for the sample is skewed, with a few observations out to the right. The
mean of the sample is \$`r round(estimates$mean_price, 2)`.
mean of the sample is \$`r format(round(estimates$mean_price, 2), nsmall=2)`.
Remember, in practice, we usually only have this one sample from the population. So
this sample and estimate are the only data we can work with.

Expand Down Expand Up @@ -1114,7 +1114,8 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol

\newpage

To do this in R, we can use the `quantile()` function:
To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
\index{quantile}
\index{pull}
\index{select}
Expand Down Expand Up @@ -1149,9 +1150,9 @@ boot_est_dist +
To finish our estimation of the population parameter, we would report the point
estimate and our confidence interval's lower and upper bounds. Here the sample
mean price per night of 40 Airbnb listings was
\$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
\$`r format(round(mean(one_sample$price),2), nsmall=2)`, and we are 95\% "confident" that the true
population mean price per night for all Airbnb listings in Vancouver is between
\$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
\$`r round(bounds[1],2)` and \$`r round(bounds[2],2)`.
Notice that our interval does indeed contain the true
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
practice, we would not know whether our interval captured the population
Expand Down
10 changes: 5 additions & 5 deletions source/regression1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -456,8 +456,8 @@ the model and returns the RMSPE for each number of neighbors. In the output of t
results data frame, we see that the `neighbors` variable contains the value of $K$,
the mean (`mean`) contains the value of the RMSPE estimated via cross-validation,
and the standard error (`std_err`) contains a value corresponding to a measure of how uncertain we are in the mean value. A detailed treatment of this
is beyond the scope of this chapter; but roughly, if your estimated mean is 100,000 and standard
error is 1,000, you can expect the *true* RMSPE to be somewhere roughly between 99,000 and 101,000 (although it may
is beyond the scope of this chapter; but roughly, if your estimated mean RMSPE is \$100,000 and standard
error is \$1,000, you can expect the *true* RMSPE to be somewhere roughly between \$99,000 and \$101,000 (although it may
fall outside this range). You may ignore the other columns in the metrics data frame,
as they do not provide any additional insight.
\index{cross-validation!collect\_metrics}
Expand Down Expand Up @@ -763,9 +763,9 @@ predictor *as part of the model tuning process* (e.g., if we are running forward
in the chapter on evaluating and tuning classification models),
then we must compare the accuracy estimated using only the training data via cross-validation.
Looking back, the estimated cross-validation accuracy for the single-predictor
model was `r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
model was \$`r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
The estimated cross-validation accuracy for the multivariable model is
`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
\$`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
Thus in this case, we did not improve the model
by a large amount by adding this additional predictor.

Expand Down Expand Up @@ -797,7 +797,7 @@ knn_mult_mets

This time, when we performed KNN regression on the same data set, but also
included number of bedrooms as a predictor, we obtained a RMSPE test error
of `r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
of \$`r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This
time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
predictors instead of 1.
Expand Down
8 changes: 4 additions & 4 deletions source/regression2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ lm_test_results
```

Our final model's test error as assessed by RMSPE \index{RMSPE}
is `r format(round(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
is \$`r format(round(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
Remember that this is in units of the response variable, and here that
is US Dollars (USD). Does this mean our model is "good" at predicting house
sale price based off of the predictor of home size? Again, answering this is
Expand Down Expand Up @@ -504,7 +504,7 @@ lm_mult_test_results
```

Our model's test error as assessed by RMSPE
is `r format(round(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
is \$`r format(round(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
In the case of two predictors, we can plot the predictions made by our linear regression creates a *plane* of best fit, as
shown in Figure \@ref(fig:08-3DlinReg).

Expand Down Expand Up @@ -614,12 +614,12 @@ lm_mult_test_results
```

We obtain an RMSPE \index{RMSPE} for the multivariable linear regression model
of `r format(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`. This prediction error
of \$`r format(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`. This prediction error
is less than the prediction error for the multivariable KNN regression model,
indicating that we should likely choose linear regression for predictions of
house sale price on this data set. Revisiting the simple linear regression model
with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`,
\$`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`,
which is slightly higher than that of our more complex model. Our model with two predictors
provided a slightly better fit on test data than our model with just one.
As mentioned earlier, this is not always the case: sometimes including more
Expand Down

0 comments on commit f9a06d2

Please sign in to comment.