Merge 7a38ab5 into b320b35

UBC-DSCI · Nov 10, 2023 · f9a06d2 · f9a06d2
2 parents b320b35 + 7a38ab5
commit f9a06d2
Show file tree

Hide file tree

Showing 3 changed files with 17 additions and 16 deletions.
diff --git a/source/inference.Rmd b/source/inference.Rmd
@@ -174,7 +174,7 @@ population_proportion <- airbnb |>
 ```
 
 We can see that the proportion of `Entire home/apt` listings in
-the data set is `r round(population_proportion,3)`. This 
+the data set is `r round(population_proportion,3)`. This
 value, `r round(population_proportion,3)`, is the population parameter. Remember, this
 parameter value is usually unknown in real data analysis problems, as it is
 typically not possible to make measurements for an entire population.
@@ -398,7 +398,7 @@ estimates
 ```
 
 The average value of the sample of size 40 
-is \$`r round(estimates$mean_price, 2)`.  This 
+is \$`r format(round(estimates$mean_price, 2), nsmall=2)`. This
 number is a point estimate for the mean of the full population.
 Recall that the population mean was 
 \$`r round(population_parameters$mean_price,2)`. So our estimate was fairly close to
@@ -771,7 +771,7 @@ and use a bootstrap distribution using just a single sample from the population.
 Once again, suppose we are
 interested in estimating the population mean price per night of all Airbnb
 listings in Vancouver, Canada, using a single sample size of 40.
-Recall our point estimate was \$`r round(estimates$mean_price, 2)`. The
+Recall our point estimate was \$`r format(round(estimates$mean_price, 2), nsmall=2)`. The
 histogram of prices in the sample is displayed in Figure \@ref(fig:11-bootstrapping1).
 
 ```{r, echo = F, message = F, warning = F}
@@ -791,7 +791,7 @@ one_sample_dist
 ```
 
 The histogram for the sample is skewed, with a few observations out to the right. The
-mean of the sample is \$`r round(estimates$mean_price, 2)`.
+mean of the sample is \$`r format(round(estimates$mean_price, 2), nsmall=2)`.
 Remember, in practice, we usually only have this one sample from the population. So
 this sample and estimate are the only data we can work with.
 
@@ -1114,7 +1114,8 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
 
 \newpage
 
-To do this in R, we can use the `quantile()` function:
+To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than 
+percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
 \index{quantile}
 \index{pull}
 \index{select}
@@ -1149,9 +1150,9 @@ boot_est_dist +
 To finish our estimation of the population parameter, we would report the point
 estimate and our confidence interval's lower and upper bounds. Here the sample
 mean price per night of 40 Airbnb listings was 
-\$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
+\$`r format(round(mean(one_sample$price),2), nsmall=2)`, and we are 95\% "confident" that the true
 population mean price per night for all Airbnb listings in Vancouver is between
-\$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
+\$`r round(bounds[1],2)` and \$`r round(bounds[2],2)`.
 Notice that our interval does indeed contain the true
 population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
 practice, we would not know whether our interval captured the population

diff --git a/source/regression1.Rmd b/source/regression1.Rmd
@@ -456,8 +456,8 @@ the model and returns the RMSPE for each number of neighbors. In the output of t
 results data frame, we see that the `neighbors` variable contains the value of $K$,
 the mean (`mean`) contains the value of the RMSPE estimated via cross-validation,
 and the standard error (`std_err`) contains a value corresponding to a measure of how uncertain we are in the mean value. A detailed treatment of this
-is beyond the scope of this chapter; but roughly, if your estimated mean is 100,000 and standard
-error is 1,000, you can expect the *true* RMSPE to be somewhere roughly between 99,000 and 101,000 (although it may
+is beyond the scope of this chapter; but roughly, if your estimated mean RMSPE is \$100,000 and standard
+error is \$1,000, you can expect the *true* RMSPE to be somewhere roughly between \$99,000 and \$101,000 (although it may
 fall outside this range). You may ignore the other columns in the metrics data frame,
 as they do not provide any additional insight.
 \index{cross-validation!collect\_metrics}
@@ -763,9 +763,9 @@ predictor *as part of the model tuning process* (e.g., if we are running forward
 in the chapter on evaluating and tuning classification models),
 then we must compare the accuracy estimated using only the training data via cross-validation.
 Looking back, the estimated cross-validation accuracy for the single-predictor 
-model was `r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
+model was \$`r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
 The estimated cross-validation accuracy for the multivariable model is
-`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
+\$`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
 Thus in this case, we did not improve the model 
 by a large amount by adding this additional predictor.
 
@@ -797,7 +797,7 @@ knn_mult_mets
 
 This time, when we performed KNN regression on the same data set, but also
 included number of bedrooms as a predictor, we obtained a RMSPE test error 
-of `r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
+of \$`r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
 Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This 
 time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
 predictors instead of 1. 

diff --git a/source/regression2.Rmd b/source/regression2.Rmd
@@ -284,7 +284,7 @@ lm_test_results
 ```
 
 Our final model's test error as assessed by RMSPE \index{RMSPE}
-is `r format(round(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`. 
+is \$`r format(round(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
 Remember that this is in units of the response variable, and here that
 is US Dollars (USD). Does this mean our model is "good" at predicting house
 sale price based off of the predictor of home size? Again, answering this is
@@ -504,7 +504,7 @@ lm_mult_test_results
 ```
 
 Our model's test error as assessed by RMSPE
-is `r format(round(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
+is \$`r format(round(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
 In the case of two predictors, we can plot the predictions made by our linear regression creates a *plane* of best fit, as
 shown in Figure \@ref(fig:08-3DlinReg).
 
@@ -614,12 +614,12 @@ lm_mult_test_results
 ```
 
 We obtain an RMSPE \index{RMSPE} for the multivariable linear regression model 
-of `r format(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`. This prediction error
+of \$`r format(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`. This prediction error
  is less than the prediction error for the multivariable KNN regression model,
 indicating that we should likely choose linear regression for predictions of
 house sale price on this data set. Revisiting the simple linear regression model
 with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was 
-`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`, 
+\$`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`,
 which is slightly higher than that of our more complex model. Our model with two predictors
 provided a slightly better fit on test data than our model with just one. 
 As mentioned earlier, this is not always the case: sometimes including more