- See the Statistical Inference & Hypothesis Testing slides
- Review the statistical inference Rmarkdown file (preview the output here)
- Interactive demos from the slides:
- Read Chapter 7 of Introduction to Statistical Thinking (With R, Without Calculus) (IST) for a recap of sampling distributions. Feel free to execute code in the book along the way.
- Do question 7.1
- Read Chapter 9 of of IST
- Do questions 9.1 and 9.2
- Go through the sampling means Rmarkdown file (preview the output here), and complete the last exercise
- Read Chapters 10 and 11 of IST
- For background:
- Chapter 4 has a good review of population distributions, expectations, and variance
- Chapter 5 has a recap of random variables
- Chapter 6 has more information on the normal distribution
- See section 4 of Mindless Statistics and this article for some warnings on misinterpretations of p-values
- Review the Prediction and Regression slides
- Do HW2 where you'll learn all about regression and Orange Juice!
- See this notebook on linear models with the
modelr
from the tidyverse and this one on model evaluation - Read Chapter 18 of R for Data Science on modeling in R
- Reference:
- A description of the oj data
- Formula syntax in R
- Dan's interactive Visual Least Squares tool
- Some background on elasticity: blog post, Khan Academy video
- A slide deck on log transformations in regression
- Chapter 3 of Introduction to Statistical Learning on regression
- Also covered in Chapter 14 of Introduction to Statistical Thinking
- Review the Testing, cross-validation, and model selection slides
- Do HW3, which looks at including store demographics and previous prices for modeling oj sales
- Investigate cross-price elasticity of oj sales together in class
- Review the slides on causality
- Do the assignment below
In this assignment we'll predict number of trips per day as a function of the weather on that day. Do all of your work in an RMarkdown file named citibike_cv.Rmd
.
- Create a data frame with one row for each day, the number of trips taken on that day, and the minimum temperature on that day.
- Split the data into a randomly selected training and test set, as in the above exercise, with 80% of the data for training the model and 20% for testing.
- Fit a model using
lm
to predict the number of trips as a (linear) function of the minimum temperature, and evaluate the fit on the training and testing data sets. Do this first visually by plotting the predicted and actual values as a function of the minimum temperature. Then do this with R^2 and RMSE on both the training and test sets. You'll want to use thepredict
andcor
functions for this. - Repeat this procedure, but add a quadratic term to your model (e.g.,
+ tmin^2
, or (more or less) equivalently+ poly(tmin,2)
). How does the model change, and how do the fits between the linear and quadratic models compare? - Now automate this, extending the model to higher-order polynomials with a
for
loop over the degreek
. For each value ofk
, fit a model to the training data and save the R^2 on the training data to one vector and test vector to another. Then plot the training and test R^2 as a function ofk
. What value ofk
has the best performance? - Finally, fit one model for the value of
k
with the best performance in 6), and plot the actual and predicted values for this model.
- Review these notebooks on linear models with the
modelr
from the tidyverse and this one on model evaluation - See this manual model fitting shiny app
- Do the assignment below
The point of this exercise is to get experience in an open-ended prediction exercise: predicting the total number of Citibike trips taken on a given day. Do all of your work in an RMarkdown file named predict_citibike.Rmd
. Here are the rules of the game:
- You can use any features you like that are available prior to the day in question, ranging from the weather, to the time of year and day of week, to activity in previous days or weeks, but don't cheat and use features from the future (e.g., the next day's trips). You might even try finding a CSV of holidays online and adding a factor for "is_holiday" to your model to see if this improves the fit.
- As usual, split your data into training and testing subsets and evaluate performance on each.
- Quantify your performance in two ways: R^2 (or the square of the correlation coefficient), as we've been doing, and with root mean-squared error.
- Report the model with the best performance on the test data. Watch out for overfitting.
- Plot your final best fit model in two different ways. First with the date on the x-axis and the number of trips on the y-axis, showing the actual values as points and predicted values as a line. Second as a plot where the x-axis is the predicted value and the y-axis is the actual value, with each point representing one day.
- Inspect the model when you're done to figure out what the highly predictive features are, and see if you can prune away any negligble features that don't matter much.
- When you're convinced that you have your best model, clean up all your code so that it saves your best model in a
.RData
file. - Commit all of your changes to git, using
git add -f
to add the model.Rdata
file if needed, and push to your Github repository. - Write a new file that loads in the weather data for new days and your saved model, and predicts the number of trips for each day (see load_trips.R for code snippets to load in the weather data).
- Modify the download_trips.sh script to download trips from 2015 (instead of 2014).
- Compute the RMSE between the actual and predicted trips for 2015 and compare the results to what you found with cross-validation.
- Pair up with a partner who has a different model, run their model, and evaluate the predictions it makes for the 2015 data.