It's 2028, and you are on a new assignment with a new client called Insurance Inc.
The client is very excited to roll out a new health insurance product that offers ultra-low premiums on the simple condition that deductibles are subject to change at any time.
In order for Insurance Inc. to offer the lowest premiums possible, it needs to be able to assess your risk in a traffic situation in real-time so that your deductible can be adjusted. The data needed to do this is easily obtained through a smartphone app, that customers download and install. The app transmits real-time information about the customer, and their surroundings, that are found in your dataset.
This is where you come in. Using the data obtained from the smartphone app you are going to provide a probability of serious injury, for a given person in the moments before a traffic accident occurs.
- Each row in the dataset corresponds to a single person, involved in a car crash, moments before it occurs.
- The data column names and values are supposed to be quite explicit - there is no data dictionary to be supplied.
- The entries in
y_train
are:1=serious_injury
, and0=no_serious_injury
- This means that a prediction of
1
is absolute certainty that a person suffered a serious (and expensive) injury.
- This means that a prediction of
You are expected to explore and understand the dataset, and train a predictive model that outputs the probability of serious injury of a person involved in a car crash. This is a very difficult prediction task using only the given data, so don't expect your scores to be super high.
You have already taken a look at many datasets. However, good practices are always good to remember:
- Take a good overview of the dataset before anything else - numerical variables, categorical variables, anomalies, outliers, missing values, etc. Basically, perform an Exploratory Data Analysis (EDA).
- Quickly create a baseline model - this will be your starting point from where you will improve (or not!).
- Think beyond what you have in your hands - you cannot fully guarantee that the data your model will see in production has exactly the same characteristics of the training dataset. Plan for failure.
- Pipelines are life savers. You probably remember them. You generally have one initial notebook so messy with EDA, that you need to create one with the decisions (steps) you would like to have in your pipeline.
- Don't overcomplicate your final solution - the more complicated it is, the more problems it might have in production.
Follow the guidelines for deploying a model on heroku and use the
test-server.py
script to test your deployment. What you'll want to do is start your server and then run
python test-server.py
. There are a few options to use with it. Here are a few examples of how to use the script:
# This assumes that you have a file called X_train.csv in the data directory
# and you have started your server on your localhost because you are developing it
python test-server.py "data/X_train.csv" "http://127.0.0.1:5000/"
# This is the same scenario but with 100 observations instead of the default 10
python test-server.py "final_datasets/X_train.csv" "http://127.0.0.1:5000/" -n 100
# This will use a different random state to select the observations. You will use
# this after you have one set of observations working well and you want to test
# with a different set.
python test-server.py "final_datasets/X_train.csv" "http://127.0.0.1:5000/" -r 45
# now you've deployed it to heroku!
python test-server.py "final_datasets/X_train.csv" "https://deployed-model.herokuapp.com"
You only have 10K rows on the free tier of heroku so after testing, you will want to clear out your database to make sure that you aren't taking up any precious space for when we actually start the simulator!
Here is a link to the report guidelines
This capstone project is meant to individually test students for the following:
- Their grasp of the model development workflow
- Their ability to do basic EDA
- Their ability to train and evaluate a predictive model
- Their ability to professionally communicate their findings
- This will be done via professional reports that are more important the the trained model itself
- Their ability to understand the paradigm of providing predictions on unseen data
- Robustness to unseen data that could be quite dirty
- Providing predictions over time rather than kaggle-style all at once
- Their ability to make decision on which technical tools to use considering the problem in hands.
This is quite different from the previous specializations because you will not be able to submit as many times.
This specialization will be the primary way in which we may certify you as entry level data scientists so it is very important that this capstone is both difficult and fair.
Another thing to note is that the EDA and model development portion of this project is not the primary focus of this specialization. You should already know how to do this. It should NOT take a long time to do EDA and train a model that performs acceptably.
- A single learning repository describing how to deploy a model to heroku while saving observations to a database. An update for windows users or a requirement to use windows subsystems for linux is almost certainly required.
- One binary classification dataset that is split into 3 parts (see image below for details)
- An initial report that they must submit describing the EDA on the dataset and the model that they will deploy
- A simulator that feeds the test set and some true outcomes to the students deployed models over the course of a week or two.
- A final report describing the test set and any updates to the model that they deployed
This should be a binary classification dataset that we can logically split into the following parts:
With the following additional requirementsX_test_1
andX_test_2
must contain both numerical and categorical values thatX_train
did not have- Model performance on
X_test_2
must be demonstrably better regardless of the model if re-trained onX_test_1
andy_test_1
- There must be noticeable shifts in the populations or distributions of 2 features that can be detected using statistical tests.
For X_test_1
, the flow will happen in this order for an example observation:
- Observation #1 arrives with all features needed to provide a prediction arrives at the student server
- The student server will store the observation in your database
- The student server must return a predction between 0 and 1 for observation #1
- Some time later, the true outcome for observation #1 will arrive at the student server
- This true outcome taken from
y_test_1
for observation #1
- This true outcome taken from
- The student server will store the true outcome for observation #1 in the database
For X_test_2
, the flow will happen in this order for an example observation:
- Observation #5000 arrives with all features needed to provide a prediction
- The student server will store the observation in your database
- The student server must return a predction between 0 and 1 for observation #5000
Notice that for X_test_1
and X_test_2
you are essentially expanding the training set. If the training set is expanded, it means that it is possible to train and deploy another model that has had the benefit of seeing more data. It will require some judgement about when and how often to do this though because it involves messing with production systems and the more one does this, the more chances there are for something to go wrong!
Also, it's quite important to make sure that the students are recording every piece of data that comes into the server. This is VERY important for the second report because they will need to analyze what kind of data showed up after they deployed the initial model and wrote the report for that.
Both reports should be professional quality. This means that in order to receive a passing grade on it, we should feel comfortable submitting it to our boss or to a client. Developing the guidelines for these reports that gives clear guidance but doesn't just give a cookie-cutter recipe is a non-trivial task that will require quite a bit of judgement. You should follow the guidelines.
God willing, by the time we finish, we will have 30-40 more students that have submitted all material that must be graded quickly and with quality. By quickly we mean that it cannot take too much time per report because of limited instructor hours. By quality we mean it must be harsh but fair since we cannot certify anyone that does not deserve it while at the same time making sure that there are no surprises for the students.
You should follow the report guidelines. You can find the grading components in that document.
The capstone is one of the most single important pieces of work that is turned in by the student. In order for the work to count toward the graduation requirements, the students must
- Cover all points required in the reports
- Note that since the grading sheet that we will be using is given to you ahead of time, requirement is 100% coverage of the topics.
- Your deployed server must return predictions for 80% of the requests sent to your server
- This is nowhere near good enough for a real production system but since this is your first time at it, we'll be kind :)
Taken from this spreadsheet, here is a breakdown of what components of the capstone will be covered on which days:
With a work breakdown estimate of the following