Skip to content
This repository has been archived by the owner on Mar 21, 2021. It is now read-only.

LDSSA/batch2-capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batch 2 - Capstone: Predicting serious injury

Brief

It's 2028, and you are on a new assignment with a new client called Insurance Inc.

The client is very excited to roll out a new health insurance product that offers ultra-low premiums on the simple condition that deductibles are subject to change at any time.

In order for Insurance Inc. to offer the lowest premiums possible, it needs to be able to assess your risk in a traffic situation in real-time so that your deductible can be adjusted. The data needed to do this is easily obtained through a smartphone app, that customers download and install. The app transmits real-time information about the customer, and their surroundings, that are found in your dataset.

This is where you come in. Using the data obtained from the smartphone app you are going to provide a probability of serious injury, for a given person in the moments before a traffic accident occurs.

Characteristics of the dataset

  • Each row in the dataset corresponds to a single person, involved in a car crash, moments before it occurs.
  • The data column names and values are supposed to be quite explicit - there is no data dictionary to be supplied.
  • The entries in y_train are: 1=serious_injury, and 0=no_serious_injury
    • This means that a prediction of 1 is absolute certainty that a person suffered a serious (and expensive) injury.

Exploring the dataset and modeling

You are expected to explore and understand the dataset, and train a predictive model that outputs the probability of serious injury of a person involved in a car crash. This is a very difficult prediction task using only the given data, so don't expect your scores to be super high.

You have already taken a look at many datasets. However, good practices are always good to remember:

  • Take a good overview of the dataset before anything else - numerical variables, categorical variables, anomalies, outliers, missing values, etc. Basically, perform an Exploratory Data Analysis (EDA).
  • Quickly create a baseline model - this will be your starting point from where you will improve (or not!).
  • Think beyond what you have in your hands - you cannot fully guarantee that the data your model will see in production has exactly the same characteristics of the training dataset. Plan for failure.
  • Pipelines are life savers. You probably remember them. You generally have one initial notebook so messy with EDA, that you need to create one with the decisions (steps) you would like to have in your pipeline.
  • Don't overcomplicate your final solution - the more complicated it is, the more problems it might have in production.

Deploying and testing

Follow the guidelines for deploying a model on heroku and use the test-server.py script to test your deployment. What you'll want to do is start your server and then run python test-server.py. There are a few options to use with it. Here are a few examples of how to use the script:

# This assumes that you have a file called X_train.csv in the data directory
# and you have started your server on your localhost because you are developing it
python test-server.py "data/X_train.csv" "http://127.0.0.1:5000/"

# This is the same scenario but with 100 observations instead of the default 10
python test-server.py "final_datasets/X_train.csv" "http://127.0.0.1:5000/" -n 100

# This will use a different random state to select the observations. You will use
# this after you have one set of observations working well and you want to test
# with a different set.
python test-server.py "final_datasets/X_train.csv" "http://127.0.0.1:5000/" -r 45

# now you've deployed it to heroku!
python test-server.py "final_datasets/X_train.csv" "https://deployed-model.herokuapp.com"

Remember!

You only have 10K rows on the free tier of heroku so after testing, you will want to clear out your database to make sure that you aren't taking up any precious space for when we actually start the simulator!

Report Guidelines

Here is a link to the report guidelines

Capstone Project Structure and Schedule

This capstone project is meant to individually test students for the following:

  • Their grasp of the model development workflow
  • Their ability to do basic EDA
  • Their ability to train and evaluate a predictive model
  • Their ability to professionally communicate their findings
  • Their ability to understand the paradigm of providing predictions on unseen data
    • Robustness to unseen data that could be quite dirty
    • Providing predictions over time rather than kaggle-style all at once
  • Their ability to make decision on which technical tools to use considering the problem in hands.

This is quite different from the previous specializations because you will not be able to submit as many times.

This specialization will be the primary way in which we may certify you as entry level data scientists so it is very important that this capstone is both difficult and fair.

Another thing to note is that the EDA and model development portion of this project is not the primary focus of this specialization. You should already know how to do this. It should NOT take a long time to do EDA and train a model that performs acceptably.

Components

  • A single learning repository describing how to deploy a model to heroku while saving observations to a database. An update for windows users or a requirement to use windows subsystems for linux is almost certainly required.
  • One binary classification dataset that is split into 3 parts (see image below for details)
  • An initial report that they must submit describing the EDA on the dataset and the model that they will deploy
  • A simulator that feeds the test set and some true outcomes to the students deployed models over the course of a week or two.
  • A final report describing the test set and any updates to the model that they deployed

The dataset

This should be a binary classification dataset that we can logically split into the following parts:

With the following additional requirements
  • X_test_1 and X_test_2 must contain both numerical and categorical values that X_train did not have
  • Model performance on X_test_2 must be demonstrably better regardless of the model if re-trained on X_test_1 and y_test_1
  • There must be noticeable shifts in the populations or distributions of 2 features that can be detected using statistical tests.

The simulation

X_test_1 and y_test_1

For X_test_1, the flow will happen in this order for an example observation:

  1. Observation #1 arrives with all features needed to provide a prediction arrives at the student server
  2. The student server will store the observation in your database
  3. The student server must return a predction between 0 and 1 for observation #1
  4. Some time later, the true outcome for observation #1 will arrive at the student server
    • This true outcome taken from y_test_1 for observation #1
  5. The student server will store the true outcome for observation #1 in the database

X_test_2

For X_test_2, the flow will happen in this order for an example observation:

  1. Observation #5000 arrives with all features needed to provide a prediction
  2. The student server will store the observation in your database
  3. The student server must return a predction between 0 and 1 for observation #5000

Upshots

Notice that for X_test_1 and X_test_2 you are essentially expanding the training set. If the training set is expanded, it means that it is possible to train and deploy another model that has had the benefit of seeing more data. It will require some judgement about when and how often to do this though because it involves messing with production systems and the more one does this, the more chances there are for something to go wrong!

Also, it's quite important to make sure that the students are recording every piece of data that comes into the server. This is VERY important for the second report because they will need to analyze what kind of data showed up after they deployed the initial model and wrote the report for that.

The reports

Both reports should be professional quality. This means that in order to receive a passing grade on it, we should feel comfortable submitting it to our boss or to a client. Developing the guidelines for these reports that gives clear guidance but doesn't just give a cookie-cutter recipe is a non-trivial task that will require quite a bit of judgement. You should follow the guidelines.

Evaluation

God willing, by the time we finish, we will have 30-40 more students that have submitted all material that must be graded quickly and with quality. By quickly we mean that it cannot take too much time per report because of limited instructor hours. By quality we mean it must be harsh but fair since we cannot certify anyone that does not deserve it while at the same time making sure that there are no surprises for the students.

You should follow the report guidelines. You can find the grading components in that document.

Minimum requirements to count toward graduation

The capstone is one of the most single important pieces of work that is turned in by the student. In order for the work to count toward the graduation requirements, the students must

  • Cover all points required in the reports
    • Note that since the grading sheet that we will be using is given to you ahead of time, requirement is 100% coverage of the topics.
  • Your deployed server must return predictions for 80% of the requests sent to your server
    • This is nowhere near good enough for a real production system but since this is your first time at it, we'll be kind :)

Schedule

Taken from this spreadsheet, here is a breakdown of what components of the capstone will be covered on which days:

With a work breakdown estimate of the following

About

Capstone Project Data and Instructions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages