My final capstone project for Thinkful's Data Science Flex Program. For this capstone we were to think of a data science product, create a model, and present it in a presentation. As someone with high cholesterol, I decided to do regression with NHANES data to predict total cholesterol level. I felt it would be beneficial to health apps to be able to estimate a persons cholesterol level without the need for lab tests. If total cholesterol is predicted to be above 200, the health app can then ask the person if they have had their cholesterol checked recently. I plan to redo this project to predict bad cholesterol instead of predicting total cholesterol.
- CapstoneResearchProposal.pdf -- initial research proposal
- presentation.ppxt -- Powerpoint presentation, needs to be updated
- files.csv -- the list of files used in the project
- final.ipynb -- the project notebook
- links.html -- just some references I wanted to keep while working on the project
- variables.csv -- the initial variable list, used when creating the dataset
- variables_final.csv -- the list of variables used in modeling
- When exploring the data, I discovered that over half of those with a total cholesterol level over 200 have never been told by a doctor they have high cholesterol. This reinforces the need for a model to estimate cholesterol levels.
- The models all struggled predicting extreme values. As the distribution for total cholesterol had outliers, a tool like SMOGN may have been helpful.
- As the dataset contained over 100 files and over 1500 variables, I chose to use variables that could effect cholesterol level based on online research I have done. It would be intereting to explore using other variables.
- Pandas
- Matplotlib
- Scikit-learn
- Keras
I am planing to redo this project using bad cholesterol instead and so not listing future tasks.