This project contains our steps to solving a data science predictive classification problem, which is the Tanzanian water well access limitations. Our main goal was to create a highly accurate predictive model to determine which water wells in Tanzania require repairs.
We obtained our data set from the Driven Data competition, originally sourced from the Tanzanian Ministry of Water and supplied by Taarifa. Our objective is to build a model that predict if the pump is functional or non functional or functional need repair. The data consist of 59,400 rows, 30 categorical columns and 10 numerical columns. we will be working with most of the data given and drop the redundant information.
The methodology we used for building the model are:
- Logistic Regression
- Knn nearest neighbor
- Decision Tree Classifier
- Random Forest Classifier
- CatBoost Classifier
Category Catboost Model predict an accuracy score of 80% on the holdout data. The model can be use to locate water wells that need repairs and determine the high risk area.
- Explore features and trends
- Identify improvement opportunity
- Optimize for maintenance cost by clustering
- Data
- Notebook
- WaterWellsProjectNotebook
- Presentation
- Readme
- Ely Lin
- Samantha Nasti
- Titilayo Amuwo