🚰 Drinking Water Potability Report 🚰

Jérôme Auguste - Marco Boucas - Ariane Dalens

Project for the Machine Learning course @CentraleSupélec

Based on the Kaggle Challenge Drinking_Water_Potability

Context

Access to safe drinking water is essential to health, a basic human right, and a component of effective policy for health protection. This is important as a health and development issue at a national, regional, and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.

Dataset

The dataset is available at this link drinking_water_potability.csv. The file file contains water quality metrics for 3276 different water bodies. The easiest way to run the following notebooks using this dataset is to unzip the csv file in the root folder of the repo.

Features	Type	count	mean	min	max
ph	float	2785	7,08	0	14,00
Hardness	float	3276	196,37	47,43	323,12
Solids	float	3276	22014,09	320,94	61227,20
Chloramines	float	3276	7,12	0,35	13,13
Sulfate	float	2495	333,78	129,00	481,03
Conductivity	float	3276	426,21	181,48	753,34
Organic_carbon	float	3276	14,28	2,20	28,30
Trihalomethanes	float	3114	66,40	0,74	124,00
Turbidity	float	3276	3,97	1,45	6,74
Potability (target)	categorical	3276	39%	0	1

📦 Structure of the project

First, we sought to visualise our data, to study correlations, missing data and outliers. This led to an exploration and visualisation of our data in the notebook data_exploration_and_visualization.ipynb.

After processing and preprocessing the data we compared and finetuned different models in the notebook pipeline.ipynb which led us to our final model

🏆 Winning Model for this dataset

After our research on which model is the most relevant for this problem, we selected 3 different well-performing algorithms (K-NN, SVM and Random Forest). Each one has been finetuned and work collaboratively in a voting classifier.

References

[1] Lundberg, Scott M and Lee, Su-In, A Unified Approach to Interpreting Model Predictions, Advances in Neural Information Processing Systems 30 (2017)

[2] Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, An introduction to stastical learning, Springer

[3] Friedman, Jerome H., The elements of statistical learning: Data mining, inference, and prediction, Springer

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.gitignore		.gitignore
20211006_Report_Group3.pdf		20211006_Report_Group3.pdf
Data_exploration_and_visualization.ipynb		Data_exploration_and_visualization.ipynb
LICENSE		LICENSE
Pipeline.ipynb		Pipeline.ipynb
README.md		README.md
Water_potability.pdf		Water_potability.pdf
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚰 Drinking Water Potability Report 🚰

Context

Dataset

📦 Structure of the project

🏆 Winning Model for this dataset

References

About

Releases

Packages

Contributors 3

Languages

License

ArianeDlns/ML-Drinking_Water_Potability

Folders and files

Latest commit

History

Repository files navigation

🚰 Drinking Water Potability Report 🚰

Context

Dataset

📦 Structure of the project

🏆 Winning Model for this dataset

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages