The main goal of the project is to analyse the existing datasets and find patterns which identify the factors contributing to deaths due to Air pollution and Traffic accidents in the US and propose ideas to achieve SDG-3 Goal.
The project aligns with the SDG 3 (by WHO) which aims to ensure healthy lives and promote well-being for all, at all ages. We have mainly focused on the following aspects of SDG:
- By 2030, halve the number of global deaths and injuries from road traffic accidents.
- By 2030, substantially reduce the number of deaths and illnesses from hazardous chemicals and air, water and soil pollution and contamination.
- Pollution dataset: EPA - 1.2 GB
- Fatal traffic accidents: FARS - 3.83 GB
- Accident dataset: Kaggle - 1 GB
- Chronic Respiratory Diseases Mortality Data: GHDx - 700 MB
- Mortality dataset: CDC - 4 GB
The below-described methods are performed on Google Cloud Platform. We have used a standard cluster with 1 Master (e2-standard, 2 cores, 32GB) node and 3 Worker nodes (e2-standard, 4 cores, 64GB) which primarily runs HDFS and Spark on Yarn in cluster mode.
- Air Pollution Analysis Pipeline
- Accidents Analysis Pipeline
Air pollutants vs Mortality rates for San Francisco county:
More results can be found here
Forecasting using LSTM for LA county:
Some analysis on Accident dataset:
For more results please refer the notebooks and project reports