Digital Product School Challenge

Digital Product School (DPS) is a 3 months training program to empower the next generation of digital product makers, where diverse and cross-functional teams solve real-world challenges.

This repository contains what I developed as the AI Track challenge. The challenge is to predict the total number of accidents (under accident_type) in Munich for the year of 2020.

Data

The monthly traffic accidents data set includes: traffic accidents, escape accidents and alcohol accidents.

The dataset used can be both explored online and downloaded via dataset.

As the quote indicates, this data represents traffic accidents. Although there are other columns in the dataset, only the first five columns were considered: category, accident_type, year, month and value.

Moreover, to make it easier for non-german speakers to understand, the values in the columns category and accident_type were translated considering the following dictionary:

category_map = {
    "Alkoholunfälle": "alcohol",
    "Fluchtunfälle": "escape",
    "Verkehrsunfälle": "traffic"
}

accident_type = {
    "insgesamt": "total",
    "Verletzte und Getötete": "injured_or_killed",
    "mit Personenschäden": "personal_injury"
}

Finally, these transformations result in the data dictionary below:

Column	type	description
category	text	one of ("alcohol", "escape", "traffic")
accident_type	text	one of ("total", "injured_or_killed", "personal_injury")
year	numeric	data from 2000-2022
month	text	either `Summe` (total) or ends with two digit month (e.g. `03`)
value	numeric	number of registered occurances

Visualization

The time series representation of the download data can be seen below. It is clear the the data has a seasonal component, with a peak in the summer months and a trough in the winter months. Moreover, the data seems to have a trend, with a longer term pattern, with a descrease followed by an increase and then another decrease due to covid lockdowns.

Another way of observing such patterns is through a seasonal decomposition, which can be observed below. The seasonal component is clear, as well as the trend component. Moreover, the residuals seem to be stationary, which is a good sign for modelling.

Modelling

For simplicity sake, only the total accident_type was considered. Moreover, the data was split into train and test sets, with the last 24 months being used as test set. The train set was used to train the model and the test set was used to evaluate the model.

The models used were:

SARIMAX: a classic time series model, which considers the seasonal component of the data, autoregressive and moving average components, as well as exogenous variables; although no exogenous variables were considered. The chosen model was a statsmodels's library SARIMAX(1, 1, 1)(1, 0, 1, 12), which was capable of capturing the seasonal component of the data, as well as the trend component. This is certainly the most robust model, as it would be capable of readjusting itself to covid's lockdown due to its moving average parameters; but didn't perform well because 2020 ownwards data was used as test set.
Persistence: this model simply persists the last observation it has seen. It performs well due to the last value being a good predictor of the next value, but it doesn't consider the seasonal component of the data, nor the trend component. This model was used as a baseline.
Average: another simple model, it calculates the average of the training data and predicts it n_steps ahead. Did not perform well due to the change introduced by covid in 2020 in the test data.
Seasonal: this model simply predicts the value of the same month in the previous year. It performs well due to the seasonal component of the data, but it doesn't consider the trend component. Interestingly, it performs almost as well as SARIMAX due to strengh of the seasonal component.

The model was evaluated using the root mean squared error (RMSE), which can be observed in the table below.

Model Name	RMSE
SARIMAX	915.08
Persistence	766.22
Average	1045.10
Seasonal	934.66

The results can be observed in the image below.

Deployment

This project was deployed on an AWS EC2 instance, using docker-compose. The docker-compose file is available in the repository.

Create Github Actions pipeline for deployment

Step by Step Instructions

Running Locally

conda create -n dpsenv
conda activate dpsenv
# conda env export > environment.yml
conda env update --file environment.yml

Running on AWS

# example extracted from SSH connect EC2 page 
ssh -i "dps.pem" ec2-user@ec2-52-201-231-31.compute-1.amazonaws.com
# from https://github.com/moby/moby/issues/17645#issuecomment-153291483
sudo su -
yum -y install docker
# from https://stackoverflow.com/a/65478517/14403987
curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
service docker start
logout # su ec2-user
docker-compose up
docker-compose version

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
imgs		imgs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
main.py		main.py
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Product School Challenge

Data

Visualization

Modelling

Deployment

Step by Step Instructions

Running Locally

Running on AWS

About

Releases

Packages

Languages

felipewhitaker/digitalproductschool

Folders and files

Latest commit

History

Repository files navigation

Digital Product School Challenge

Data

Visualization

Modelling

Deployment

Step by Step Instructions

Running Locally

Running on AWS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages