Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
SAFE DRIVING CHALLENGE 
ML Project Report 
BACHELOR OF TECHNOLOGY 
              IN 
COMPUTER SCIENCE & ENGINEERING 
SUBMITTED BY 
NAME OF THE STUDENT 
Ms. T BINDHU BHARGAVI 	
Department of Computer Science and Engineering 
BVRIT HYDERABAD College of Engineering for Women
(Approved by AICTE, New Delhi and Affiliated to JNTUH, Hyderabad) Bachupally, Hyderabad – 500090



 
Department of Computer Science and Engineering
BVRIT HYDERABAD College of Engineering for Women
(Approved by AICTE, New Delhi and Affiliated to JNTUH, Hyderabad)
Bachupally, Hyderabad – 500090
 
 
Acknowledgement 
 
Firstly, I would like to express my immense gratitude towards BVRIT HYDERABAD College of Engineering for Women, which created a great platform to attain profound technical skills in the field of Computer Science though this industry enabled learning WISE.  
 
    I would like to extend my sincere thanks and gratitude to Dr. K V N Sunitha, Principal, BVRIT HYDERABAD College of Engineering for Women and WISE team of college for their meticulous planning and conduction of this learning program.  
 
I would also like to extend my sincere thanks to WISE & Team of Talent sprint for enabling us with this unique learning platform.  
 
                                                                                                                
T Bindhu Bhargavi
 
 
INDEX 
 
S.NO 	Contents 	Page No. 
1 	Abstract 	4 
2 	Introduction 	5 
3 	Problem statement 	6 
4 	Approach and Statistics of code 	7 
5 	Data Sets 	8-9 
6 	First Model 	10 
7 	Feature Engineering  	11 
8 	PCA 	12-13 
9 	Neural Network 	14-15 
10 	Second Model 	16 
11 	Random forest and Naïve Bayes 	17 
12 	Comparisons of Models 	18-20 
13 	Result 	21 
14 	Reference Link and Project Link 	22 
 
LIST OF FIGURES 
 
S.NO 	Name of the figure 	Page No. 
1 	Screen plot of 30 features 	12 
2 	Histogram of mean alertness per trail 	15 
3 	ROC curve of two models 	20 
 

 
 
 
ABSTRACT 
 
In this project we introduce a classifier which takes in multidimensional data consisting of real-world measurements of physical, environmental and vehicular continuous features obtained from number of driving sessions. We will show that using Naive Bayes classifier which assumes the data distribution to be Gaussian distribution we can make a prediction weather the driver is alerted or not while driving and achieve reasonable low misclassification rate for the given data. We will inspect how insight into relevant features were obtain by using Principal Component 
Analysis (PCA) and simple correlation matrix. We were able to obtain a misclassification rate as low as 12.03 % and 27.07 % for the test and training data respectively. 
 	  
INTRODUCTION 
 
With a training and test set consisting of 33 features from real time measurements test we want to use that information to predict if a certain driver is alerted or not alerted while driving. Here our goal is to construct a binary classifier which will predict a binary target value using the whole or a subset of the 33 features and give a prediction as Predictions = ( 1 if the driver is alert 0 if the driver is not alert (1) A. Datasets The datasets are gained from the website www.kaggle.com and consist of one training set and one test set. The datasets include measurements from total of 510 real time driving session where each driving session takes 2 minutes. This gives a new measurement of the each of the 33 features every 100ms. The headers in the datasets are listed in table I below. The size of the training set is a measurement set of 510 driving sessions done by 100 people. This results in a 604330×33 as the size of the training set. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Problem Statement 
Driving while distracted, fatigued or drowsy may lead to accidents. Activities that divert the driver's attention from the road ahead, such as engaging in a conversation with other passengers in the car, making or receiving phone calls, sending or receiving text messages, eating while driving or events outside the car may cause driver distraction. Fatigue and drowsiness can result from driving long hours or from lack of sleep. 
The data for this Kaggle challenge shows the results of a number of "trials", each one representing about 2 minutes of sequential data that are recorded every 100ms during a driving session on the road or in a driving simulator. The trials are samples from some 100 drivers of both genders, and of different ages and ethnic backgrounds. The files are structured as follows: 
The first column is the Trial ID - each period of around 2 minutes of sequential data has a unique trial ID. For instance, the first 1210 observations represent sequential observations every 100ms, and therefore all have the same trial ID The second column is the observation number - this is a sequentially increasing number within one trial ID The third column has a value X for each row where 
 
           X = 1     if the driver is alert 
 
           X = 0     if the driver is not alert 
 
The next 8 columns with headers P1, P2 , …….., P8 represent physiological data; 
The next 11 columns with headers E1, E2, …….., E11 represent environmental data; 
The next 11 columns with headers V1, V2, …….., V11 represent vehicular data; 
 
 
 
 
 
 
APPROACH 
 
•	Initially, we have analysed train and test datasets 
•	Imported the required libraries 
•	By using Data pre-processing, logistic regression, feature engineering, PCA, Support vector regression, Neural network we have predicted the output. 
 
 
STATISTICS OF THE CODE 
•	We have used google Collaboratory to predict the output. 
 
 
 
 
 
 





 
 
 
SAFE DRIVING CHALLENGE 
 
1 INTRODUCTION 
The objective is to design a classifier that will detect whether the driver is alert or not alert, employing data that are acquired while driving. This report is meant to illustrate the process of building a predictive machine learning model of the Machine Learning. 
 
2 DATASETS 
 There are 604,329 instances of data in the training dataset and 120,840 instances of data in the test dataset. The data for this challenge shows the results of a number of” trials”, each one representing about 2 minutes of sequential data that are recorded every 100ms during a driving session on the road or in a driving simulator. The trials are samples from some 100 drivers of both genders, and of different ages and ethnic backgrounds. The files are structured as follows: The training data was broken into 500 trials, each trial consisted of a sequence of approximately 1200 measurements spaced by 0.1 seconds. Each measurement consisted of 30 features; these features were presented in three sets: physiological (P1...P8), environmental (E1...E11) and vehicular (V1...V11). Each feature was presented as a real number. For each measurement we were also told whether the driver was alert or not at that time (a Boolean label called Is Alert). No more information on the features was available. 
3 EXISTED MODEL 
In order to summarize existed work and formulate a plan in order to build an outperformed machine learning predictive model. Similar machine learning techniques are applied to this dataset. The techniques most participants used limited to Nave Bayes, Logistic Regression, Support Vector Machine, Neural Network, and Random Forest. But the performances of their models are totally different, as they pre-processed the original data in different ways, especially in their feature engineering. 
Thus, I will mainly focus on the feature engineering methods applied by the participants, instead of how they choose parameters of algorithms in the summary part. 
The highest score (AUC = 0.861151) was reached by   a logistic regression model. As the dataset consists of sequential data recorded every 100ms for 2 minutes in each trial, the partitions of the data by trials (Trial ID) rather than randomly partition. The Means and Standard Deviations of each trial were computed as new features (include the target feature Is Alert). After- wards, feature selection based on diagnostics of the logistic regression was conducted and three strong features were chosen for modelling (sdE5, V11, and E9). How- ever, this model applies future observation (The mean and standard deviation can only be calculated when a trial is finished), thus inapplicable for real-life situations. A running Mean and Standard deviation were applied to training instead and the AUC has dropped slightly, from 0.861151 to 0.849245). We focus on the instances at the initial moment the driver lost alertness, the dataset is reduced significantly in this way and he highlighted the factors change significantly between status change for feature selection. E4, E5, E6, E7, E8, E9, E10, P6, V4, V6, V10, and V11 are selected for building a Neural Network. This model reaches an AUC of 0.84953 & also attempts to aggregate data from each trial and calculate means and standard deviations as additional features. After tossing up correlated feature and other feature engineering, a logistic regression model trained from feature selected data reaches an AUC of 0.80779. Fourier generates around 600 new features to the dataset (The inverse, the square, and the cube of each features, all the combinations of 2 columns, time interval variables). It reaches the highest AUC by applying forward search to select predictive features. A Nave Bayes model trained by these selected features reach an AUC of 0.844. We trained an epsilonSVR, RBF kernel model with parameters c = 2, g = 1/30, and p = 0.1, which reaches an AUC of 0.839 and applies a random forest with 199 trees and min node size of 25, the correlated features are tossed out beforehand. This predictive model reaches an AUC of 0.81410. 
3.1 SUMMARY OF EXISTED MODEL 
An important feature for this dataset is that it contains sequential data. For each trial, the dataset records data every 100ms. Thus, all the participants shuffle the dataset by trials for the purpose of preserving this sequential feature. Aggregating data within a trial to generate means and standard deviations as new features for modelling is proofed as a useful method of data pre-processing. Another useful method of data pre-processing is to choose the instances close to the moment the driver lost alertness, which reduce time to train the models significantly. 
Multiple methods of feature selection are applied, the mean/standard deviation of existed features, inverse, the square, the cube, and a combination of 2 columns are viewed as potentially useful new features. Correlated, remain constant features are always tossed out. As for the choice of predictive machine learning algorithms, there is no valid proof that one algorithm out- performs all the others in this specific situation. Generally, Nave Bayes, Logistic Regression, Random Forest, Support Vector Machine, and Neural Network all reach a good performance in this case. 
4 MODEL BUILDING PLANS 
Even though many existed models have already had a decent performance, it’s still possible to improve the model. A plan for building a new predictive model is outlined in this section. 
4.1 GAP IDENTFICATION 
The predictive model with the highest AUC value is trained from 20% of the training dataset. What’s more, the means and standard deviations of each trial are future observation features. Those make this predictive model inapplicable to a real-life situation. An AUC value of 0.861151 also means there are still rooms for improvement. Another noticeable point within most of the existed work is that most of the models are evaluated by either AUC score or classification accuracy. For this specific situation, it’s obviously more important to identify those not alert instances as driving while not alert can be deadly. Failing to identify ’not alert’ can lead to worse consequences compare to failing to identify ’alert’. Thus, true negative rate (TN / (TN + FP)) can also be a valuable measure of evaluation as it shows the percentage of ’not alert’ instances successfully identified. Furthermore, as all the models’ classification accuracies are above 50%, which makes building an ensemble model to reach a better performance possible as if the recalls and the specificities of all the models can reach above 50% at the same time for all the models. 
4.2 MODEL BUILDING PLAN 
Firstly, those existed models with good performance will be reproduced, includes the way they pre-process the data and the parameters they choose to build predictive models. Secondly, a local evaluation will be conducted on these models. The recalls and specificity will be used for evaluation, apart from classification accuracy and AUC score. Then to group those models with recall and specificity both higher than 50% to build an ensemble model, aims to reach a better performance than all the existed models. 
5 SOLUTION DEVELOPMENT 
Python is used as the developing environment for this project. Scikit-learn is the machine learning tool applied. Missing data is identified as 0 throughout all the dataset. 
5.1 FIRST MODEL 
The first predictive model is built by the data pre-processing method. We were concerned that using the entire data set would create too much noise and lead to inaccuracies in the model. The final goal of the system is to detect the change in the driver from alert to not alert so that the car can self-correct or alert the driver. So, we decided to just focus on the data at the initial moment when the driver lost alertness. According to this, I subset the dataset to the moment when the driver lost alertness. The rows with the feature ’Is Alert’ == 0 and the last rows with the feature ’Is Alert’== 0 are chosen, along with 5 rows before and after each (100ms of time between each observation, 5 rows before means focus on the data recorded 0.5s before and after the driver lost alertness). After sub setting, 37421 instances without duplication are chosen to build the predictive model. 
 
  
 
5.1.1 FEATURE ENGINEERING 
 
There are 30 features included in the dataset, thus filter those features with higher impact could not only save computational resources but also potentially improve the performance of the predictive model. Principle Component Analysis (PCA) is applied as the feature engineering technique in this case. For PCA, the dataset is standardized firstly, then the fraction of variances of each feature is calculated to identify those features have higher impact on the result. 
 
 
 
図 1: Scree Plot of 30 features 
We can see that the first 14 attributes contribute 80.95% of the total variance, the number of features  
 
selected for modelling is decreased from 30 to 14 in this way.  
 
5.1.2 MODELLING 
As the size of subset is relatively smaller, stratified 10-fold cross-validation is applied as data evaluation method to make full use of the dataset. Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine, and Neural Network models are built from this dataset. Gaussian Nave Bayes model performs a validation ac- curacy of 
61.74%. Logistic Regression with optimization algorithm of ’liblinear’ reaches a validation accuracy of 64.6%. Multiple models in Support Vector Machine family include Linear-SVC, Nu-SVC, C-SVC are applied as well. Their validation accuracies varied from 65% to 78% 
A Neural Network with 5 neurons in the first hidden layer and 2 neurons with the second hidden layers reaches a validation accuracy of 65.01%. 
I tried to use neural networks with different architectures, another neural network with 5 hidden layers and 14, 14, 12, 10, 5 neurons in each layer. The activation function is also changed from RELU to logistic regression. Unfortunately, the performance of the new 	neural 	network 	does 	not 	change 	mach. 
  
 
 
  
 
 
Model 	Accuracy 	recall 	Specificity 	AUC Sc 
Logistic Regression 	64.60% 	94.32% 	25.25% 	0.5978 
Nave Bayes 	61.74% 	89.47% 	25.02% 	0.5724 
Random Forest 	93.54% 	97% 	90.08% 	0.9354 
Linear-SVC 	64.55% 	94.85% 	24.44% 	0.5958 
Nu-SVC 	78.63% 	92.36% 	60.45% 	0.7640 
C-SVC 	67.55% 	98.13% 	27.06% 	0.626 
Neural Network1 	65.01% 	94.31% 	26.21% 	0.6026 
Neural Network2 	65.96% 	97.02% 	24.83% 	0.6092 
Performances of algorithm on PCA dataset 
 
It can be found from the performance diagram that all the models perform pretty well on predicting those drivers ’in alert’. However, most models cannot reach a decent result when it comes to identifying drivers not in alert, which is more important in this specific situation. On the other hand, the Random Forest model outer form all other models, especially when it comes to specificity, which makes it a part of our final ensemble model. The Nu-SVC model reaches a specificity of more than 50% as well, which means it can also be part of an ensemble model. 
 
5.2 SECOND MODEL 
   Unfortunately, we did not get a good predictive ma- chine learning model by the first data pre-processing method (apart from the Random Forest with 50 trees model). I decided to conduct an exploratory analysis on the dataset in order to provide a guidance of data pre-processing 
 
5.2.1 EXPLORATORY ANALYSIS AND FEATURE ENGINEERING ON THE DATASET 
We calculate the average Is Alert value per trial and plot the result on a histogram. 
 図 2: Histogram of mean alertness per trial 
 
  

 
It is found that for most drivers, they either stay alert or not alert throughout the 1200ms trial. Thus, the characteristic of each driver, recorded in the mean and standard deviation of each attribute, can be helpful for predictive analysis. 
On the other hand, it is impossible to get the mean and standard deviation of a trial at the beginning of each trial, which makes using stable means and standard deviations of each feature unpractical in real-life situation. More- over, using stable means and standard deviations cannot record the change of the driver’s behaviour within a trial, which may be constantly changing overtime. 
For these reasons, we decided to use rolling means and standard deviations of each features as new features in- stead of simply using stable means and standard deviations in order to make full use of the sequential feature. 
The rolling window is set to 5, as for every 5 instances (500ms), it calculates the mean and standard deviation for them, then the algorithm drops the first instance and add a new instance, etc. 
  
 
 
5.2.2 MODELLING 
 Similarly, we applied algorithms mentioned above to this pre-processed dataset. As the size of the dataset is big enough, we use 80%-20% to train-test split the dataset in- stead of cross validation. Firstly, we tried Random Forest algorithm, the one per- forms the best in the last feature selected dataset, to see if there’s any improvement compare to the other feature selecting method. The Random Forest has 50 trees, the parameters are the same as the one applied before. It reaches a decent performance on the validation dataset, with a validation accuracy of 98.91%. Algorithms of the Support Vector Machine family all fail to converge within a specific period of time. A neural network with four hidden layers, each layer has 90, 70, 50, 30 neurons respectively also applied, reaches a validation accuracy of 80.76%. Furthermore, Nave Bayes and Logistic Regression have not improved much compare to the preview models. Generally, Neural Network and Random Forest performs better than other models in this situation, and Random Forest performs far better than Neural Networks. 
 
 
 
 
Model 	Accuracy 	recall 	Specificity 	AUC Sc 
Logistic Regression 	61.21% 	75.21% 	41.96% 	0.5858 
Nave Bayes 	62.86% 	45.28% 	87.02% 	0.6615 
Random Forest 	98.91% 	98.58% 	97.55% 	0.9873 
Neural Network 	80.76% 	96.40% 	59.26% 	0.7783 
Performances of algorithm on rolling mean std dataset 
 
 
5.2.1COMPARISON OF MODELS 
 
Comparing the performances of models trained from data pre-processed by different methods, it is found that algorithms logistic regression, Support Vector Machine, and nave Bayes are not suitable for this problem. While Neural Network can reach a good performance in the dataset pre-processed by generating time sequential feature, it is not the model fits the dataset the best. The Random Forest Algorithm generates the best result on predictive analysis, either trained from data pre-processed by PCA or from data pre-processed by other feature engineering techniques. Another interesting finding is that most models perform better when it comes to predicting ’alert’ drivers than to predicting ’not alert’ drivers, apart from the Nave Bayes model. Considering two values are basically equally distributed (alert: 349785, not alert: 254544), it’s hard to say one label is over represented than the other, which makes the unbalanced predict result hard to explain. 
As a result, I choose three models for local evaluation, which are two Random Forest models and a Neural Net- work Model. 
6 Local Evaluation 
We use the data’solution.csv’ to evaluate the final models. 
6.1 MODEL 1 
The first model is the Random Forest trained by the data with features selected from PCA. 
 
 	Predict = 0 	Predict=1 
Actual = 0 	22571 	7343 
Actual = 1 	63616 	27310 
AUC = 0.52744 
Though this model only reaches an accuracy of 41.28% on the test dataset, it identifies many not alerted drivers correctly. Overall, this model is not good enough, no matter evaluated by which method. 
6.2 MODEL 2 
The second model is the Random Forest trained from the data with added features of rolling mean and standard deviation. 
 
 
 	Predict = 0 	Predict = 1 
Actual = 0 	16671 	13243 
Actual = 1 	8679 	82247 
AUC = 0.7309 
This model reaches a good performance, with classification accuracy of 81.86% on the test data. It has a good performance in predicting alert drivers, with recall = 90.45%, precision = 86.13% and F1-score = 88.24%. However, for this specific situation.  The model is expected to predict ’not alert’ drivers precisely, specificity (The percentage of Actual = 0 is predicted correctly) should be the evaluation method we focus on for this rea- son. The specificity of this model only reaches 55.73%, which still has lots of room to improve. 
Overall, the AUC value of this model is 0.7309, not as good as the work I referenced, but still an improvement. 
6.3 MODEL 3 
The third model is the Neural Network trained from the data with added features of rolling mean and standard deviation. It has four hidden layers with 90, 70, 50, 30 neurons in each layer. We were meant to use the first layer to grab all the original features and the coming layers to process and predict the output, thus the number of neurons for the first layer is 
as many as the number of the features. 
 
 	Predict = 0 	Predict = 1 
Actual = 0 	12886 	17028 
Actual = 1 	361 	90565 
AUC = 0.71340 
This model reaches a good performance as well, com- pares to the first model. It successively predicts most of alert drivers (recall = 99.6%, precision = 84.17%, F1- score = 91.24%). However, the model fails to predict many not alert drivers correctly (specificity = 43.08%), which is the more important evaluation method for this predictive model. 
6.4 COMPARISON OF MODEL 2 AND MODEL 3 
Both two models perform better on predicting alert drivers than identifying not alert drivers as the true positive rate are both higher than their true negative rate in their confusion matrix, though the main goal of this predictive model is to predict not alert drivers. The curve reaches 100% true positive rate firstly is the neural network, the other curve is the random forest. It also can be found that the random forest model performs better than the neural network model. However, the neural network predicts most alert drivers correctly and when it predicts a driver as not alert, it’s correct at the most of times. 
 
 
 
 
The random forest model reaches a significantly higher result on identifying not alert drivers from all the drivers than the neural network model, though when it predicts a driver as not alert, it gets 34.25% chance of being wrong. The Random Forest model would be a better choice in this situation, but the architecture of the neural network model can be optimized to reach a higher performance. 
 
7 RESULT REFLECTION AND COMPARISON 
 
7.1 RESULT CONCLUSION 
    This project was meant to build a supervised learning model to predict not alert drivers, the model with the best performance is achieved by Random Forest with 50 trees in it. It predicts 16671 of 29914 not alert drivers correctly in the test data. It reaches a classification accuracy of 81.86% and AUC value of 0.7309. 
 
7.2 RESULT COMPARISON 
   When it compares to the results of those in the leader- board, there are lots of participants’ models reach a higher performance. The best model reaches an AUC value of 0.86115, though it applies means and standard deviations of each trial as new features. Almost 20 participants’ models reach AUC scores over 0.8, which is significantly higher than mine. Refers that there is still large room to improve my model. 
 
7.3 DISCUSSION AND FUTURE WORK 
The existed predictive model for this problem is far from perfection. There are a few perspectives that can improve the performance of the model. 
 
7.3.1 DATA PREPROCESSING METHOD 
 
The rolling means and the standard deviation is proved to be a good method to preserve the sequential attribute of the data. However, rolling means for every 0.5s could be too short to grab the driving pattern of a driver. Expand the rolling window to produce rolling means and standard deviations in a longer period could be considered as a useful method to introduce the long-term driving patterns of drivers. Better performance is believed can be achieved by model learns not only from drivers’ behaviours in a short time (0.5s) but in a long time as well. 
 
 
 
7.3.2 MODEL OPTIMIZATION AND SELECTION 
 
Though the neural network model fails to produce a better performance than the random forest model, it is still not convincing that random forest is always the best option for this problem. Neural network still shows great potential to produce good result. Further work could to optimize the architecture of neural networks. 
 
REFERENCE LINK 
 
• https://www.kaggle.com/c/stayalert/data 
 
PROJECT LINK 
 
https://github.com/bindhu520/Safe-driving-Challenge-ML-PROECT-
  • Loading branch information
bindhu520 authored Feb 5, 2021
1 parent d4cc902 commit c834664
Showing 1 changed file with 0 additions and 0 deletions.
Binary file added REPORT F.docx
Binary file not shown.

0 comments on commit c834664

Please sign in to comment.