Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SAFE DRIVING CHALLENGE ML Project Report BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE & ENGINEERING SUBMITTED BY NAME OF THE STUDENT Ms. T BINDHU BHARGAVI Department of Computer Science and Engineering BVRIT HYDERABAD College of Engineering for Women (Approved by AICTE, New Delhi and Affiliated to JNTUH, Hyderabad) Bachupally, Hyderabad – 500090 Department of Computer Science and Engineering BVRIT HYDERABAD College of Engineering for Women (Approved by AICTE, New Delhi and Affiliated to JNTUH, Hyderabad) Bachupally, Hyderabad – 500090 Acknowledgement Firstly, I would like to express my immense gratitude towards BVRIT HYDERABAD College of Engineering for Women, which created a great platform to attain profound technical skills in the field of Computer Science though this industry enabled learning WISE. I would like to extend my sincere thanks and gratitude to Dr. K V N Sunitha, Principal, BVRIT HYDERABAD College of Engineering for Women and WISE team of college for their meticulous planning and conduction of this learning program. I would also like to extend my sincere thanks to WISE & Team of Talent sprint for enabling us with this unique learning platform. T Bindhu Bhargavi INDEX S.NO Contents Page No. 1 Abstract 4 2 Introduction 5 3 Problem statement 6 4 Approach and Statistics of code 7 5 Data Sets 8-9 6 First Model 10 7 Feature Engineering 11 8 PCA 12-13 9 Neural Network 14-15 10 Second Model 16 11 Random forest and Naïve Bayes 17 12 Comparisons of Models 18-20 13 Result 21 14 Reference Link and Project Link 22 LIST OF FIGURES S.NO Name of the figure Page No. 1 Screen plot of 30 features 12 2 Histogram of mean alertness per trail 15 3 ROC curve of two models 20 ABSTRACT In this project we introduce a classifier which takes in multidimensional data consisting of real-world measurements of physical, environmental and vehicular continuous features obtained from number of driving sessions. We will show that using Naive Bayes classifier which assumes the data distribution to be Gaussian distribution we can make a prediction weather the driver is alerted or not while driving and achieve reasonable low misclassification rate for the given data. We will inspect how insight into relevant features were obtain by using Principal Component Analysis (PCA) and simple correlation matrix. We were able to obtain a misclassification rate as low as 12.03 % and 27.07 % for the test and training data respectively. INTRODUCTION With a training and test set consisting of 33 features from real time measurements test we want to use that information to predict if a certain driver is alerted or not alerted while driving. Here our goal is to construct a binary classifier which will predict a binary target value using the whole or a subset of the 33 features and give a prediction as Predictions = ( 1 if the driver is alert 0 if the driver is not alert (1) A. Datasets The datasets are gained from the website www.kaggle.com and consist of one training set and one test set. The datasets include measurements from total of 510 real time driving session where each driving session takes 2 minutes. This gives a new measurement of the each of the 33 features every 100ms. The headers in the datasets are listed in table I below. The size of the training set is a measurement set of 510 driving sessions done by 100 people. This results in a 604330×33 as the size of the training set. Problem Statement Driving while distracted, fatigued or drowsy may lead to accidents. Activities that divert the driver's attention from the road ahead, such as engaging in a conversation with other passengers in the car, making or receiving phone calls, sending or receiving text messages, eating while driving or events outside the car may cause driver distraction. Fatigue and drowsiness can result from driving long hours or from lack of sleep. The data for this Kaggle challenge shows the results of a number of "trials", each one representing about 2 minutes of sequential data that are recorded every 100ms during a driving session on the road or in a driving simulator. The trials are samples from some 100 drivers of both genders, and of different ages and ethnic backgrounds. The files are structured as follows: The first column is the Trial ID - each period of around 2 minutes of sequential data has a unique trial ID. For instance, the first 1210 observations represent sequential observations every 100ms, and therefore all have the same trial ID The second column is the observation number - this is a sequentially increasing number within one trial ID The third column has a value X for each row where X = 1 if the driver is alert X = 0 if the driver is not alert The next 8 columns with headers P1, P2 , …….., P8 represent physiological data; The next 11 columns with headers E1, E2, …….., E11 represent environmental data; The next 11 columns with headers V1, V2, …….., V11 represent vehicular data; APPROACH • Initially, we have analysed train and test datasets • Imported the required libraries • By using Data pre-processing, logistic regression, feature engineering, PCA, Support vector regression, Neural network we have predicted the output. STATISTICS OF THE CODE • We have used google Collaboratory to predict the output. SAFE DRIVING CHALLENGE 1 INTRODUCTION The objective is to design a classifier that will detect whether the driver is alert or not alert, employing data that are acquired while driving. This report is meant to illustrate the process of building a predictive machine learning model of the Machine Learning. 2 DATASETS There are 604,329 instances of data in the training dataset and 120,840 instances of data in the test dataset. The data for this challenge shows the results of a number of” trials”, each one representing about 2 minutes of sequential data that are recorded every 100ms during a driving session on the road or in a driving simulator. The trials are samples from some 100 drivers of both genders, and of different ages and ethnic backgrounds. The files are structured as follows: The training data was broken into 500 trials, each trial consisted of a sequence of approximately 1200 measurements spaced by 0.1 seconds. Each measurement consisted of 30 features; these features were presented in three sets: physiological (P1...P8), environmental (E1...E11) and vehicular (V1...V11). Each feature was presented as a real number. For each measurement we were also told whether the driver was alert or not at that time (a Boolean label called Is Alert). No more information on the features was available. 3 EXISTED MODEL In order to summarize existed work and formulate a plan in order to build an outperformed machine learning predictive model. Similar machine learning techniques are applied to this dataset. The techniques most participants used limited to Nave Bayes, Logistic Regression, Support Vector Machine, Neural Network, and Random Forest. But the performances of their models are totally different, as they pre-processed the original data in different ways, especially in their feature engineering. Thus, I will mainly focus on the feature engineering methods applied by the participants, instead of how they choose parameters of algorithms in the summary part. The highest score (AUC = 0.861151) was reached by a logistic regression model. As the dataset consists of sequential data recorded every 100ms for 2 minutes in each trial, the partitions of the data by trials (Trial ID) rather than randomly partition. The Means and Standard Deviations of each trial were computed as new features (include the target feature Is Alert). After- wards, feature selection based on diagnostics of the logistic regression was conducted and three strong features were chosen for modelling (sdE5, V11, and E9). How- ever, this model applies future observation (The mean and standard deviation can only be calculated when a trial is finished), thus inapplicable for real-life situations. A running Mean and Standard deviation were applied to training instead and the AUC has dropped slightly, from 0.861151 to 0.849245). We focus on the instances at the initial moment the driver lost alertness, the dataset is reduced significantly in this way and he highlighted the factors change significantly between status change for feature selection. E4, E5, E6, E7, E8, E9, E10, P6, V4, V6, V10, and V11 are selected for building a Neural Network. This model reaches an AUC of 0.84953 & also attempts to aggregate data from each trial and calculate means and standard deviations as additional features. After tossing up correlated feature and other feature engineering, a logistic regression model trained from feature selected data reaches an AUC of 0.80779. Fourier generates around 600 new features to the dataset (The inverse, the square, and the cube of each features, all the combinations of 2 columns, time interval variables). It reaches the highest AUC by applying forward search to select predictive features. A Nave Bayes model trained by these selected features reach an AUC of 0.844. We trained an epsilonSVR, RBF kernel model with parameters c = 2, g = 1/30, and p = 0.1, which reaches an AUC of 0.839 and applies a random forest with 199 trees and min node size of 25, the correlated features are tossed out beforehand. This predictive model reaches an AUC of 0.81410. 3.1 SUMMARY OF EXISTED MODEL An important feature for this dataset is that it contains sequential data. For each trial, the dataset records data every 100ms. Thus, all the participants shuffle the dataset by trials for the purpose of preserving this sequential feature. Aggregating data within a trial to generate means and standard deviations as new features for modelling is proofed as a useful method of data pre-processing. Another useful method of data pre-processing is to choose the instances close to the moment the driver lost alertness, which reduce time to train the models significantly. Multiple methods of feature selection are applied, the mean/standard deviation of existed features, inverse, the square, the cube, and a combination of 2 columns are viewed as potentially useful new features. Correlated, remain constant features are always tossed out. As for the choice of predictive machine learning algorithms, there is no valid proof that one algorithm out- performs all the others in this specific situation. Generally, Nave Bayes, Logistic Regression, Random Forest, Support Vector Machine, and Neural Network all reach a good performance in this case. 4 MODEL BUILDING PLANS Even though many existed models have already had a decent performance, it’s still possible to improve the model. A plan for building a new predictive model is outlined in this section. 4.1 GAP IDENTFICATION The predictive model with the highest AUC value is trained from 20% of the training dataset. What’s more, the means and standard deviations of each trial are future observation features. Those make this predictive model inapplicable to a real-life situation. An AUC value of 0.861151 also means there are still rooms for improvement. Another noticeable point within most of the existed work is that most of the models are evaluated by either AUC score or classification accuracy. For this specific situation, it’s obviously more important to identify those not alert instances as driving while not alert can be deadly. Failing to identify ’not alert’ can lead to worse consequences compare to failing to identify ’alert’. Thus, true negative rate (TN / (TN + FP)) can also be a valuable measure of evaluation as it shows the percentage of ’not alert’ instances successfully identified. Furthermore, as all the models’ classification accuracies are above 50%, which makes building an ensemble model to reach a better performance possible as if the recalls and the specificities of all the models can reach above 50% at the same time for all the models. 4.2 MODEL BUILDING PLAN Firstly, those existed models with good performance will be reproduced, includes the way they pre-process the data and the parameters they choose to build predictive models. Secondly, a local evaluation will be conducted on these models. The recalls and specificity will be used for evaluation, apart from classification accuracy and AUC score. Then to group those models with recall and specificity both higher than 50% to build an ensemble model, aims to reach a better performance than all the existed models. 5 SOLUTION DEVELOPMENT Python is used as the developing environment for this project. Scikit-learn is the machine learning tool applied. Missing data is identified as 0 throughout all the dataset. 5.1 FIRST MODEL The first predictive model is built by the data pre-processing method. We were concerned that using the entire data set would create too much noise and lead to inaccuracies in the model. The final goal of the system is to detect the change in the driver from alert to not alert so that the car can self-correct or alert the driver. So, we decided to just focus on the data at the initial moment when the driver lost alertness. According to this, I subset the dataset to the moment when the driver lost alertness. The rows with the feature ’Is Alert’ == 0 and the last rows with the feature ’Is Alert’== 0 are chosen, along with 5 rows before and after each (100ms of time between each observation, 5 rows before means focus on the data recorded 0.5s before and after the driver lost alertness). After sub setting, 37421 instances without duplication are chosen to build the predictive model. 5.1.1 FEATURE ENGINEERING There are 30 features included in the dataset, thus filter those features with higher impact could not only save computational resources but also potentially improve the performance of the predictive model. Principle Component Analysis (PCA) is applied as the feature engineering technique in this case. For PCA, the dataset is standardized firstly, then the fraction of variances of each feature is calculated to identify those features have higher impact on the result. 図 1: Scree Plot of 30 features We can see that the first 14 attributes contribute 80.95% of the total variance, the number of features selected for modelling is decreased from 30 to 14 in this way. 5.1.2 MODELLING As the size of subset is relatively smaller, stratified 10-fold cross-validation is applied as data evaluation method to make full use of the dataset. Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine, and Neural Network models are built from this dataset. Gaussian Nave Bayes model performs a validation ac- curacy of 61.74%. Logistic Regression with optimization algorithm of ’liblinear’ reaches a validation accuracy of 64.6%. Multiple models in Support Vector Machine family include Linear-SVC, Nu-SVC, C-SVC are applied as well. Their validation accuracies varied from 65% to 78% A Neural Network with 5 neurons in the first hidden layer and 2 neurons with the second hidden layers reaches a validation accuracy of 65.01%. I tried to use neural networks with different architectures, another neural network with 5 hidden layers and 14, 14, 12, 10, 5 neurons in each layer. The activation function is also changed from RELU to logistic regression. Unfortunately, the performance of the new neural network does not change mach. Model Accuracy recall Specificity AUC Sc Logistic Regression 64.60% 94.32% 25.25% 0.5978 Nave Bayes 61.74% 89.47% 25.02% 0.5724 Random Forest 93.54% 97% 90.08% 0.9354 Linear-SVC 64.55% 94.85% 24.44% 0.5958 Nu-SVC 78.63% 92.36% 60.45% 0.7640 C-SVC 67.55% 98.13% 27.06% 0.626 Neural Network1 65.01% 94.31% 26.21% 0.6026 Neural Network2 65.96% 97.02% 24.83% 0.6092 Performances of algorithm on PCA dataset It can be found from the performance diagram that all the models perform pretty well on predicting those drivers ’in alert’. However, most models cannot reach a decent result when it comes to identifying drivers not in alert, which is more important in this specific situation. On the other hand, the Random Forest model outer form all other models, especially when it comes to specificity, which makes it a part of our final ensemble model. The Nu-SVC model reaches a specificity of more than 50% as well, which means it can also be part of an ensemble model. 5.2 SECOND MODEL Unfortunately, we did not get a good predictive ma- chine learning model by the first data pre-processing method (apart from the Random Forest with 50 trees model). I decided to conduct an exploratory analysis on the dataset in order to provide a guidance of data pre-processing 5.2.1 EXPLORATORY ANALYSIS AND FEATURE ENGINEERING ON THE DATASET We calculate the average Is Alert value per trial and plot the result on a histogram. 図 2: Histogram of mean alertness per trial It is found that for most drivers, they either stay alert or not alert throughout the 1200ms trial. Thus, the characteristic of each driver, recorded in the mean and standard deviation of each attribute, can be helpful for predictive analysis. On the other hand, it is impossible to get the mean and standard deviation of a trial at the beginning of each trial, which makes using stable means and standard deviations of each feature unpractical in real-life situation. More- over, using stable means and standard deviations cannot record the change of the driver’s behaviour within a trial, which may be constantly changing overtime. For these reasons, we decided to use rolling means and standard deviations of each features as new features in- stead of simply using stable means and standard deviations in order to make full use of the sequential feature. The rolling window is set to 5, as for every 5 instances (500ms), it calculates the mean and standard deviation for them, then the algorithm drops the first instance and add a new instance, etc. 5.2.2 MODELLING Similarly, we applied algorithms mentioned above to this pre-processed dataset. As the size of the dataset is big enough, we use 80%-20% to train-test split the dataset in- stead of cross validation. Firstly, we tried Random Forest algorithm, the one per- forms the best in the last feature selected dataset, to see if there’s any improvement compare to the other feature selecting method. The Random Forest has 50 trees, the parameters are the same as the one applied before. It reaches a decent performance on the validation dataset, with a validation accuracy of 98.91%. Algorithms of the Support Vector Machine family all fail to converge within a specific period of time. A neural network with four hidden layers, each layer has 90, 70, 50, 30 neurons respectively also applied, reaches a validation accuracy of 80.76%. Furthermore, Nave Bayes and Logistic Regression have not improved much compare to the preview models. Generally, Neural Network and Random Forest performs better than other models in this situation, and Random Forest performs far better than Neural Networks. Model Accuracy recall Specificity AUC Sc Logistic Regression 61.21% 75.21% 41.96% 0.5858 Nave Bayes 62.86% 45.28% 87.02% 0.6615 Random Forest 98.91% 98.58% 97.55% 0.9873 Neural Network 80.76% 96.40% 59.26% 0.7783 Performances of algorithm on rolling mean std dataset 5.2.1COMPARISON OF MODELS Comparing the performances of models trained from data pre-processed by different methods, it is found that algorithms logistic regression, Support Vector Machine, and nave Bayes are not suitable for this problem. While Neural Network can reach a good performance in the dataset pre-processed by generating time sequential feature, it is not the model fits the dataset the best. The Random Forest Algorithm generates the best result on predictive analysis, either trained from data pre-processed by PCA or from data pre-processed by other feature engineering techniques. Another interesting finding is that most models perform better when it comes to predicting ’alert’ drivers than to predicting ’not alert’ drivers, apart from the Nave Bayes model. Considering two values are basically equally distributed (alert: 349785, not alert: 254544), it’s hard to say one label is over represented than the other, which makes the unbalanced predict result hard to explain. As a result, I choose three models for local evaluation, which are two Random Forest models and a Neural Net- work Model. 6 Local Evaluation We use the data’solution.csv’ to evaluate the final models. 6.1 MODEL 1 The first model is the Random Forest trained by the data with features selected from PCA. Predict = 0 Predict=1 Actual = 0 22571 7343 Actual = 1 63616 27310 AUC = 0.52744 Though this model only reaches an accuracy of 41.28% on the test dataset, it identifies many not alerted drivers correctly. Overall, this model is not good enough, no matter evaluated by which method. 6.2 MODEL 2 The second model is the Random Forest trained from the data with added features of rolling mean and standard deviation. Predict = 0 Predict = 1 Actual = 0 16671 13243 Actual = 1 8679 82247 AUC = 0.7309 This model reaches a good performance, with classification accuracy of 81.86% on the test data. It has a good performance in predicting alert drivers, with recall = 90.45%, precision = 86.13% and F1-score = 88.24%. However, for this specific situation. The model is expected to predict ’not alert’ drivers precisely, specificity (The percentage of Actual = 0 is predicted correctly) should be the evaluation method we focus on for this rea- son. The specificity of this model only reaches 55.73%, which still has lots of room to improve. Overall, the AUC value of this model is 0.7309, not as good as the work I referenced, but still an improvement. 6.3 MODEL 3 The third model is the Neural Network trained from the data with added features of rolling mean and standard deviation. It has four hidden layers with 90, 70, 50, 30 neurons in each layer. We were meant to use the first layer to grab all the original features and the coming layers to process and predict the output, thus the number of neurons for the first layer is as many as the number of the features. Predict = 0 Predict = 1 Actual = 0 12886 17028 Actual = 1 361 90565 AUC = 0.71340 This model reaches a good performance as well, com- pares to the first model. It successively predicts most of alert drivers (recall = 99.6%, precision = 84.17%, F1- score = 91.24%). However, the model fails to predict many not alert drivers correctly (specificity = 43.08%), which is the more important evaluation method for this predictive model. 6.4 COMPARISON OF MODEL 2 AND MODEL 3 Both two models perform better on predicting alert drivers than identifying not alert drivers as the true positive rate are both higher than their true negative rate in their confusion matrix, though the main goal of this predictive model is to predict not alert drivers. The curve reaches 100% true positive rate firstly is the neural network, the other curve is the random forest. It also can be found that the random forest model performs better than the neural network model. However, the neural network predicts most alert drivers correctly and when it predicts a driver as not alert, it’s correct at the most of times. The random forest model reaches a significantly higher result on identifying not alert drivers from all the drivers than the neural network model, though when it predicts a driver as not alert, it gets 34.25% chance of being wrong. The Random Forest model would be a better choice in this situation, but the architecture of the neural network model can be optimized to reach a higher performance. 7 RESULT REFLECTION AND COMPARISON 7.1 RESULT CONCLUSION This project was meant to build a supervised learning model to predict not alert drivers, the model with the best performance is achieved by Random Forest with 50 trees in it. It predicts 16671 of 29914 not alert drivers correctly in the test data. It reaches a classification accuracy of 81.86% and AUC value of 0.7309. 7.2 RESULT COMPARISON When it compares to the results of those in the leader- board, there are lots of participants’ models reach a higher performance. The best model reaches an AUC value of 0.86115, though it applies means and standard deviations of each trial as new features. Almost 20 participants’ models reach AUC scores over 0.8, which is significantly higher than mine. Refers that there is still large room to improve my model. 7.3 DISCUSSION AND FUTURE WORK The existed predictive model for this problem is far from perfection. There are a few perspectives that can improve the performance of the model. 7.3.1 DATA PREPROCESSING METHOD The rolling means and the standard deviation is proved to be a good method to preserve the sequential attribute of the data. However, rolling means for every 0.5s could be too short to grab the driving pattern of a driver. Expand the rolling window to produce rolling means and standard deviations in a longer period could be considered as a useful method to introduce the long-term driving patterns of drivers. Better performance is believed can be achieved by model learns not only from drivers’ behaviours in a short time (0.5s) but in a long time as well. 7.3.2 MODEL OPTIMIZATION AND SELECTION Though the neural network model fails to produce a better performance than the random forest model, it is still not convincing that random forest is always the best option for this problem. Neural network still shows great potential to produce good result. Further work could to optimize the architecture of neural networks. REFERENCE LINK • https://www.kaggle.com/c/stayalert/data PROJECT LINK https://github.com/bindhu520/Safe-driving-Challenge-ML-PROECT-
- Loading branch information