This Repository contains code developed during the "Applied AI" course at SRH Heidelberg in Novermber / December 2017
This Project needs the following packages installed in your python environment:
scipy
numpy
matplotlib
pandas
sklearn
seaborn
All these packages can be installed using pip.
The objective of this project is to analyze the Adult Dataset hosted by the UCI Machine learning repository.
The main python script can be found in the folder AdultDataset. Running it will analyse the dataset.
During the development of the mentioned script, effort was taken so that the script is as generic as possible and can be used with other datasets with minimal code adjustments.
If the code is supposed to be used with a different dataset, the following code adjustments be needed in the functions main and load_data.
main:
Comment out the following lines:
dataset_flt = dataset[dataset["workclass"] != "?"]
dataset_flt = dataset_flt[dataset_flt["occupation"] != "?"]
dataset_flt = dataset_flt[dataset_flt["native-country"] != "?"]
del dataset_flt['education']
load_data:
Adjust the following lines to whatever is needed to load your data into a DataFrame as delivered by pandas:
headerNames = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'class']
print('Downloading Data...')
dataset = pandas.read_csv('adult.data', header=None, sep=', ', names=headerNames)
print("Download complete. The fnlwgt feature is not needed for our prediction. Let us delete it.")
del dataset['fnlwgt']
del headerNames[2]
Please note that it is important that the values to be predicted are in a column called "class".
If a dataset with non-binary classes is used, a slight adjustments also needs to be done in function algorithmTrialDefault:
#results.append(cv_results) should be uncommented
results = pandas.DataFrame(results) should be commented out
print (results.sort_values(by='test_roc_auc', ascending=False)) should be commented out
This script was tested on python 2.7 and python 3.6
It worked in both environments, but the performance was slightly better with 3.6
Each function of the script will prompt the user, whether it shall be executed or not.
These prompts can be skipped by answering the prompt whether the skript shall be executed headless with "y".
The following steps will be executed:
- Loading the data into a DataFrame.
- Generating a graphical overview of the data and export it into png files.
- Generating heatmaps of feature correlations and export it into png files.
- Trying different machine learning algorithms with their default parameters.
- Executing a parameter optimization for the Logistic Regression Algorithm.
- Executing a parameter optimization for the KNN Algorithm.
- Executing a parameter optimization for the SVM Algorithm. Please expect a very long runtime (easily 10 hours) here.
An Analysis of the AdultDataset can be found in the file "Documentation.docx". This file describes my work done during the Applied AI course.