Implementations of both N-Nearest Neighbour and Naive Bayes classifiers in Python, also includes some sample data. Some references to Weka are made - it can be downloaded here:
1. Raw Data - Contains the raw pima indians diabetes
file as well as
file as well as the assignment spec. -
2. Processing - Contains everything needed to generate the processed
file:- Run the
file to add a header, change the class names and remove invalid results (see Assumptions and Invalid Data). - Open the resulting
file in Weka and goFilter > Choose > Attribute > Normalise > Apply
and then save the file aspima.csv
to get a normalised CSV file.
- Run the
3. Classifiers - contains python scripts to run the classifiers:
- Run -h
for more information about argument usage. - A log file
will be created which logs information about the run such as number of correctly and incorrectly identified instances. --folds
number of folds you want to split the data into (default 10)--neighbours
, number of nearest neighbours (default 3)--algorithm
, the algorithm to run (KNN/NB)
- Run
4. Feature Selection - contains a version of the data that has been run through Weka's CFS feature selection:
- Open the
file generated in step 2 in Weka and goSelect Attributes > Start > Right click the result > Save Reduced data...
- Some header information that Weka generates in puts in the file had to be removed manually.
- Open the
5. Results - contains a results spreadsheet and the final report.
- Compares our classifiers to Weka's for both the data with no feature selection and the data with feature selection (this is also included inreport.pdf
- Contains findings.
- There are a number of fields in the data where attributes are missing and have been coded as 0. We have decided to remove the rows containing a 0 value in the following fields:
- Glucose Concentration
- Blood Pressure
- Tricep Skin
- Body Mass Index (BMI)