Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Predicting_Covid19_Test_Result Deliverable 0, 1, 2, 3, 4, & Final Report #179

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Predicting_COVID-19_Test_Result/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Predicting COVID-19 Test Result from Symptoms and Comorbidities

Group Member: Runsheng Wang & Jinqi Lu
runsheng@bu.edu, jinqilu@bu.edu

# Project Description
This study seeks to use machine learning to predict Covid-19 diagnosis from a multitude of clinical data. The data was collected, aggregated, de-identified, and published by Carbon Health, a U.S. primary and urgent care provider. The dataset contains a total of 46 variables, including epidemiological factors, comorbidities, vitals, clinical assessed and patient reported symptoms, as well as lab results and radiological findings. The dataset was first cleaned and encoded, before splitted into a training set and a testing set. Exploratory Data Analysis was performed on the training set to obtain basic understandings about the characteristics of input variables. Feature selection was then performed by ensembling the Chi-Squared and Mutual Information criteria. Six features, namely Cough, Fever, Headache, Loss of Smell, Loss of Taste, and Muscle Sore, were ultimately used for model construction. Due to the extreme imbalance (99:1 ratio) in the class distribution of the response variable, resampling methods were applied to the training set in order to reduce classification bias towards the majority class. Since the dataset consists solely of categorical features, SMOTE-N, a categorical variation of the Synthetic Minority Oversampling Technique (SMOTE), was adapted to the training set to achieve a negative-to-positive class ratio of 5:1. Undersampling was then performed using One Sided Selection and Neighborhood Cleaning Rule to further balance the class distributions of the response variable. To identify the best model for this particular dataset, six classifiers, namely kNN, Logisti, Decision Tree, Random Forest, Categorical Naive Bayes, XGBoost, were fitted and baseline performances were compared. To optimize the hyperparameter tuning process, Random Search CV was executed 2000 times on the Random Forest Classifier, with 4-fold CV repeated three times on each iteration. Best hyperparameters were recorded locally, and Grid Search CV was applied to further tune the hyperparameters by checking the immediate neighbors of each hyperparameter value. The model predicts the testing set quite well, and model performance is substantially better compared to the baseline. However, the model underperforms at predicting positive results, which is to be expected considering the severe class imbalance present in the original dataset. Our model could be applied to prioritize testing when testing resources are scarce or inaccessible. It could also aid the diagnosis of Covid-19 in ambiguous or conflicting cases.
<<<<<<< HEAD

=======
>>>>>>> 127e1aa8430577543fa37572af2da965c75795c4

110 changes: 110 additions & 0 deletions Predicting_COVID-19_Test_Result/code/dataplot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""
Generate plot for project data

Project Started: April.25.2021
Project Concluded: April.28.2021
"""

###Import Lib
import pandas as pd
import matplotlib.pyplot as plt
import progressbar

###Global Settings
datadir = "E:/Personal Sync/Academic/Spring 2021/CS 506/Homework/project/covid-project/data/"
outdir = "E:/Personal Sync/Academic/Spring 2021/CS 506/Homework/project/covid-project/output/"

##Uncomment and comment to switch dataset
#dfname = "raw_concatenated.csv"
dfname = "israeli/corona_tested_individuals_ver_006.english.csv"

#round to X decimal for float
roundn = 6

#whitelist some values to avoid type conversion
wlist = ["covid19_test_results", "swab_type", "age"]


######Begin Functions
###Load data from file
def getdata(dataloc):
return pd.read_csv(dataloc)

#get nan for all data
def getnan(rdata):
nan = rdata.isna().any(axis=1).sum()
print(nan)

#write a specific file to a system location with content
def writefile(filename, content):
res = open(filename, "w")
lenf1 = len(content)
bar = progressbar.ProgressBar(maxval=lenf1, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
bar.start()
cnt = 0
for i in content:
res.write(str(i))
res.write("\n")
cnt += 1
bar.update(cnt)
bar.finish()
res.close()

#get histo plot for specific column
def get_histo(rdata, cname):
if (cname in wlist):
pass
else:
rdata = rdata.applymap(str)
testr = rdata[cname]
plt.hist(testr)
plt.title(cname)
plt.ylabel('Number of Cases')
plt.savefig(outdir + cname + "_histogram.png")
plt.show()
return

#get bar chart
def get_bar(rdata, cname):
#rdata = rdata.applymap(str)
testr = rdata[cname]
plt.bar(testr)
plt.title(cname)
plt.ylabel('Number of Cases')
plt.savefig(outdir + cname + "_barchart.png")
plt.show()
return

#get number of nan in a series
def get_nan(rdata):
return rdata.isna().sum()

#main function
def main():
dloc = datadir + dfname
rawd = getdata(dloc)
#getnan(rawd)

#get histogram
for col in rawd.columns:
get_histo(rawd, col)

#count nan values
res = []
#get row length
rlen = len(rawd.index)
res.append("Number of NaN values and its percentage in each column")
for col in rawd.columns:
n = get_nan(rawd[col])
#get percentage
perc = (n/rlen)*100
r = col + " : " + str(n) + " Percentage: " + str(round(perc, roundn)) + "%"
res.append(r)
writefile(outdir + "number_of_nan.txt", res)
#write to file

return
#invoke the main function
if __name__== "__main__":
main()
36 changes: 36 additions & 0 deletions Predicting_COVID-19_Test_Result/code/dict_to_txt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# -*- coding: utf-8 -*-
"""
Created on Sat Apr 24 00:09:42 2021

@author: KevinLu
"""
from datetime import datetime

d = {'x': 1, 'y': "test", 'z': 3}


wdir = "./output/dict_log.txt"

def dict_to_txt(payload, title, wodir = wdir):
def add_txt_to_file(filename, content):
res = open(filename, "a")
for i in content:
res.write(str(i))
res.write("\n")
res.close()
#generate payload list
res = []
res.append("##########" + title + "##########")
t1 = str(datetime.now())
res.append("Report Created: " + t1)
res.append("\n")

#loop through the dict
for key in payload:
stro = str(key) + " : " + str(payload[key])
res.append(stro)
res.append("#"*(len(title)+20))
#write file
add_txt_to_file(wodir, res)

dict_to_txt(d, "this is a test")
Loading