BU-Spark · KevinLu2000 · Mar 1, 2021 · Apr 28, 2021 · Apr 28, 2021 · Apr 28, 2021
diff --git a/Predicting_COVID-19_Test_Result/README.md b/Predicting_COVID-19_Test_Result/README.md
@@ -0,0 +1,12 @@
+# Predicting COVID-19 Test Result from Symptoms and Comorbidities
+
+Group Member: Runsheng Wang & Jinqi Lu
+runsheng@bu.edu, jinqilu@bu.edu
+
+# Project Description
+This study seeks to use machine learning to predict Covid-19 diagnosis from a multitude of clinical data. The data was collected, aggregated, de-identified, and published by Carbon Health, a U.S. primary and urgent care provider. The dataset contains a total of 46 variables, including epidemiological factors, comorbidities, vitals, clinical assessed and patient reported symptoms, as well as lab results and radiological findings. The dataset was first cleaned and encoded, before splitted into a training set and a testing set. Exploratory Data Analysis was performed on the training set to obtain basic understandings about the characteristics of input variables. Feature selection was then performed by ensembling the Chi-Squared and Mutual Information criteria. Six features, namely Cough, Fever, Headache, Loss of Smell, Loss of Taste, and Muscle Sore, were ultimately used for model construction. Due to the extreme imbalance (99:1 ratio) in the class distribution of the response variable, resampling methods were applied to the training set in order to reduce classification bias towards the majority class. Since the dataset consists solely of categorical features, SMOTE-N, a categorical variation of the Synthetic Minority Oversampling Technique (SMOTE), was adapted to the training set to achieve a negative-to-positive class ratio of 5:1. Undersampling was then performed using One Sided Selection and Neighborhood Cleaning Rule to further balance the class distributions of the response variable. To identify the best model for this particular dataset, six classifiers, namely kNN, Logisti, Decision Tree, Random Forest, Categorical Naive Bayes, XGBoost, were fitted and baseline performances were compared. To optimize the hyperparameter tuning process, Random Search CV was executed 2000 times on the Random Forest Classifier, with 4-fold CV repeated three times on each iteration. Best hyperparameters were recorded locally, and Grid Search CV was applied to further tune the hyperparameters by checking the immediate neighbors of each hyperparameter value. The model predicts the testing set quite well, and model performance is substantially better compared to the baseline. However, the model underperforms at predicting positive results, which is to be expected considering the severe class imbalance present in the original dataset. Our model could be applied to prioritize testing when testing resources are scarce or inaccessible. It could also aid the diagnosis of Covid-19 in ambiguous or conflicting cases.
+<<<<<<< HEAD
+
+=======
+>>>>>>> 127e1aa8430577543fa37572af2da965c75795c4
+
diff --git a/Predicting_COVID-19_Test_Result/code/dataplot.py b/Predicting_COVID-19_Test_Result/code/dataplot.py
@@ -0,0 +1,110 @@
+"""
+Generate plot for project data
+
+Project Started: April.25.2021
+Project Concluded: April.28.2021
+"""
+
+###Import Lib
+import pandas as pd
+import matplotlib.pyplot as plt
+import progressbar
+
+###Global Settings
+datadir = "E:/Personal Sync/Academic/Spring 2021/CS 506/Homework/project/covid-project/data/"
+outdir = "E:/Personal Sync/Academic/Spring 2021/CS 506/Homework/project/covid-project/output/"
+
+##Uncomment and comment to switch dataset
+#dfname = "raw_concatenated.csv"
+dfname = "israeli/corona_tested_individuals_ver_006.english.csv"
+
+#round to X decimal for float
+roundn = 6
+
+#whitelist some values to avoid type conversion
+wlist = ["covid19_test_results", "swab_type", "age"]
+
+
+######Begin Functions
+###Load data from file
+def getdata(dataloc):
+    return pd.read_csv(dataloc)
+
+#get nan for all data
+def getnan(rdata):
+    nan = rdata.isna().any(axis=1).sum()
+    print(nan)
+
+#write a specific file to a system location with content
+def writefile(filename, content):
+    res = open(filename, "w")
+    lenf1 = len(content)
+    bar = progressbar.ProgressBar(maxval=lenf1, \
+    widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
+    bar.start()
+    cnt = 0
+    for i in content:
+        res.write(str(i))
+        res.write("\n")
+        cnt += 1
+        bar.update(cnt)
+    bar.finish()
+    res.close()
+
+#get histo plot for specific column
+def get_histo(rdata, cname):
+    if (cname in wlist):
+        pass
+    else:
+        rdata = rdata.applymap(str)
+    testr = rdata[cname]
+    plt.hist(testr)
+    plt.title(cname)
+    plt.ylabel('Number of Cases')
+    plt.savefig(outdir + cname + "_histogram.png")
+    plt.show()
+    return
+
+#get bar chart
+def get_bar(rdata, cname):
+    #rdata = rdata.applymap(str)
+    testr = rdata[cname]
+    plt.bar(testr)
+    plt.title(cname)
+    plt.ylabel('Number of Cases')
+    plt.savefig(outdir + cname + "_barchart.png")
+    plt.show()
+    return    
+
+#get number of nan in a series
+def get_nan(rdata):
+    return rdata.isna().sum()
+
+#main function
+def main():
+    dloc = datadir + dfname
+    rawd = getdata(dloc)
+    #getnan(rawd)
+
+    #get histogram
+    for col in rawd.columns:
+        get_histo(rawd, col)
+
+    #count nan values
+    res = []
+    #get row length
+    rlen = len(rawd.index)
+    res.append("Number of NaN values and its percentage in each column")
+    for col in rawd.columns:
+        n = get_nan(rawd[col])
+        #get percentage
+        perc = (n/rlen)*100
+        r = col + " : " + str(n) + "   Percentage: " + str(round(perc, roundn)) + "%"
+        res.append(r)    
+    writefile(outdir + "number_of_nan.txt", res)
+    #write to file
+
+    return
+#invoke the main function
+if __name__== "__main__":
+    main()
diff --git a/Predicting_COVID-19_Test_Result/code/dict_to_txt.py b/Predicting_COVID-19_Test_Result/code/dict_to_txt.py
@@ -0,0 +1,36 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Sat Apr 24 00:09:42 2021
+
+@author: KevinLu
+"""
+from datetime import datetime
+
+d = {'x': 1, 'y': "test", 'z': 3} 
+
+
+wdir = "./output/dict_log.txt"
+
+def dict_to_txt(payload, title, wodir = wdir):
+    def add_txt_to_file(filename, content):
+        res = open(filename, "a")
+        for i in content:
+            res.write(str(i))
+            res.write("\n")
+        res.close()
+    #generate payload list
+    res = []
+    res.append("##########" + title + "##########")
+    t1 = str(datetime.now())
+    res.append("Report Created: " + t1)
+    res.append("\n")
+
+    #loop through the dict
+    for key in payload:
+        stro = str(key) + " : " + str(payload[key])
+        res.append(stro)
+    res.append("#"*(len(title)+20))
+    #write file
+    add_txt_to_file(wodir, res)
+
+dict_to_txt(d, "this is a test")