- set up base file project structure and github
- load data from virus_genome folder into pandas dataframe and perform kmer processing
- normalize kmer values and merge with zoonotic
- store models in pickle files - currently in curr_models
- fix gradient boosting and XGBoost inconsistencies - properly order dataset in way it was trained
- fix random forest issues
- fix mergedDf problems with sequences & classification not lining up properly
- visualize kmer patterns & feature importances with pyplot
- similar sequence patterns
- retrieve blood virome data from genbank for validation
- separate model architecture into three different files ("run" for initial data preprocessing and storage, "models" for evaluation, "validate" for testing model performance)
- load data into "info.csv" for quick access & load times
- add initial synthetic data for testing
- optimize HTTP requests for blood virome accessions
- optimize synthetic data for better performance
- finish validate model on blood sequences
- fix validation/continue training on nardus mollentze paper for GBM - we know that merging datasets works!!
- consider NOT normalizing data before loading it into info.csv
- https://medium.com/mlearning-ai/apply-machine-learning-algorithms-for-genomics-data-classification-132972933723#c401 <-- read this
- fix gitignore not sensing contigs
- consider restructuring models to handle without normalization and diff models for diff base pairings?
- consider using different kmer lengths!!
- [] fix ensemble model predicting probabilities really wide ranging