set up base file project structure and github
load data from virus_genome folder into pandas dataframe and perform kmer processing
normalize kmer values and merge with zoonotic
store models in pickle files - currently in curr_models
fix gradient boosting and XGBoost inconsistencies - properly order dataset in way it was trained
fix random forest issues
fix mergedDf problems with sequences & classification not lining up properly
visualize kmer patterns & feature importances with pyplot
similar sequence patterns
retrieve blood virome data from genbank for validation
separate model architecture into three different files ("run" for initial data preprocessing and storage, "models" for evaluation, "validate" for testing model performance)
load data into "info.csv" for quick access & load times
add initial synthetic data for testing
optimize HTTP requests for blood virome accessions
optimize synthetic data for better performance
finish validate model on blood sequences
fix validation/continue training on nardus mollentze paper for GBM - we know that merging datasets works!!
consider NOT normalizing data before loading it into info.csv
https://medium.com/mlearning-ai/apply-machine-learning-algorithms-for-genomics-data-classification-132972933723#c401 <-- read this
fix gitignore not sensing contigs
consider restructuring models to handle without normalization and diff models for diff base pairings?
consider using different kmer lengths!!
[] fix ensemble model predicting probabilities really wide ranging

Provide feedback

Saved searches