Skip to content

Latest commit

 

History

History
23 lines (23 loc) · 1.73 KB

TODO.md

File metadata and controls

23 lines (23 loc) · 1.73 KB
  • set up base file project structure and github
  • load data from virus_genome folder into pandas dataframe and perform kmer processing
  • normalize kmer values and merge with zoonotic
  • store models in pickle files - currently in curr_models
  • fix gradient boosting and XGBoost inconsistencies - properly order dataset in way it was trained
  • fix random forest issues
  • fix mergedDf problems with sequences & classification not lining up properly
  • visualize kmer patterns & feature importances with pyplot
  • similar sequence patterns
  • retrieve blood virome data from genbank for validation
  • separate model architecture into three different files ("run" for initial data preprocessing and storage, "models" for evaluation, "validate" for testing model performance)
  • load data into "info.csv" for quick access & load times
  • add initial synthetic data for testing
  • optimize HTTP requests for blood virome accessions
  • optimize synthetic data for better performance
  • finish validate model on blood sequences
  • fix validation/continue training on nardus mollentze paper for GBM - we know that merging datasets works!!
  • consider NOT normalizing data before loading it into info.csv
  • https://medium.com/mlearning-ai/apply-machine-learning-algorithms-for-genomics-data-classification-132972933723#c401 <-- read this
  • fix gitignore not sensing contigs
  • consider restructuring models to handle without normalization and diff models for diff base pairings?
  • consider using different kmer lengths!!
  • [] fix ensemble model predicting probabilities really wide ranging