Skip to content

SayujGupta2005/TIR-Predictor-in-E.coli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

TIR Predictor for E.Coli (Work in Progress)

Overview

  • TIR Predictor is a machine learning-based project aimed at predicting translational initiation rates (TIR) using gene sequence data. The project is currently under development, with ongoing improvements in feature selection and model performance.

Dataset

  • The dataset(final_data.csv) is really small, containing only 331 different ecoli gene sequences.
  • Each sequence has 26 parameters out of which 4 are gene_length,shine-dalgarno score,position of A and C nucleotide (T and G are irrelevant) and rest of them being Gibbs-Free energy for various parts of the sequence.

Current Approach

  • Dimensionality Reduction: We applied Principal Component Analysis (PCA) and selected 12 dimensions for optimal performance.
  • Model Used: after thorough testing,XGBoost has provided best results for prediction.
  • Performance:
    • Train Pearson Score: 78%
    • Test Pearson Score: 75%

Next Steps

  • Exploring feature engineering techniques to enhance prediction accuracy.
  • Calculating sd_score using some better way that have a positive effect on pearson score
  • Trying different models and hyperparameter tuning.
  • Increasing dataset size for better generalization.

Acknowledgment

Stay tuned for further updates!

About

TIR Predictor model for E.coli

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published