TIR Predictor for E.Coli (Work in Progress)

Overview

TIR Predictor is a machine learning-based project aimed at predicting translational initiation rates (TIR) using gene sequence data. The project is currently under development, with ongoing improvements in feature selection and model performance.

The dataset(final_data.csv) is really small, containing only 331 different ecoli gene sequences.
Each sequence has 26 parameters out of which 4 are gene_length,shine-dalgarno score,position of A and C nucleotide (T and G are irrelevant) and rest of them being Gibbs-Free energy for various parts of the sequence.

Dimensionality Reduction: We applied Principal Component Analysis (PCA) and selected 12 dimensions for optimal performance.
Model Used: after thorough testing,XGBoost has provided best results for prediction.
Performance:
- Train Pearson Score: 78%
- Test Pearson Score: 75%

Exploring feature engineering techniques to enhance prediction accuracy.
Calculating sd_score using some better way that have a positive effect on pearson score
Trying different models and hyperparameter tuning.
Increasing dataset size for better generalization.

Stay tuned for further updates!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
final_data.csv		final_data.csv
model (2).ipynb		model (2).ipynb