A machine learning pipeline for classifying protein sequences associated with Parkinson's disease using various sequence features and multiple classification models. The project achieves 80.3% accuracy using LSTM architecture with comprehensive sequence feature analysis.
- Python 3.11+
- CUDA capable GPU (optional, for faster training)
git clone https://github.com/Spritan/ParkinsonDiseaseClassifier
cd ParkinsonDiseaseClassifier
pip install uv
uv run main.py
dependencies = [
"biopython>=1.84",
"matplotlib>=3.9.2",
"pandas>=2.2.3",
"rich>=13.9.4",
"scikit-learn>=1.5.2",
"seaborn>=0.13.2",
"torch>=2.5.1",
"xgboost>=2.1.2",
]
FASTA files contain protein sequences in a text-based format:
>sp|Q9Y6M9|NDUB9_HUMAN NADH dehydrogenase [ubiquinone] 1 beta subcomplex subunit 9
MAFLASGPYLTHQQKVLRLYKRALRHLESWCVQRDKYRYFACLMRARFEEHKNEKDMAKA
TQLLKEAEEEFWYRQHPQPYIFPDSPGGTSYERYDCYKVPEWCLDDWHPSEKAMYPDYFA
KREQWKKLRRESWEREVKQLQEETPPGGPLTEALPPARKEGDLPPLWWYIVTRPRERPM
- Header line (starts with '>'): Contains sequence identifier and metadata
- Subsequent lines: Amino acid sequence in single-letter code
-
Control Sequences (129 samples)
- Source:
data/cleaned_output (1).fasta
- Average length: 482 amino acids
- Contains various human proteins without PD association
- Source:
-
Parkinson's Disease Sequences (165 samples)
- Source:
data/uniprotkb_parkinson_disease_protein_AND_2024_11_15.fasta
- Average length: 523 amino acids
- Proteins associated with PD pathology
- Source:
- Length Distribution: Right-skewed, ranging from 100 to 2000+ amino acids
- GC Content: Normal distribution centered around 0.52
- Hydrophobic Ratio: Peaks at 0.40, indicating typical protein composition
- Charged Ratio: Centered at 0.25, showing consistent charge distribution
- Most Abundant:
- Leucine (L): ~10%
- Alanine (A): ~7.8%
- Serine (S): ~7.2%
- Least Abundant:
- Tryptophan (W): ~1.4%
- Cysteine (C): ~2.2%
- Histidine (H): ~2.4%
- Class Differences:
- Slight variations in Serine (S) content between healthy and PD
- Minor differences in charged amino acids (R, K, D, E)
Bi-mer Analysis:
Top 5 Most Frequent:
1. LE (7.2%): Leucine-Glutamic acid
2. AL (6.8%): Alanine-Leucine
3. LA (6.5%): Leucine-Alanine
4. EL (5.9%): Glutamic acid-Leucine
5. SE (5.7%): Serine-Glutamic acid
Tri-mer Analysis:
Top 5 Most Frequent:
1. LEE (3.1%): Leucine-Glutamic acid-Glutamic acid
2. ALA (2.9%): Alanine-Leucine-Alanine
3. LAA (2.8%): Leucine-Alanine-Alanine
4. ELL (2.7%): Glutamic acid-Leucine-Leucine
5. AAL (2.6%): Alanine-Alanine-Leucine
- Strong correlations between related k-mers
- Moderate correlations between physicochemical properties
- Weak correlations between different feature types
- No missing values
- Valid amino acid sequences (20 standard amino acids)
- Balanced class distribution (43.88% healthy, 56.12% PD)
- Consistent sequence format and annotation
- Sequence validation
- Length normalization
- Feature extraction
- Feature scaling
- Train-test split (80-20)
def load_fasta_data(healthy_path, parkinsons_path):
"""
Load and process FASTA sequences
Returns DataFrame with sequences and labels
"""
- FASTA file parsing using BioPython
- Sequence validation and preprocessing
- Data frame construction with sequence metadata
Features extracted include:
- K-mer frequencies (k=2,3)
- Basic sequence properties:
- Length
- GC content
- Hydrophobic ratio
- Charged amino acid ratio
- Total features extracted: 7,950
- Features selected for modeling: 50
Multiple models evaluated:
- Deep Learning:
- LSTM
- Neural Network
- Traditional ML:
- SVM
- Random Forest
- Gradient Boosting
- XGBoost
- KNN
The feature extraction process is handled by the SequenceFeatureExtractor
class (reference: src/feature_extraction.py
, lines 9-48), which implements a comprehensive approach to protein sequence analysis:
class SequenceFeatureExtractor:
def __init__(self, k=3):
self.k = k
self.vectorizer = CountVectorizer(analyzer='char', ngram_range=(k, k))
def compute_basic_features(self, sequence):
return {
'length': len(sequence),
'gc_content': (sequence.count('G') + sequence.count('C')) / len(sequence),
'hydrophobic_ratio': sum(aa in 'AILMFWYV' for aa in sequence) / len(sequence),
'charged_ratio': sum(aa in 'DEKR' for aa in sequence) / len(sequence)
}
- Length: Raw sequence length
- GC Content: Proportion of Glycine and Cytosine
- Hydrophobic Ratio: Proportion of hydrophobic amino acids (A, I, L, M, F, W, Y, V)
- Charged Ratio: Proportion of charged amino acids (D, E, K, R)
The system uses CountVectorizer to generate:
- Bi-mers (k=2): Captures local sequence patterns
- Tri-mers (k=3): Captures extended sequence motifs
Feature selection is implemented in SequenceDataAnalyzer
(reference: src/data_analysis.py
, lines 81-97) using a multi-method approach:
def select_features(self, n_features=50):
"""
Select top features using ensemble of methods:
1. Random Forest importance
2. ANOVA F-scores
3. Mutual Information
"""
-
Random Forest Importance
- Evaluates feature importance through decision trees
- Particularly effective for capturing non-linear relationships
- Robust to feature scaling
-
ANOVA F-scores
- Statistical test for feature relevance
- Identifies features with significant class separation
- Assumes normal distribution
-
Mutual Information
- Information theory-based approach
- Captures non-linear relationships
- Scale-invariant feature selection
-
Initial Feature Space
- Total features extracted: 7,950
- Includes all k-mers and basic properties
- Sparse matrix representation for efficiency
-
Feature Ranking
- Each method ranks features independently
- Scores are normalized to [0,1] range
- Combined ranking through weighted voting
-
Feature Reduction
- Final selection: Top 50 features
- Balanced representation across feature types
- Optimized for model performance
The system provides comprehensive feature importance visualization through multiple plots:
- Correlation Analysis
startLine: 103
endLine: 127
- Feature Importance Plotting
startLine: 128
endLine: 159
Feature selection significantly improves model performance:
-
Computational Efficiency
- 99.4% reduction in feature space
- Faster training times
- Reduced memory requirements
-
Model Performance
- Improved generalization
- Reduced overfitting
- More interpretable models
-
Cross-validation Results
- Selected features maintain 80.3% accuracy
- Stable performance across folds
- Robust to different model architectures
-
Advanced Feature Engineering
- Position-specific scoring matrices
- Secondary structure predictions
- Evolutionary conservation scores
- Domain-specific features
-
Selection Methods
- Deep learning-based feature selection
- Ensemble selection strategies
- Dynamic feature importance tracking
-
Optimization
- Feature selection hyperparameter tuning
- Custom feature importance metrics
- Real-time feature importance updates
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Score (mean ± std) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ Accuracy │ 0.803 ± 0.046 │
│ Precision │ 0.816 ± 0.078 │
│ Recall │ 0.845 ± 0.095 │
│ F1 │ 0.825 ± 0.054 │
└───────────┴────────────────────┘
- Best overall performance
- Balanced precision and recall
- Consistent performance across classes
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Score (mean ± std) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ Accuracy │ 0.718 ± 0.028 │
│ Precision │ 0.718 ± 0.065 │
│ Recall │ 0.830 ± 0.094 │
│ F1 │ 0.764 ± 0.042 │
└───────────┴────────────────────┘
- Second-best performer
- High recall but lower precision
- More stable metrics (lower std)
Model F1-Score Accuracy
----------------------------------------
Gradient Boosting 0.714 ± 0.048 0.656 ± 0.040
Random Forest 0.690 ± 0.043 0.639 ± 0.036
XGBoost 0.686 ± 0.040 0.636 ± 0.044
KNN 0.683 ± 0.034 0.595 ± 0.039
- Leucine (L): ~10% frequency in both classes
- Key differences in Serine (S) and Alanine (A)
- Rare amino acids: Cysteine (C), Tryptophan (W)
Most significant patterns:
Bi-mers: LE (7.2%), AL (6.8%), LA (6.5%)
Tri-mers: LEE (3.1%), ALA (2.9%), LAA (2.8%)
Top features by method:
- Random Forest:
- Hydrophobic ratio
- Length
- GC content
- ANOVA F-score:
- K-mer patterns
- Charged ratio
- Mutual Information:
- Sequence length
- K-mer frequencies
- LSTM consistently outperforms other models
- Deep learning models show superior performance
- Traditional ML models cluster around 65-70% accuracy
- Data Processing Pipeline:
def process_sequences(sequences):
"""
Process raw sequences:
1. Validate amino acid content
2. Extract features
3. Normalize values
"""
- Feature Extraction:
class SequenceFeatureExtractor:
"""
Extract features from protein sequences:
- K-mer frequencies
- Sequence properties
- Physicochemical properties
"""
- Model Training:
class SequenceClassifier:
"""
Train and evaluate multiple models:
- Deep learning (LSTM, NN)
- Traditional ML
"""
- Feature Selection Process:
- Initial features: 7,950
- Selected features: 50
- Selection methods:
- Random Forest importance
- ANOVA F-scores
- Mutual Information
- Cross-validation Strategy:
- 5-fold stratified CV
- Consistent random seed
- Performance metrics with std
- Target: 1000+ sequences per class
- Include:
- More control sequences
- Various PD subtypes
- Related neurodegenerative diseases
- Position-specific scoring matrices
- Secondary structure predictions
- Domain-specific features
- Evolutionary conservation scores
- Attention mechanisms
- Transformer architectures
- Ensemble strategies
- Model interpretability
- Model persistence
- Parallel processing
- Configuration management
- API development
- UniProt database for protein sequences
- BioPython community
- Rich library developers
- PyTorch team