Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

Introduction

This repository contains the scikit-learn implementation of the StoicIML pipeline, as shown below. StoicIML is an an interpretable, data-driven pipeline that leverages linear machine learning models to classify protein stoichiometry in virus-like particle assembly.

Pipeline

Datasets

The datasets/curatedfolder contains 200 protein sequences that assemble into either 60-mer or 180-mer VLPs, sourced from the RCSB PDB [1].

Requirements

Please install (a) the packages listed in requirements.txt and (b) the feature selection repository [2].

pip install -r requirements.txt

Reproduce Results

The basic configurations are inconfigs/congigs.py, main experiment configs in configs/main_exp/VLP_200.yaml, and ablation study configs in configs/study1_truncate/*.yaml and configs/study2_position_selection/*.yaml.

For the main experiments, you can directly run the following command.

chmod +x ./shell_scripts/main_experiments.sh
./shell_scripts/main_experiments.sh

For the ablation study, you can run

chmod +x ./shell_scripts/ablation_study.sh
./shell_scripts/ablation_study.sh

Citation

Please cite our paper if you find it useful.

@misc{zhang2025classifyingstoichiometryviruslikeparticles,
      title={Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning}, 
      author={Jiayang Zhang and Xianyuan Liu and Wei Wu and Sina Tabakhi and Wenrui Fan and Shuo Zhou and Kang Lan Tee and Tuck Seng Wong and Haiping Lu},
      year={2025},
      eprint={2502.12049},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.12049}, 
}

References

[1] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic acids research. 2000 Jan 1;28(1):235-42.
[2] Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: A data perspective. ACM computing surveys (CSUR). 2017 Dec 6;50(6):1-45.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analysis		analysis
configs		configs
datasets/curated		datasets/curated
image		image
main-exp-result/trial-result-VLP200-ridge-onehotprotein-seqlen1426-optimfold9		main-exp-result/trial-result-VLP200-ridge-onehotprotein-seqlen1426-optimfold9
pipeline		pipeline
shell_scripts		shell_scripts
LICENSE		LICENSE
README.md		README.md
main_linear-intlabel-cluster.py		main_linear-intlabel-cluster.py
main_linear-intlabel-protein.py		main_linear-intlabel-protein.py
main_linear-onehot-cluster.py		main_linear-onehot-cluster.py
main_linear-onehot-protein.py		main_linear-onehot-protein.py
main_linear-study2-onehot-protein.py		main_linear-study2-onehot-protein.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

Introduction

Pipeline

Datasets

Requirements

Reproduce Results

Citation

References

About

Releases

Packages

Languages

License

Shef-AIRE/StoicIML

Folders and files

Latest commit

History

Repository files navigation

Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

Introduction

Pipeline

Datasets

Requirements

Reproduce Results

Citation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages