Proteins function by folding amino acid sequences into dynamic structural ensembles. Despite the critical role of protein dynamics, their inherent complexity and the lack of efficient integration methods have hindered their incorporation into deep learning models. To address this, we developed SeqDance and ESMDance, protein language models pre-trained on dynamic biophysical properties derived from molecular dynamics (MD) trajectories of over 35,800 proteins and normal mode analyses (NMA) of over 28,500 proteins.
SeqDance, which operates solely on sequence input, captures both local dynamic interactions and global conformational properties for ordered and disordered proteins, even for proteins without homologs in the pre-training dataset. Predicted dynamic property changes from SeqDance are predictive of mutation effects on protein folding stability. ESMDance, which utilizes ESM2 outputs, significantly outperforms ESM2 in zero-shot mutation effect prediction for designed and viral proteins. Together, SeqDance and ESMDance provide novel insights into protein behaviors and mutation effects through the lens of protein dynamics.
- Training sequences and extracted features: Hugging Face
- Pre-trained SeqDance/ESMDance weights:
(Version v2)
- Hugging Face SeqDance
- Hugging Face ESMDance
- Code for model usage: notebook
SeqDance and ESMDance both consist of Transformer encoders with dynamic property prediction heads. The Transformer encoder follows the ESM2-35M architecture, with 12 layers and 20 attention heads per layer. Both models take protein sequences as input and predict residue-level and pairwise dynamic properties. The dynamic property prediction heads contain 1.2 million trainable parameters.
- SeqDance: All parameters were randomly initialized, allowing the model to learn dynamic properties from scratch.
- ESMDance: All ESM2-35M parameters were frozen, leveraging evolutionary information from ESM2-35M to predict dynamic properties.
For details, refer to the code in the model directory.
Download and process the data as described in Hugging Face (Merging HDF5 Files). Update file paths in config.py.
A conda environment with pytorch=2.5.1
, transformers=4.48.2
, and h5py
is enough for pre-training in our experiment. You can also use SeqDance_env.yml.
conda env create -f SeqDance_env.yml
conda activate seqdance
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 model/train_ddp.py
SeqDance/ESMDance were trained using Distributed Data Parallel (DDP). Training took 10 days on a server with four L40s GPUs. Hyperparameters are listed in config.py.
Predict dynamic properties for wild-type and mutated sequences, calculate their relative changes, and infer mutation effects.
SeqDance embeddings encode rich biophysical properties and outperform ESM2-650M in predicting conformational properties for both ordered and disordered proteins.
SeqDance's attention effectively captures protein dynamic interactions from sequence alone, performing comparably to ESM2-35M while using only 1/1000th of its training sequences.
- Code: Attention analysis
All datasets used in SeqDance are publicly available.
Source | Description | Number | Method |
---|---|---|---|
mdCATH | Ordered structures in PDB | 5,392 | All-atom MD, 5×464 ns |
ATLAS | Ordered structures (no membrane proteins) | 1,516 | All-atom MD, 3×100 ns |
PED | Disordered regions | 382 | Experimental and other methods |
GPCRmd | Membrane proteins | 509 | All-atom MD, 3×500 ns |
IDRome | Disordered regions | 28,058 | Coarse-grained MD, converted to all-atom |
ProteinFlow | Ordered structures in PDB | 28,546 | Normal mode analysis |
Residue-level features: RMSF, surface area, secondary structure (eight classes), dihedral angles (phi, psi, chi1).
Pairwise features: Correlation of Cα movements, and frequencies of hydrogen bonds, salt bridges, Pi-cation, Pi-stacking, T-stacking, hydrophobic, and van der Waals interactions.
- Interaction extraction: GetContacts
cd data_prepare/molecular_dynamics
get_dynamic_contacts.py --itypes hb sb pc ps ts hp vdw --cores 2 --topology 3tvj_I.pdb --trajectory 3tvj_I_10frames.dcd --output 3tvj_I_10frames_contact.tsv
- Feature extraction: MDTraj v1.9.9
python MD_features.py -p 3tvj_I.pdb -t 3tvj_I_10frames.dcd -i 3tvj_I_10frames_contact.tsv -o 3tvj_I
NMA was performed using ProDy v2.4.0. Normal modes were categorized into three frequency-based clusters, and residue fluctuation and pairwise correlation maps were computed.
cd data_prepare/normal_mode_analysis
python NMA_features.py -i 2g3r.pdb -o nma_residue_pair_features_2g3r
We recommend installing GetContacts, MDTraj, and ProDy in separate conda environments. The feature extraction process took over a month.
SeqDance: A Protein Language Model for Representing Protein Dynamic Properties
Chao Hou, Yufeng Shen
bioRxiv 2024.10.11.617911; doi: https://doi.org/10.1101/2024.10.11.617911