SeqDance and ESMDance: Protein Language Models Trained on Protein Dynamic Properties

Abstract

Proteins function by folding amino acid sequences into dynamic structural ensembles. Despite the critical role of protein dynamics, their inherent complexity and the lack of efficient integration methods have hindered their incorporation into deep learning models. To address this, we developed SeqDance and ESMDance, protein language models pre-trained on dynamic biophysical properties derived from molecular dynamics (MD) trajectories of over 35,800 proteins and normal mode analyses (NMA) of over 28,500 proteins.

SeqDance, which operates solely on sequence input, captures both local dynamic interactions and global conformational properties for ordered and disordered proteins, even for proteins without homologs in the pre-training dataset. Predicted dynamic property changes from SeqDance are predictive of mutation effects on protein folding stability. ESMDance, which utilizes ESM2 outputs, significantly outperforms ESM2 in zero-shot mutation effect prediction for designed and viral proteins. Together, SeqDance and ESMDance provide novel insights into protein behaviors and mutation effects through the lens of protein dynamics.

Data and Model Weights

Training sequences and extracted features: Hugging Face
Pre-trained SeqDance/ESMDance weights:
- (Version v2)
- Hugging Face SeqDance
- Hugging Face ESMDance
Code for model usage: notebook

SeqDance/ESMDance Pre-training

SeqDance and ESMDance both consist of Transformer encoders with dynamic property prediction heads. The Transformer encoder follows the ESM2-35M architecture, with 12 layers and 20 attention heads per layer. Both models take protein sequences as input and predict residue-level and pairwise dynamic properties. The dynamic property prediction heads contain 1.2 million trainable parameters.

SeqDance: All parameters were randomly initialized, allowing the model to learn dynamic properties from scratch.
ESMDance: All ESM2-35M parameters were frozen, leveraging evolutionary information from ESM2-35M to predict dynamic properties.

For details, refer to the code in the model directory.

Step 1: Data Preparation

Download and process the data as described in Hugging Face (Merging HDF5 Files). Update file paths in config.py.

Step 2: Environment Setup & Training

A conda environment with pytorch=2.5.1, transformers=4.48.2, and h5py is enough for pre-training in our experiment. You can also use SeqDance_env.yml.

conda env create -f SeqDance_env.yml
conda activate seqdance
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 model/train_ddp.py

SeqDance/ESMDance were trained using Distributed Data Parallel (DDP). Training took 10 days on a server with four L40s GPUs. Hyperparameters are listed in config.py.

SeqDance/ESMDance Usage

Zero-shot Prediction of Mutation Effects

Predict dynamic properties for wild-type and mutated sequences, calculate their relative changes, and infer mutation effects.

Code: Zero-shot mutation prediction

Applications of SeqDance Embeddings

SeqDance embeddings encode rich biophysical properties and outperform ESM2-650M in predicting conformational properties for both ordered and disordered proteins.

Code: Embedding conformational property analysis

SeqDance's attention

SeqDance's attention effectively captures protein dynamic interactions from sequence alone, performing comparably to ESM2-35M while using only 1/1000th of its training sequences.

Code: Attention analysis

Protein Dynamic Dataset

All datasets used in SeqDance are publicly available.

Source	Description	Number	Method
mdCATH	Ordered structures in PDB	5,392	All-atom MD, 5×464 ns
ATLAS	Ordered structures (no membrane proteins)	1,516	All-atom MD, 3×100 ns
PED	Disordered regions	382	Experimental and other methods
GPCRmd	Membrane proteins	509	All-atom MD, 3×500 ns
IDRome	Disordered regions	28,058	Coarse-grained MD, converted to all-atom
ProteinFlow	Ordered structures in PDB	28,546	Normal mode analysis

Feature Extraction

MD Trajectory Feature Extraction

Residue-level features: RMSF, surface area, secondary structure (eight classes), dihedral angles (phi, psi, chi1).
Pairwise features: Correlation of Cα movements, and frequencies of hydrogen bonds, salt bridges, Pi-cation, Pi-stacking, T-stacking, hydrophobic, and van der Waals interactions.

Interaction extraction: GetContacts

cd data_prepare/molecular_dynamics
get_dynamic_contacts.py --itypes hb sb pc ps ts hp vdw --cores 2 --topology 3tvj_I.pdb --trajectory 3tvj_I_10frames.dcd --output 3tvj_I_10frames_contact.tsv

Feature extraction: MDTraj v1.9.9

python MD_features.py -p 3tvj_I.pdb -t 3tvj_I_10frames.dcd -i 3tvj_I_10frames_contact.tsv -o 3tvj_I

Normal Mode Analysis (NMA) Feature Extraction

NMA was performed using ProDy v2.4.0. Normal modes were categorized into three frequency-based clusters, and residue fluctuation and pairwise correlation maps were computed.

cd data_prepare/normal_mode_analysis
python NMA_features.py -i 2g3r.pdb -o nma_residue_pair_features_2g3r

We recommend installing GetContacts, MDTraj, and ProDy in separate conda environments. The feature extraction process took over a month.

Citation

SeqDance: A Protein Language Model for Representing Protein Dynamic Properties
Chao Hou, Yufeng Shen
bioRxiv 2024.10.11.617911; doi: https://doi.org/10.1101/2024.10.11.617911

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
data_prepare		data_prepare
image		image
model		model
notebook		notebook
LICENSE		LICENSE
README.md		README.md
SeqDance_env.yml		SeqDance_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqDance and ESMDance: Protein Language Models Trained on Protein Dynamic Properties

Abstract

Data and Model Weights

SeqDance/ESMDance Pre-training

Step 1: Data Preparation

Step 2: Environment Setup & Training

SeqDance/ESMDance Usage

Zero-shot Prediction of Mutation Effects

Applications of SeqDance Embeddings

SeqDance's attention

Protein Dynamic Dataset

Feature Extraction

MD Trajectory Feature Extraction

Normal Mode Analysis (NMA) Feature Extraction

Citation

About

Releases

Packages

Languages

License

ShenLab/SeqDance

Folders and files

Latest commit

History

Repository files navigation

SeqDance and ESMDance: Protein Language Models Trained on Protein Dynamic Properties

Abstract

Data and Model Weights

SeqDance/ESMDance Pre-training

Step 1: Data Preparation

Step 2: Environment Setup & Training

SeqDance/ESMDance Usage

Zero-shot Prediction of Mutation Effects

Applications of SeqDance Embeddings

SeqDance's attention

Protein Dynamic Dataset

Feature Extraction

MD Trajectory Feature Extraction

Normal Mode Analysis (NMA) Feature Extraction

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages