Proteins function by folding amino acid sequences into dynamic structural ensembles. Despite the central role of protein dynamics, their complexity and the absence of efficient representation methods have hindered their incorporation into studies of protein function and mutation fitness, particularly in deep learning applications. To address this challenge, we present SeqDance, a protein language model designed to learn representations of protein dynamic properties directly from sequence. SeqDance was pre-trained on dynamic biophysical properties derived from over 30,400 molecular dynamics trajectories and 28,600 normal mode analyses. Our results demonstrate that SeqDance effectively captures local dynamic interactions, co-movement patterns, and global conformational features, even for proteins without homologs in the pre-training set. Furthermore, SeqDance improves predictions of protein fitness landscapes, disorder-to-order transition binding regions, and phase-separating proteins. By learning dynamic properties from sequence, SeqDance complements conventional evolution- and static structure-based methods, providing novel insights into protein behavior and function.
SeqDance was trained using Python (v3.12.2), PyTorch (v2.2.0), and the Transformers library (v4.39.1). For detailed environment setup, please refer to SeqDance_env.yml. For details on the model architecture and pre-training process, please refer to codes in the model directory.
conda env create -f SeqDance_env.yml
conda activate seqdance
cd model
torchrun --nnodes=1 --nproc_per_node=6 train_ddp.py
SeqDance is trained via distributed data parallel. The detailed hyperparameters are listed in config. The pre-training took ten days on a server with six A6000 GPUs.
We provide the training sequences in the dataset: in "sequence" column, we use <linker>
to separate sequences in a complex; in "modify_seq" column, we use <eos><cls>
instead.
If you are interested in using the extracted features (~100G size), please contact us.
You can download the pre-trained SeqDance weights here: . Follow the instructions in notebook/pretrained_seqdance_attention_embedding.ipynb for how to extract pairwise features-related attentions and how to get residue level embeddings. Please note that this demo may take a few minutes to complete.
All pre-training datasets used in SeqDance are publicly available.
Source | Description | Number | Method |
---|---|---|---|
ATLAS | Ordered structures in PDB (no membrane proteins) | 1,516 | All-atom MD, 3x100 ns |
PED | Disordered regions | 382 | Experimental and other methods |
GPCRmd | Membrane proteins | 509 | All-atom MD, 3x500 ns |
IDRome | Disordered regions | 28,058 | Coarse-grained MD, converted to all-atom |
ProteinFlow | Ordered structures in PDB | 28,631 | Normal mode analysis |
IDRome trajectories were converted to all-atom trajectories using cg2all, with the following command:
convert_cg2all -p top_ca.pdb -d traj.xtc -o traj_all.dcd -opdb top_all.pdb --cg CalphaBasedModel
We extracted residue-level and pairwise dynamic features from MD trajectories:
- Residue-level features: Root mean square fluctuation (RMSF), surface area, secondary structure (eight classes), and dihedral angles (phi, psi, chi1).
- Pairwise features: Correlation of Cα movements, and frequencies of hydrogen bonds, salt bridges, Pi-cation, Pi-stacking, T-stacking, hydrophobic, and van der Waals interactions.
GetContacts was used to extract nine types of interactions from MD trajectories:
cd data_prepare/molecular_dynamics
get_dynamic_contacts.py --itypes hb sb pc ps ts hp vdw --cores 2 --topology 3tvj_I.pdb --trajectory 3tvj_I_10frames.dcd --output 3tvj_I_10frames_contact.tsv
After extract interactions, you can use MDTraj v1.9.9 to generate the residue-level and pairwise features with:
cd data_prepare/molecular_dynamics
python MD_features.py -p 3tvj_I.pdb -t 3tvj_I_10frames.dcd -i 3tvj_I_10frames_contact.tsv -o 3tvj_I
-p
: PDB structure file; -t
: MD trajectory file (.dcd format here); -i
: interaction tsv file from GetContacts; -o
: file name for residue features and pairwise features.
For NMA data, we used ProDy v2.4.0 to conduct the analysis. Normal modes were categorized into three frequency-based clusters. For each cluster, residue fluctuation and pairwise correlation maps were computed.
cd data_prepare/normal_mode_analysis
python NMA_features.py -i 2g3r.pdb -o nma_residue_pair_features_2g3r
-i
: PDB structure file; -o
: file name for NMA features.
We recommend installing GetContacts, MDTraj, and ProDy in different conda environments from the SeqDance pre-training environment. Installing all required packages took about a hour in our server.
The feature extraction process is the most complicated step in our work, it took us over a month.