Full-atom d-peptide co-design with flow matching

Our codebase is developed on top of FrameFlow, MultiFlow, PepFlow, and ByProt. If you have any questions, please contact [fangwu97@stanford.edu]. Thank you! :)

Installation

# Install environment with dependencies.
conda env create -f env.yml

# Activate environment
conda activate dflow

# Install local package.
# Current directory should have setup.py.
pip install -e .

pip install easydict, lmdb

Next you need to install torch-scatter manually depending on your torch version. (Unfortunately torch-scatter has some oddity that it can't be installed with the environment.) We use torch 2.4.1 and cuda 12.1 (H100) so we install the following

pip install torch-scatter -f https://data.pyg.org/whl/torch-2.4.1+cu121.html

If you use a different torch then that can be found with the following.

# Find your installed version of torch
python
>>> import torch
>>> torch.__version__
# Example: torch 2.4.1+cu121

Warning

You will likely run into the follow error from DeepSpeed

ModuleNotFoundError: No module named 'torch._six'

If so, replace from torch._six import inf with from torch import inf.

/path/to/envs/site-packages/deepspeed/runtime/utils.py
/path/to/envs/site-packages/deepspeed/runtime/zero/stage_1_and_2.py

where /path/to/envs is replaced with your path.

Data

Pretrain datasets are host on Zenodo here. Download the following files,

real_train_set.tar.gz (2.5 GB)
synthetic_train_set.tar.gz (220 MB)
test_set.tar.gz (347 MB)

Next, untar the files

# Uncompress training data
mkdir train_set
tar -xzvf real_train_set.tar.gz -C train_set/
tar -xzvf synthetic_train_set.tar.gz -C train_set/

# Uncompress test data
mkdir test_set
tar -xzvf test_set.tar.gz -C test_set/

The resulting directory structure should look like

<current_dir>
├── train_set
│   ├── processed_pdb
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
│   ├── processed_synthetic
|   |   └── <protein_id>.pkl
├── test_set
|   └── processed
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
...

Our experiments read the data by using relative paths. Keep the directory structure like this to avoid bugs.

PepMerge dataset is available on Google Drive here. Downloading the following files:

PepMerge_release.zip (1.2GB)

The PepMerge_release.zip contains filtered data of peptide-receptor pairs, which is collected from PepBDB and QBioLip. For example, in the folder 1a0n_A, the P chain in the PDB file 1a0n is the peptide. In each sub-folder, FASTA and PDB files of the peptide and receptor are given. The postfix _merge means the peptide and receptor are in the same PDB file. The binding pocket of the receptor is also provided, where our model is trained to generate peptides based on the binding pocket. When you run the code, it will automatically process the data and produce pep_pocket_train_structure_cache.lmdb and pep_pocket_test_structure_cache.lmdb in the default cache folder (i.e., ../pep_cache/).

Training

The command to run co-design training is the following,

# pretrain 
python -W ignore dflow/experiments/train_se3_flows.py -cn pdb_codesign

# peptide training (1 GPU)
python -W ignore dflow/experiments/train_pep_flows.py

# DDP peptide training (e.g., 4 GPUs)
torchrun --nproc_per_node=4 dflow/experiments/train_pep_flows.py

We use Hydra to maintain our configs. The training config is found here multiflow/configs/pdb_codesign.yaml.

Most important fields:

experiment.num_devices: Number of GPUs to use for training. Default is 2.
data.sampler.max_batch_size: Maximum batch size. We use dynamic batch sizes depending on data.sampler.max_num_res_squared. Both these parameters need to be tuned for your GPU memory. Our default settings are set for a 40GB Nvidia RTX card.
data.sampler.max_num_res_squared: See above.

Inference

Model weights are provied at this Google Drive link.

Run the following to unpack the weights

unzip model_weights.zip

The following three tasks can be performed.

# Unconditional Co-Design
python -W ignore dflow/experiments/inference_pep.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
ProteinMPNN		ProteinMPNN
dflow		dflow
openfold		openfold
README.md		README.md
check_chirality.py		check_chirality.py
env.yml		env.yml
multiflow.yml		multiflow.yml
pdb_flow.sh		pdb_flow.sh
pep_flow.sh		pep_flow.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Full-atom d-peptide co-design with flow matching

Installation

Data

Training

Inference

Configs

About

Releases

Packages

Languages

smiles724/PeptideDesign

Folders and files

Latest commit

History

Repository files navigation

Full-atom d-peptide co-design with flow matching

Installation

Data

Training

Inference

Configs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages