Our codebase is developed on top of FrameFlow, MultiFlow, PepFlow, and ByProt. If you have any questions, please contact [fangwu97@stanford.edu]. Thank you! :)
# Install environment with dependencies.
conda env create -f env.yml
# Activate environment
conda activate dflow
# Install local package.
# Current directory should have setup.py.
pip install -e .
pip install easydict, lmdb
Next you need to install torch-scatter manually depending on your torch version. (Unfortunately torch-scatter has some oddity that it can't be installed with the environment.) We use torch 2.4.1 and cuda 12.1 (H100) so we install the following
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.4.1+cu121.html
If you use a different torch then that can be found with the following.
# Find your installed version of torch
python
>>> import torch
>>> torch.__version__
# Example: torch 2.4.1+cu121
Warning
You will likely run into the follow error from DeepSpeed
ModuleNotFoundError: No module named 'torch._six'
If so, replace from torch._six import inf
with from torch import inf
.
/path/to/envs/site-packages/deepspeed/runtime/utils.py
/path/to/envs/site-packages/deepspeed/runtime/zero/stage_1_and_2.py
where /path/to/envs
is replaced with your path.
Pretrain datasets are host on Zenodo here. Download the following files,
real_train_set.tar.gz
(2.5 GB)synthetic_train_set.tar.gz
(220 MB)test_set.tar.gz
(347 MB)
Next, untar the files
# Uncompress training data
mkdir train_set
tar -xzvf real_train_set.tar.gz -C train_set/
tar -xzvf synthetic_train_set.tar.gz -C train_set/
# Uncompress test data
mkdir test_set
tar -xzvf test_set.tar.gz -C test_set/
The resulting directory structure should look like
<current_dir>
├── train_set
│ ├── processed_pdb
| | ├── <subdir>
| | | └── <protein_id>.pkl
│ ├── processed_synthetic
| | └── <protein_id>.pkl
├── test_set
| └── processed
| | ├── <subdir>
| | | └── <protein_id>.pkl
...
Our experiments read the data by using relative paths. Keep the directory structure like this to avoid bugs.
PepMerge dataset is available on Google Drive here. Downloading the following files:
PepMerge_release.zip
(1.2GB)
The PepMerge_release.zip
contains filtered data of peptide-receptor pairs, which is collected from
PepBDB and QBioLip.
For example, in the folder 1a0n_A
, the P
chain in the PDB file 1a0n
is the peptide.
In each sub-folder, FASTA and PDB files of the peptide and receptor are given.
The postfix _merge means the peptide and receptor are in the same PDB file.
The binding pocket of the receptor is also provided, where our model is trained to generate peptides based on the binding pocket.
When you run the code, it will automatically process the data and produce pep_pocket_train_structure_cache.lmdb
and
pep_pocket_test_structure_cache.lmdb
in the default cache folder (i.e., ../pep_cache/
).
The command to run co-design training is the following,
# pretrain
python -W ignore dflow/experiments/train_se3_flows.py -cn pdb_codesign
# peptide training (1 GPU)
python -W ignore dflow/experiments/train_pep_flows.py
# DDP peptide training (e.g., 4 GPUs)
torchrun --nproc_per_node=4 dflow/experiments/train_pep_flows.py
We use Hydra to maintain our configs.
The training config is found here multiflow/configs/pdb_codesign.yaml
.
Most important fields:
experiment.num_devices
: Number of GPUs to use for training. Default is 2.data.sampler.max_batch_size
: Maximum batch size. We use dynamic batch sizes depending ondata.sampler.max_num_res_squared
. Both these parameters need to be tuned for your GPU memory. Our default settings are set for a 40GB Nvidia RTX card.data.sampler.max_num_res_squared
: See above.
Model weights are provied at this Google Drive link.
Run the following to unpack the weights
unzip model_weights.zip
The following three tasks can be performed.
# Unconditional Co-Design
python -W ignore dflow/experiments/inference_pep.py