Skip to content

Conditioning diffusion models for scaffolding protein motifs. A repository for my master's dissertation.

Notifications You must be signed in to change notification settings

matsagad/mres-project

Repository files navigation

Diffusion Posterior Sampling via SMC for Zero-Shot Scaffolding of Protein Motifs

This repository provides an interface for solving motif scaffolding problems with an unconditional diffusion model as a prior. It defines inverse problems for a variety of tasks and solves them by sampling the posterior via sequential Monte Carlo. This permits conditional sampling without additional training to the chosen unconditional model.

The following are the supported tasks, samplers, and likelihood formalisations for conditioning on a motif.

Motif Scaffolding Tasks

  • (Single-) motif scaffolding
  • Multi-motif scaffolding
  • Symmetric motif scaffolding
  • Inpainting/scaffolding with degrees of freedom

(Diffusion) Posterior Sampling Methods

Motif Scaffolding Likelihood Formalisations

  • Masking (as done by previous methods)
    • Condition on a partial view of the backbone with a fixed orientation.
  • Distance
    • Condition on the pairwise residue distances (but is reflection-invariant).
  • Frame-based distance
    • Condition on the pairwise residue distances and pair-wise residue rotation matrix deviations from the rigid body frame representation.

Additionally, other unconditional models can be supported by creating an adapter for them in model, registering them and their parameters, and adding a config in config/model. Currently, we have Genie-SCOPe-128 and Genie-SCOPe-256 available. The conditional samplers assume the frame representation of the protein, so some extra engineering may be required for other models.

Installation

To match our setup, use Python 3.9 with CUDA version 11.8 and above. First, pip install the requirements.

pip install -r requirements.txt

Then, initialise the submodules if not already setup.

git submodule update --init

Finally, but optionally, we use the insilico design pipeline from AQLaboratory for evaluation. Run their bash scripts for installing TMScore, ProteinMPNN, and ESMFold to set up the self-consistency pipeline.

Structure

.
├── conditional/                    # Posterior samplers
│   ├── __init__.py                     # Sampler registration
│   ├── components/                     # Reusable components
│   │   ├── observation_generator.py        # for generating y sequence
│   │   └── particle_filter.py              # for filtering
│   ├── bpf.py                          # Bootstrap particle filter
│   ├── fpssmc.py                       # Filtering posterior sampling
│   ├── smcdiff.py                      # SMCDiff
│   ├── tds.py                          # Twisted diffusion sampler
│   └── wrapper.py                      # Abstract class for samplers
├── config/                         # Configs
│   ├── config.yaml                     # Main config file
│   ├── experiments/                    # Experiment config groups
│   │   └── ...
│   └── model/                          # Model config groups
│       └── ...
├── data/                           # Motif scaffolding data   
│   ├── motif_problems/                 # RFDiffusion benchmark
│   │   └── ...
│   ├── multi_motif_problems/           # Genie2 benchmark
│   │   └── ...
│   └── symmetric_motif_problems/       # RFDiffusion trimeric covid binder
│       └── ...
├── experiments/                    # Experiments
│   ├── __init__.py                     # Experiment registration
│   └── experiments.py                  # Experiment definitions
├── model/                          # Diffusion model
│   ├── __init__.py                     # Model registration
│   ├── diffusion.py                    # Abstract class for diffusion models
│   └── genie.py                        # Genie adapter
├── multirun/                       # Output of multirun/sweeping experiments
│   └── ...
├── outputs/                        # Output of experiments
│   └── ...
├── protein/                        # Protein-related functions
│   └── frames.py                       # Abstract class for frames
├── scripts/                        # Scripts for config generation, etc.
│   └── ...
├── submodules/                     # Git submodules
│   └── ...
├── utils/                          # Utility functions
│   ├── path.py                         # for resolving paths
│   ├── pdb.py                          # for working with PDBs
│   ├── registry.py                     # for handling registrations
│   ├── resampling.py                   # for low-variance resampling
│   └── symmetry.py                     # for dealing with symmetry
└── main.py                         # Main entry point

Usage

The project uses the Hydra framework for handling different experimental setups. Configuration files and groups are defined under the config folder.

To get started, the following command will show the available options for config groups, e.g. an experiment type, as well as the default parameters set.

python3 main.py --help

Supported Experiments

Configured through the option experiment={experiment_name}. Check their arguments and defaults in config/experiments.

Experiment Description
sample_unconditional Sample unconditional samples from the diffusion model. Total length must be specified.
sample_given_motif Sample conditioned on a motif being present in the samples. Motif config files have specifications like in RFDiffusion.
sample_given_multiple_motifs Sample conditioned on multiple motifs being present on the samples. Motif config files have specifications like in Genie2.
sample_given_symmetry Sample conditioned on the samples following a point symmetry.
sample_given_motif_and_symmetry Sample conditioned on a motif being present in the samples and them following a point symmetry. Motif specification is for a single monomer.
evaluate_samples Evaluate motif scaffolding results using insilico design pipeline.

Examples

Sample 16 proteins with 96 residues each using unconditional model Genie-SCOPe-128 (default if unspecified) on GPU device #1.

python3 main.py experiment=sample_unconditional \
    experiment.n_samples=16 \
    experiment.sample_length=96 \
    model=genie-scope-128 \
    model.device=cuda:1

Scaffold motif problem 3IXT using TDS with masking likelihood, twist scale=2.0, and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=3IXT \
    experiment/conditional_method=tds-mask \
    experiment.conditional_method.twist_scale=2.0 \
    experiment.n_samples=8

Produce 16 scaffolds for motif problem 1PRW using TDS with distance likelihood and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=1PRW \
    experiment/conditional_method=tds-distance \
    experiment.conditional_method.n_batches=16
    experiment.n_samples=128

Scaffold motif problem 5TPN, allowing the motif to be placed anywhere, using TDS with masking likelihood and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=5TPN \
    experiment.fixed_motif=False \
    experiment/conditional_method=tds-mask \
    experiment.n_samples=8

Scaffold multi-motif problem 1PRW_two using TDS with frame-based distance likelihood and K=8 particles.

python3 main.py experiment=sample_given_multiple_motifs \
    experiment/multi_motif=1PRW_two \
    experiment/conditional_method=tds-frame-distance \
    experiment.n_samples=8 \
    model=genie-scope-256

Sample a monomer with 250 residues and C-5 internal symmetry using FPS-SMC with K=16 particles.

python3 main.py experiment=sample_given_symmetry \
    model=genie-scope-256 \
    experiment.sample_length=250 \
    experiment.symmetry=C-5 \
    experiment/conditional_method=fpssmc \
    experiment.n_samples=16

Evaluate samples from unconditional, single-motif scaffolding, or multi-motif scaffolding experiments using the insilico design pipeline with CUDA visible devices #0, #2, and #3

python3 main.py experiment=evaluate_samples \
    experiment.path_to_experiment=<path_to_hydra_output_folder> \
    experiment.gpu_devices=\[0, 2, 3\]

Each of the conditional methods and models have their own default hyperparameters which can be overwritten in the command-line. Check out their config files for more info. Custom motifs can also be scaffolded by creating a config file in configs/experiment/motif following the specification of configs in that directory. The case is similar with multiple motifs, except they are stored in configs/experiment/multi_motif.

About

Conditioning diffusion models for scaffolding protein motifs. A repository for my master's dissertation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages