Protein design VAE

09.2019

Copyright (C) 2019 Xinran Lian, Andrew Ferguson, Rama Ranganathan

Please refer to instructions in VAE_SH3.ipynb.

Vanilla VAE codes for https://www.biorxiv.org/content/10.1101/2022.12.21.521443v1

InfoVAE codes: https://github.com/Ferg-Lab/Protein_design_mmdVAE_torch

Protein design variational autoencoder (VAE) is an approach for designing new proteins from primary sequential structure and evolutionary constrains based on deep learning. Feeded by neural network with natural sequences (namely multiple sequence alignment, MSA), the VAE encodes the high-dimensional sequence data into low-dimensional latent space, then decodes the sample points from the latent space to construct new sequences.

This repository includes a complete pipeline from preprocessing the MSA data to pick up new VAE generated sequences. The tutorial file VAE_SH3.ipynb is distributed as Jupyter notebook; for details please see: https://jupyter.org/.


Inputs/	Input data, including the MSA (.fasta) and the SCA reference (.pdb).
Outputs/	Output files
sources/	The VAE codes (explained in the appendix)
VAE_SH3.ipynb	The VAE tutorial for SH3
runsca_SH3.sh	The shell script to run SCA for MSA

Getting Started

1. Clone modules from git

To clone this VAE repository together with the SCA submodule, go to your destination folder and run following commands in the terminal:

git clone https://github.com/andrewlferguson/protein-design-VAE.git

2. Set up environment and dependencies

We recommand you to set up the working environment in the following steps:

Installed the newest version of required python packages: numpy, pandas, numba, scipy, matplotlib, torch, sklearn, Bio, pySCA:
I. It is recommended to install torch according to the official instruction.
II. For the other packages, Conda environment will be helpful. After installing conda, you can install the packages by running the command below in the terminal:
```
conda install numpy pandas numba scipy matplotlib sklearn Bio
```
Install pySCA dependencies according to this instruction.
Note: pySCA is a seperate module from VAE. It is highly recommended though not critical to have it installed. Without the dependencies you will not be able to add -a argument when generating new sequences(see below), and error will occur while running sections related to SCA in the tutorial.

3. Usage

For SH3, we prepared the input .fasta file (Inputs/sh3_59.fasta) and reference .pdb structure of Sho1 (2VKN.pdb).
To execute the VAE pipeline, follow the instructions at the beginning of the jupyter notebook tutorial VAE_SH3.ipynb.
Briefly, run the following commands sequentially in this directory:

cd source  
./preprocessing.py ../Inputs/sh3_59.fasta -n SH3  
./train_model.py -n SH3
./Generate_many_seqs.py -g 1000 -n SH3

Additionally, then run the following two commands if you want to evaluate the model with SCA.

./runsca_SH3.sh # run SCA for MSA
./Generate_many_seqs.py -n SH3 -c 1e4 -t 1.0 -p 0 -a # run SCA for generated sequences

Appendix: contents of `source/`

We have the source codes explained in three categories:

Scripts to be execute
Toolkits
Scripts specified for UChicago RCC
The SCA script for MSA is not included here because the SCA settings are not generic but dependent on the protein family. See the SCA documention if you are interested.

1. Scripts to be execute

preprocessing.py

Convert the MSA into one-hot Potts representation.
Compute plmDCA probability for MSA.

--help
usage: train_model.py [-h] [-n NAME] [-e NBEPOCH]

optional arguments:
  -h, --help  show this help message and exit
  -n NAME     Name of your protein.
  -e NBEPOCH  number of training epochs.

train_model.py

Train the VAE model.

--help  
optional arguments:
  -h, --help  show this help message and exit
  -n NAME     Name of your protein.
  -e NBEPOCH  number of training epochs.

Generate_many_seqs.py

Generate new sequences.
Compute closest identity (minimum Hamming distance) of the generated sequences to MSA, plmDCA probability and the VAE log probability for generated sequences.

--help
usage: Generate_many_seqs.py [-h] [-g NGEN] [-s NSAMP] [-t THRESH]
                             [-p THRESHP] [-r RANDSEED] [-n NAME] [-c CUSTOM]
                             [-a]

Hint: In total ngen*nsamp new sequences are generated, default 1000. Then they
are filtered according to thresholds of plmDCA probability and minimum Hamming
distance.

optional arguments:
  -h, --help            show this help message and exit
  -g NGEN, --ngen NGEN  times of sampling in the latent space. Default 1000.
                        Recommended to enter a multiple of 10.
  -s NSAMP, --nsamp NSAMP
                        times of throwing dice at each sampling point. Default
                        10
  -t THRESH, --thresh THRESH
                        Filter out sequences with min Hamming distance larger
                        than the threshold (float). Default 1.0, meaning all 
                        sequences will be kept.
  -p THRESHP, --threshp THRESHP
                        Filter out sequences with plmDCA prob. < threshold
                        (float). Default 110.
  -r RANDSEED, --randseed RANDSEED
                        Random seed. Default 1000.
  -n NAME, --name NAME  Name of your protein.
  -c CUSTOM, --custom CUSTOM
                        A custom string for your generated sequence file name.
                        Default None.
  -a, --sca             Compute SCA for generated sequecnes

2. Toolkits

toolkit.py

Generic tool functions.

compute_plmDCA.jl

A julia script to compute plmDCA for MSA. Will be executed upon running preprocessing.py.

VAE.pyt

The trained VAE model generated upon running train_model.py.

3. Scripts specified for UChicago RCC

rcc_train_model.sbatch

Run train_model.py in UChicago RCC GPU. See here for help.

rcc_train_model.sh

Will be executed upon running rcc_train_model.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein design VAE

Please refer to instructions in VAE_SH3.ipynb.

Getting Started

1. Clone modules from git

2. Set up environment and dependencies

3. Usage

Appendix: contents of `source/`

1. Scripts to be execute

preprocessing.py

train_model.py

Generate_many_seqs.py

2. Toolkits

toolkit.py

compute_plmDCA.jl

VAE.pyt

3. Scripts specified for UChicago RCC

rcc_train_model.sbatch

rcc_train_model.sh

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Inputs		Inputs
Outputs		Outputs
Training_curves		Training_curves
source		source
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
VAE_SH3.ipynb		VAE_SH3.ipynb
runsca_SH3.sh		runsca_SH3.sh

License

ranganathanlab/VAEforDesign

Folders and files

Latest commit

History

Repository files navigation

Protein design VAE

Please refer to instructions in VAE_SH3.ipynb.

Getting Started

1. Clone modules from git

2. Set up environment and dependencies

3. Usage

Appendix: contents of source/

1. Scripts to be execute

preprocessing.py

train_model.py

Generate_many_seqs.py

2. Toolkits

toolkit.py

compute_plmDCA.jl

VAE.pyt

3. Scripts specified for UChicago RCC

rcc_train_model.sbatch

rcc_train_model.sh

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Appendix: contents of `source/`

Packages