This code provides implementations of variational autoencoder models designed to work with aligned and unaligned protein sequence data as described in the manuscript Generating novel protein variants with variational autoencoders.
The code requires Python 3. Variational autoencoder models were implemented in keras (2.1.2) using the tensorflow backend (tensorflow 1.0.0). Full python dependencies are listed in requirements.txt.
Individual models were trained on a single Tesla K80 GPU with cuda 8.0.0, cudnn v5 and Python 3.6.0.
To run code locally, first clone the repository, then install all dependencies (pip install -r requirements.txt)
To train models run the corresponding script (training logs will be written to output/logs, and weights saved to output/weights at the end of training.)
python scripts/train_msa.py
or
python scripts/train_raw.py
For the latter we recommend the use of a GPU, the former can run in a few hours on a standard CPU.
To generate sequences by sampling from the prior run scripts/generate_from_prior.py, passing the name of the weights file, and specifying the --unaligned flag if using an ARVAE model. Generated sequences will be written to a new fasta file in output/generated_sequences/
python scripts/generate_from_prior.py data/weights/msavae.h5