Konstantin Weissenow, Michael Heinzinger, Burkhard Rost
Technical University of Munich
This repository contains the code for the EMBER3D protein structure and mutation effect prediction system. EMBER3D is currently provided as a prototype release for preview purposes. The system is still under active development.
A Google Colab notebook for structure prediction and the rendering of protein mutation movies (PMM) can be found here.
Please install EMBER3D on a Linux machine.
Create a new virtual environment, e.g. using conda:
conda create -n EMBER3D python=3.8
conda activate EMBER3D
If you use CUDA 11, you can use the provided requirements.txt to install dependencies (taking a couple of minutes on a normal desktop computer, depending on internet connectivity):
pip install -r requirements.txt
If you plan to render protein mutation movies, you additionally need PyMOL and ffmpeg, which you can install with e.g.
conda install -c conda-forge ffmpeg
conda install -c conda-forge pymol-open-source
Note for different CUDA versions: We currently don't yet provide package lists for different CUDA versions. If you use a different version, please use pip or conda to install the following packages:
torch (1.11)
dgl
pyg (aka torch-geometric)
e3nn
psutil
transformers
sentencepiece
biopython
matplotlib
You can compute structure predictions based on FASTA sequences using
python predict.py -i <FASTA> -o <OUTPUT_DIRECTORY>
The ProtT5 protein language model used to generate sequence embeddings will be downloaded on first use (2.3 GB) and stored by default in the directory 'ProtT5-XL-U50'. You can change this directory with the --t5_model
parameter.
By default, the script will produce PDB files and distance maps. You can disable outputs using the parameters --no-pdb
and --no-distance-maps
respectively.
Predictions for average-length protein sequences take less than a second, but the initial model loading causes a one-time cost of several seconds (depending on system speed). For efficiency, provide a single FASTA file with multiple sequences instead of calling the script multiple times with single-sequence inputs.
You can predict structures for all single amino-acid variants (SAVs) for the sequence(s) in a FASTA file using
python predict_sav.py -i <FASTA> -o <OUTPUT_DIRECTORY>
In addition to PDB files and distance maps, the SAV prediction script computes the structural difference between predictions for each mutant and the wild-type measured in lDDT (1.0 = most similar, 0.0 = least similar). These structure deltas are both rendered as an image (mutation_matrix.png) as well as provided as a text file for downstream consumption (mutation_log.txt).
From previously computed SAV predictions (see above), you can render movies using
python render_mutation_movie.py <FASTA> <OUTPUT_DIRECTORY>
Alternatively, you can do both steps (prediction + movie rendering) at once using
./create_SAV_movie.sh <FASTA> <OUTPUT_DIRECTORY>
3KDE_3_C.mp4
You can run a simple webserver for the visualization of predictions by starting
python webserver.py
and directing your browser at http://localhost:24398/
or using the address of the machine the server is running on. You can change the default port number using the -d
parameter when starting the webserver.
We reused several modules of the RoseTTAFold architecture. We use the SE(3)-Transformer implementation from NVIDIA.
For now, please cite this work as follows:
@article {Weissenow2022.11.14.516473,
author = {Weissenow, Konstantin and Heinzinger, Michael and Steinegger, Martin and Rost, Burkhard},
title = {Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies},
elocation-id = {2022.11.14.516473},
year = {2022},
doi = {10.1101/2022.11.14.516473},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2022/11/16/2022.11.14.516473},
eprint = {https://www.biorxiv.org/content/early/2022/11/16/2022.11.14.516473.full.pdf},
journal = {bioRxiv}
}