This is the official code repository for the paper Kermut: Composite kernel regression for protein variant effects (preprint).
Kermut is a carefully constructed Gaussian process which obtains state-of-the-art performance for protein property prediction on ProteinGym's supervised substitution benchmark while providing meaningful overall calibration.
Kermut as been applied to the supervised ProteinGym substitution benchmark (paper, repo) and reaches state of the art performance across splits.
Below is a table showing the aggregated Spearman scores per cross validation scheme. In March 2024, the modulo and contiguous splits were updated (see GitHub issue), and we show the performance on the corrected splits as well as the old splits to compare to ProteinNPT.
Model | Average | Random | Modulo | Contiguous |
---|---|---|---|---|
Kermut | 0.655 | 0.744 | 0.631 | 0.591 |
Kermut (old splits) | 0.662 | 0.744 | 0.633 | 0.610 |
ProteinNPT (old splits) | 0.613 | 0.730 | 0.564 | 0.547 |
Below is a table showing the aggregated Spearman scores per functional category.
Model | Activity | Binding | Expression | Organismal Fitness | Stability |
---|---|---|---|---|---|
Kermut | 0.605 | 0.614 | 0.662 | 0.580 | 0.817 |
Kermut (old splits) | 0.606 | 0.630 | 0.672 | 0.581 | 0.824 |
ProteinNPT (old splits) | 0.547 | 0.470 | 0.584 | 0.493 | 0.749 |
The per-variant predictions (with uncertainties) for all assays can be downloaded via:
# Download zip archive
curl -o predictions.zip https://sid.erda.dk/share_redirect/c2EWrbGSCV/predictions.zip
# Unpack and remove zip archive
unzip predictions.zip && rm predictions.zip
The per-variant predictions for the old split and the ablation predictions can be downloaded via
# Download zip archive
curl -o predictions_old_split.zip https://sid.erda.dk/share_redirect/c2EWrbGSCV/predictions_old_split.zip
# Unpack and remove zip archive
unzip predictions_old_split.zip && rm predictions_old_split.zip
After cloning the repository, the environment can be installed via
conda env create -f environment.yml
conda activate kermut_env
pip install -e .
The environment.yml
file is a minimal version. For exact reproduction, the environment_full.yml
file should be used (which will however be system dependent).
To run Kermut from scratch without precomputed resources, e.g., for a new dataset, the ProteinMPNN repository must be installed. Additionally, the ESM-2 650M parameter model must be saved locally:
Kermut leverages structure-conditioned amino acid distributions from ProteinMPNN, which can has to installed from the official repository. An environment variable pointing to the installation location can then be set for later use:
export PROTEINMPNN_DIR=<path-to-ProteinMPNN-installation>
Kermut leverages protein sequence embeddings and zero-shot scores extracted from ESM-2 (paper, repo). We concretely use the 650M parameter model (esm2_t33_650M_UR50D
). While the ESM repository is installed above /via the yml-file), the model weights should be downloaded separately and placed in the models
directory:
curl -o models/esm2_t33_650M_UR50D.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
curl -o models/esm2_t33_650M_UR50D-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t33_650M_UR50D-contact-regression.pt
This section describes how to access the data that was used to generate the results. To reproduce all results from scratch, follow all steps in this section and in the Data preprocessing section. To reproduce the benchmark results using precomputed resources (ESM-2 embeddings, conditional amino-acid distributions, etc.) see the section on precomputed resources.
Kermut is evaluated on the ProteinGym benchmark (paper, repo). For full details on downloading the relevant data, please see the ProteinGym resources. In the following, commands are provided to extract the relevant data.
- Reference file: A reference file with details on all assays can be downloaded from the ProteinGym repo and should be saved as
data/DMS_substitutions.csv
The file can be downloaded by running the following:
curl -o data/DMS_substitutions.csv https://raw.githubusercontent.com/OATML-Markslab/ProteinGym/main/reference_files/DMS_substitutions.csv
- Assay data: All assays (with CV folds) can be downloaded and extracted to
data
. This results in two subdirectories:data/substitutions_singles
anddata/substitutions_multiples
. The data can be accessed viaCV folds - Substitutions - <Singles,Multiples>
in ProteinGym.
To download the data, run the following:
# Download zip archive
curl -o cv_folds_singles_substitutions.zip https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip
# Unpack and remove zip archive
unzip cv_folds_singles_substitutions.zip -d data
rm cv_folds_singles_substitutions.zip
- PDBs: All predicted structure files are downloaded and placed in
data/structures/pdbs
. PDBs are accessed viaPredicted 3D structures from inverse-folding models
in ProteinGym.
# Download zip archive
curl -o ProteinGym_AF2_structures.zip https://marks.hms.harvard.edu/proteingym/ProteinGym_AF2_structures.zip
# Unpack and remove zip archive
unzip ProteinGym_AF2_structures.zip -d data/structures/pdbs
rm ProteinGym_AF2_structures.zip
- Zero-shot scores: For the zero-shot mean function, precomputed scores can be downloaded and placed in
zero_shot_fitness_predictions
, where each zero-shot method has its own directory. The precomputed zero-shot scores from ProteinGym can be accessed viaZero-shot DMS Model scores - Substitutions
. NOTE: The full zip archive with all scores takes up approximately 44GB of storage. Alternatively, the zero-shot scores for the 650M parameter ESM-2 model is included in the precomputed resources, which in total is only approximately 4GB.
# Download zip archive
curl -o zero_shot_substitutions_scores.zip https://marks.hms.harvard.edu/proteingym/zero_shot_substitutions_scores.zip
unzip zero_shot_substitutions_scores.zip -d data/zero_shot_fitness_predictions
# Unpack and remove zip archive
rm zero_shot_substitutions_scores.zip
- (Optional) Baselines scores: Results from ProteinNPT and the baseline models from ProteinGym can be accessed via
Supervised DMS Model performance - Substitutions
. The resulting csv-file can be placed inresults/baselines
.
curl -o DMS_supervised_substitutions_scores.zip https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip
# Unpack and remove zip archive
unzip DMS_supervised_substitutions_scores.zip -d results/baselines
rm DMS_supervised_substitutions_scores.zip
After downloading and extracting the relevant data in the Data access section, ESM-2 embeddings can be generated via the example_scripts/generate_embeddings.sh
script or simply via:
python src/data/extract_esm2_embeddings.py \
--dataset=all \
--which=singles
The embeddings for individual assays can be generated by replacing all
with the full assay name (e.g., --dataset=BLAT_ECOLX_Stiffler_2015
).
For each assay, an h5
file is generated which contains all embeddings for all variants. Since the Kermut GP only uses the mean-pooled embeddings, only these are stored. To obtain per AA embeddings, the extraction script can be altered by removing the mean operator.
The embeddings are located in data/embeddings/substitutions_singles/ESM2
(for the single-mutant assays).
For the multi-mutant assays, replace singles
with multiples
.
The structure-conditioned amino acid distributions for all residues and assays, can be computed with ProteinMPNN via
bash example_scripts/conditional_probabilities.sh
For a single dataset, see example_scripts/conditional_probabilities_single.sh
or example_scripts/conditional_probabilities_all.sh
. This generates per-assay directories in data/conditional_probs/raw_ProteinMPNN_outputs
. After this, postprocessing for easier access is performed via
python src/data/extract_ProteinMPNN_probs.py
This generates per-assay npy
-files in data/conditional_probs/ProteinMPNN
.
Lastly, the 3D coordinates can be extracted from each PDB file via
python src/data/extract_3d_coords.py
This saves npy
-files for each assay in data/structures/coords
.
If not relying on pre-computed zero-shot scores from ProteinGym, they can be computed for ESM-2 via:
python src/data/extract_esm2_zero_shots.py --dataset all
# for all datasets, or
python src/data/extract_esm2_zero_shots.py --dataset name_of_dataset
# for a single dataset.
See the script for usage details. For multi-mutant datasets, the log-likelihood ratios are summed for each mutant.
All outputs from the preprocessing procedure (i.e., precomputed ESM-2 embeddings, conditional amino acid distributions, processed coordinate files, and zero-shot scores from ESM-2) can be readily accessed via a zip-archive hosted by the Electronic Research Data Archive (ERDA) by the University of Copenhagen using the following link. The file takes up approximately 4GB. To download and extract the data, run the following:
# Download zip archive
curl -o kermut_data.zip https://sid.erda.dk/share_redirect/c2EWrbGSCV/kermut_data.zip
# Unpack and remove zip archive
unzip kermut_data.zip && rm kermut_data.zip
To reproduce the results (assuming that preprocessing has been carried out for all 217 DMS assays), run the following script:
bash example_scripts/benchmark_all_datasets.sh
This will run the Kermut GP on all 217 assays using the predefined random, modulo, and contiguous splits for cross validation. Each assay and split configuration generates a csv-file with the per-variant predictive means and variances. The prediction files can be merged using
python src/process_results/merge_score_files.py
This script generates two files: kermut_scores.csv
and merged_scores.csv
, where the former contains all Kermut's processed results, while the latter additionally includes all the reference models from ProteinGym.
Lastly, the results are aggregated across proteins and functional categories to obtain the final scores using ProteinGym functionality:
python src/process_results/performance_DMS_supervised_benchmarks.py \
--input_scoring_file results/merged_scores.csv \
--output_performance_file_folder results/summary
The post-processing steps for the main and ablation results can be seen in example_scripts/process_results.sh
.
Assuming that preprocessed data from ProteinGym is used, Kermut can be evaluated on any assay individually as follows:
python src/experiments/proteingym_benchmark.py --multirun \
dataset=ASSAY_NAME \
split_method=fold_random_5,fold_modulo_5,fold_contiguous_5 \
gp=kermut \
use_gpu=true
The codebase was developed using Hydra. The ablation GPs/kernels can be accessed straightforwardly:
python src/experiments/proteingym_benchmark.py --multirun \
dataset=ASSAY_NAME \
split_method=fold_random_5,fold_modulo_5,fold_contiguous_5 \
gp=kermut_constant_mean,kermut_no_g
Any dataset for which the structure follows the ProteinGym data (and has been processed as above) can be evaluated using new splits.
This simply requires adding new columns to the assay file in data/substitutions_singles
. The column should contain integer values for each variant indicating which fold it belongs to. The chosen GP will then be evaluated using CV (i.e., tested on all unique paritions while trained on the remaining):
python src/experiments/proteingym_benchmark.py --multirun \
dataset=ASSAY_NAME \
split_method=new_split_col \
gp=kermut \
use_gpu=true