Here we provide PUFFIN models trained on peptitde-MHC binding affinity datasets from NetMHCpan3.0 (for class I MHC) and NetMHCIIpan-3.2 (for class II MHC). The five folds are combined and split into training and validation set.
Note that because here we no longer need to hold out data as test set, this training setup is different from what's used in the paper (Table 1 and Table 2), where the training/test set provided by Bhattacharya et al. was used for class I MHC and the 5-fold cross-validation split from NetMHCIIpan-3.2 was used for class II MHC (the performance on each fold was evaluated by a model trained on the other four folds).
Download the trained model from here.
We provide a Conda environment that provides all the Python packages required by PUFFIN. Build and activate this environment by:
conda env create -f environment.yml
source activate puffin
To deactivate this environment:
source deactivate
Save all MHC-peptide pairs to evaluate in a tab-delimited file with three columns, each of which denotes the peptide sequence, the observed binding affinity (use any placeholder number when it is not available), and the MHC allele respectively. The MHC allele names supported are in the first column of this file. (class I example, class II example)
Then preprocess the data by:
python preprocess.py -i DATAFILE -o OUTDIR -c CLASS
DATAFILE
: the file that contains the MHC-peptide pairsOUTDIR
: the directory to save all the outputCLASS
: "1" for class I and "2" for class II
python score.py -o OUTDIR -c CLASS -g GPU
OUTDIR
: same as aboveCLASS
: same as aboveGPU
: a comma-delimited string that denotes the index(es) of the GPU(s) to run the models on (eg. "0,1,2,3"). We recommend using multiple GPUs if possible to speed up the prediction.
The predictions are saved in $OUTDIR/PUFFIN.combined
. It's a tab-delimited file with four columns, each of which denotes the predicted mean affinity, epistemic uncertainty, aleatoric uncertainty, and the binding likelihood for a 500 nM binding threshold.