We present PIFiA (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data.
Check out our Molecular Systems Biology article!
We imaged the global yeast ORF-GFP collection and applied PIFiA to generate protein feature profiles from single-cell images of fluorescently tagged proteins. We show that PIFiA outperforms existing approaches for molecular representation learning and describe a range of downstream analysis tasks to explore the information content of the feature profiles.
Linux, Mac OS, Windows are supported for running the code on CPU; We recommend Linux for running experiments with GPU. At least 16GB of RAM is required to run the software. The codebase has been heavily tested on Linux 4.15.0-206-generic #217-Ubuntu.
Our implementation is based on Python 3 and TensorFlow 2.1.
Requirements:
- python 3.7
- tensorflow 2.1
- pandas 0.25.1
- numpy 1.18.1
- matplotlib 3.0.3
- seaborn 0.11.1
- pillow 6.1.0
- plotly 4.14.3
Make sure that you have Anaconda installed. If not - follow this miniconda installation.
To run PIFiA code on GPU, make sure that you have a CUDA capable GPU and the drivers for your GPU are up to date. In our implementation, we used and CUDA 11.0.
Now you can configure conda environment:
$ git clone https://github.com/arazd/pifia
$ cd pifia
$ conda env create -f environment.yml
Your conda should start downloading and extracting packages. This can take ~15-20 minutes.
To activate the environment, run:
$ conda activate pifia_env
Here we show how to run PIFiA demo on a toy dataset (5 proteins).
First, unzip the toy dataset folder:
$ cd pifia/data
$ unzip data_subset.zip
A.1 Create folders for checkpointing / saving model weights:
$ cd ../
$ mkdir ckpt_dir
$ mkdir saved_weights
Since our full dataset contrains >3 million single-cell images, it is expensive to run feature extraction during training. Hence, we save model weights several times during training, then perform feature extraction and evaluation, and finally select the best weights.
Checkpointing is implemented for training on high-performance computing facilities that require job preemption.
A.2 Run training script:
$ export HDD_MODELS_DIR=./
$ conda activate pifia_env
$ python model/train.py --dataset harsha \
--backbone pifia_network --learning_rate 0.0003 --dropout_rate 0.02 --cosine_decay True \
--labels_type toy_dataset --dense1_size 128 --num_features 64 --save_prefix TEST_RUN \
--num_epoch 30 --checkpoint_interval 1800 --checkpoint_dir ./ckpt_dir \
--saved_weights_dir ./saved_weights --log_file ./log_file.log
OR, if you are using slurm, run:
$ sbatch scipts/train_pifia.sh
After training is completed, you can see training log and saved weights in saved_weights
folder we created.
Loading weights for PIFiA model is very straightforward. Final pre-trained weights of PIFiA network (that we used in our paper) are stored under model/pretrained_weights
. Alternatively, if you are training PIFiA network from scratch (as shown in step A), your weights with epoch number should be saved in saved_weights
folder.
B.1 Load a pre-trained PIFiA model
First, activate your conda environment and go to model
folder.
$ conda activate pifia_env
$ python
To load pre-trained PIFiA weights in Python, run the following code in the python shell:
# import packages
>>> import numpy as np
>>> from model import models
>>> from model.extract_features import *
# we need to know number of proteins in our training data to get number of nodes for PIFiA classification layer
>>> labels_dict = np.load('data/protein_to_files_dict_toy_dataset.npy',allow_pickle=True)[()]
>>> num_classes = 4049
>>> model = models.pifia_network(num_classes,
k=1,
num_features=64,
dense1_size=128,
last_block=True)
>>> model.load_weights('model/pretrained_weights/pifia_weights_i0')
Note that if you want to load custom PIFiA weights (from training in step A), you need to change the weights path. Loading the model should take ~1min.
B2. Extract single-cell features
After loading the model, here is an example of extracting features from NUP2 protein from our toy dataset:
>>> protein_name = 'NUP2'
>>> protein_features, protein_images = get_features_from_protein(protein_name, labels_dict, model,
average=False, subset='test')
In this example, protein_features
is a numpy array of size (10, 64)
, i.e. it contains ten 64-dimensional single-cell feature profiles. The extraction process should take ~1sec on GPU and ~20sec on CPU.
To reproduce results from the paper, please check out folder pifia_results
.
To interactively look at PIFiA average feature profiles, or aFPs (Fig. 3 in the paper), download file PIFiA_aFPs_tSNE.html
and open it in your web browser. You should be able to see a whole-proteome tSNE with protein names and their annotations (subcellular localization, GO bioprocess, GO molecular function, pathway, protein complex).
If you found this work useful for your research, please cite:
Razdaibiedina, A., Brechalov, A.V., Friesen, H., Mattiazzi Usaj, M., Masinas, M.P.D., Garadi Suresh, H., Wang, K., Boone, C., Ba, J. and Andrews, B.J., 2023. PIFiA: Self-supervised Approach for Protein Functional Annotation from Single-Cell Imaging Data.
@article{razdaibiedina2023pifia,
title={PIFiA: Self-supervised Approach for Protein Functional Annotation from Single-Cell Imaging Data},
author={Razdaibiedina, Anastasia and Brechalov, Alexander V and Friesen, Helena and Mattiazzi Usaj, Mojca and Masinas, Myra Paz David and Garadi Suresh, Harsha and Wang, Kyle and Boone, Charlie and Ba, Jimmy and Andrews, Brenda J},
journal={bioRxiv},
pages={2023--02},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}