We used the ProteinCartography (v0.4.2) pipeline and the Actin Prediction pipeline to investigate the well-known actin family of proteins.
You can find a summary of this study and the results in the Pub:
The results of the analysis performed in this repository can be found on Zenodo:
The ProteinCartography pipeline is designed to explore protein families based on their structural relationships. The pipeline produces interactive maps with clusters and various overlays that can be used to analyze families. For more information on the ProteinCartography pipeline checkout the pub.
As a use case of this tool, we analyzed the actin family. This is a well-studied family with multiple subfamilies that makes it an interesting choice for the pipeline. Additionally, this protein family is involved in a plethora of cellular functions, meaning the results we find will have relevance to many fields. We've previously investigated this family in our Defining Actin pub and the corresponding GitHub repo. In this analysis, we gathered a list of actins via a BLAST search for the top 50,000 proteins in the NCBI non-redundant (nr) database with no taxonomic constrictions. This list is used in the current analysis.
For in depth instructions regarding how to use this repository in conjunction with the ProteinCartography pipeline and the Actin Prediction pipeline, see the Walkthrough below. Briefly, you should follow these steps:
- Clone this repository, set up the
2023-actin
environment - Clone the ProteinCartography pipeline, set up the
cartography_tidy
environment - Set up the general directory structure
- Fetch data from the Actin Prediction repo
- Use the
1_prepare_metadata.ipynb
notebook to prepare metadata for the ProteinCartography analysis - Use the
2_get_alphafold_structures.ipynb
notebook to download all relevant AlphaFold structures - Run ProteinCartography in "Cluster Mode"
- Create custom plots using the
3_plotting_overlays.ipynb
- Evaluate cluster distributions using
4_cluster_distributions.ipynb
First, you should clone this repository using the following command:
git clone https://github.com/Arcadia-Science/2023-actin-embedding.git
For this repository, we use the 2023-actin
environment, which is the cartography_tidy
environment from the ProteinCartography pipeline plus a few additional packages, including scipy
and ipykernel
. We recommend using conda and/or mamba to set up your environment. Create the environment using conda by running the following code from within the repository:
conda env create -f envs/2023-actin-embedding.yml -n 2023-actin-embedding
conda activate 2023-actin-embedding
To begin, visit the ProteinCartography repo for a guide on running the pipeline. The notebooks used in this repo are designed to run alongside the ProteinCartography pipeline, so we recommend cloning the ProteinCartography pipeline before beginning. We used the ProteinCartography pub release v0.4.2 release for this analysis.
The Quickstart guide in the ProteinCartography repository should get you started. But briefly, to clone the repository, use the following command:
git clone https://github.com/Arcadia-Science/ProteinCartography.git
You can then checkout the specific v0.4.2 release using the following:
git checkout v0.4.2
We will use the cartography_tidy
environment from the ProteinCartography pipeline to run the ProteinCartography analysis. Create the environment using conda by running the following code from within the repository:
conda env create -f envs/cartography_tidy.yml -n cartography_tidy
conda activate cartography_tidy
After cloning this repository and the ProteinCartography repository. You should set up your directory structure as follows:
├── ProteinCartography
│ └── actin # preparatory files and ProteinCartography run results end up here
| └── structures # structures downloaded from the AlphaFold database
| └── output # ProteinCartography results end up here
├── 2023-actin-embedding
│ ├── notebooks # notebooks used for analysis
│ ├── input # files from actin prediction pipeline
│ └── output # final figures will end up here
Then move into the 2023-actin-embedding
folder for the remainder of the analysis.
Once ProteinCartography is installed, you can find the data generated in the Actin Prediction pipeline on Zenodo. This data is also included in the Inputs
folder of this repository as all_outputs_summarized.tsv
. The proteins from this file were used as our input for the analysis in this repository. Because it includes many proteins, this was broken up into 2 batches: 2022-actin-prediction-blastoutputs1.txt
and 2022-actin-prediction-blastoutputs2.txt
.
The notebooks in this repository were created to help prepare the metadata, download AlphaFold structures, and apply additional custom overlays to the output map.
We started this analysis with a list of the 50,000 proteins most related to human actin according to protein BLAST. Before the ProteinCartography pipeline could be used, we prepared the list of proteins by running the 1_prep_metadata.ipynb
notebook from within the 2023-actin-embedding/notebooks
folder using the 2023-actin-embedding
environment. To activate this environment, use the shell command:
conda activate 2023-actin-embedding
This involves mapping RefSeq IDs to UniProt accession IDs, then retrieving data from UniProt for each protein. The data retrieved from UniProt include protein name, organism, taxonomic information, length, annotation score, length, fragment status, sequence, and gene name. The fragmentary proteins are then filtered out.
The final list of proteins is reformatted to the format required for "Cluster mode" of the ProteinCartography pipeline.
Next, we downloaded all available AlphaFold structures by running the 2_get_alphafold_structures.ipynb
notebook. This notebook should be ran from within the 2023-actin-embedding/notebooks
folder using the 2023-actin-embedding
environment, but will deposit structures into ProteinCartography/actin/structures
if you've set up your directory structure as above.
We then ran the ProteinCartography in "Cluster Mode" using the standard pipeline parameters. The complete analysis is linked in the Zenodo.
We placed the config_ff_actin.yml
file, which can be found in the ProteinCartography_docs
folder of this repository inside the ProteinCartography/actin
folder. We moved into the ProteinCartography
directory and then used the cartography-tidy
environment to run the analysis. To activate the environment, use the following command:
conda activate cartography-tidy
Then, from within the ProteinCartography
folder, the following command used to run the ProteinCartography pipeline from "Cluster Mode" was:
snakemake --snakefile Snakefile_ff --configfile actin/config_ff_actin.yml --use-conda --cores 2
We used the results of ProteinCartography and the results form the Actin Prediction pipeline to create custom plots. To do this, we moved back into the 2023-actin-emedding/notebooks
folder, and then ran the 3_plotting_overlays.ipynb
using the 2023-actin-embedding
environment. To activate this environment, run the following command:
conda activate 2023-actin-embedding
Finally, we evaluated the distributions of proteins within clusters by running the 4_cluster_distributions.ipynb
notebook.