This contains data used in the Google colab notebooks for SHEPHARD and links to SHEPHARD notebook tutorials implemented in Google-Colab notebooks.
The SHEPHARD code can be found here.
The SHEPHARD documentation can be found here: https://shephard.readthedocs.io/en/latest/
Ginell, Flynn & Holehouse 2022
All data and code used for the analysis in the paper can be found here here
We provide a ready-to-analyze notebook with the annotated human proteome which can be taken and used to perform novel proteome-wide analysis.
The human proteome is annotated with the following data:
- Post-translational modifications
- Intrinsically disordered regions
- Prion-like domains
- Per-residue secondary structure annotation
- Per-residue solvation scores
- Protein copy number
Once the first three cells are run, the user is free to either run the demo cells or begin novel analysis.
Google-Colab: human_proteome_analysis
Below are links to run each of the SHEPHARD examples with google-colab:
To start learning how to use SHEPHARD click a google colab link below!
NOTE: The analyses done in the example notebooks are purely a demonstration of what is capable in SHEPHARD
Google-Colab Notebook Descriptions: |
---|
read_fasta_map_domains, get_overlaping_domains, get_sequence_around_site, find_sites_near_PTMs, find_lxvp_sites, uniprot_id_to_gene_name, add_callable_attributes, build_track_from_sliding_window |
Google-Colab: read_fasta_map_domains
Functionally the example script identifies the C- and N-terminal domains across the proteins, calculates the serine and glycine content of those terminal domains, and returns the proteins that have C- of N- domains comprised of poly-GS. This example demonstrates how to:
- Read in a FASTA using
shephard.api.fasta
module - Add domains to proteins
- Analyze domains
- Assign domain attributes
Google-Colab: get_overlaping_domains
This notebook provides an example for how to evaluate overlap of domains in proteins, as well as getting the fraction of overlap between any two domains. This example demonstrates how to:
- Initialize an empty proteome and add proteins
- Add domains to proteins
- Use built-in domain manipulation functions and functions in
shephard.tools.domain_tools
- Calculate the fractional overlap between domains
Google-Colab: build_track_from_sliding_window
This notebook provides an example for how to evalute a sqeuquence using a sliding window, as well as get the a region of a track that is within a domain. This example demonstrates how to:
- Add Tracks based on a custom function
- Calculate the fraction of residues using a slideing window
- Extract the portion of track that alines to a specific domain
Google-Colab: get_sequence_around_site
This example provides code that takes an input sequence and (1) defines all the arginines (R) residues as sites and (2) then gets the local sequence context around those sites. This example demonstrates how to:
- Initialize an empty proteome and add protein
- Find specific positions of residues
- Add sites to proteins (adding a numerical value to the site)
- Perform site-specific analysis
- Get the local sequence context around a site
Google-Colab: find_sites_near_PTMs
This notebook reads in all the proteins from the human proteome and annotates them with PTMs. It then calculates the frequency of PTMs near sites of dimethyl-arginine in the human proteome. This example demonstrates how to:
- Initialize an empty proteome and add proteins from a shepard protein file
- Add sites from a sites file
- Filter a proteome for sites of specific type
- Get sites near other sites
- Add proteome attribute for quick reference of performed analysis
- Calculate frequency of PTM sites proximal to a site type
Google-Colab: find_lxvp_sites
This notebook reads in all the proteins and intrinsically disordered regions in the human proteome and annotates all examples of 'LXVP' motifs as Sites in the proteome. This example demonstrates how to:
- Read in a uniprot FASTA using the
shephard.api.uniprot
module - Add domains from a shepard domains file
- Iterate over domains in proteome
- Find amino acid patterns in domain sequences
- Add and count site based on identified pattern locations
Google-Colab: uniprot_id_to_gene_name
This notebook provides an example for how to parse the complex protein headers of uniprot FASTA files and extract the proteins associated gene name. This example demonstrates how to:
- Read in a UniProt FASTA using the
shephard.api.uniprot
module - Parse the UniProt FASTA header when annotated as a protein name
- Add protein attribute
- Write protein attributes using SHEPHARD interfaces
- Read in protein attributes using SHEPHARD interfaces
Google-Colab: add_callable_attributes
This notebook provides an example for how to use associated attributes to save functions and call them later in analysis.
In this example, a lambda function is saved as proteome attribute which allows one to call the attribute and pass a protein length to identify the what percentile the in protein length is in in the proteome. This example demonstrates how to:
- Read in a UniProt FASTA using the
shephard.api.uniprot
module - Generate an array comprised of protein lengths in the proteome
- Save a lambda function as proteome attribute
- Get the value of a parameter at a specific percentile relative to the proteome
- Call a proteome attribute and pass it an input
Google-Colab: read_in_all_data
This notebook provides the base SHEPHARD session for exploritory analysis of the human proteome