This directory contains a computational pipeline for analyzing data from the high-throughput protein-stability assay from Rocklin et al, 2017, Science. Much of the code that makes up the pipeline is derived from the Rocklin et al. paper. This pipeline involves three main steps:
- Computing EC50 values from FACS and deep-sequencing data using the script
scripts/compute_ec50_values_from_deep_sequencing_data.py
. - Computing stability scores from EC50 values using the script
scripts/compute_stability_scores_from_EC50_values.py
- Making summary plots using the script
scripts/create_summary_plots.py
.
Below are detailed instructions for conducting each step. Note: I've had to separate these into different steps since steps 1 and 2 require slightly different external dependencies (see the next section). In the future, my goal is to find a single set of external dependencies that works for everything, but this has been challenging.
Carrying out the pipeline requires external dependencies. As mentioned above, some steps require different external dependencies. For each step, I have encoded nearly all dependencies in Conda
environments, which can be installed using the following YML files:
-
environment_compute_ec50_values.yml
: An environment for carrying out steps 1 and 3 of the pipeline. -
environment_compute_stability_scores.yml
: An environment for carrying out step 2 of the pipeline.
as described here using the command:
conda env create -f {environment.yml}
where environment.yml
is the name of the YML file. To run each step in the pipeline, the appropriate environment must first be activated. To do so, simply use the command:
source activate {env_name}
where env_name
is the name of the environment as encoded in the YML file (e.g., 2018_prot_stab_compute_stability_scores
).
In addition to the dependencies encoded in the Conda
environments, this pipeline also requires a program called PEAR
to assemble paired-end deep-sequencing reads. Instructions for installing PEAR
are provided on its website.
The pipeline is in the form of a series of Python
scripts that are contained in the directory called scripts/
. Below, I describe how to carry out each of the three steps of the pipeline summarized above.
Step #1: Compute EC50 values from FACS and deep-sequencing data using the script compute_ec50_values_from_deep_sequencing_data.py
First, install the dependencies in the Conda
environment encoded in environment_compute_ec50_values.yml
(see above). Then activate it using the command:
source activate 2018_prot_stab_compute_ec50_values
(2018_prot_stab_compute_ec50_values
is the default name of the environment in the corresponding YML file).
Next, call the script compute_ec50_values_from_deep_sequencing_data.py
with the appropriate command-line arguments, as described below. All arguments are required:
python compute_ec50_values_from_deep_sequencing_data.py [--designed_sequences_file DESIGNED_SEQUENCES_FILE] [--experimental_summary_file EXPERIMENTAL_SUMMARY_FILE] [--fastq_dir FASTQ_DIR] [--pear_path PEAR_PATH] [--five_prime_flanking_seq FIVE_PRIME_FLANKING_SEQ] [--three_prime_flanking_seq THREE_PRIME_FLANKING_SEQ] [--output_dir OUTPUT_DIR]
-
--designed_sequences_file
: the path to a CSV file giving the name and DNA or protein sequence for each input design. See here for an example. In this file, each row specifies a design and each column specifies information about that design. This file must have the following comma-delimited columns:name
: a unique name for the designprotein_sequence
ordna_sequence
: the protein or DNA sequence of the design, respectively
-
--fastq_dir
: a path to a directory that has the deep-sequencing data in the form of unassembled paired-end FASTQ files for all samples in the experiment. -
--experimental_summary_file
: the path to a CSV file with a variety of metadata for each sample in the experiment, where samples are rows and columns are different pieces of metadata. See here for an example. This file is based on theexperiments.csv
file from the Rocklin et al. study, but contains come additional columns, and is also missing certain columns that will be added in later by the analysis code. The columns must be comma-delimited and must include:experiment_id
: a unique ID for the entire experiment (need to decide if I want to use this somehow)protease_type
: the protease associated with the sample (e.g., "trypsin" or "chymotrypsin")selection_strength
: the index of the selection step associated with the sample, following the indexing scheme used in the Rocklin et al. study. The standard range of these values is normally: 0-6, where "0" corresponds to the naive library that has not been exposed to protease and "6" corresponds to the library that has been challenged with the highest concentration of protease. In theexperiments.csv
file from Rocklin et al., the naive library has a blank entry in this column, instead of "0", as described above. You must include this "0" initially because it is used to identify which data (e.g., deep-sequencing data) corresponds to the naive library. Later, the pipeline creates a newexperiments.csv
file in theec50/
folder that gets created inoutput_dir
. This new file is what is used as input for computing EC50 values and will have a blank entry instead of "0".conc_factor
: the fold-change in protease concentration between selection steps, i.e., whenselection_strength
is incrimented by a value of one. A value of 3 would indicate that the protease concentration is increased by 3 fold between selection steps. Note: leave this value blank for samples that have not been treated with any protease.parent
: the value ofselection_strength
for the sample that serves as the input library for the given selection. A value of 0 would indicate that thefastq_id
: a string that is common to all FASTQ files for a given sample and is unique to those files. The code will search within the directory specified by--fastq_dir
for all FASTQ files that contain this string in their name, aggregating the data among all matches. Thus, strings that are not unique to a particular dataset will result in "cross contamination" between datasets. Be careful when using numbers. For instance, the string "test1" would find not only FASTQ files with "test1" in their name, but also FASTQ files with "test10" in their name. Using a string like "test1_" could be used to get around this problem.parent_expression
: the fraction of cells (events) passing the selection threshold in the given library before proteolysis, according to the sorting instrument.fraction_collected
: the fraction of cells (events) passing the selection threshold in the given library after proteolysis, according to the sorting instrumentcells_collected
: the total number of cells (events) collected during the given selection, according to the sorting instrument
-
--pare_path
: a path to the programPEAR
-
--five_prime_flanking_seq
: a DNA sequence that flanks the coding sequence of interest on the 5' end of the sequencing read (string). The coding sequence should begin immediately after the final nucleotide of this flanking sequence. This flanking sequence and the one given bythree_prime_flanking_seq
will be used to extract the DNA coding sequence from each sequencing read and then translate it into a protein dsequence. Note: the default sequence used in Rocklin et al., 2017, Science wasCATATG
. -
--three_prime_flanking_seq
: a DNA sequence that flanks the coding sequence of interest on the 3' end of the sequencing read (string). The coding sequence should begin immediately before the first nucleotide of this flanking sequence. Note: this DNA sequence should be in the same 5'-to-3' orientation asfive_prime_flanking_seq
. Note: the default sequence used in Rocklin et al., 2017, Science wasCTCGAG
-
--output_dir
: a path to an output directory where all the results will be stored. This directory will be created if it does not already exist -
--protein_or_dna_level
: whetherdesigned_sequences_file
contains DNA or protein sequences. Options are: 'protein' or 'dna'. Default is: 'protein'.
The output of this step is described in the below section, and includes all output except for the output data in the folder called stability_scores
.
Step #2: Compute stability scores from EC50 values using the script compute_stability_scores_from_EC50_values.py
First, install the dependencies in the Conda
environment encoded in environment_compute_stability_scores.yml
(see above). Then activate it using the command:
source activate 2018_prot_stab_compute_stability_scores
(2018_prot_stab_compute_stability_scores
is the default name of the environment in the corresponding YML file).
Next, call the script compute_stability_scores_from_EC50_values.py
using the following command-line arguments:
First, activate the same Conda
environment used in step 1.
Next, call the script create_summary_plots.py
using the following command-line argument:
python scripts/create_summary_plots.py [--data_dir DATA_DIR] [--output_dir OUTPUT_DIR]
--data_dir
: the directory that contains all output data from steps 1 and 2 (specified above by--output_dir
)--output_dir
: the output directory where the plots will be stored
All results are stored in the directory specified by the input command --output_dir
. Within this directory, there are multiple subdirectories that contain the results:
-
log.log
: a log file (currently, this file is not actually generated; I still need to implement this) -
paired_FASTQ_files/
: this directory contains assembled FASTQ files and associated log files generated byPARE
:- for each sample and for each pair of unassembled FASTQ files associated with that sample, a FASTQ of the assembled paired-end reads. The assembled FASTQ files are named according to the format:
{protease_type}_{selection_strength}-{n}.fastq
, whereprotease_type
andselection_strength
are derived from the corresponding entries in the input file specified by the command--experimental_summary_file
, and wheren
corresponds to the index of the assembled pair of FASTQ files, indexed starting at one. This index is necessary since some samples may be associated with more than one pair of FASTQ files.
- for each sample and for each pair of unassembled FASTQ files associated with that sample, a FASTQ of the assembled paired-end reads. The assembled FASTQ files are named according to the format:
-
counts/
: this directory contains counts files, including:- for each sample, a file giving counts of all unique sequences observed in the deep-sequencing data. These files are named according to the format:
{protease_type}_{selection_strength}_counts.csv
. They contain either protein or DNA counts depending on the value of the--protein_or_dna_level
input argument. - for each protease, a file giving the counts aggregated across all samples that are associated with that protease. This file only includes counts for sequences that match one of the input designs included in the file specified by the input command
--designed_sequences_file
and have more than one counts in the naive library (selection index = 0). These files are named according to the format:{protease_type}.counts
. These files are similar to the.fulloutput
files from the Rocklin et al. study. They contain either protein or DNA counts depending on the value of the--protein_or_dna_level
input argument.
- for each sample, a file giving counts of all unique sequences observed in the deep-sequencing data. These files are named according to the format:
-
ec50_values/
: this directory contains a variety of output files, including:experiments.csv
: a file that serves as input into the script for computing EC50 values. This file is identical to theexperiments.csv
file described in the Rocklin et al. study.- for each protease, a file giving the aggregated counts across all samples that are associated with that protease. These files are the same as the ones in
counts/
, copied over for the purpose of inferring EC50s (the current script for inferring EC50s requires that all input is in the same directory; this should be changed in the future). - for each protease, a file called
{protease}.fulloutput
giving the EC50 values and other metadata in tab-delimited columns. These files are similar to the.fulloutput
files from the Rocklin et al. study. See here for an example.
-
stability_scores/
: this directory contains a variety of output files, including:- for each protease, a file called
{protease}_stability_scores.txt
, which is the same as the corresponding{protease}.fulloutput
file in theec50_values/
subdirectory, but with the following additional columns:ec50_pred
: a sequence's unfolded EC50, as predicted by the unfolded-state model from Rocklin et al.ec50_rise
: the difference between a sequence's observed EC50 value and the predicted EC50 value of the unfolded sequence (= observed - predicted unfolded)stabilityscore
: the stability score (TODO: describe how this is computed)
- for each protease, a file called
An example analysis that reproduces the results from Rocklin et al. from the starting deep-sequencing and FACS data
I provide an example of how to execute this pipeline in the Jupyter
notebook called example_analysis_code.ipynb
. In this notebook, I reproduce the entire analysis from the Rocklin et al. study, starting from the deep-sequencing data. Most of the input data for the analysis is stored in the directory called data/
. However, the input FASTQ files are stored in a separate location on TACC.
After executing the analysis, I test that the results of the pipeline match the original results from the Rocklin et al. study. To do so, I downloaded the following files from the paper's supplemental info and uploaded them in the following location in this repository:
- for each protease, I uploaded a file giving protein counts generated from the raw deep-sequencing data. These files are stored in the directory
data/original_Rocklin_counts/
and are calledrd4_chymo.counts
andrd4_tryp.counts
for chymotrypsin and trypsin, respectively. - for each protease, I uploaded a file giving EC50 values (and other related metadata) for each protein design. These files are stored in the directory
data/original_Rocklin_EC50_values/
and are calledrd4_chymo.sel_k0.8.erf.5e-7.0.0001.3cycles.fulloutput
andrd4_tryp.sel_k0.8.erf.5e-7.0.0001.3cycles.fulloutput
for chymotrypsin and trypsin, respectively.
The notebook template_notebook.ipynb
is a template for carrying out the above pipeline for computing stability scores starting from experimental data from the UW BIOFAB uploaded to TACC. See the instructions in the notebook for further details.
I ran the above notebook on TACC. To do so, you will need a TACC account. If you do not already have one, you can request one by following the instructions here: https://sd2e.org/accounts/request-access/
To launch a notebook on Wrangler, go here: https://sd2e.org/workbench/workspace . Note: if you have to sign in, it might take you to a home page that is different than the one linked above. In that case, just click the above link again and it should take you there.
Next, go to the My Apps
tab at the top right, and select the one called HPC Jupyter Notebook - Wrangler
. Enter your email address in the Parameters
section. Sometimes the Job name
section has too many characters. Just delete the last few at the end. Then click run
, and you should receive an email with a link and a password for launching the notebook.
-
How to get
FlowCytometryTools
conda create -n py27 python=2.7 source activate py27 pip install flowcytometrytools python import FlowCytometryTools print(FlowCytometryTools) exit() -- Take note of the site-packages from the print() In the jupyter notebook: sys.path.append("Whatever your path was that ends with this /py27/lib/python2.7/site-packages")
-
How to get
PEAR
conda install -c bioconda pear readlink -f $(which pear)
All of the code for the pipeline is in the directory called scripts/
. This includes scripts that perform the analyses, as well as scripts with functions that are imported as modules.
This is the main script that performs the entire analysis.
- Inputs:
- all inputs are described above in the section called "How to run the pipeline"
- Dependencies:
- all scripts in the
scripts/
directory, described below, as well asPARE
- all scripts in the
- Outputs:
- all outputs are described above in the section called "Output of the pipeline"
This is a script from Rocklin et al. that is used to fit EC50 values. I call this script for this purpose in compute_ec50_values_from_deep_sequencing_data.py
.
python fit_all_ec50_data.py [--counts_dir COUNTS_DIR] [--experimental_summary_file EXPERIMENTAL_SUMMARY_FILE] [--datasets DATASETS] [--output_dir OUTPUT_DIR]
-
Inputs:
--counts_dir
: a path to the directory with the input counts files giving counts for a single protease across all selection levels (e.g., you might have one counts file for trypsin and one for chymotrypsin). Counts files must be structured such that rows are proteins, and columns (space-delimited) with the following names/info:name
: the name of the given proteincounts{selection_index_0}
: the counts for a given protein at the first selection index in the experiment (e.g., this might be set tocounts0
for the counts in the naive library).counts{selection_index_1}
: the counts for a given protein at the second selection index in the experiment.- ... include a column for every selection index in the experiment (e.g., the Rocklin et al. study had columns for selection indices 0-6).
--experimental_summary_file
: the path to an input file that follows the exact same format as theexperiments.csv
file from the Rocklin et al. study. This file has the exact same information as the--experimental_summary_file
input filecompute_ec50_values_from_deep_sequencing_data.py
, but does not have the columns calledexperiment_id
andfastq_id
, and has additional columns called:input
: the name of the counts file with counts for a given sample, excluding the prefix provided in--counts_dir
(e.g., "trypsin.counts" might have counts for all samples challenged with trypsin)column
: the name of the the column in the counts file that corresponds to the selection level of a given sample (e.g., "counts1" would correspond to the first selection level).
--datasets
: a string of comma-delimited datasets to analyze (no-spaces between items, only commas). These datasets are defined in theinput
column of theexperiments.csv
file, formatted as{dataset}.counts
--output_dir
: a path to an output directory where all the results will be stored. This directory will be made if it does not already exist.
-
Dependencies:
- the modules called
protease_sequencing_model.py
,utility.py
, andcompile_counts_and_FACS_data/__init__.py
, which are described below
- the modules called
-
Outputs:
- all files in the
ec50_values/
results directory described above, except for theexperiments.csv
file
- all files in the
This is a script that computes stability scores from EC50 values using the unfolded-state model from Rocklin et al.
- Inputs:
- need to add
- Dependencies:
- the module called
sequence_protease_susceptibility.py
, which is described below - the parameters file called
unfolded_state_model_params
, which is described below
- the module called
- Outputs:
- all files in the
stability_scores/
results directory described above
- all files in the
This is a custom script with Python
functions for analyzing deep-sequencing data, used in compute_ec50_values_from_deep_sequencing_data.py
.
These are both scripts from Rocklin et al. that are imported in fit_all_ec50_data.py
as modules for computing EC50 values.
This script has functionalities that are used to implement the unfolded-state model in the script compute_stability_scores_from_EC50_values.py
.
A file with values used to parameterize the unfolded-state model in the script compute_stability_scores_from_EC50_values.py
.
- Set up a system for logging progress of the script
- Ask Gabe if he included samples with zero counts in the naive sample.