Skip to content

DISSEQT – DIStribution based modeling of SEQuence Space Time dynamics

License

Notifications You must be signed in to change notification settings

rasmushenningsson/DISSEQT.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DISSEQT.jl

The DISSEQT.jl package is a Julia implementation of the pipeline described in the paper DISSEQT – DIStribution based modeling of SEQuence Space Time dynamics.

Installation

Start Julia (1.1 or later) and enter the Package REPL by pressing ]. Then install DISSEQT.jl by entering:

registry add https://github.com/BioJulia/BioJuliaRegistry.git
add https://github.com/rasmushenningsson/SynapseClient.jl.git
add https://github.com/rasmushenningsson/DISSEQT.jl.git

(The BioJuliaRegistry is needed for the FASTX.jl package used by DISSEQT to read .fasta files.) Also see installation instructions for SynapseClient.jl if you want to enable the Synapse features in DISSEQT.

Examples

The complete analysis of deep sequencing data from the DISSEQT paper is available at the collaborative science platform Synapse here. In order to view and download files you must create a Synapse account. (The scripts are also available in the examples folder for reference, but note that you will need a Synapse account to access the data files.)

The Provenance system in Synapse makes it possible to trace the steps used to produce every result in Synapse, showing how the analysis was done (which script was called) and listing all input files. All example scripts below upload their results to Synapse. The scripts themselves are also automatically uploaded to ensure that all analyses can be rerun elsewhere. (In order to run the scripts locally, you need to make the Julia Packages used by the script avaiable by running e.g. add JLD in the Julia Package REPL.) The steps in the DISSET pipeline are outlined below:

Alignment

If you already have BAM files, you can start the DISSEQT pipeline from the next step. Note however that DISSEQT performs iterative alignment. That is, if the consensus sequence of an aligned mutant swarm is different from the reference sequence used during alignment, it is realigned using the new consensus sequence as the reference. The process is repeated until the reference does not change. Iterative alignment improves inference of codon frequencies close to consensus changes.

To run alignment locally, you need to have bwa, samtools and fastq-mcf installed and available in your path.

Example scripts and other relevant files for running alignment using DISSEQT can be found here. It is recommended to use one script for each run. The Reference Genomes and Adapter files are also needed. The iterative alignment typically took about 1-4 minutes per sample on a modest desktop computer. Alignment for multiple samples can be run in parallel to cut total runtime.

The outputs of the Alignment step are BAM Files, the consensus sequence and a detailed alignment log for each sample are also saved in the same folder. An overview log file - AlignUtils.log - is also created.

An optional step for quality control is to create Read Coverage graphs by running another script. Read Coverage

Codon Frequency Inference

Codon frequencies are inferred from BAM Files. An example script can be found here. The output is one Mutant Swarm File per sample and an overview log file. Runtimes for the codon frequency inference were similar to alignment at about 1-4 minutes per sample on a modest desktop computer. Inference for multiple samples can be run in parallel to cut total runtime.

All the later steps run in a matter of minutes for the whole data set.

Limit of Detection

DISSEQT determines the Limit of Detection for each codon at each site per experiment. To account for differences between runs, a Metadata table with details about the samples is needed for this step. The Limit of Detection script for the Fitness Landscape data set can be found here.

Dimension Estimation

A Talus Plot used for dimension estimation of the Fitness Landscape data set can be created using this script. Talus Plot

Sequence Space Representation

One of the core features of the DISSEQT pipeline is to produce a low-dimensional representation of the Mutant Swarms in Sequence Space, which is useful for plotting and downstream analysis. The script for the Fitness Landscape data set can be found here. First, a high-dimensional representation is created using the inferred codon frequencies and estimated per-variant limits of detection. Second, SubMatrix Selection SVD is used for dimension reduction, based on the number of dimensions determined in the Talus Plot. SubMatrixSelectionSVD plot

Downstream Analysis

Based on the low-dimensional representatation, fitness landscapes and sequence space visualizations are created in this script. An evaluation of the ability of different models to predict fitness is performed in this script. Fitness Landscape

Contact

If you have problems running DISSEQT, please open an issue in the Issue Tracker or contact rasmus.henningsson@med.lu.se. A docker file will be published soon.

About

DISSEQT – DIStribution based modeling of SEQuence Space Time dynamics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages