The pipeline was designed with three types of analysis in mind:
- Targeted cleavage site prediction in candidate genes, which is the main focus of this pipeline
- Targeted de novo assembly of candidate genes
- Targeted read alignment for expression quantification of candidate genes, i.e. genes of interest
The commonality among the three is that as they are all targeted analysis of candidate genes instead of all genes available for a given RNA-Seq dataset. In addition, targeted cleavage sites prediction (Task 1) depends on the results from the target de novo assembly (Task 3).
Currently, all three analysis will be conducted by default when running the pipeline. There is still no option implemented to disable any of them yet.
The pipeline is desigend mainly for the cloud computing environment, the Google Cloud Platform (GCP) in particular, and to be used in the form of a Docker image. However, in principle, nothing prevents it from being used without Docker in a non-cloud environment.
The easiest way to install is to build a Docker image with the included
Dockerfile
, and use that image directly. To build a Docker image, make sure
you have Docker installed, then try
git clone git@github.com:bcgsc/tasrkleat.git
cd tasrkleat
make build
To see if the image has been built successfully
docker images
Pre-built Docker images are available at the dockerhub. The tags should match those at the github repo, except for the v0 tag, which is used for testing purpose exclusively, and the latest tag, which reflects the automatically built image from the master branch.
It is recommended to run the pipeline interactively first to get familiar with its behavior before scaling up the computation.
Fetch an interactive Docker session
# You may or may not need sudo depending on your user group setup
sudo docker run -it --rm \
-v /path/to/reference:/mnt \
-v /path/to/reads-data/:/data \
zyxue/tasrkleat:latest \
/bin/bash
-it
means fetching an interactive pseudo-tty session. For details of
docker run, please see the
doc.
--rm
means to remove the container after it finishes (e.g. you exit
the container). This is optional, but I find it handy. Otherwise, you
will need to cleanup all the finished container manually with docker rm
.
-v
mounts path of local file system to that inside the container so
that the data is accessible by the pipeline. The above command mounts
two paths, one for the references data, and one for the reads data.
/bin/bash
means to run bash
when the container first starts so
that you could interact with it.
reference
should contains all the necessary reference files, a copy
of those used in the manuscript can be found at
http://bcgsc.ca/downloads/tasrkleat-static/on-cloud/.
Once you are inside a tasrkleat container as root. The environment looks like
root@b7aed8a3b50f:/# whoami
root
While tasrkleat Docker image is just a binary file with all necessary software packaged in, a docker container is an running instance of the image. In analogy to programming, the image is like a class, and the container is like an instance of that class.
A example command to run the pipeline inside the container
app.py \
--input-tar /data/data.tar \
--input-bf /mnt/targets.bf \
--transabyss-kmer-sizes 32 52 72 \
--reference-genome /mnt/hg19.fa \
--reference-genome-gmap-index /mnt/gmapdb \
--gtf /mnt/ensembl.fixed.sorted.gz
--input-bf
accepts the pre-built input bloomfilters.--transabyss-kmer-sizes
accept three kmer sizes.--input-tar
could be a tarball of gzipped fastq files, or a gzipped tar of uncompressed fastq files, both situation occurs in the TCGA samples. It's dealt in theextract_tarball
function if you need more details. Currently, tasrkleat can only handle paired-end data.
After you get familiar with how the pipeline works, you could run it in batch mode, e.g.
sudo docker run --rm \
-v /path/to/reference:/mnt \
-v /path/to/reads-data/:/data \
zyxue/tasrkleat:latest \
app.py --input-tar /data/data.tar \
--input-bf /mnt/targets.bf \
--transabyss-kmer-sizes 32 52 72 \
--reference-genome /mnt/hg19.fa \
--reference-genome-gmap-index /mnt/gmapdb \
--gtf /mnt/ensembl.fixed.sorted.gz
The command is mostly the same to that in the interactive mode except for the
parts that enable interaction (e.g. -it
and /bin/bash
) are removed. Now it
runs app.py
directly instead of /bin/bash
when the container starts.
- Version every package installed if possible in the Dockerfile.
- Push each versioned image explicitly with
docker push zyxue/tasrkleat:<tag>
. - Write changlog with
git tag -a <commit hash>
for new releases in the following formatOne sentence summary - changed thing a - change thing b Memo or other stuff to be recorded