SMAP is a pipeline for the process of xenografts sequencing data.
It takes FASTQ as input and outputs specie-specific BAM or gene counts.
These are the few steps to run a complete SMAP pipeline, from genome file download to obtaining specie-specific gene counts and fusion transcripts.
- SMAP requires a simple installation step
- Download and combine reference genomes and transcriptomes. A shell script is given to simplify the genome combination process
- Alignment to combined reference genome/transcriptome. This should follow the usual process of read alignement
- Gene expression quantification. The SMAP R package provides all necessary steps to quantify the expression of genes (or exons) of the host and the graft sequences seperatly
- Fusion detection. The SMAP R package provides a function to efficiently filter false positive gene fusions
- BAM file seperation, only required for other types of analysis such as variant calling.
The SMAP R package can be installed by downloading/cloning the repository and using R CMD INSTALL
or directly from R using the devtools package:
# If not already installed, install the devtools package
install.packages("devtools")
library(devtools)
install_github("cit-bioinfo/SMAP")
The SMAP pipeline requires two additional scripts. These are included in the downloaded/cloned repository. They can also be downloaded seperately:
- A shell script to prepare the genomes and transcriptomes of reference: SMAP_prepareReference.sh
- A python script to seperate the BAM file of aligned reads based on their predicted specie of origin: smap_splitBySpecie_standard.py
To download these scripts seperately:
wget https://raw.githubusercontent.com/cit-bioinfo/SMAP/master/SMAP_prepareReference.sh
wget https://raw.githubusercontent.com/cit-bioinfo/SMAP/master/smap_splitBySpecie_standard.py
Requirements The BAM file seperation smap_splitBySpecie_standard.py
script (which is not required for gene quantification) requires the pysam
python module.
These steps, shown for human and mouse xenografts, detail how to prepare reference genome and transcriptome for SMAP.
It may take several minutes to download the entire genome of both species.
Human :
# Genome Fasta :
wget -O - ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa.gz | gunzip -c > Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa
# Transcriptome GTF :
wget -O - ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz | gunzip -c >Homo_sapiens.GRCh37.75.gtf
Mouse :
# Genome Fasta :
wget -O - ftp://ftp.ensembl.org/pub/release-75/fasta/mus_musculus/dna/Mus_musculus.GRCm38.75.dna_sm.primary_assembly.fa.gz | gunzip -c >Mus_musculus.GRCm38.75.dna_sm.primary_assembly.fa
# Transcriptome GTF :
wget -O - ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus/Mus_musculus.GRCm38.75.gtf.gz | gunzip -c > Mus_musculus.GRCm38.75.gtf
Use the provided SMAP_prepareReference.sh
shell script to prepare the reference genome and transcriptomes.
sh SMAP_prepareReference.sh \
-t Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa \
-u Homo_sapiens.GRCh37.75.gtf \
-m Mus_musculus.GRCm38.75.dna_sm.primary_assembly.fa \
-s Mus_musculus.GRCm38.75.gtf \
-o combined
The combined genome and transcriptome will be available under the combined directory (can be modified using the -o
argument), as announced by the output of SMAP_prepareReference.sh
.
The SMAP pipeline is aligner-agnostic. However, STAR is recommended. This section details how to run STAR. Any personnal or institutional STAR pipeline can be used here with the SMAP combined genome/transcriptome.
STAR first requires the combined genome/transcriptome to be indexed. This only needs to be done once for a given pair of genomes.
mkdir combined/StarIndexed_combined
STAR --runMode genomeGenerate --genomeDir combined/StarIndexed_combined --genomeFastaFiles combined/combined.fasta --sjdbGTFfile combined/combined.gtf --sjdbOverhang 100 # set to the length of base pair sequencing
Each of the samples can then be mapped using the following STAR command in which combined/StarIndexed_combined
is the STAR indexed SMAP combined genome/transcriptome generated in the precedent command, FASTQ1.fq
and FASTQ2.fq
are the paired raw FASTQ files of the sample and combined/combined.gtf
is the newly reference transcriptome (gtf) created by the SMAP_prepareReference.sh
script.
STAR --genomeDir combined/StarIndexed_combined --readFilesIn FASTQ1.fq FASTQ2.fq --sjdbGTFfile combined/combined.gtf --outFilterType BySJout --outFilterMultimapNmax 100 --outSAMtype BAM SortedByCoordinate
# Other parameters for STAR may include (not specific to SMAP) :
# --runThreadN 16
# --alignSJoverhangMin 8 --alignSJDBoverhangMin 1
# --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20
# --alignIntronMax 1000000 --alignMatesGapMax 0 --outSAMtype BAM SortedByCoordinate
# --chimSegmentMin 5 --chimJunctionOverhangMin 5 --chimScoreMin 0
The output is a classical BAM file, that is usually be named Aligned.sortedByCoord.out.bam
.
Use the SMAPcount
function in the SMAP R package. This function gives two separate count data matrix for the tumor and host respectively from bam files using featurecount function at the gene or exon level (gene level by default).
#Example
d = system.file("extdata","Example_STAR_OUTPUT", package = "SMAP")
counts = SMAPcount(SMAPBAM=d, GTF = paste(d, "combined/combined.gtf", sep = "/"))
countsTumor = counts$FCcountsTumor
countsHost = counts$FCcountsHost
countsTumor
corresponds to the count data matrix for the tumor and countsHost
to the one for the host.
First, run the STAR-Fusion tool after dowloading the following fasta files:
wget -O - | gunzip -c ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz >Homo_sapiens.GRCh37.75.cdna.all.fa
wget -O - | gunzip -c ftp://ftp.ensembl.org/pub/release-75/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.75.cdna.all.fa.gz >Mus_musculus.GRCm38.75.dna_sm.primary_assembly.fa
cat Homo_sapiens.GRCh37.75.cdna.all.fa Mus_musculus.GRCm38.75.dna_sm.primary_assembly.fa >combined/combinedCdna.fa
STAR-Fusion command line:
STAR-Fusion -J Chimeric.out.junction -G combined/combined.gtf -C combined/combinedCdna.fa
After running the STAR-Fusion tool, use the SMAPfuz
function in the SMAP R package.
This function aims at defining the best thresholds for chimeric transcripts detection. It is based on an estimation of an H0 distribution of parameters of each of the detected fusion transcripts by taking into account the impossibility of observing cross-species fusions and therefore using them to define the parameters of H0. As of now, only the number of spanning fragments and of junction reads are used.
#Example
d = system.file( "extdata","Example_StarFusion_OUTPUT", package = "SMAP")
fusions = SMAPfuz(STAR_FUSION_DIR=d,"hs")
fusions_filtered = fusions[which(fusions$Combined_adj.pvalue < 0.01),]
fusions_filtered = fusions[which(fusions$Combined_adj.pvalue < 0.001),]
fusions
is a data frame containing one fusion per line. In addition to the information given by STAR fusion, 4 colums are added: JunctionReads_Pvalue: the p-value of the junction read SpanningFrags_Pvalue: the p-value of the spanning fragments Combined_pvalue: The combined p-value using fishers method Combined_adj.pvalue: The FDR correction of the combined p-value
Use the provided smap_splitBySpecie_standard.py
python script to split a bam file into 2 bam files: one for the host and one for the graft.
python smap_splitBySpecie_standard.py bamfile.bam