A pipeline that uses read-level features and extra trees/random forest algorithms for accurate and fast detection of somatic mutations in next generation sequencing data.
If you don't have docker already installed in your system, please follow these instructions.
We have created a docker image with all dependencies installed:
docker pull labxa/rfcaller:1.2.0
The image has the following structure:
databases
directory contains dbSNP, ploidy files and a panel of normals.RFcaller
has the scripts used by the pipeline.output
is the working directory.
/home
|-- databases
|-- dbSNP
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz.tbi
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz.tbi
|-- ploidy_files
|-- GRCm38.ploidy.file
|-- GRCm39.ploidy.file
|-- PoN
|-- PanelOfNormals_hs37d5.tsv.gz
|-- PanelOfNormals_hs37d5.tsv.gz.tbi
|-- PanelOfNormals_hg38.tsv.gz
|-- PanelOfNormals_hg38.tsv.gz.tbi
|-- RFcaller
|-- scripts
|-- training
/output
In case you want to build your own RFcaller image:
git clone git@github.com:xa-lab/RFcaller.git
wget -O databases.zip https://zenodo.org/record/7113432/files/databases.zip?download=1
unzip databases.zip
cp -r databases/dbSNP RFcaller/databases
rm -rf databases databases.zip
cd RFcaller
docker build -t labxa/rfcaller:1.2.0 .
Here is basic configuration, for more information see docker and RFcaller options.
# Single case
docker run --rm -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -@ INT -nb /bams/NORMAL.BAM -tb /bams/TUMOR.BAM -o OUTPUT --genome /genome/GENOME.FA --dbSNP hg19
# Multiple cases
docker run --rm -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19
Among all the options offered by docker (docker run --help
), we recommend:
--rm
: Automatically remove the container when it exits.-v, --volume
: Mount local volumes in the container.- With the option
-v $(pwd):/output/
, RFcaller results will be in your current directory.
- With the option
-u, --user
: Specify the user ID and its group ID. It's useful to not run the pipeline as root.-i, --interactive
: Keep STDIN open even if not attached.-t, --tty
: Allocate a pseudo-TTY. When combined with-i
it allows you to connect your terminal with the container terminal.-e, --env
: Set environment variables.
RFcaller has the following required inputs:
-nb, --normalBam
: Path to normal BAM.-tb, --tumorBam
: Path to tumor BAM.-o, --output
: Output file name.
-
-i, --input
: The input is a TSV (tab-separated values) file with five required columns and an optional extra one. This last column is used to specify the file with the positions you want to analyze.Normal name Normal BAM
(CRAM supported)Tumor name Tumor BAM
(CRAM supported)Output name VCF/BED
(optional)
-g, --genome
: Reference genome in FASTA format (bgzip supported).-p, --dbSNP
: VCF file provided by the image with common SNPs (MAF ≥ 1%) to eliminate these positions from the analysis. In case you want to use your own VCF (bgzip supported), use also the--ploidy_file
argument.- Choices:
[hg19, hg38, your_dbSNP]
- Choices:
RFcaller provides additional configuration through the following optional inputs:
-n, --normal
: Name for normal files.- Default: normal
-t, --tumor
: Name for tumor files.- Default: tumor
-@
: Max number of threads to use.- Default: 20
-b, --bamsDir
: Main directory where BAMs are located inside the container. By default, the program will look in the/bams/
directory and that's why we set the docker option-v /BAMS_PATH/:/bams/
. However, if you mount your BAMs in other directory inside the container, you have to use this option to specify it.- Default:
/bams/
- Default:
-r, --regions
: List of regions in which to make the call in VCF, BCF or TSV format.--PoN
: TSV (CHROM, FROM, TO) with positions appearing in a Panel of Normals. UseRFcaller --PoN ?
for more information.- Default: Same version as dbSNP
- Choices:
[hg19, hg38, None, your_PoN]
--ploidy_file
:bcftools call --ploidy_file
to use in variant calling. UseRFcaller --ploidy_file ?
for more information.- Default: Same version as dbSNP
- Choices:
[GRCh37, GRCh38, GRCm38 (mouse), GRCm39 (mouse), None (consider all sites as diploid), your_ploidy_file]
--includePatches
: Tag used to analyze both canonical chromosomes and patches.--keep
: Tag used to keep intermediate files.
RFcaller has some extra parameters to use as filters:
Parameter | Description | Default |
---|---|---|
contamination | Percentage of tumor contamination in normal sample. For example, if you expect that 10% of reads in normal comes from tumor (10% contamination),you have to set "contamination": 0.1 . This argument is used to increase the maximum number of mutated reads allowed in normal sample |
0.05 |
TD_cov_SNV | Minimum coverage for tumor sample (SNVs) | 7 |
TD_cov_INDEL | Minimum coverage for tumor sample (INDELs) | 7 |
ND_cov_SNV | Minimum coverage for normal sample (SNVs) | 7 |
ND_cov_INDEL | Minimum coverage for normal sample (INDELs) | 7 |
TD_mut_SNV | Minimum number of mutated reads for a position in tumor sample to not discard it (SNVs) | 3 |
TD_mut_INDEL | Minimum number of mutated reads for a position in tumor sample to not discard it (INDELs) | 4 |
ND_mut_SNV | Maximum number of mutated reads allowed for a position in normal sample (SNVs) | 3 |
ND_mut_INDEL | Maximum number of mutated reads allowed for a position in normal sample (INDELs) | 2 |
ND_window | Window size around a position to look for mutations in normal (INDELs) | 10 |
SNV_threshold | Minimum QUAL value to consider a SNV as good | 10.1774 |
INDEL_threshold | Minimum QUAL value to consider an INDEL as good | 32.1418 |
polyINDEL_threshold | Minimum QUAL value to consider an homopolymerINDEL (mononucleotide microsatellites) as good | 0.7723 |
Imagine that you have the following configuration:
/home/RFcaller/example/
|-- example.metadata
|-- normal.bam
|-- normal.bam.bai
|-- tumor.bam
|-- tumor.bam.bai
/home/genomes/
|-- hg19.fa
|-- hg19.fa.fai
If we use the following command to run RFcaller:
docker run --rm -v /home/RFcaller/example/:/bams/ -v /home/genomes/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -i /bams/example.metadata --genome /genome/hg19.fa --dbSNP hg19
The example.metadata file should be:
normal normal.bam tumor tumor.bam example
This is because thanks to the -v
command, the files inside the container are located in the following way:
Computer | Docker |
---|---|
/home/example/example.metadata | /bams/example.metadata |
/home/example/bams/normal.bam | /bams/normal.bam |
/home/example/bams/tumor.bam | /bams/tumor.bam |
/home/example/genome/hg19.fa | /genome/hg19.fa |
A more detailed explanation:
- We use
-i /bams/example.metadata
because inside the container, we are working in the/output/
directory, but we have mounted the metadata file in the/bams/
directory. - We have said before that the default directory for the BAMs inside the container is
/bams/
. In this sense, as our BAMs are in this directory, we don't have to specify the full path in theexample.metadata
file, just write where these BAMs are located inside this folder. - In the case of genome, we have to specify the full path inside the container, that is why we use
--genome /genome/hg19.fa
.
To make it clear how to use the Docker options together with RFcaller, we have prepared a more complex example. For this time, we start with the following directory structure:
/home/complex_example/
|-- example.metadata
/data/
|-- bams
|-- normal_dir
|-- normal.bam
|-- normal.bam.bai
|-- tumor_dir
|-- tumor.bam
|-- tumor.bam.bai
|-- hg19_genome
|-- hg19.fa
|-- hg19.fa.fai
In the event that we were working in the /home/
directory:
docker run --rm -v /data/bams/:/example_bams/ -v /data/hg19_genome/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -i complex_example/example.metadata --bamsDir /example_bams/ --genome /genome/hg19.fa --dbSNP hg19
The distribution of the files will be:
Computer | Docker |
---|---|
/home/complex_example/example.metadata | /output/complex_example/example.metadata |
/data/bams/normal_dir/normal.bam | /example_bams/normal_dir/normal.bam |
/data/bams/tumor_dir/tumor.bam | /example_bams/tumor_dir/tumor.bam |
/data/hg19_genome/genome/hg19.fa | /genome/hg19.fa |
Thus, the metadata.example
file should be:
normal normal_dir/normal.bam tumor tumor_dir/tumor.bam example
A more detailed explanation:
- As we are using
-v $(pwd):/output/
and our working directory outside the container is/home/
, theexample.metadata
file in the container is inside the foldercomplex_example
. This is why we use-i complex_example/example.metadata
. - We have mounted the BAMs directory in
/example_bams/
instead of/bams/
, so we also have to changed BAMs directory for RFcaller with the-b, --bamsDir
option. Now, for theexample.metadata
we have to write the full BAMs path removing the main directory because it is defined with--bamsDir
.
The time zone of the docker image is Europe/Madrid
and it's used to set the time for the log file. To change this, add the -e
option to set environment variables:
docker run --rm -e "TZ=$(cat /etc/timezone)" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38
Or set it manually (to know your TZ visit: TZ database)
docker run --rm -e "TZ=America/Toronto" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.2.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38
For each case three directories and a final file will be created:
- calling: bcftools basic call results
${name}.vcf.gz
-> Original calling from bcftools${name}.norm.vcf.gz
-> After normalizing the indels${name}.filter.norm.vcf.gz
-> After removing SNPs and very low quality mutations${name}.SNVs.filter.norm.vcf
and${name}.INDELs.filter.norm.vcf
-> Separate SNVs and INDELs into two filespolyIndel.pos
-> BED file with those positions which contain an homopolymer INDEL
- somaticSNV: RFcaller results for SNVs
reduced.positions
-> All positions to analyze in${name}.SNVs.filter.norm.vcf
reduced_*.bam
-> A small bam with just the reads overlapping positions inreduced.positions
*.mini.pileup
and${name}.mini.pileup.vcf
-> Filtered mini.pileup and its VCF for the positions inreduced.positions
common.positions
-> File with the positions in${name}.mini.pileup.vcf
with the formatCHR \t POS
${tumor}.read.names
,${name}.mutations.interval
,${name}.sequence
,${name}.stats
,${name}.cigar
and${name}.mapq
-> Intermediate files with read-level features${name}.prepare_muts.csv
-> File generated from gathering the above features and filtering themregression_results_SNVs_${name}.txt
-> Regression algorithm result for SNVsmutations_SNVs_${name}.vcf.gz
-> Final set of SNVs
- somaticINDEL: RFcaller results for INDELs
reduced.positions
-> All positions to analyze in${name}.INDELs.filter.norm.vcf
short_reduced.positions
andlong_reduced.positions
-> Positions separated by the size of the INDEL (short < 7 and long >= 7)reduced_*.bam
-> A small bam with just the reads overlapping positions inreduced.positions
*.mini.pileup
and${name}.mini.pileup.vcf
-> Filtered mini.pileup and its VCF for the positions inreduced.positions
short_common.positions
,long_common.positions
andcommon.positions
-> File with the positions in${name}.mini.pileup.vcf
with the formatCHR \t POS
${name}.features
,${name}.mutations.interval
,${name}.sequence
,${name}.distribution
,${name}.stats
,${name}.cigar
and${name}.mapq
-> Intermediate files with read-level features${name}.prepare_muts.csv
-> File generated from gathering the above features and filtering themregression_results_INDELs_${name}.txt
-> Regression algorithm result for INDELsmutations_INDELs_${name}.vcf.gz
-> Final set of INDELs
mutations_${name}.vcf
-> Final VCF with the mutations from SNV and INDEL pipelines
If the tag --keep
is not set, all above directories are removed and only the final VCF file mutations_${name}.vcf
in conserved.
Two additional global files are created:
- RFcaller.log -> Log file with the commands that have been executed for each case and their results
- .stderr.log -> A file with errors and warnings that have occurred during the pipeline
Citation: If you use RFcaller in your publication, please cite us:
Ander Díaz-Navarro et al., RFcaller: a machine learning approach combined with read-level features to
detect somatic mutations, NAR Genomics and Bioinformatics, Volume 5, Issue 2, June 2023, lqad056,
https://doi.org/10.1093/nargab/lqad056
Authors: Ander Díaz-Navarro, Pablo Bousquets-Muñoz & Xose S. Puente -- Universidad de Oviedo