Skip to content

lisemangiante/needlestack

 
 

Repository files navigation

A multi-sample somatic variant caller

Join the chat at https://gitter.im/iarcbioinfo/needlestack Circle CI Docker Hub

Warning: development in progress, unreliable results warranted.

Please wait upcoming publication before using it in production.

Contact: follm@iarc.fr

Needlestack development is support by the US National Cancer Institute (grant number R21CA175979) and the French Institut national du cancer.

Description

Needlestack is an ultra-sensitive multi-sample variant caller for Next Generation Sequencing (NGS) data. It is based on the idea that analysing several samples together can help estimate the distribution of sequencing errors to accurately identify variants. It has been initially developed for somatic variant calling using very deep NGS data from circulating free DNA, but is also applicable to lower coverage data like Whole Exome Sequencing (WES) or even Whole Genome Sequencing (WGS). It is a highly scalable and reproducible pipeline thanks to the use of nextflow and docker technologies.

Here is a summary of the method:

  • At each position and for each candidate variant, we model sequencing errors using a negative binomial regression with a linear link and a zero intercept. The data is extracted from the BAM files using samtools.
  • Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use a robust estimator for the negative binomial regression (published here with code available here).
  • We calculate for each sample a p-value for being a variant (outlier from the regression) that we further transform into q-values to account for multiple testing.

Input

  • A set of BAM files (called *.bam) grouped in a single folder along with their index files (called *.bam.bai). A minimum of 20 BAM files is recommended.
  • A reference fasta file (eventually compressed with bgzip) along with its faidx index (and *.gzi faidx index if compressed).
  • Optionally a bed file, otherwise the variant calling is performed on the whole reference provided.

Quick start

Needlestack works under most Linux distributions and Apple OS X.

  1. Install java JRE if you don't already have it (7 or higher).

  2. Install nextflow.

    curl -fsSL get.nextflow.io | bash

    And move it to a location in your $PATH (/usr/local/bin for example here):

    sudo mv nextflow /usr/local/bin
  3. Install docker.

    This is very system specific (but quite easy in most cases), follow docker documentation. Also follow the optional configuration step called Create a Docker group in their documentation.

  4. Optionally download a sample dataset.

    git clone --depth=1 https://github.com/mfoll/NGS_data_test.git
  5. Run the pipeline.

    Here on the example dataset downloaded above:

    cd NGS_data_test/1000G_CEU_TP53/
    nextflow run iarcbioinfo/needlestack -with-docker  \
             --bed TP53_all.bed --bam_folder BAM/ --fasta_ref 17.fasta.gz

    You will find a VCF file called all_variants.vcf in the BAM/ folder once done.

    The first time it will take more time as the pipeline will be downloaded from github and the docker container from dockerhub.

    Creating an alias for the long command above can be useful. For example:

    alias needlestack='nextflow run iarcbioinfo/needlestack -with-docker'

    If you want to permanantly add this alias (and not just for your current session), add the above line to your ~/.bashrc file (assuming you are using bash).

    It will allow you to do this:

    needlestack --bed TP53_all.bed --bam_folder BAM/ --fasta_ref 17.fasta.gz
  6. Update the pipeline

    You can update the nextflow sofware and the pipeline itself simply using:

    nextflow -self-update
    nextflow pull iarcbioinfo/needlestack

    You can also automatically update the pipeline when you run it by adding the option -latest in the nextflow run command. Doing so you will always run the latest version from Github.

    Official releases can be found here. There is a corresponding official docker container for each release and one can run a particular version using (for example for v0.3):

    nextflow run iarcbioinfo/needlestack -r v0.3 -with-docker \
             --bed TP53_all.bed --bam_folder BAM/ --fasta_ref 17.fasta.gz

Detailed instructions

Nextflow and Docker

If you can't install docker or don't want to use it, the pipeline will also work if you install perl, bedtools, samtools and Rscript from R and put them in your path (executables are assumed to be respectively called perl, bedtools, samtools and Rscript). In this case, remove the -with-docker option from step 5 above.

The exact same pipeline can be run on your computer or on a HPC cluster, by adding a nextflow configuration file to choose an appropriate executor. For example to work on a cluster using SGE scheduler, simply add a file named nextflow.config in the current directory (or ~/.nextflow/config to make global changes) containing:

process.executor = 'sge'

Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation here for more details. Also have a look at the other parameters for the executors, in particular queueSize that defines the number of tasks the executor will handle in a parallel manner. Parallelism in needlestack is managed by splitting the genomic regions in pieces of equal sizes (--nsplit). Note that dealing with very large regions can take a large amount of memory, therefore splitting more is more memory-efficient.

Parameters

Type --help to get the full list of options. --bam_folder and --fasta_ref are compulsary. The optional parameters with default values are:

Parameter Default value Description
min_dp 30 Minimum median coverage to consider a site. In addition, at least 10 samples have to be covered by min_dp.
min_ao 3 Minimum number of non-ref reads in at least one sample to consider a site
nsplit 1 Split the bed file in nsplit pieces and run in parallel
min_qval 50 qvalue threshold in Phred scale to consider a variant
sb_type SOR Strand bias measure, either SOR, RVSB or FS
sb_snv 100 or 1000 Strand bias threshold for SNVs (100 (1000 if FS) = no filter)
sb_indel 100 or 1000 Strand bias threshold for indels (100 (1000 if FS) = no filter)
map_qual 20 Min mapping quality (passed to samtools)
base_qual 20 Min base quality (passed to samtools)
max_DP 30000 Downsample coverage per sample (passed to samtools)
use_file_name Put this argument to use the bam file names as sample names. By default the sample name is extracted from the bam file SM tag.
all_SNVs Put this argument to output all SNVs, even when no variant is detected. Note that positions with zero coverage for all samples might still be missing depending on how the region split is performed
no_plots Put this argument to remove pdf plots of regressions from the output
no_labels Put this argument for not labeling the outliers on regression plots
no_indels Put this argument to do not perform the variant calling on insertions and deletions
no_contours Put this argument to do not plot qvalues contours (for qvalue threshold={10,30,50,70,100} by default) and do not plot minimum detectable allelic fraction in function of coverage
out_folder --bam_folder Output folder, by default equals to the input bam folder
out_vcf all_variants.vcf File name of final VCF
bed BED file containing a list of regions (or positions) where needlestack should be run
region A region in format CHR:START-END where calling should be done
pairs_file A tab-delimited file containing two columns (normal and tumor sample names) for each sample in line. This enables matched tumor/normal pair calling features (see below)
power_min_af   Allelic fraction used to classify genotypes to 0/0 or ./. depending of the power to detect a variant at this fraction (see below)
extra_robust_gl Add this argument to perform extra-robust regression (useful for common germline SNPs, see below)
sigma_normal 0.1 Sigma parameter for negative binomial modeling of expected germline allelic fraction. We strongly recommend not to change this parameter unless you really know what it means
input_vcf A VCF file (from GATK) where calling should be done. Needlestack will extract DP and AO from this VCF (DP and AD fields) and annotate it with phred q-value score (FORMAT/QVAL field), error rate (INFO/ERR) and over-dispersion sigma parameter (INFO/SIG). WARNING: by default, only work with split (coreutils) version > 8.13

By default, if neither --bed nor --region are provided, needlestack would run on whole reference, building a bed file from fasta index inputed. If --bed and --region are both provided, it should run on the region only.

Simply add the parameters you want in the command line like --min_dp 1000 for example to change the min coverage or --all_SNVs to output all sites.

Germline, somatic, matched Tumor-Normal pairs calling and contamination

When using matched tumor/normal, Needlestack can classify variants (VCF FORMAT/STATUS field) according to the following table: Status Table

For this one need to provide a tab-delimited file containing two columns with normal and tumor sample names using the --pairs_file option. The first line of this file is a header with TUMOR and NORMAL keywords. When one normal or one tumor is missing, one can write NA. In this mode, the parameter power_min_af defines the allelic fraction in the tumor one is trying to detect to classify genotypes as ./. or 0/0 depending on the power to detect this allelic fraction. Variants found as somatic in a tumor, but germline in another sample of the series will be flagged as POSSIBLE_CONTAMINATION. We found this particularly important, as needlestack is very sensitive to low allelic fractions, to filter out contamination among samples for pooled exome capture.

In other cases (when there is no --pairs_file parameter defined), genotypes are defined as ./. or 0/0 assuming one is looking for allelic fractions expected for germline variants (negative binomial distribution centered at 0.5 with over-dispersion parameter sigma=sigma_normal, with sigma_normal=0.1 by default). If you are looking for somatic variants without matched-normal and assuming you are interesting to correctly distinguish ./. and 0/0genotypes, you can set the power_min_af parameter to the lowest allelic fraction of somatic variants you are interested with (and your coverage allows you to find). Note that this is by far not the most common situation, and that in most cases you don't have to worry about the power_min_af parameter.

Notes

Common variants

Needlestack is made to identify rare variants (i.e. only a few samples in your set of samples have a particular variant), because of the robust regression used. Therefore, common SNPs (>10%) or strong somatic hotspots will be missed. The optional extra_robust_gl can overcome partially this issue for common germline mutations: it first discard high allelic fraction (>20%, assuming these are likely true variants) before fitting the regression model when between 10% and 50% of sample have such a high allelic fraction. A flag is written in the VCF INFO/WARN field when this happened (EXTRA_ROBUST_GL). Additionally, when an other allele than the reference allele is the most common, it is taken as the reference allele and a flag is also written in the VCF (INFO/WARN=INV_REF).

Strand bias

For conventional variant callers, GATK WES/WGS recommended values for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. We haven't found this particularly useful, but a more detailed evaluation is necessary. For amplicon based targeted sequencing, RVSB>0.85 seems to reduce some erros. There is no hard filter by default as this is easy to do afterward using bcftools filter command.

Reproducibility

A good practice is to keep (and publish) the .nextflow.log file create during the pipeline process, as it contains useful information for reproducibility (and for debugging in case of problem). You should keep the trace.txt file containing even more information to keep for records. Nextflow also creates a nice processes execution timeline file (a web page) in timeline.html.

Pipeline execution DAG

About

Multi-sample somatic variant caller

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 67.7%
  • Nextflow 22.5%
  • Perl 5.9%
  • TeX 3.1%
  • Shell 0.8%