cacao is a computational workflow that provides software and data to assess sequencing depth for clinically actionable/pathogenic loci in cancer for a given sequence alignment (BAM/CRAM). Most importantly, the software will pinpoint genomic loci of clinical relevance in cancer that has sufficient sequencing coverage for reliable variant calling. In combination with the actual variants that have been identified, it may thus serve to confirm negative findings, a matter of significant clinical value that is underappreciated in current cancer sequencing analysis. The specific requirements to denote loci as callable (i.e. depth & alignment quality) can be configured by the user, and should thus reflect how the input are used for variant calling (RNA/DNA, germline/somatic calling)
Technically, cacao combines the speed of mosdepth with the powerful R markdown framework for interactive data reporting. It currently employs the Docker technology for software encapsulation to ease the installation process (A Conda package is in the making)
- December 9th 2020: 0.3.1 release
- Updated track directory for ClinVar and CIViC
- Dockerfile uses renv for improved installation of R package dependencies
Three clinical genomic tracks in BED format have been created:
- Loci with pathogenic and likely pathogenic variants in protein-coding genes related to cancer predisposition and inherited cancer syndromes (BRCA1, BRCA2, ATM etc.)
- Variants have been retrieved from ClinVar (December 2020 release)
- The collection of cancer predisposition genes comes from the superpanel (panel 0) established in the Cancer Predisposition Sequencing Reporter (CPSR)
- Loci associated with actionable somatic variants (related to prognosis, diagnosis, or drug sensitivity, e.g. BRAF V600E)
- Variants have been retrieved from CIViC (data harvested December 9th 2020)
- Only variants that can be mapped unambigusously to the genome are considered as sources of actionable loci
- Loci identified as somatic mutational hotspots (i.e. likely driver alterations) in cancer
- Variants have been retrieved from cancerhotspots.org (v2)
IMPORTANT: At each variant identified from the three sources above, we have used a surrounding sequence window of approximately 10bp for which the mean depth is calculated and representing the loci coverage.
All three tracks (hereditary, somatic_actionable, and somatic_hotspot) are available for GRCh37 and GRCh38, and there is also tab-separated files that link each locus to its associated
- variants and phenotypes (ClinVar),
- clinical evidence items (therapeutic context, evidence level, from CIViC)
- tumor types (cancerhotspots.org)
- An example report from the CACAO workflow showing callable cancer loci in an RNA sequence alignment.
- Prerequisites:
- Make sure that Docker is installed and running
- The CACAO workflow script
cacao_wflow.py
requires that Python3 is installed
- Download the latest release
- Pull the latest docker image
docker pull sigven/cacao:0.3.1
Run the CACAO workflow with the cacao_wflow.py
Python script, which takes the following required and optional arguments:
usage:
cacao_wflow.py -h [options]
--query_aln BAM/CRAM
--track_dir TRACK_DIR
--output_dir OUTPUT_DIR
--genome_assembly grch37|grch38
--sample_id SAMPLE_ID
--mode hereditary|somatic|any
cacao - assessment of sequencing coverage at pathogenic and actionable loci in
cancer
Required arguments:
--query_aln QUERY_ALN
Query alignment file (BAM/CRAM)
--track_dir TRACK_DIR
Directory with BED tracks of pathogenic/actionable cancer loci for grch37/grch38
--output_dir OUTPUT_DIR
Output directory
--genome_assembly {grch37,grch38}
Human genome assembly build: grch37 or grch38
--mode {hereditary,somatic,any}
Choice of loci and clinical cancer context (cancer predisposition/tumor sequencing)
--sample_id SAMPLE_ID
Sample identifier - prefix for output files
Optional arguments:
-h, --help show this help message and exit
--mapq MAPQ mapping quality threshold (default: 0)
--threads THREADS Number of mosdepth BAM decompression threads. (use 4
or fewer) (default: 0)
--callability_levels_germline CALLABILITY_LEVELS_GERMLINE
Simple colon-separated string that defines four levels
of variant callability: NO_COVERAGE (0), LOW_COVERAGE
(1-9), CALLABLE (10-99), HIGH_COVERAGE (>= 100).
Initial value must be 0. (default: 0:10:100)
--callability_levels_somatic CALLABILITY_LEVELS_SOMATIC
Simple colon-separated string that defines four levels
of variant callability: NO_COVERAGE (0), LOW_COVERAGE
(1-29), CALLABLE (30-199), HIGH_COVERAGE (>= 200).
Initial value must be 0. (default: 0:30:200)
--query_target QUERY_TARGET
BED file with genome target regions subject to
sequencing/analysis (default: None)
--force_overwrite By default, the script will fail with an error if any
output file already exists. You can force the
overwrite of existing result files by using this flag
(default: False)
--version show program's version number and exit
Coming
sigven AT ifi.uio.no