DeSeq-Free (Whole genome Deep Sequencing analysis of Cell Free tumor DNA ) is a Snakemake workflow, aimed to analyze WGS of circulating cell-free DNA (cfDNA) in the plasma of cancer patients in a reproducible, automated, and partially contained manner. It is implemented such that alternative or similar analysis can be added or removed.
We assume that you already have conda installed, otherwise you can easily install it:
To install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
In order to ease the use of DeSeq-Free, we provide a yml file for conda with all required tools, including Snakemake.
To use DeSeq-Free:
git clone https://github.com/mdelcorvo/DeSeq-Free.git
cd DeSeq-Free && conda env create -f envs/workflow.yaml
conda activate DeSeq-Free_workflow
#edit config and prepare a (csv or excel) input file
snakemake --use-conda \
--config \
input=inputfile.xlsx \
output=../output_directory \
genome=genome.fasta
-
Reference genome
Before starting, a user need to download reference genome.Download from NCBI, Ensembl, or any other autorities
wget https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
-
Index reference genome for bwa-mem2
Prepare indexed genome for bwa-mem2 to boost mapping. Refer to the bwa-mem2 instruction.- Example code:
./bwa-mem2 index <in.fasta> Where <in.fasta> is the path to reference sequence fasta file and
- Example code:
The pipeline leverages several tools to QC DeSeq-Free library, create statistics/interactive report and calculate/annotate interaction matrices at different bin size: bwa mem, pairtools, juicer, cooler, pairix, Macs2 and FitHiChIP.
You will need to specify the location of the reference genome
(hg38) in fasta/fa format with bwa index.
Use the parameter genome_data
in the config file to add it.
Users are required to provide a metadata file for running the DeSeq-Free workflow:
- metadata file – a tab-delimited text file listing the name of the samples, the sequencing technology and the paths to raw paired FASTQ files
sample | platform | fq1 | fq2 |
---|---|---|---|
Sample1 | ILLUMINA | data/S1_1.fastq.gz | data/S1_2.fastq.gz |
Sample2 | ILLUMINA | data/S2_1.fastq.gz | data/S2_2.fastq.gz |
Sample3 | ILLUMINA | data/S3_1.fastq.gz | data/S3_2.fastq.gz |
- configuration file
The configuration file (config.yaml
) contains all the paths to input, output and reference files and additional parameters to customize the pipeline and the performed tests. All of these need to be carefully specified in accordance with the specific experiment.
Important: ALL relative paths will be interpreted relative to the directory where the Snakefile is located. Alternatively, you can use absolute paths.
- reference in a fasta file format, e.g. hg38 with bwa index
- Somatic variant analysis
- Variant allele frequency
- Annotation of somatic variants
- Somatic signatures
- Analysis of somatic CNAs
- Fragment size analysis