This pipeline calls variants (snp/indel and sv) by aligning assembled contigs to a reference as well as aligning contigs from two haplotypes to each other. The main pipeline is in competitive_genotyping.wdl
. The per-sample variant calling and counting takes place in call_assembly_variants.wdl
. There is also a process to qc variant calls by aligning reads to the assemblies and counting read support at the variant positions. This is in get_read_support.wdl
(can be enabled to be called from inside call_assembly_variants.wdl but currently commented out due to high computational time).
assembly_list
: a tab-separated file listing input assemblies with the following fields: sample name, haplotype 1 contig path, haplotype 2 contig path, path to file listing fastq files for read support analysisdataset_list
: this gives a list of paths to illumina data that you want to align to the variants to genotype them in another datasetref
: path to the reference (GRCh38 no alts)ref_index
: path to the reference indexref_name
: reference namesegdup_bed
: bed track of segdups for categorizing variantsstr_bed
: bed track of str for categorizing variants
- SNP/indel calls from each haploid assembly are under
call_small_variants1_ref
andcall_small_variants2_ref
.loose.genotyped.vcf.gz
is relative to reference coordinates.loose2.genotyped.vcf.gz
is relative to the respective contig coordinates. SNP/indel calls from haplotype 1 v haplotype 2 contigs are undercall_small_variants_self
.loose.genotyped.vcf.gz
is relative to haplotype 2 contig coordinates.loose2.genotyped.vcf.gz
is relative to haplotype 1 contig coordinates. In all cases, the variant ids inloose.genotyped.vcf.gz
andloose2.genotyped.vcf.gz
correspond to the same alignment event in both vcfs. In some cases they don't line up exactly one-to-one, because the "genotyping" process summarizes places where the same variant is called in multiple contigs, or the same contig corresponds to multiple variants. - Split-read sv calls from each haploid assembly are under
call_sv1_ref
andcall_sv2_ref
.breakpoints.sorted.bedpe
contains the putative breakpoints that have been classified by SV type based on a number of rules. There may be duplicate breakpoints in this file if multiple contigs have alignments that indicate the same breakpoint. - SNP/indel calls that have been combined to make diploid genotypes are under
combine_small_variants_vcf
. The two reference-based haploid callsets are combined into a diploid callset on reference coordinates insmall_variants.combined.vcf.gz
. Reference-based diploid calls that correspond to calls from haplotype 1 vs haplotype 2 are inref_nonunique_small_variants.vcf.gz
. Reference-based diploid calls that do not correspond to self-calls are inref_unique_small_variants.vcf.gz
. Diploid calls from aligning the haplotypes to each other are sorted intoself_novel_small_variants.vcf.gz
andself_known_small_variants.vcf.gz
, based on whether they correspond to reference-based calls or not. - SV calls that have been combined to make diploid genotypes are under
combine_sv
. Haploid calls from haplotype 1 vs ref and haplotype 2 vs ref are compared using pairtopair with 50 bp slop. These calls are output inref1_ref2.bedpe
where the name column contains the SV type, the genotype (homalt=1/1, ref1=1|0, ref2=0|1), and a unique number. - Variants are counted and summarized in
count_variants
andcount_self_variants
.
I included Rmd files that I used to produce plots: assembly_variants.Rmd
takes counts.txt
, which is a concatenation of all of the counts.txt
files from the count_variants
step in call_assembly_variants.wdl
. It also takes self_counts.txt
, which is a concatenation of all the counts.txt
files from the count_self_variants
step in call_assembly_variants.wdl
. read_support.Rmd
takes contigs2_contigs1_support.txt
, which is the support.txt
output of the combine_small_variant_support_contigs1_contigs2
step of get_read_support.wdl
. ref1_self.txt
and ref2_self.txt
are outputs of the correspond_variant_ids
step of get_read_support.wdl
. contigs2_contigs1.{snps, ins, del}.txt
are simply lists of the variant ids sorted by type, this could be generated in a number of ways. contigs2_contigs1.genotypes.txt
gives the "genotypes" from the original variant calling step. These are not actually genotypes, but rather are dependent on how many contigs were observed with that variant alignment. giab_ref1_match.tsv
and giab_ref2_match.tsv
are derived from hap.py analysis comparing the reference-based calls to GiaB calls. This is only applicable to GiaB sample(s) or possibly other samples with truth sets. hap.py
generates a vcf and the tsv file can be derived with the command zcat ref1.happy.vcf.gz | grep -v "^#" | grep -v NOCALL | cut -f 1,2 > giab_ref1_match.tsv
I have been running this using cromwell on compute1. Configuration details are in https://github.com/hall-lab/cromwell-on-lsf I start the cromwell server and the workflow using the following command: LSF_DOCKER_VOLUMES="/home/aregier:/home/aregier /storage1/fs1/ccdg/Active:/storage1/fs1/ccdg/Active /scratch1/fs1/ccdg:/scratch1/fs1/ccdg" bsub -R "rusage[mem=32000]" -q ccdg -G compute-ccdg -oo /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/ca_round1/logs/%J.log -a 'docker(registry.gsc.wustl.edu/apipe-builder/genome_perl_environment:22)' /usr/bin/java -Xmx16g -Dconfig.file=/storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/cromwell-on-lsf/cromwell.config -Dsystem.input-read-limits.lines=50000 -jar /opt/cromwell.jar run -t wdl -i /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.inputs.json /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.wdl