GitHub - hall-lab/competitive-alignment

Competitive genotyping pipeline

This pipeline calls variants (snp/indel and sv) by aligning assembled contigs to a reference as well as aligning contigs from two haplotypes to each other. The main pipeline is in competitive_genotyping.wdl. The per-sample variant calling and counting takes place in call_assembly_variants.wdl. There is also a process to qc variant calls by aligning reads to the assemblies and counting read support at the variant positions. This is in get_read_support.wdl (can be enabled to be called from inside call_assembly_variants.wdl but currently commented out due to high computational time).

Inputs

assembly_list: a tab-separated file listing input assemblies with the following fields: sample name, haplotype 1 contig path, haplotype 2 contig path, path to file listing fastq files for read support analysis
dataset_list: this gives a list of paths to illumina data that you want to align to the variants to genotype them in another dataset
ref: path to the reference (GRCh38 no alts)
ref_index: path to the reference index
ref_name: reference name
segdup_bed: bed track of segdups for categorizing variants
str_bed: bed track of str for categorizing variants

Outputs

SNP/indel calls from each haploid assembly are under call_small_variants1_ref and call_small_variants2_ref. loose.genotyped.vcf.gz is relative to reference coordinates. loose2.genotyped.vcf.gz is relative to the respective contig coordinates. SNP/indel calls from haplotype 1 v haplotype 2 contigs are under call_small_variants_self. loose.genotyped.vcf.gz is relative to haplotype 2 contig coordinates. loose2.genotyped.vcf.gz is relative to haplotype 1 contig coordinates. In all cases, the variant ids in loose.genotyped.vcf.gz and loose2.genotyped.vcf.gz correspond to the same alignment event in both vcfs. In some cases they don't line up exactly one-to-one, because the "genotyping" process summarizes places where the same variant is called in multiple contigs, or the same contig corresponds to multiple variants.
Split-read sv calls from each haploid assembly are under call_sv1_ref and call_sv2_ref. breakpoints.sorted.bedpe contains the putative breakpoints that have been classified by SV type based on a number of rules. There may be duplicate breakpoints in this file if multiple contigs have alignments that indicate the same breakpoint.
SNP/indel calls that have been combined to make diploid genotypes are under combine_small_variants_vcf. The two reference-based haploid callsets are combined into a diploid callset on reference coordinates in small_variants.combined.vcf.gz. Reference-based diploid calls that correspond to calls from haplotype 1 vs haplotype 2 are in ref_nonunique_small_variants.vcf.gz. Reference-based diploid calls that do not correspond to self-calls are in ref_unique_small_variants.vcf.gz. Diploid calls from aligning the haplotypes to each other are sorted into self_novel_small_variants.vcf.gz and self_known_small_variants.vcf.gz, based on whether they correspond to reference-based calls or not.
SV calls that have been combined to make diploid genotypes are under combine_sv. Haploid calls from haplotype 1 vs ref and haplotype 2 vs ref are compared using pairtopair with 50 bp slop. These calls are output in ref1_ref2.bedpe where the name column contains the SV type, the genotype (homalt=1/1, ref1=1|0, ref2=0|1), and a unique number.
Variants are counted and summarized in count_variants and count_self_variants.

Plotting

I included Rmd files that I used to produce plots: assembly_variants.Rmd takes counts.txt, which is a concatenation of all of the counts.txt files from the count_variants step in call_assembly_variants.wdl. It also takes self_counts.txt, which is a concatenation of all the counts.txt files from the count_self_variants step in call_assembly_variants.wdl. read_support.Rmd takes contigs2_contigs1_support.txt, which is the support.txt output of the combine_small_variant_support_contigs1_contigs2 step of get_read_support.wdl. ref1_self.txt and ref2_self.txt are outputs of the correspond_variant_ids step of get_read_support.wdl. contigs2_contigs1.{snps, ins, del}.txt are simply lists of the variant ids sorted by type, this could be generated in a number of ways. contigs2_contigs1.genotypes.txt gives the "genotypes" from the original variant calling step. These are not actually genotypes, but rather are dependent on how many contigs were observed with that variant alignment. giab_ref1_match.tsv and giab_ref2_match.tsv are derived from hap.py analysis comparing the reference-based calls to GiaB calls. This is only applicable to GiaB sample(s) or possibly other samples with truth sets. hap.py generates a vcf and the tsv file can be derived with the command zcat ref1.happy.vcf.gz | grep -v "^#" | grep -v NOCALL | cut -f 1,2 > giab_ref1_match.tsv

Notes on running

I have been running this using cromwell on compute1. Configuration details are in https://github.com/hall-lab/cromwell-on-lsf I start the cromwell server and the workflow using the following command: LSF_DOCKER_VOLUMES="/home/aregier:/home/aregier /storage1/fs1/ccdg/Active:/storage1/fs1/ccdg/Active /scratch1/fs1/ccdg:/scratch1/fs1/ccdg" bsub -R "rusage[mem=32000]" -q ccdg -G compute-ccdg -oo /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/ca_round1/logs/%J.log -a 'docker(registry.gsc.wustl.edu/apipe-builder/genome_perl_environment:22)' /usr/bin/java -Xmx16g -Dconfig.file=/storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/cromwell-on-lsf/cromwell.config -Dsystem.input-read-limits.lines=50000 -jar /opt/cromwell.jar run -t wdl -i /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.inputs.json /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.wdl

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
annotations		annotations
sv_merging		sv_merging
README.md		README.md
assembly_variants.Rmd		assembly_variants.Rmd
call_assembly_variants.wdl		call_assembly_variants.wdl
competitive_genotyping.inputs.json		competitive_genotyping.inputs.json
competitive_genotyping.wdl		competitive_genotyping.wdl
convert_to_fasta.wdl		convert_to_fasta.wdl
find_duplicate_markers.py		find_duplicate_markers.py
genotype_markers.wdl		genotype_markers.wdl
get_read_support.inputs.json		get_read_support.inputs.json
get_read_support.wdl		get_read_support.wdl
marker_counts_from_paf.py		marker_counts_from_paf.py
paf.py		paf.py
printRegions2.pl		printRegions2.pl
read_support.Rmd		read_support.Rmd
read_support_from_paf.py		read_support_from_paf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Competitive genotyping pipeline

Inputs

Outputs

Plotting

Notes on running

About

Releases

Packages

Contributors 2

Languages

hall-lab/competitive-alignment

Folders and files

Latest commit

History

Repository files navigation

Competitive genotyping pipeline

Inputs

Outputs

Plotting

Notes on running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages