Skip to content

Latest commit

 

History

History

svpop

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

SVPOP

Data

The VCF file with the SV catalog (sv-pop-explicit.vcf.gz) was prepared with:

wget -nc http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20181025_EEE_SV-Pop_1/VariantCalls_EEE_SV-Pop_1/EEE_SV-Pop_1.ALL.sites.20181204.bed.gz
wget -nc http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20181025_EEE_SV-Pop_1/VariantCalls_EEE_SV-Pop_1/EEE_SV-Pop_1.ALL.sites.20181204.vcf.gz
wget -nc http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20181025_EEE_SV-Pop_1/VariantCalls_EEE_SV-Pop_1/EEE_SV-Pop_1.ALL.sites.20181204.vcf.gz.tbi

# fixes up some basic info tags like AC and AF
vcfkeepinfo EEE_SV-Pop_1.ALL.sites.20181204.vcf.gz SVTYPE | vcffixup - | bgzip > EEE_SV-Pop_1.ALL.sites.20181204.fix.vcf.gz
./vcf-add-bed-seqs.py --inv leave EEE_SV-Pop_1.ALL.sites.20181204.fix.vcf.gz EEE_SV-Pop_1.ALL.sites.20181204.bed.gz | bgzip > sv-pop-explicit.vcf.gz
tabix -f -p vcf sv-pop-explicit.vcf.gz

Reads

  • HG00514: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903030/ERR903030_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903030/ERR903030_2.fastq.gz
  • HG00733: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894724/ERR894724_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894724/ERR894724_2.fastq.gz
  • NA19240: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894724/ERR894724_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894724/ERR894724_2.fastq.gz

toil-vg

Using the helper scripts from ../toil-scripts.

# Construct the graph and index for svpop (15 sv samples) including inversions
./construct.sh -p -i -c ${CLUSTER} ${JOBSTORE} ${OUTSTORE}/SVPOP

# Mapping and calling

./map.sh -c ${CLUSTER} ${JOBSTORE} ${OUTSTORE}/SVPOP/map-HG00514 s3://${OUTSTORE}/SVPOP/SVPOP HG00514 ${FQBASE}/HG00514/ERR903030_1.fastq.gz ${FQBASE}/HG00514/ERR903030_2.fastq.gz 
./map.sh -c ${CLUSTER} ${JOBSTORE} ${OUTSTORE}/SVPOP/map-HG00733 s3://${OUTSTORE}/SVPOP/SVPOP HG00733 ${FQBASE}/HG00733/ERR895347_1.fastq.gz ${FQBASE}/HG00733/ERR895347_2.fastq.gz
./map.sh -c ${CLUSTER} ${JOBSTORE} ${OUTSTORE}/SVPOP/map-NA19240 s3://${OUTSTORE}/SVPOP/SVPOP NA19240 ${FQBASE}/NA19240/ERR894724_1.fastq.gz ${FQBASE}/NA19240/ERR894724_2.fastq.gz

./call.sh -c ${CLUSTER} -v s3://${OUTSTORE}/SVPOP/sv-pop-explicit.vcf.gz ${JOBSTORE} ${OUTSTORE}/SVPOP/call-HG00514 s3://${OUTSTORE}/SVPOP/SVPOP.xg HG00514 s3://${OUTSTORE}/SVPOP/map-HG00514/HG00514_chr
./call.sh -c ${CLUSTER} -v s3://${OUTSTORE}/SVPOP/sv-pop-explicit.vcf.gz ${JOBSTORE} ${OUTSTORE}/SVPOP/call-HG00733 s3://${OUTSTORE}/SVPOP/SVPOP.xg HG00733 s3://${OUTSTORE}/SVPOP/map-HG00733/HG00733_chr
./call.sh -c ${CLUSTER} -v s3://${OUTSTORE}/SVPOP/sv-pop-explicit.vcf.gz ${JOBSTORE} ${OUTSTORE}/SVPOP/call-NA19240 s3://${OUTSTORE}/SVPOP/SVPOP.xg NA19240 s3://${OUTSTORE}/SVPOP/map-NA19240/NA19240_chr

# vg call no longer needs toil-vg
# for the timing benchmarks in the paper, it was run directly on the whole genome output (GAM created by catting chromosome gams together)
# vg pack -x SVPOP.xg -Q 5 -t 10 -g HG00514.gam -o SVPOP.HG00514.pack
# vg call SVPOP.xg -k SVPOP.HG00514.pack -v SVPOP.haps.vcf.gz -r SVPOP.snarls -t 10 | bgzip > HG00514.vg.vcf.gz

Paragraph

Padding sequence was added when needed to avoid errors:

zcat sv-pop-explicit.vcf.gz | python ../misc-scripts/addMissingPaddingHg38.py | bgzip -c > sv-pop-explicit-padded.vcf.gz

Each sample was genotyped individually. For example for HG00514:

echo -e "id\tpath\tdepth\tread length\nHG00514\tHG00514.bam\t30\t150" > samples_for_paragraph_HG00514.txt
multigrmpy.py -m samples_for_paragraph_HG00514.txt -i sv-pop-explicit-padded.vcf.gz -r hg38.fa -t 10 -o paragraph_out_HG00514

The output genotypes are in paragraph_out_HG00514/genotypes.vcf.gz.

SMRT-SV v2

SMRT-SV was obtained from github using commit 43d28cbd4e76ed2a4e820ded11a80592675db6c9 (the then-current master branch). It was then run as described in its README, on a single server

${SMRTSV_DIR}/smrtsv --jobs 30 genotype genotyper.json variants.vcf.gz

Where genotyper.json contained (the VCF and BAM were downloaded from the FTP site listed above (ftp.1000genomes.ebi.ac.uk))

{
  "sample_manifest": "svpop-samples.tab",
  "sv_reference": "hg38.fa",
  "sv_calls": "EEE_SV-Pop_1.ALL.sites.20181204.vcf.gz",
  "sv_contigs": "EEE_SV-Pop_1.ALL.sites.20181204.bam",
  "model": "30x-4",
  "min_call_depth": 8
}

and hsvc-samples.tab contained:

SAMPLE	SEX	DATA
HG00514	F	hgsvc-bams/ERR903030_bwa_mem_sort_rg_md.bam
HG00733	F	hgsvc-bams/ERR895347_bwa_mem_sort_rg_md.bam
NA19240	F	hgsvc-bams/ERR894724_bwa_mem_sort_rg_md.bam

The BAMs were made as described here

The VCFs output by this version of SMRT-SV v2 are invalid due to a bug which outputs the wrong values in the REF column. This was corrected by re-extracting the REF values from the FASTA using this script on variants.vcf.gz. At the same time, the symbolic alleles were converted to explict alleles.

./make-explicit.py variants.vcf.gz --fasta hg38.fa.gz  | vcfkeepinfo - NA | vcffixup - | bgzip > svpop-smrtsv-explicit.vcf.gz