STIX (Structural Variant Index) supports searching every discordant paired-end and split-read alignment from thousands of sample BAMs or CRAMs for the existence of an arbitrary SV. STIX reports a per-sample count of all concurring evidence. From these counts we can, for example, conclude that an SV with high-level evidence in many samples is common and an SV with no evidence is rare.
usage: stix <options>
options:
-i index directory
-s slop
-p PED file
-c Alt file column (default 1)
-d PED database file
-r right SV region
-l left SV region
-f VCF file
-a List of columns to aggregate over
-F Filter samples by PED field query
-j JSON output
-S Give only summary
The following example is based on four sample BAMs from the 1000 Genomes project:
samtools view -b s3://1000genomes/phase3/data/NA12812/alignment/NA12812.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam 13 14 > NA12812.13.14.bam
samtools view -b s3://1000genomes/phase3/data/HG00672/alignment/HG00672.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522.bam 13 14 > HG00672.13.14.bam
samtools view -b s3://1000genomes/phase3/data/NA12878/alignment/NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam 13 14 > NA12878.13.14.bam
samtools view -b s3://1000genomes/phase3/data/HG00674/alignment/HG00674.mapped.ILLUMINA.bwa.CHS.low_coverage.20121211.bam 13 14 > HG00674.13.14.bam
To create an STIX index, use excord
to extract discordant paired-end reads and
split reads.
mkdir four_alt
wget -O excord https://github.com/brentp/excord/releases/download/v0.2.2/excord_linux64
chmod +x excord
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
for sample in NA12812.13.14 HG00672.13.14 NA12878.13.14 HG00674.13.14; do
samtools view -b $sample.bam \
| ./excord \
--discordantdistance 500 \
--fasta hs37d5.fa.gz \
/dev/stdin \
| LC_ALL=C sort --buffer-size 2G -k1,1 -k2,2n -k3,3n \
| bgzip -c > four_alt/$sample.bed.gz
done
Now index with GIGGLE.
giggle index -i "four_alt/*gz" -o four_alt_b -s -f
STIX requires a PED file that needs at least sample name and discordant alignment file name, but can support an arbitrary number of columns. For example:
Sample Sex Populationzv Super_Population Alt_File
NA12812 1 CEU EUR NA12812.13.14.bed.gz
HG00672 2 CHS EAS HG00672.13.14.bed.gz
NA12878 2 CEU EUR NA12878.13.14.bed.gz
HG00674 1 CHS EAS HG00674.13.14.bed.gz
From a PED file create a database/.
stix -i four_alt_b -p test/data/four.ped -d four.ped.db -c 5
STIX can query either a single SV or annotate a VCF file. All queries require a slop parameter that accounts for the variability in the normal fragment size. In most cases use the mean fragments size plus 3 times the fragment size standard deviation.
For individual queries, STIX takes the SV type and the left and right positions of the SV, where left is (by convention) the lower chromosomal position. Since many SVs do not have base pair resolution, STIX takes a genomic range for each side.
stix -i four_alt_db -d four.ped.db -s 500 -t DEL -l 14:68603030-68603035 -r 14:68603738-68603743
Total 0:1 0:0 7:8:11 0:2:1:1
Giggle_File_Id Sample Sex Population Super_Population Alt_File Pairend Split
0 HG00672 2 CHS EAS HG00672.13.14.bed.gz 8 0
1 HG00674 1 CHS EAS HG00674.13.14.bed.gz 7 0
2 NA12812 1 CEU EUR NA12812.13.14.bed.gz 7 0
3 NA12878 2 CEU EUR NA12878.13.14.bed.gz 11 0
For a VCF, STIX uses the INFO/SVTYPE
field for the SV type and the
POS
, INFO/END
, INFO/CIPOS
, and INFO/CIEND
fields for the quarry
positions. STIX annotates each variant with four fields:
STIX_ZERO
gives the number of samples that have no evidence for the SVSTIX_ONE
gives the number of samples that have one alignments supporting the SVSTIX_QUANTS
has three values that are the quantiles for the per sample count of alignments supporting the SVSTIX_QUANT_DEPTHS
has four values that give the number of samples with SV an evidence count that is- between 2 and the quantile one value number of alignments
- between quantile one to quantile two
- quantile two to quantile three
- more than quantile three
stix -i four_alt_db -d four.ped.db -s 500 -f 1kg.four.13.14.vcf.gz
git clone https://github.com/ryanlayer/giggle.git
cd giggle
make
cd ..
wget http://www.sqlite.org/2017/sqlite-amalgamation-3170000.zip
unzip sqlite-amalgamation-3170000.zip
git clone https://github.com/ryanlayer/stix.git
cd stix
make
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
0 0 0,4,11 0,0,1,1
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
19 12694867 G <CN0> CIEND=0,1;CIPOS=0,1;END=12698924 1|1 1|1 1|1
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 1/1 13 796.49 -82,-3,-2 40 9 30 9 30 0 0 0 9 30 0.77
NA12890 1/1 10 609.94 -63,-3,-2 31 7 23 7 23 0 0 0 7 23 0.77
NA12889 0/1 139 794.47 -80,-1,-15 68 33 34 32 33 20 0 0 12 33 0.51
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
19 12694907 N <DEL> CIPOS=-10,69;CIEND=-59,4;END=12698928 1/1 1/1 1/1
High coverage | Low coverage |
---|---|
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
3 0 0,0,0 0,0,0,0
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
5 1022803 C <CN0> CIEND=-500,1000;CIPOS=-1000,500;END=1025877 1|1 1|1 1|1
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/0 200 0.00 -0,-18,-61 62 62 0 61 0 27 0 0 34 0 0
NA12890 0/0 200 0.00 -0,-23,-78 78 78 0 78 0 37 0 0 41 0 0
NA12889 0/0 200 0.00 -0,-33,-109 110 109 0 109 0 55 0 0 54 0 0
N/A
High coverage | Low coverage |
---|---|
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
0 0 4,8,10 0,1,1,1
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
4 113985874 . <DEL> END=113986369 0|0 0|0 0|0
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 1/1 24 1484.47 -152,-6,-3 74 17 56 17 56 0 0 0 17 56 0.77
NA12890 0/1 105 446.67 -46,-1,-11 42 22 19 22 19 0 0 0 22 19 0.46
NA12889 0/1 21 1259.40 -129,-3,-6 71 21 49 21 49 0 0 0 21 49 0.7
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
4 113985951 . <DEL> CIPOS=-12,2;CIEND=-51,12;END=113986325 1/1 0/1 1/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS INFO NA12878 NA12890 NA12889
12 12544792 END=12546607;CIEND=-21,21;CIPOS=-21,21;AC=4129;AF=0.824481 1|1 1|1 1|1
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/1 58 58 0 114 1478.17 -150,-2,-14 99 39 60 38 59 20 0 2 18 57 0.61
NA12890 1/1 54 54 0 25 1356.74 -139,-5,-3 67 15 51 15 51 0 0 0 15 51 0.77
NA12889 0/1 46 46 0 200 1073.26 -108,-1,-22 94 47 46 47 45 26 0 1 21 44 0.49
#CHROM POS INFO NA12878 NA12890 NA12889
12 12544868 END=12546613;CIPOS=-2,17;CIEND=-15,19; 0/1 1/1 0/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS INFO NA12878 NA12890 NA12889
12 47290470 END=47309756;CIEND=-55,55;CIPOS=-55,55;AC=248;AF=0.0495208 0|0 0|0 1|0
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/0 0 0 0 200 0.00 -0,-39,-130 131 131 0 130 0 64 0 0 66 0 0
NA12890 0/0 0 0 0 200 0.00 -0,-35,-115 116 116 0 115 0 56 0 0 59 0 0
NA12889 0/1 79 56 23 146 578.17 -65,-7,-59 120 87 32 86 31 38 0 9 48 22 0.26
#CHROM POS INFO NA12878 NA12890 NA12889
12 47290448 END=47309758;CIPOS=-3,2;CIEND=-10,5; 0/0 0/0 0/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
N/A
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/1 0 0 0 4 53.14 -6,-0,-1 4 1 2 1 2 1 0 2 0 0 0.67
NA12890 0/0 0 0 0 200 0 -0,-28,-93 94 94 0 93 0 42 0 0 51 0 0
NA12889 0/1 68 44 24 143 383.23 -48,-9,-65 115 89 25 88 24 42 0 2 46 22 0.21
#CHROM POS INFO NA12878 NA12890 NA12889
2 89161083 END=89185670;CIPOS=0,0;CIEND=-9,0 0/1 0/0 0/1
High coverage | Low coverage |
---|---|