Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabix indexes on END position of interchromosomal VCF entries #428

Closed
msto opened this issue Oct 19, 2016 · 1 comment
Closed

Tabix indexes on END position of interchromosomal VCF entries #428

msto opened this issue Oct 19, 2016 · 1 comment
Milestone

Comments

@msto
Copy link

msto commented Oct 19, 2016

Hi,

I'm having some trouble indexing a VCF of structural variants when I specify an END position for interchromosomal BND sites. The indexing assumes that the END position is the end of a region on the primary chromosome. This results in breakends that don't lie in a query interval being returned, and the exclusion of breakends that do.

Here's an example VCF.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1
22      16872532        var0    C       ]X:153010136]TC 999     PASS    CHR2=X;END=153010136;SVTYPE=BND GT:GL:GQ:DR:PE:RR:SR    0/1:975,0,999:925.0:38:3:41:29
22      16945961        var1    A       A[Y:13529835[   999     MaxDepth        CHR2=Y;END=13529835;SVTYPE=BND  GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:69:1:45:51
22      16945961        var2    A       A[Y:13529835[   999     MaxDepth        CHR2=Y;SVTYPE=BND       GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:69:1:45:51
22      16945961        var3    A       <DEL>   999     MaxDepth        CHR2=22;END=16946061;SVTYPE=DEL GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:69:1:45:51
22      17010316        var4    C       [Y:28549365[AC  999     MaxDepth        CHR2=Y;END=28549365;SVTYPE=BND  GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:46:0:36:35

I'd expect a tabix query on the region 22:16900000-17100000 to return variants var1, var2, var3, and var4. However, var1 is excluded despite sharing the same coordinate as var2 and var3, and var0 is included despite lying outside the query region.

$ tabix demo.vcf.gz 22:16900000-17100000
22      16872532        var0    C       ]X:153010136]TC 999     PASS    CHR2=X;END=153010136;SVTYPE=BND GT:GL:GQ:DR:PE:RR:SR    0/1:975,0,999:925.0:38:3:41:29
22      16945961        var2    A       A[Y:13529835[   999     MaxDepth        CHR2=Y;SVTYPE=BND       GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:69:1:45:51
22      16945961        var3    A       <DEL>   999     MaxDepth        CHR2=22;END=16946061;SVTYPE=DEL GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:69:1:45:51
22      17010316        var4    C       [Y:28549365[AC  999     MaxDepth        CHR2=Y;END=28549365;SVTYPE=BND  GT:GL:GQ:DR:PE:RR:SR    0/1:999,0,999:999.0:46:0:36:35

As far as I can tell, this is due to tabix parsing the coordinates specified in the END INFO fields as the intervals 22:16872532-153010136 and 22:16945961-13529835, which span the query region and are a null interval, respectively.

Is there any way to turn this behavior off and force tabix to index only on the POS column? I tried indexing with tabix -b2 -e2 $vcf instead of tabix -p vcf $vcf, but observed the same issue. (EDIT: the two tabix commands produce identical index files. Is this expected behavior?) I'm using tabix version 1.3-48-g1afaf0c.

Thanks!

@pd3
Copy link
Member

pd3 commented Oct 24, 2016

Hi,
thank you for the test case. Yes, tabix is not read for the SVs going in the opposite direction. It is not easy to fix this directly, because the VCF records are sorted by the POS coordinate and indexing by END for records where POS>END would break the order.
So the only possibility is to fix tabix so that the file format auto detection is turned off when -s, -b, -e is given

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants