Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

Closed
cwhelan opened this issue Apr 8, 2021 · 1 comment

Comments

@cwhelan
Copy link

cwhelan commented Apr 8, 2021

If I understand the current behavior of bcftools annotate correctly, records in the input VCF are matched to records in the annotation file based on POS, REF, and ALT in cases where the annotation file is a VCF, or if it's a tab-delimited file and REF and ALT are specified in -c.

When dealing with VCFs representing structural variants, we sometimes have records that represent different variants but have the same position and alternate allele. This is because use symbolic ALT alleles, which don't alwasy fully specify the variant. For example, we may have detected an deletion with two different tools that have the same start position but different end positions. Both records will have <DEL> as their alt allele despite representing different variants. If we've prepared data with which we'd like to annotate each variant, this leaves us unable to do so with bcftools annotate under the current matching scheme. For example, if in the VCF we have these two records:

chr1	100000 VID1	A	<DUP>	.	PASS	END=200000
chr1	100000 VID2	A	<DUP>	.	PASS	END=300000

And we'd like to annotate VID1 and VID2 with different values, there doesn't seem to be a way to do so with the current matching rules of bcftools annotate; ie if we have the annotation file:

chr1	100000	A	<DUP>	1
chr1	100000	A	<DUP>	2

and try to annotate with bcftools annotate -a annotations.tsv.gz -c CHROM,POS,REF,ALT,INFO/VAL input_vcf.gz, we get the output vcf:

chr1	100000 VID1	A	<DUP>	.	PASS	END=200000;VAL=1
chr1	100000 VID2	A	<DUP>	.	PASS	END=300000;VAL=1

If we add ID to the annotation file and include it in the column list, ID will get overwritten by the ID of the first matching variant by CHR,POS,REF,ALT.

I was wondering if either there is some way to accomplish our desired annotation in the current functionality of bcftools, and if not, if it would be possible to add it as a new feature. I could see the latter being accomplished either by an option that would allow the user to specify ID as a column in the annotation file which should be used for matching records, or by adding a matching rule to -l that would do something like match the nth duplicate record at a given position to the nth duplicate annotation value (although I imagine the latter option might get tricky to implement).

@pd3 pd3 closed this as completed in 308c149 Apr 16, 2021
@pd3
Copy link
Member

pd3 commented Apr 16, 2021

This is now possible by including ~ID instead of ID in --columns. This can be used e.g. as

bcftools annotate -a annots.tab.gz  -c CHROM,POS,~ID,REF,ALT,VAL input.vcf

Please try it out and let me know if you encounter any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants