bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

cwhelan · 2021-04-08T15:32:12Z

If I understand the current behavior of bcftools annotate correctly, records in the input VCF are matched to records in the annotation file based on POS, REF, and ALT in cases where the annotation file is a VCF, or if it's a tab-delimited file and REF and ALT are specified in -c.

When dealing with VCFs representing structural variants, we sometimes have records that represent different variants but have the same position and alternate allele. This is because use symbolic ALT alleles, which don't alwasy fully specify the variant. For example, we may have detected an deletion with two different tools that have the same start position but different end positions. Both records will have <DEL> as their alt allele despite representing different variants. If we've prepared data with which we'd like to annotate each variant, this leaves us unable to do so with bcftools annotate under the current matching scheme. For example, if in the VCF we have these two records:

chr1	100000 VID1	A	<DUP>	.	PASS	END=200000
chr1	100000 VID2	A	<DUP>	.	PASS	END=300000

And we'd like to annotate VID1 and VID2 with different values, there doesn't seem to be a way to do so with the current matching rules of bcftools annotate; ie if we have the annotation file:

chr1	100000	A	<DUP>	1
chr1	100000	A	<DUP>	2

and try to annotate with bcftools annotate -a annotations.tsv.gz -c CHROM,POS,REF,ALT,INFO/VAL input_vcf.gz, we get the output vcf:

chr1	100000 VID1	A	<DUP>	.	PASS	END=200000;VAL=1
chr1	100000 VID2	A	<DUP>	.	PASS	END=300000;VAL=1

If we add ID to the annotation file and include it in the column list, ID will get overwritten by the ID of the first matching variant by CHR,POS,REF,ALT.

I was wondering if either there is some way to accomplish our desired annotation in the current functionality of bcftools, and if not, if it would be possible to add it as a new feature. I could see the latter being accomplished either by an option that would allow the user to specify ID as a column in the annotation file which should be used for matching records, or by adding a matching rule to -l that would do something like match the nth duplicate record at a given position to the nth duplicate annotation value (although I imagine the latter option might get tricky to implement).

The text was updated successfully, but these errors were encountered:

pd3 · 2021-04-16T15:28:02Z

This is now possible by including ~ID instead of ID in --columns. This can be used e.g. as

bcftools annotate -a annots.tab.gz  -c CHROM,POS,~ID,REF,ALT,VAL input.vcf

Please try it out and let me know if you encounter any problems.

pd3 closed this as completed in 308c149 Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

cwhelan commented Apr 8, 2021

pd3 commented Apr 16, 2021 •

edited

Loading

bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

bcftools annotate: possible to use ID to match variants or otherwise disambiguate variants with the same POS/ALT? #1461

Comments

cwhelan commented Apr 8, 2021

pd3 commented Apr 16, 2021 • edited Loading

pd3 commented Apr 16, 2021 •

edited

Loading