Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools annotate can output an INFO field with unquoted semicolons #2202

Closed
jkmatila opened this issue May 31, 2024 · 1 comment
Closed

Comments

@jkmatila
Copy link

jkmatila commented May 31, 2024

bcftools annotate can output an INFO field value with unquoted semicolons (;). This causes the part after the semicolon to be interpreted as another INFO field when parsed. If the part after the semicolon contains the comma character, the resulting file cannot be viewed using bcftools view, instead producing an error.

Steps to reproduce:

A minimal VCF file to annotate:

$ cat repro.vcf
##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	repro
chr20	33791101	.	GC	G	.	.	.	GT	0/1

Annotations file, containing an annotation value that contains a semicolon:

$ cat annots.txt
chr20	33791101	GC	G	ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47

A header line to use for the new annotation:

$ cat header.txt
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">

Annotating the VCF file:

$ bgzip annots.txt
$ tabix -s 1 -b 2 -e 2 annots.txt.gz
$ bcftools annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf > out.vcf

We can see that it produced a VCF file where INFO field separator ; appears unquoted:

$ cat out.vcf
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	repro
chr20	33791101	.	GC	G	.	.	FOO=ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47	GT	0/1

This is not accepted by bcftools view, because it parses the part after the semicolon to be another info field, and tries to create a dummy header line for it, which fails due to the comma embedded in it:

$ bcftools view out.vcf
[W::vcf_parse_info] INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##INFO=<ID=ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47,Number=1,Type=String,Description=\"Dummy\">"
[E::vcf_parse_info] Could not add dummy header for INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' at chr20:33791101
Error: VCF parse error
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
##bcftools_viewVersion=1.20+htslib-1.20
##bcftools_viewCommand=view out.vcf; Date=Fri May 31 10:15:56 2024
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	repro

Additional information

VCF v4.3 spec, Section 1.2 says:

Some characters have a special meaning when they appear (such as field delimiters ‘;’ in INFO or ‘:’ FORMAT fields), and for any other meaning they must be represented with the capitalized percent encoding; [...]

bcftools version

$ bcftools version
bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Files used in the steps to reproduce

repro.zip

@pd3 pd3 closed this as completed in f313599 Jun 3, 2024
@pd3
Copy link
Member

pd3 commented Jun 3, 2024

The program now makes sure characters with special meaning are encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants