Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added case of bracketed list of values to bcf_hdr_parse_line. #1240

Merged
merged 2 commits into from
Jun 9, 2021

Conversation

AlbertoCasasOrtiz
Copy link
Contributor

I was using the following lines in a VCF file (extracted directly from the VCFv4.3.pdf file):

##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]>
##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]>
##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]>
##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]>

The full VCF file was parsed correctly, but those four lines were giving the following errors:

[E::bcf_hdr_parse_line] Could not parse the header line: "##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome,Exome]>"
[E::bcf_hdr_parse_line] Could not parse the header line: "##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]>"
[E::bcf_hdr_parse_line] Could not parse the header line: "##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]>"
[E::bcf_hdr_parse_line] Could not parse the header line: "##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]>"

I found out that deleting the Values from each ##META entry (e.g., deleting Values=[WholeGenome, Exome] in the first entry), it worked. Also, if I substituted the [ symbol and ] symbol with " it worked.

After looking at the file vcf.c and debugging the behaviour of the function bcf_hdr_parse_line (original), I've found out that the case of a the value in a key-value pair starting and ending with brackets ([ and ]) was not contemplated, so it could not correctly parse the aforementioned lines.

So, I have added that case to the function bcf_hdr_parse_line (mine). After testing again with the same vcf file (and a couple more I have from 1000 Genome Project), it works perfectly now.

@valeriuo
Copy link
Contributor

valeriuo commented Feb 23, 2021

You can probably reuse most of the code from quoted without duplicating it.
Something like:

@@ -426,11 +426,23 @@ bcf_hrec_t *bcf_hdr_parse_line(const bcf_hdr_t *h, const char *line, int *len)
         if (bcf_hrec_add_key(hrec, p, q-p-m) < 0) goto fail;
         p = ++q;
         while ( *q && *q==' ' ) { p++; q++; }
-        int quoted = *p=='"' ? 1 : 0;
+        int quoted = 0;
+        char ending;
+        switch (*p) {
+        case '"':
+            quoted = 1;
+            ending = '"';
+            break;
+        case '[':
+            quoted = 1;
+            ending = ']';
+            break;
+        }
+
         if ( quoted ) p++, q++;
         while ( *q && *q != '\n' )
         {
-            if ( quoted ) { if ( *q=='"' && !is_escaped(p,q) ) break; }
+            if ( quoted ) { if ( *q==ending && !is_escaped(p,q) ) break; }
             else
             {
                 if ( *q=='<' ) nopen++;
@@ -444,7 +456,7 @@ bcf_hrec_t *bcf_hdr_parse_line(const bcf_hdr_t *h, const char *line, int *len)
         while ( r > p && r[-1] == ' ' ) r--;
         if (bcf_hrec_set_val(hrec, hrec->nkeys-1, p, r-p, quoted) < 0)
             goto fail;
-        if ( quoted && *q=='"' ) q++;
+        if ( quoted && *q==ending ) q++;
         if ( *q=='>' ) { nopen--; q++; }
     }

@AlbertoCasasOrtiz
Copy link
Contributor Author

@valeriuo Thank you for your suggestions, you are right, it can be simplified. I have commited it with your changes, tested it, and its working.

@daviesrob
Copy link
Member

Sorry for taking a while to get back to this. It certainly fixes the parser, but there's a problem with round-tripping the data as the square brackets get converted into quotes, turning the list into a normal string. For example, with this small test file:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=249250621>
##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]>
##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]>
##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]>
##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]>
##SAMPLE=<ID=S1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germline genome from unaffected",DOI=url>
##SAMPLE=<ID=S2,Assay=Exome,Ethnicity=CEU,Disease=Cancer,Tissue=Breast,Description="European patient exome from breast cancer">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2
1	3000150	.	C	T	59.2	PASS	AC=2	GT	0/1	1/0

I get:

$ ./htsfile -c /tmp/test.vcf
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=249250621>
##META=<ID=Assay,Type=String,Number=.,Values="WholeGenome, Exome">
##META=<ID=Disease,Type=String,Number=.,Values="None, Cancer">
##META=<ID=Ethnicity,Type=String,Number=.,Values="AFR, CEU, ASN, MEX">
##META=<ID=Tissue,Type=String,Number=.,Values="Blood, Breast, Colon, Lung, ?">
##SAMPLE=<ID=S1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germline genome from unaffected",DOI=url>
##SAMPLE=<ID=S2,Assay=Exome,Ethnicity=CEU,Disease=Cancer,Tissue=Breast,Description="European patient exome from breast cancer">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2
1	3000150	.	C	T	59.2	PASS	AC=2	GT	0/1	1/0

As bcf_hrec_set_val() currently puts quote marks around the stored value if you set is_quoted, the best solution might be to do the same with the brackets. The easiest way to get that would be to include them in the string you pass in, and call it with is_quoted set to zero (you have to check the closing bracket is present for this to work correctly). I have had a go at altering your PR to do this, although my effort is a bit untidy at the moment. I could push it up here if you'd like.

Currently HTSlib won't do anything with these ##META values. Unfortunately the specification does not say much about the feature and how it should be used at the moment. However, it would be good to at least preserve them if present.

I should add that picard and so presumably HTSJDK are even worse at handling this, although I guess we shouldn't expect too much as it's a VCF4.3 feature, and I think they only really go up to VCF4.2. Currently they silently drop the leading bracket and everything but the last value, so the META lines end up like this:

##META=<ID=Assay,Type=String,Number=.,Values=Exome]>
##META=<ID=Disease,Type=String,Number=.,Values=Cancer]>
##META=<ID=Ethnicity,Type=String,Number=.,Values=MEX]>
##META=<ID=Tissue,Type=String,Number=.,Values=?]>

So I guess there may be a bit more work needed to get this header tag supported in the wider world.

@daviesrob
Copy link
Member

samtools/hts-specs#491 added a test file for this, so I guess we really ought to support it. I've taken the liberty of adding my changes to get the square brackets to round-trip, and a test based in the hts-specs file. I've also rebased it (so apologies for the forced-push) so that it picks up some unrelated changes needed so that our automated tests carry on working.

AlbertoCasasOrtiz and others added 2 commits June 9, 2021 15:03
Quoted lines are actually stored with the quote marks surrounding
them, so we can do the same with the square-bracket syntax used
in META lines to enable round-tripping.

Add VCF META header tag round-trip test

VCF test file comes from the hts-specs repository
file test/vcf/4.3/passed/passed_meta_meta.vcf modified to add
contig and FILTER headers so `htsfile -c` will round-trip
it without making changes or printing any warnings.
@whitwham whitwham merged commit 723e261 into samtools:develop Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants