-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added case of bracketed list of values to bcf_hdr_parse_line. #1240
Conversation
dea21af
to
1bbf958
Compare
You can probably reuse most of the code from @@ -426,11 +426,23 @@ bcf_hrec_t *bcf_hdr_parse_line(const bcf_hdr_t *h, const char *line, int *len)
if (bcf_hrec_add_key(hrec, p, q-p-m) < 0) goto fail;
p = ++q;
while ( *q && *q==' ' ) { p++; q++; }
- int quoted = *p=='"' ? 1 : 0;
+ int quoted = 0;
+ char ending;
+ switch (*p) {
+ case '"':
+ quoted = 1;
+ ending = '"';
+ break;
+ case '[':
+ quoted = 1;
+ ending = ']';
+ break;
+ }
+
if ( quoted ) p++, q++;
while ( *q && *q != '\n' )
{
- if ( quoted ) { if ( *q=='"' && !is_escaped(p,q) ) break; }
+ if ( quoted ) { if ( *q==ending && !is_escaped(p,q) ) break; }
else
{
if ( *q=='<' ) nopen++;
@@ -444,7 +456,7 @@ bcf_hrec_t *bcf_hdr_parse_line(const bcf_hdr_t *h, const char *line, int *len)
while ( r > p && r[-1] == ' ' ) r--;
if (bcf_hrec_set_val(hrec, hrec->nkeys-1, p, r-p, quoted) < 0)
goto fail;
- if ( quoted && *q=='"' ) q++;
+ if ( quoted && *q==ending ) q++;
if ( *q=='>' ) { nopen--; q++; }
} |
@valeriuo Thank you for your suggestions, you are right, it can be simplified. I have commited it with your changes, tested it, and its working. |
Sorry for taking a while to get back to this. It certainly fixes the parser, but there's a problem with round-tripping the data as the square brackets get converted into quotes, turning the list into a normal string. For example, with this small test file:
I get:
As Currently HTSlib won't do anything with these I should add that picard and so presumably HTSJDK are even worse at handling this, although I guess we shouldn't expect too much as it's a VCF4.3 feature, and I think they only really go up to VCF4.2. Currently they silently drop the leading bracket and everything but the last value, so the META lines end up like this:
So I guess there may be a bit more work needed to get this header tag supported in the wider world. |
samtools/hts-specs#491 added a test file for this, so I guess we really ought to support it. I've taken the liberty of adding my changes to get the square brackets to round-trip, and a test based in the hts-specs file. I've also rebased it (so apologies for the forced-push) so that it picks up some unrelated changes needed so that our automated tests carry on working. |
Quoted lines are actually stored with the quote marks surrounding them, so we can do the same with the square-bracket syntax used in META lines to enable round-tripping. Add VCF META header tag round-trip test VCF test file comes from the hts-specs repository file test/vcf/4.3/passed/passed_meta_meta.vcf modified to add contig and FILTER headers so `htsfile -c` will round-trip it without making changes or printing any warnings.
I was using the following lines in a VCF file (extracted directly from the VCFv4.3.pdf file):
The full VCF file was parsed correctly, but those four lines were giving the following errors:
I found out that deleting the Values from each
##META
entry (e.g., deletingValues=[WholeGenome, Exome]
in the first entry), it worked. Also, if I substituted the[
symbol and]
symbol with"
it worked.After looking at the file vcf.c and debugging the behaviour of the function bcf_hdr_parse_line (original), I've found out that the case of a the value in a key-value pair starting and ending with brackets (
[
and]
) was not contemplated, so it could not correctly parse the aforementioned lines.So, I have added that case to the function bcf_hdr_parse_line (mine). After testing again with the same vcf file (and a couple more I have from 1000 Genome Project), it works perfectly now.