-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colons in VCF #CHROM columns are not parsed and throw error #2179
Comments
While colons within chromosome names are legal within the VCF specification, this was a huge mistake which sadly we cannot undo. They create ambiguities in parsing. We've seen this before, where people create a contig name "chr1" and then another "chr1:1000-2000" to indicate a portion of that chromosome. Only now how do you specify the range of chr1 that happens to cover 1000-2000? Arguably it's a tooling issue, as the specifications don't define language like "chr1:1000-2000" to mean a sub-range, but the only reason people make chromosomes named like that is specifically because that's the encoding pattern used by tooling. If tools had used "=" instead of ":" then almost certainly they'd get used in chromosome names instead. There are some mitigations for this in htslib, where we added the ability to do {chr1}:1000-2000 as a method for explicitly stating the chromosome name is the "chr1" portion, but it's not honoured everywhere and it's specific to command line arguments. We obviously cannot change the specification of things like BED. In short, yes this is probably a bug, but please consider choosing chromosome names that aren't ambiguous in meaning to downstream tools. |
I do not see this as a mistake — once communities such as HLA had standardised on allele names containing colons and with users likely to use them as contigs, it became a requirement for formats such as VCF to enable this usage. I am unaware of any ambiguities in parsing VCF due to colons in chromosome names. It may create difficulties for tools, but if you have a VCF parsing ambiguity in mind please show an example. It would be a lot easier to investigate this if @groodri provided an actual example of the problem. I attempted to reproduce the problem using a couple of trivial VCF files:
Compressing and indexing them and running the command shown produces:
Thus the In the meantime this can be worked around by adding an appropriate
which I assume is the sort of output the OP was expecting. |
My point was one of command line parsing. Bcftools for example permits You're right that our hands were forced by HLA, but I still think it was a bad decision right from day one to use colon as a meaning for sub-regions and to not ban it in chromosome names. Life could have been so much simpler, but that's water under the bridge. |
To put into context, my use case is amplicon sequencing. Because we are only interested in specific amplicon sequences, it makes more sense computationally to use the amplicon sequences as reference for variant calling (e.g. Thus, I have VCF files that look like this:
Where called variants are positioned on the amplicon, which itself is named after the gene and region in the human genome. I understand there are multiple ways around this (and thank you for your suggestion @jmarshall -- that is indeed the output I was expecting). I was however not sure that this is the intended behavior for bcftools. Particularly, because of the fact that bcftools throws this error as a warning, so exit code is 0. Because exit code is 0, this makes it difficult to integrate this tooling into analysis pipelines with e.g. Nextflow, as we cannot directly test if a VCF file has appropriate contig names for bcftools. If allowing colons in contig names is not suitable, then would it not be better to throw an error, as opposed to a warning? |
Infact a proof of this being a poor choice is in the code that implements it: https://github.com/samtools/htslib/blob/develop/synced_bcf_reader.c#L1034-L1038
Basically it's stuffed. It accepts chr, chr:pos and chr:beg-end, but cannot disambiguate "chr:beg-end" being a poorly chosen chromosome name from a region of "chr". Basically it needs a code rewrite to include the header so it can then sanitize against the contig lines and figure out which bits of the string represents the chromosome and which may represent a region. I think our previous discussions on this were purely from a VCF specification viewpoint on whether or not it was technically parseable within the format, but we never discussed whether it was sensible. It can be solved in code, but I wonder how many other places the same bugs occur?
Edit: can possibly use |
Also I just can't get bcftools concat to work full stop with -a. I think it may not work. An example:
Despite identical headers and data for two separate chromosomes (the exact case in the usage statement it claims to be written for), I only get the 2nd chromosome output. This doesn't happen if I remove
@pd3 can you confirm this is a bug, or is it just confusing documentation / usage? (Ie user error) |
Ignore that last comment. It's some strangeness of pos 0 vs pos -1 when doing all-contig ranges. I don't understand the code well enough to know why it matters. Anyway, I have a PR incoming that will use |
Don't use _regions_init_string(), which misinterprets contig names containing colons as region specification strings. The code used _regions_init_string() rather than _regions_add() only when needed to allocate a new bcf_sr_regions_t structure; instead extract basic initialisation into a new bcf_sr_regions_alloc() function, which as a bonus checks the memory allocation. Use the new function throughout. Fixes samtools/bcftools#2179.
Like I said, the behaviour is clearly a bug. The particular bug is fixed by samtools/htslib#1781. James says he has fixed it a different way by improving the code to use the headers where available ETA :— As James's commit only affects the bad All this code exists because the synced reader code does not assume the presence of contig-defining headers so cannot use |
Don't use _regions_init_string(), which misinterprets contig names containing colons as region specification strings. The code used _regions_init_string() rather than _regions_add() only when needed to allocate a new bcf_sr_regions_t structure; instead extract basic initialisation into a new bcf_sr_regions_alloc() function, which as a bonus checks the memory allocation. Use the new function throughout. Fixes samtools/bcftools#2179.
Correct. My change modified It had two calls, one was in My code is in jkbonfield/htslib@34d6a6d for posterity, incase we ever find a use for this internal function elsewhere, but I don't think it's worth fixing it now. For reference, it looks like only vcfstats and vcfannotate (of our code) calls |
Note: Specific regions were substituted by
<gene>
,<chrom>
and<pos>
to hide sensitive information. Real usage did not contain brackets.After combining region of interest locations with
bedtools
to build a new reference FASTA:And then doing variant calling with this new reference, we obtain a chromosome (VCF column
#CHROM
) name structure like so:<gene>::<chrom>:<pos>
Upon using
bcftools concat
to merge VCF files:This structure throws an error:
However, as of VCFv4.3 and later this structure should be legal. See the official VCFv4.3 changelog:
This was tested on
bcftools=1.20
, installed from source code, and previouslybcftools=1.15.1
, installed on conda.I tried to confirm which
HTSlib
version was being compiled, but I couldn't figure it out. Does the current release ofbcftools
support VCFv4.3 and later?The text was updated successfully, but these errors were encountered: