-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF Draft 4.5 and Modified Bases #767
Comments
It's not something I can comment on for VCF as I don't know the landscape of tools so well there. However for SAM et all it was a fundamental requirement that it be a side-channel as the BAM specifications simply forbids additional letters in the sequence field. I agree it would have been easier from a specification to update sequence, but it's also harder when dealing with legacy software. That latter case however can be mitigated with filtering tools however (like BCF doesn't use the nibble-encoding for sequence that BAM does, but we can't have single letter codes for every type of modification as there are too many of them once we get to RNA. |
I'm with @cjw85. I think basing this on ALT would be simpler and more natural, as long as a sensible encoding of the ALT field can be found. I prefer ALT as it would allow the AF, AD, ADF/ADR, etc fields to naturally represent the values we care about (like AF for modification frequency, which is obviously the main one). |
I have more extensive thoughts on the "mod as ALT" versus "mod as INFO field" (spoiler alert I am in favor of the latter), which I will expand on later, but one thought on the current proposal (since this is likely the thread for this as well), is that it would be good to add a specifier as to which canonical base a modified base is targeted. There are modifications which are canonical base independent (backbone sugar modified bases) or may be dependent upon context as to which canonical base a modified base targets. Therefore I would propose that that the modified base INFO field be ammended to include the intended canonical base target. For example instead of This will make things much easier on parsers in terms of programmatically identifying which canonical base is targeted (by the INFO tag and removing any ambiguity. In my opinion the alternative is a database of which canonical base corresponds to which each potential modified base in ChEBI. I think I can safely assume that no one wants to maintain such a database. |
I've worked through my argument against the "mods as ALT". Let's start with the simple case for the two options. Given the simple case with a mod as info tag: This appears to be quite reasonable for both proposals. So now let's make this a tiny bit more complicated with 1 extra modified base 5hmC (single letter code
This would result in the following records (I may have some of the syntax slightly wrong; I don't work with GT fields all that often these days): mod as info tag: Unless I am missing something obvious I think that the second option here is pretty much a non-starter. This would require some pretty heavy logic just to get back to the simple answer that this location has a single canonical allele ( Some python code to produce this tag is here if anyone is interested in tinkering (or actually formatting the tags correctly):
|
In the ALT-based representation the number of haplotypes, and hence records in the ALT field/AF entries, is constrained by read depth and the combinations of modifications actually observed in the reads. So while it is possible to have a large number of entries and an unwieldy representation it is unlikely to happen in practice and up to the VCF-producing software to manage. This is no different than calling somatic mutations where complex patterns of subclonal mutations could be imagined but in practice isn't a major problem. |
As shown in the example, the representation is complicated not by the underlying state of the data, but instead by the "mod as ALT" specification. The modified base proportions shown are all above 10% which requires very low coverage to determine even on a haploytpe specific mod pileup. The two haplotypes in reality share no information about the modified base annotation. The complexity of the many combinations of modified base state is completely a result of the "mod as ALT" specification. All of the state information is encapsulated in the "mod as INFO" specification and all of the proportions can be estimated without much trouble. I would argue that the proposed situation would not be all that uncommon. Beyond this first point, it is not simply up to the VCF writer here. The parsers must now handle the case where modified bases and variants occur together. This means that all current variant pipelines immediately become invalid until support for this is added. This will lead to a lack of adoption of the modified base feature for a considerable time period. If there were some massive benefit to the "mod as ALT" specification then it may be possible to drive adoption by VCF parsing tools to use this format, but there is no discernible benefit for the "mod as ALT" specification other than the fact that it is possible (that I can see). Most users of VCF would want the file to be easy to do variant analysis. If the modified base addition makes variant analysis much harder, I would worry that this would result in a lack of adoption of an aggregated modified base format. |
Thanks everyone for your valued feedback. For those of you that were not able to make the GA4GH Connect session where the this VCFv4.5 release candidate 1 was presented at, I'll make a brief summary here: Design goals:
I raised issues that we wanted more clarification on during the Q&A at the end of the session and I'm glad you've also raised some of them. Firstly, on ALT/INFO encoding, I'm unsure what you're referring to as
That's a reasonable position to take for SAM as at a read level as each base either has or does not have a given modification. If the modifications are mutually exclusive then it make sense to treat them as additional base type. Unfortunately, this breaks down for VCF since it's not just about what bases are present but about the sample genotype.
Unfortunately this is not the case. Encoding in I think the different proposals come down to different goals. As is the case with base calling, it is not a design goal to losslessly encode the methylation co-phasing information present in raw reads. The question for methylation data is what is the essential information that we need to encode.
There's a precedent here. the VCF solution to handling subclonal heterogeneity is to encode each distinct subclone as a separate sample. That approach doesn't work if the entire evolutionary tree is still present at non-trivial AF. It may or may not work for methylation data. I'm not entirely sure what methylation data looks does it (please select all options that apply): Depending on the prevalence and biological importance of these patterns, the current design may or may not be appropriate. It assumes that base modifications are a combination of (d) and (e) with support for (a), (b) and (c) via encoding each orthogonal pattern as a seperate sample. That works for (a), somewhat for (b) (lots of columns), and not really for (c) (explosion of columns). All design is a trade-off. Supporting (c) adds a lot of additional complexity. How important is it to do so? Do we need to support methylation phasing when base call phasing is not available or do those always come together in assays? |
Agreed.
We could encode this in a custom tag in the |
The spec as it stands only has We'll probably end up with some ugly concatenation of names such as What fields would be useful? Anything other than AD/ADF/ADR? Any likelihood/phred-scales fields such as an equivalent of PL? |
In terms of thinking of how much methylation phasing information is retained in the VCF, I'd like to propose the following thought experiment: Assume you have a perfect T2T single-cell sequencer that calls all bases with all modifications perfectly. You sequence 1,000,000 cells from a single sample. What information about the base modifications present in your sample should be included the VCF? |
Each base modification was given it's own independent |
In such a case we're sampling a population, so we have population statistics. We're basically recording prevalence data. However consider a case of an unsure base caller where it's 20% likely to call a variant due to error. It may be 100% of the time the bases aren't modified, but we have a 1 in 5 error rate making it look like a 20% population. For the data to be meaningful, we need to know if it is a prevalence figure or a likelihood of correctness figure. Obviously we could just have a hard quality threshold, but clearly this is demonstrating we need some level of quality information to disambiguate this so we know what the fractions actually mean.
|
I think the above is a key point. We're retrofitting something on to an existing standard, but we have to be aware there is a lot of legacy software out there. This means if we change the meaning of an existing field then we must first get support from the main implementations to write tooling to strip out the new meaning and translate back to old VCF. For example, tools may just inspect GT and look for 0/0, 0/1 and 1/1 to count hom ref, hom alt and heterozygous stats. With base mods a naive counting like that is completely wrong, but probably silently so giving incorrect results rather than just failing. While I agree it'd be nice for everyone to have robust software that checks things in the correct manner, we all know this just isn't so! Also remember if we're going to support both the existing short published codes like "m" and "h", but also the more complex string ones (5hmC) and beyond to the huge plethora of ChEBI codes, then the parsing of ALT is going to become a messy task given we don't have bases as a list. They're just concatenated together. Is 5hmC ChEBI 5 followed by 'h', 'm' and 'C' bases, or is it just "5hmC" (aka 'h')? Is A,1234512346 ChEBI code 1234512346 or codes 12345 followed by 12346? We need an entirely new markup to be invented here with separators between bases. Maybe we can use |
You're carving out an entire prefix latter for a specific purpose. The problem is one of parsing. How do we know this M field is a base modification field and not just someones random custom tag they added that happens to start with M? This is something that SAM was so good on and bizarrely VCF so bad on (given it came second). SAM very clearly delineates user space (X*, Y*, Z* and lowercase) from official namespace (everything else). The ship has partly sailed already, but the official tags are never lowercase and never X, Y, Z* I think. So we could at least define that to be private space and never going to be reused, which would give tool chain authors somewhere to go. |
I was mixing up INFO and FORMAT/SAMPLE fields. You are correct. By "mods as INFO" I am referring to the current proposal in the VCF 4.5 draft. There is no third proposal. Only the amendment I suggested to the current schema to specify the canonical base.
If we are already referring to the header for the definition of the field, why not just require the ChEBI code, the canonical base alternative and allow VCF writers to specify the short encoding for use throughout the file. I would personally prefer the tag in the records to be |
The difference in size is practically zero once gzipped since the FORMAT row definition (e.g. |
Yes SAM has been fixed for some time now. It's a different beast though as SAM tags are required to be 2 characters only and there's a limited name space too, so we can't reasonably invent a new tag for each and every type of base modification. Hence they're serialised into a single MM tag. VCF doesn't have that limitation so I don't think it's necessary to create such a format-within-a-format. (I didn't really want to do it in SAM either!) It's also easier in VCF to add new things into the header, which is nigh on impossible in SAM world. I do think we should be exploiting the VCF header more here, with meta-data to indicate a tag as a base modification. We can map tag names to ChEBI codes / widely recognised short codes (eg the ones listed in SAM). It gives the format better introspection and future proofing. I would be wary though of assuming all this stuff will vanish after compression. It sort of works, but remember gzip has a very limited buffer to look back over. Once lines get longer than 32kb, compression does start suffering considerably. Long tags just exacerbate that. |
If we go down this path then we'd end up:
Pro:
Cons:
Cons could be addressed by standardising the dozen or so fields that current base modification assays actually report. Field names become rather ad-hoc and likely inherited from existing tooling (e.g. MC/MH/MA/Mh/Mm?). I prefer just reserving everything that could be used and aliasing but it's a weak preference and can be swayed. |
We could also allow general aliasing of ##FORMAT and ##INFO headers that seems like a large API overhead & headache for both VCF consumers and producers without much payoff other than making
SAM already does this for the PL field (although in retrospect that was probably a mistake). If they're important/widespread enough I don't see why they can't get their own alias. That said, having any aliases is does lead one to want to generalise it to |
If some modified base codes are added to the specification, then I would prefer to see some idea of how to manage which new bases are "common enough" to warrant addition to the spec in the future. I think the recommended codes would be a much easier solution to manage and potentially reduce the burden on modified base readers and writers. I am mostly trying to avoid the situation with the SAM spec which is now a combination of defined short codes and ChEBI codes for (what bacterial reserachers would consider) common bases. See this issue for the crux of the issue: #741 |
The problem of going FORMAT only and no in-depth header meta-data is that defining format tags for the common modifications means over time you'll become the defacto maintainer of a database of common modification names! We do need someone to do that role, but it really needs someone skilled in the community (ideally RNA community as they have the most modification types). This is what killed #741 - no one is willing to act as the arbiter of what gets a short code and what doesn't, and even if I were happy to do it for SAM I'm simply way out of my depth to judge such things. However, even if we do go down the FORMAT M* means base modification rule, you still need header meta-data to back it up as a base modification tag as there will be millions of files out there in the wild with private tags that already start with M, due to no controlled namespaces. So that means the "M*" becomes a convention and not a rule that parsers can trust. I don't see a way around that. |
We need to define a convention for unstranded methylation data. Any objections to standardising on the position of the first modified base in the motif on the positive strand and Similarly, a clarification that assays that don't include information for certain sites should use MISSING ( E.g. unstranded CpG puts the unstranded methylation data on the C and a |
We don't want VCF have to maintain it's own database of base modification short names. Is there database we can reference? Candidates include: |
I've been thinking about this recently and want to raise a few issues that should be considered. The discussion so far has focused on situations where there is a variant w.r.t the reference, and how to encode modification frequencies for the REF and ALT allele. However, most of the modification calls will be at homozygous REF positions. I understand that the proposal is to emit records with empty ALT, which seems like a natural solution, and I want to point out that the VCF files will become substantially bigger as there are ~28M CpGs in the human reference. This leads to an issue about how positions that are not present in the VCF file are treated. Are they assumed to be uninformative (say, not covered by reads), or not modified? This is particularly important for all-context calling (5mC in any context, not just CpG) and it is not desirable to produce enormous VCFs. I can see some interaction with the gVCF format here, so specifying if/how the modification calls are treated in gVCF should also be considered. |
You're right, gVCF ref blocks are a natural fit here to disambiguate the not-covered vs the not-found scenarios. |
Changes to the 4.5 draft located at #770 |
We could extend the ref block to define the base modification fraction specified in a ref block to apply to all applicable bases in the ref block. Example:
Would mean that all the Cs in chr:1-1000 are unmethylated, and all the Cs in chr:2000-2499 are methylated. Would that cover your use case? |
Huh? I think this would be a terrible solution. Epigenetic modifications vary cell to cell. What are you going to put in the Alt field? "23% more methylated than age, sex, and ancestry controlled comparators"? You want THAT as a part of the ALT designation in a VCF file? How do you know the difference doesnt flow from differences in cell proportions between the sample at hand in the comparators? Or are you going to hard threshold it? If you have any mods at that position, then you get the Alt designation? It's not a workable thought. IMO the only sensible way forward is to incorporate that info as an annotation of the primary sequence, or even better as an annotation of a haplotype, along side really well curated metadata ... |
VCF has to either change or die entirely. I don't think it is meaningful to discuss incorporation of epigenetic mod information without (at minimum) a gVCF file, or, (better) something like a GFA2. |
Following x.com/@fiamh I was made aware of the proposal for how to store modified base information in VCF files.
A quick search and I could not find any public discussions around this proposal.
The storing of modified base information in formats other that simple tables (i.e. BED files) is something we have often touted with Oxford Nanopore Technologies. Something better is certainly required than the bedMethyl standard/description.
I notice that the VCF 4.5 draft intends to list modified base information as part of the genotype. I think this may miss one of the underlying conceptual issues that has made more difficult the implementation of modified base information in SAM/BAM/CRAM: trying to treat modified bases as a special case.
I would argue that from the position of VCF as a format for enumerating possible alternative subsequences in samples, it is more natural to store simply in the
ALT
column the possibility of a modified-base being present; as would be the case for any other non-reference base. This would lead prima-facie, to no special handling being necessary in other fields. Though we could certainly create defined ancilliaryINFO
fields if the need arise.This may require some finagling of section 1.6.1.5 (description of
ALT
fixed field). For instance it could provide for non-{A,C,G,T,N} bases being enumerated in an<ID>
string, or the tragically simple option of allowing other character codes.The text was updated successfully, but these errors were encountered: