Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF Draft 4.5 and Modified Bases #767

Open
cjw85 opened this issue Apr 22, 2024 · 29 comments
Open

VCF Draft 4.5 and Modified Bases #767

cjw85 opened this issue Apr 22, 2024 · 29 comments
Assignees
Labels
Milestone

Comments

@cjw85
Copy link

cjw85 commented Apr 22, 2024

Following x.com/@fiamh I was made aware of the proposal for how to store modified base information in VCF files.

A quick search and I could not find any public discussions around this proposal.

The storing of modified base information in formats other that simple tables (i.e. BED files) is something we have often touted with Oxford Nanopore Technologies. Something better is certainly required than the bedMethyl standard/description.

I notice that the VCF 4.5 draft intends to list modified base information as part of the genotype. I think this may miss one of the underlying conceptual issues that has made more difficult the implementation of modified base information in SAM/BAM/CRAM: trying to treat modified bases as a special case.

I would argue that from the position of VCF as a format for enumerating possible alternative subsequences in samples, it is more natural to store simply in the ALT column the possibility of a modified-base being present; as would be the case for any other non-reference base. This would lead prima-facie, to no special handling being necessary in other fields. Though we could certainly create defined ancilliary INFO fields if the need arise.

This may require some finagling of section 1.6.1.5 (description of ALT fixed field). For instance it could provide for non-{A,C,G,T,N} bases being enumerated in an <ID> string, or the tragically simple option of allowing other character codes.

@jkbonfield
Copy link
Contributor

It's not something I can comment on for VCF as I don't know the landscape of tools so well there.

However for SAM et all it was a fundamental requirement that it be a side-channel as the BAM specifications simply forbids additional letters in the sequence field. I agree it would have been easier from a specification to update sequence, but it's also harder when dealing with legacy software. That latter case however can be mitigated with filtering tools however (like iconv does for text files), but it doesn't overcome hard limitations of file formats.

BCF doesn't use the nibble-encoding for sequence that BAM does, but we can't have single letter codes for every type of modification as there are too many of them once we get to RNA.

@jts
Copy link
Contributor

jts commented Apr 22, 2024

I'm with @cjw85. I think basing this on ALT would be simpler and more natural, as long as a sensible encoding of the ALT field can be found. I prefer ALT as it would allow the AF, AD, ADF/ADR, etc fields to naturally represent the values we care about (like AF for modification frequency, which is obviously the main one).

@marcus1487
Copy link

I have more extensive thoughts on the "mod as ALT" versus "mod as INFO field" (spoiler alert I am in favor of the latter), which I will expand on later, but one thought on the current proposal (since this is likely the thread for this as well), is that it would be good to add a specifier as to which canonical base a modified base is targeted. There are modifications which are canonical base independent (backbone sugar modified bases) or may be dependent upon context as to which canonical base a modified base targets. Therefore I would propose that that the modified base INFO field be ammended to include the intended canonical base target. For example instead of M5mC we would have MC5mC or using a ChEBI code example for 4mC instead of M21839 we would have MC21839. For a theoretical canonical base independent modified base 12345 the INFO tag could be MN12345. This could theoretically apply to any IUPAC ambiguous base group, but I think supporting ACGT+N would be a sufficient spec to catch all cases and reduce parser overhead for supporting the full IUPAC ambiguous alphabet.

This will make things much easier on parsers in terms of programmatically identifying which canonical base is targeted (by the INFO tag and removing any ambiguity. In my opinion the alternative is a database of which canonical base corresponds to which each potential modified base in ChEBI. I think I can safely assume that no one wants to maintain such a database.

@marcus1487
Copy link

marcus1487 commented Apr 22, 2024

I've worked through my argument against the "mods as ALT". Let's start with the simple case for the two options. Given the simple case with a C base annotated with 5mC (single letter code m) with 60% on haplotype 1 and 10% on haplotype 2. (edit to update simple tags in correct format and add GT hopefully in the right format)

mod as info tag:
chr 10 C . GT:M5mC 0|0:0.6,0.1
mod as ALT:
chr 10 C .,m GT,AF 1|0:0.6,0.9

This appears to be quite reasonable for both proposals. So now let's make this a tiny bit more complicated with 1 extra modified base 5hmC (single letter code h) and a variant inserting a second C base. Then let the proportions of modified bases be:

  • haplotype 1: REF allele (single C)
    • mods: 50% C, 10% m, 40% h
  • haplotype 2: insertion allele (second inserted C base)
    • mods base1: 30% C, 20% m, 50% h
    • mods base1: 10% C, 30% m, 60% h

This would result in the following records (I may have some of the syntax slightly wrong; I don't work with GT fields all that often these days):

mod as info tag:
chr 10 C .,CC GT:M5mC:5hmC 0|1:0.1,0.2,0.3:0.4,0.5,0.6
mod as ALT:
chr 10 C .,m,h,CC,Cm,Ch,mC,mm,mh,hC,hm,hh GT:AF 0|3,0|6,0|9,0|4,0|7,0|10,0|5,0|8,0|11,1|3,1|6,1|9,1|4,1|7,1|10,1|5,1|8,1|11,2|3,2|6,2|9,2|4,2|7,2|10,2|5,2|8,2|11:0.015,0.045,0.090,0.010,0.030,0.060,0.025,0.075,0.150,0.003,0.009,0.018,0.002,0.006,0.012,0.005,0.015,0.030,0.012,0.036,0.072,0.008,0.024,0.048,0.020,0.060,0.120

Unless I am missing something obvious I think that the second option here is pretty much a non-starter. This would require some pretty heavy logic just to get back to the simple answer that this location has a single canonical allele (0|1). If this got more complicated and there where more inserted C bases on which to annotate modified bases the field would increase in size exponentially.

Some python code to produce this tag is here if anyone is interested in tinkering (or actually formatting the tags correctly):

from itertools import product

num_mod_cats = 3
hp1_props = (0.5, 0.1, 0.4)
hp2_b1_props = (0.3, 0.2, 0.5)
hp2_b2_props = (0.1, 0.3, 0.6)
AF = ""
GT = ""
for (hp1, hp1_prop), (hp2_b1, hp2_b1_prop), (hp2_b2, hp2_b2_prop) in product(
    enumerate(hp1_props), enumerate(hp2_b1_props), enumerate(hp2_b2_props)
):
    AF += f",{hp1_prop*hp2_b1_prop*hp2_b2_prop:.3f}"
    GT += f",{hp1}|{num_mod_cats+hp2_b1+hp2_b2*num_mod_cats}"
print(f"{GT[1:]}\n{AF[1:]}")

@jts
Copy link
Contributor

jts commented Apr 22, 2024

In the ALT-based representation the number of haplotypes, and hence records in the ALT field/AF entries, is constrained by read depth and the combinations of modifications actually observed in the reads. So while it is possible to have a large number of entries and an unwieldy representation it is unlikely to happen in practice and up to the VCF-producing software to manage. This is no different than calling somatic mutations where complex patterns of subclonal mutations could be imagined but in practice isn't a major problem.

@marcus1487
Copy link

marcus1487 commented Apr 22, 2024

As shown in the example, the representation is complicated not by the underlying state of the data, but instead by the "mod as ALT" specification. The modified base proportions shown are all above 10% which requires very low coverage to determine even on a haploytpe specific mod pileup. The two haplotypes in reality share no information about the modified base annotation. The complexity of the many combinations of modified base state is completely a result of the "mod as ALT" specification. All of the state information is encapsulated in the "mod as INFO" specification and all of the proportions can be estimated without much trouble. I would argue that the proposed situation would not be all that uncommon.

Beyond this first point, it is not simply up to the VCF writer here. The parsers must now handle the case where modified bases and variants occur together. This means that all current variant pipelines immediately become invalid until support for this is added. This will lead to a lack of adoption of the modified base feature for a considerable time period. If there were some massive benefit to the "mod as ALT" specification then it may be possible to drive adoption by VCF parsing tools to use this format, but there is no discernible benefit for the "mod as ALT" specification other than the fact that it is possible (that I can see). Most users of VCF would want the file to be easy to do variant analysis. If the modified base addition makes variant analysis much harder, I would worry that this would result in a lack of adoption of an aggregated modified base format.

@d-cameron
Copy link
Contributor

Thanks everyone for your valued feedback. For those of you that were not able to make the GA4GH Connect session where the this VCFv4.5 release candidate 1 was presented at, I'll make a brief summary here:

Design goals:

  • Prevent proliferation of custom bed-like file formats
  • Minimal (ideally zero) loss of information when translating from everyone's' custom formats
  • As conceptually simple as possible
  • Can be added to an existing VCF
  • Extensible to arbitrary base modifications

I raised issues that we wanted more clarification on during the Q&A at the end of the session and I'm glad you've also raised some of them.

Firstly, on ALT/INFO encoding, I'm unsure what you're referring to as INFO encoding. The proposal using sample-specific FORMAT/genotype encoding and I'm not sure what would be put in INFO fields. I'm going to assume that when referring to INFO encoding you mean the FORMAT style encoding proposed in the draft and you're understandably not familiar with the internal nomenclature of the VCF specification. Please correct me if that's an incorrect assumption and there's actually a third design in play.

I would argue that from the position of VCF as a format for enumerating possible alternative subsequences in samples, it is more natural to store simply in the ALT column the possibility of a modified-base being present; as would be the case for any other non-reference base.

That's a reasonable position to take for SAM as at a read level as each base either has or does not have a given modification. If the modifications are mutually exclusive then it make sense to treat them as additional base type. Unfortunately, this breaks down for VCF since it's not just about what bases are present but about the sample genotype.

This would lead prima-facie, to no special handling being necessary in other fields.

Unfortunately this is not the case. Encoding in ALT breaks sample genotyping. In the @marcus1487 example ALT of .,m,h,CC,Cm,Ch,mC,mm,mh,hC,hm,hh the sample genotype is going to be diploid if the sample is indeed diploid. This means you can't actually reuse AD et al because you can't express the heterogeneity of the base modifications present because your sample is diploid sample. @marcus1487 had to redefine GT as list to even encode the relative in-record methylation phasing information. The notation for phasing across sites would be even messier.

I think the different proposals come down to different goals. As is the case with base calling, it is not a design goal to losslessly encode the methylation co-phasing information present in raw reads. The question for methylation data is what is the essential information that we need to encode.

This is no different than calling somatic mutations where complex patterns of subclonal mutations could be imagined but in practice isn't a major problem.

There's a precedent here. the VCF solution to handling subclonal heterogeneity is to encode each distinct subclone as a separate sample. That approach doesn't work if the entire evolutionary tree is still present at non-trivial AF. It may or may not work for methylation data. I'm not entirely sure what methylation data looks does it (please select all options that apply):
a) Form an admixture of distinct fixed genome-wide methylation pattern (e.g. multiple distinct cell types / cell states each with their own methylation patter)
b) Same as (a) but the patterns only apply to a subset of the genome (e.g. a on/off stress response pattern in a distinct subset of genes that turns on/off in the same way for all cell types independently of what's going on with genes outside that subset)
c) Phase locally (e.g. both Cs in a CpG are either methylated or not, independent of maternal/paternal status)
d) Phase with haplotype (e.g. all the maternal copies of gene X are methylated, none of the paternal)
e) Appear to be random
f) Something else (what have I missed?)

Depending on the prevalence and biological importance of these patterns, the current design may or may not be appropriate. It assumes that base modifications are a combination of (d) and (e) with support for (a), (b) and (c) via encoding each orthogonal pattern as a seperate sample. That works for (a), somewhat for (b) (lots of columns), and not really for (c) (explosion of columns). All design is a trade-off. Supporting (c) adds a lot of additional complexity. How important is it to do so? Do we need to support methylation phasing when base call phasing is not available or do those always come together in assays?

@d-cameron
Copy link
Contributor

Therefore I would propose that that the modified base INFO field be ammended to include the intended canonical base target. ... the alternative is a database of which canonical base corresponds to which each potential modified base in ChEBI. I think I can safely assume that no one wants to maintain such a database.

Agreed.

For example instead of M5mC we would have MC5mC or using a ChEBI code example for 4mC instead of M21839 we would have MC21839.

We could encode this in a custom tag in the ##FORMAT header itself or in the field name. I'm happy with field name but as a suffix instead as this means the aliased tags such as M5mC just stay as they are.

@d-cameron
Copy link
Contributor

d-cameron commented Apr 23, 2024

I prefer ALT as it would allow the AF, AD, ADF/ADR, etc fields to naturally represent the values we care about (like AF for modification frequency, which is obviously the main one).

The spec as it stands only has AF but I agree some sort of methylation equivalent of AD/ADF/ADR is needed.

We'll probably end up with some ugly concatenation of names such as M5mCAD/M5mCADF/M5mCADR to encode the different data types.

What fields would be useful? Anything other than AD/ADF/ADR? Any likelihood/phred-scales fields such as an equivalent of PL?

@d-cameron
Copy link
Contributor

In terms of thinking of how much methylation phasing information is retained in the VCF, I'd like to propose the following thought experiment: Assume you have a perfect T2T single-cell sequencer that calls all bases with all modifications perfectly. You sequence 1,000,000 cells from a single sample. What information about the base modifications present in your sample should be included the VCF?

@d-cameron
Copy link
Contributor

we can't have single letter codes for every type of modification as there are too many of them once we get to RNA.

There are modifications which are canonical base independent (backbone sugar modified bases) or may be dependent upon context as to which canonical base a modified base targets.

Each base modification was given it's own independent FORMAT field as some modifications are genuinely independent of each other. Common ones are not (e.g. a base can't be both 5mC & 5hmC) but others can be. Are there any edge cases (that we could potentially care about) of modification combinations that an independent ChEBI code for each modification wouldn't adequately describe? Not worried about nonsensical value(e.g. M15377C), or data inconsistences (e..g GT:M5mC:M5hmC 0/0:1:1), just potential ambiguities.

@d-cameron d-cameron added the vcf label Apr 23, 2024
@d-cameron d-cameron added this to the VCFv4.5 RC1 milestone Apr 23, 2024
@d-cameron d-cameron self-assigned this Apr 23, 2024
@jkbonfield
Copy link
Contributor

In terms of thinking of how much methylation phasing information is retained in the VCF, I'd like to propose the following thought experiment: Assume you have a perfect T2T single-cell sequencer that calls all bases with all modifications perfectly. You sequence 1,000,000 cells from a single sample. What information about the base modifications present in your sample should be included the VCF?

  1. This raises a question I was struggling to expand on during the GA4GH session.

In such a case we're sampling a population, so we have population statistics. We're basically recording prevalence data. However consider a case of an unsure base caller where it's 20% likely to call a variant due to error. It may be 100% of the time the bases aren't modified, but we have a 1 in 5 error rate making it look like a 20% population. For the data to be meaningful, we need to know if it is a prevalence figure or a likelihood of correctness figure.

Obviously we could just have a hard quality threshold, but clearly this is demonstrating we need some level of quality information to disambiguate this so we know what the fractions actually mean.

  1. You also mentioned if we want absolute counts, fractions or phred scores. Firstly Phred is a disaster here. It's good at very high rates like 0.999 vs 0.99, but it's hopeless with the low end and has no way to distinguish 0.001 from 0.01. For that you want log-odds instead. However neither are precise obviously and granularity gets fuzzy around 50%. Secondly, counts don't work unless we have both with and without counts so we can work out the fraction. Maybe we require AD, but that's also nuanced and problematic. Are they both filtered the same, etc? Ultimately we need to decide if we want linear precision or exponents. Fractions may well be easiest.

@jkbonfield
Copy link
Contributor

Beyond this first point, it is not simply up to the VCF writer here. The parsers must now handle the case where modified bases and variants occur together. This means that all current variant pipelines immediately become invalid until support for this is added. This will lead to a lack of adoption of the modified base feature for a considerable time period. If there were some massive benefit to the "mod as ALT" specification then it may be possible to drive adoption by VCF parsing tools to use this format, but there is no discernible benefit for the "mod as ALT" specification other than the fact that it is possible (that I can see). Most users of VCF would want the file to be easy to do variant analysis. If the modified base addition makes variant analysis much harder, I would worry that this would result in a lack of adoption of an aggregated modified base format.

I think the above is a key point. We're retrofitting something on to an existing standard, but we have to be aware there is a lot of legacy software out there. This means if we change the meaning of an existing field then we must first get support from the main implementations to write tooling to strip out the new meaning and translate back to old VCF.

For example, tools may just inspect GT and look for 0/0, 0/1 and 1/1 to count hom ref, hom alt and heterozygous stats. With base mods a naive counting like that is completely wrong, but probably silently so giving incorrect results rather than just failing. While I agree it'd be nice for everyone to have robust software that checks things in the correct manner, we all know this just isn't so!

Also remember if we're going to support both the existing short published codes like "m" and "h", but also the more complex string ones (5hmC) and beyond to the huge plethora of ChEBI codes, then the parsing of ALT is going to become a messy task given we don't have bases as a list. They're just concatenated together. Is 5hmC ChEBI 5 followed by 'h', 'm' and 'C' bases, or is it just "5hmC" (aka 'h')? Is A,1234512346 ChEBI code 1234512346 or codes 12345 followed by 12346? We need an entirely new markup to be invented here with separators between bases. Maybe we can use , but it's a marked change from what we currently have and probably best just avoided.

@jkbonfield
Copy link
Contributor

You're carving out an entire prefix latter for a specific purpose. Msomething to mean a base modification. Maybe this needs to be a header tag for the relevant format field. "ISMOD" or some such.

The problem is one of parsing. How do we know this M field is a base modification field and not just someones random custom tag they added that happens to start with M?

This is something that SAM was so good on and bizarrely VCF so bad on (given it came second). SAM very clearly delineates user space (X*, Y*, Z* and lowercase) from official namespace (everything else). The ship has partly sailed already, but the official tags are never lowercase and never X, Y, Z* I think. So we could at least define that to be private space and never going to be reused, which would give tool chain authors somewhere to go.

@marcus1487
Copy link

I'm unsure what you're referring to as INFO encoding. The proposal using sample-specific FORMAT/genotype encoding and I'm not sure what would be put in INFO fields.

I was mixing up INFO and FORMAT/SAMPLE fields. You are correct. By "mods as INFO" I am referring to the current proposal in the VCF 4.5 draft. There is no third proposal. Only the amendment I suggested to the current schema to specify the canonical base.

Maybe this needs to be a header tag for the relevant format field. "ISMOD" or some such.

If we are already referring to the header for the definition of the field, why not just require the ChEBI code, the canonical base alternative and allow VCF writers to specify the short encoding for use throughout the file. I would personally prefer the tag in the records to be Mm and Mh for example to save some space since all the metadata about the modification is in the header anyways. It might be common to use the suggested 5mC, 5hmC, etc, but I think it would be cleaner to remove these from the spec and allow users to specify the meaning of the M* tags by some defined header line since that is on the table here. Adding some modifications to the specification and not others creates the illusion of a database that users will invariably ask to add on to. If there are no special modified bases and the ChEBI codes are specified in the header the specification becomes a bit more future proof.

@d-cameron
Copy link
Contributor

I would personally prefer the tag in the records to be Mm and Mh for example to save some space since all the metadata about the modification is in the header anyways.

The difference in size is practically zero once gzipped since the FORMAT row definition (e.g. GT:AD:ADR:ADF:M5hmC) is almost always identical for every row so it compresses efficiently. My preference is for readability and consistency over terseness but one could make a consistency arguement about matching SAM fields. What's the state of SAM methylation encoding? Is that finalised?

@jkbonfield
Copy link
Contributor

Yes SAM has been fixed for some time now.

It's a different beast though as SAM tags are required to be 2 characters only and there's a limited name space too, so we can't reasonably invent a new tag for each and every type of base modification. Hence they're serialised into a single MM tag. VCF doesn't have that limitation so I don't think it's necessary to create such a format-within-a-format. (I didn't really want to do it in SAM either!) It's also easier in VCF to add new things into the header, which is nigh on impossible in SAM world.

I do think we should be exploiting the VCF header more here, with meta-data to indicate a tag as a base modification. We can map tag names to ChEBI codes / widely recognised short codes (eg the ones listed in SAM). It gives the format better introspection and future proofing.

I would be wary though of assuming all this stuff will vanish after compression. It sort of works, but remember gzip has a very limited buffer to look back over. Once lines get longer than 32kb, compression does start suffering considerably. Long tags just exacerbate that.

@d-cameron
Copy link
Contributor

d-cameron commented May 3, 2024

It might be common to use the suggested 5mC, 5hmC, etc, but I think it would be cleaner to remove these from the spec and allow users to specify the meaning of the M* tags by some defined header line since that is on the table here.

I do think we should be exploiting the VCF header more here, with meta-data to indicate a tag as a base modification. We can map tag names to ChEBI codes / widely recognised short codes (eg the ones listed in SAM). It gives the format better introspection and future proofing.

If we go down this path then we'd end up:

  • Adding BaseModification=[ACGTN] to any ##FORMAT header
  • Adding ChEBI="99999" to any ##FORMAT header
  • Require them both to appear together
  • Defining Number=M,

Pro:

  • Short field names
  • Proper array length validation possible (instead of Number=. and a huge explainatory text section in the specs)
  • Can store additional arbitrary data.
  • Don't need to reserve an entire regex of reserved fields

Cons:

  • No spec-defined fields. Each caller could/would use it's own fields
    -- Worse interoperability

Cons could be addressed by standardising the dozen or so fields that current base modification assays actually report. Field names become rather ad-hoc and likely inherited from existing tooling (e.g. MC/MH/MA/Mh/Mm?).

I prefer just reserving everything that could be used and aliasing but it's a weak preference and can be swayed.

@d-cameron
Copy link
Contributor

We could also allow general aliasing of ##FORMAT and ##INFO headers that seems like a large API overhead & headache for both VCF consumers and producers without much payoff other than making M5mC et al a little less special.

Adding some modifications to the specification and not others creates the illusion of a database that users will invariably ask to add on to

SAM already does this for the PL field (although in retrospect that was probably a mistake).

If they're important/widespread enough I don't see why they can't get their own alias. That said, having any aliases is does lead one to want to generalise it to AliasOf=. Or just make it consistent so there's no M4mC at all, just M21839C and a subset of VCF users getting very familiar with a dozen or so ChEBI codes. Cleaner in every way except actual usability.

@marcus1487
Copy link

If some modified base codes are added to the specification, then I would prefer to see some idea of how to manage which new bases are "common enough" to warrant addition to the spec in the future. I think the recommended codes would be a much easier solution to manage and potentially reduce the burden on modified base readers and writers. I am mostly trying to avoid the situation with the SAM spec which is now a combination of defined short codes and ChEBI codes for (what bacterial reserachers would consider) common bases. See this issue for the crux of the issue: #741

@jkbonfield
Copy link
Contributor

The problem of going FORMAT only and no in-depth header meta-data is that defining format tags for the common modifications means over time you'll become the defacto maintainer of a database of common modification names! We do need someone to do that role, but it really needs someone skilled in the community (ideally RNA community as they have the most modification types). This is what killed #741 - no one is willing to act as the arbiter of what gets a short code and what doesn't, and even if I were happy to do it for SAM I'm simply way out of my depth to judge such things.

However, even if we do go down the FORMAT M* means base modification rule, you still need header meta-data to back it up as a base modification tag as there will be millions of files out there in the wild with private tags that already start with M, due to no controlled namespaces. So that means the "M*" becomes a convention and not a rule that parsers can trust. I don't see a way around that.

@d-cameron
Copy link
Contributor

We need to define a convention for unstranded methylation data. Any objections to standardising on the position of the first modified base in the motif on the positive strand and . for the other base?

Similarly, a clarification that assays that don't include information for certain sites should use MISSING (.) for those sites or omit it completely.

E.g. unstranded CpG puts the unstranded methylation data on the C and a . on the G.

@d-cameron
Copy link
Contributor

We don't want VCF have to maintain it's own database of base modification short names. Is there database we can reference? Candidates include:

@jts
Copy link
Contributor

jts commented May 8, 2024

I've been thinking about this recently and want to raise a few issues that should be considered.

The discussion so far has focused on situations where there is a variant w.r.t the reference, and how to encode modification frequencies for the REF and ALT allele. However, most of the modification calls will be at homozygous REF positions. I understand that the proposal is to emit records with empty ALT, which seems like a natural solution, and I want to point out that the VCF files will become substantially bigger as there are ~28M CpGs in the human reference.

This leads to an issue about how positions that are not present in the VCF file are treated. Are they assumed to be uninformative (say, not covered by reads), or not modified? This is particularly important for all-context calling (5mC in any context, not just CpG) and it is not desirable to produce enormous VCFs. I can see some interaction with the gVCF format here, so specifying if/how the modification calls are treated in gVCF should also be considered.

@jkbonfield
Copy link
Contributor

You're right, gVCF ref blocks are a natural fit here to disambiguate the not-covered vs the not-found scenarios.

@d-cameron
Copy link
Contributor

Changes to the 4.5 draft located at #770

@d-cameron
Copy link
Contributor

This leads to an issue about how positions that are not present in the VCF file are treated. Are they assumed to be uninformative (say, not covered by reads), or not modified? This is particularly important for all-context calling (5mC in any context, not just CpG) and it is not desirable to produce enormous VCFs. I can see some interaction with the gVCF format here, so specifying if/how the modification calls are treated in gVCF should also be considered.

We could extend the ref block to define the base modification fraction specified in a ref block to apply to all applicable bases in the ref block.

Example:

#CHROM POS REF ALT FORMAT SAMPLE
chr    1 C . GT:LEN:M5mC 0/0:1000:0
chr 2000 G . GT:LEN:M5mC 0/0:500:1

Would mean that all the Cs in chr:1-1000 are unmethylated, and all the Cs in chr:2000-2499 are methylated.

Would that cover your use case?

@LauferVA
Copy link

LauferVA commented Jul 3, 2024

I'm with @cjw85. I think basing this on ALT would be simpler and more natural, as long as a sensible encoding of the ALT field can be found. I prefer ALT as it would allow the AF, AD, ADF/ADR, etc fields to naturally represent the values we care about (like AF for modification frequency, which is obviously the main one).

Huh? I think this would be a terrible solution. Epigenetic modifications vary cell to cell. What are you going to put in the Alt field? "23% more methylated than age, sex, and ancestry controlled comparators"? You want THAT as a part of the ALT designation in a VCF file? How do you know the difference doesnt flow from differences in cell proportions between the sample at hand in the comparators? Or are you going to hard threshold it? If you have any mods at that position, then you get the Alt designation? It's not a workable thought.

IMO the only sensible way forward is to incorporate that info as an annotation of the primary sequence, or even better as an annotation of a haplotype, along side really well curated metadata ...

@LauferVA
Copy link

LauferVA commented Jul 3, 2024

I've been thinking about this recently and want to raise a few issues that should be considered.

The discussion so far has focused on situations where there is a variant w.r.t the reference, and how to encode modification frequencies for the REF and ALT allele. However, most of the modification calls will be at homozygous REF positions. I understand that the proposal is to emit records with empty ALT, which seems like a natural solution, and I want to point out that the VCF files will become substantially bigger as there are ~28M CpGs in the human reference.

This leads to an issue about how positions that are not present in the VCF file are treated. Are they assumed to be uninformative (say, not covered by reads), or not modified? This is particularly important for all-context calling (5mC in any context, not just CpG) and it is not desirable to produce enormous VCFs. I can see some interaction with the gVCF format here, so specifying if/how the modification calls are treated in gVCF should also be considered.

VCF has to either change or die entirely. I don't think it is meaningful to discuss incorporation of epigenetic mod information without (at minimum) a gVCF file, or, (better) something like a GFA2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants