-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing base modifications (methylation) #362
Comments
I'll leave this here with no comment... ;-) |
Revisiting this I now understand it better. Strand is important as it's the way we indicate on a double-stranded template that we have observed the methylation on the opposite strand. Some sequencing methods do use both strands (eg PacBio?) so this is a genuine issue. My favoured simple proposal for now is to use a style which is a halfway house between MD and SA tags. I rejected the BQ style notation (one value per base, with most being NOP) because it's just inefficient and bloated in SAM, BAM and memory and the frequency of modifications is low. We store methylation codes and distances between them (MD style), but our codes are numeric IDs to permit a wider range as there are already over 100 known for RNA and growing. We also need to cope with the potential for a single site having a modification on both top and bottom strand simultaneously (vanishingly rare, but we cannot rule it out). Thus we combine pos,type(,type*) lists, separated by a semi-colon (SA style). An example (tag name is arbitrary!):
(Where "1" is 5mC, "2" is 5mC on complementary strand, and "3" is 5hmC. These are "m", "1" and "h" in Coby's slides respectively.) It's not optimal for compression as we're mixing multiple types of data into the single field - types, strand (part of type) and position-deltas and also not taking into account data duplication (knowledge of primary base type). However it's simple, in keeping with the rest of SAM, and frankly it's unlikely to be a HUGE hit whatever we do. If we wish to separate out strand, we could do so explicitly. Eg above converted to:
Any thoughts? |
If you add this to the specification you'll have to update it every time someone looks at a new modification (what about 5-fC and 5-caC?)... |
We'd populate the standard set of modification types with a long list of known ones, so we wouldn't envisage a change happening regularly. Coby Viner has already given a good talk on this and there are papers explaining it too. See the above links. The second bit which I forgot to add is that the ChEBI registery of small molecular chemicals can also be used to give a more complex description of anything we haven't thought of. It was also advocated by the paper. This is probably best done by simply hacking the name space. Eg either positive types are ours (see the defined table) and negative types are theirs, or better still letters are our shorthand (m, h, f, etc) and numbers are ChEBI.. We don't want to define everything using ChEBI because even the common modifications are 5 digit numbers which just bloats everything. So maybe this:
I'm still unsure about whether strand should be explicit as it is here, or encoded within the type as Coby suggests. Both work. Explicit strand is cleaner and so I'm leaning towards that, but not all technologies (most?) are going to be reporting double-stranded methylation. Edit: or shorter form Edit: shorter still, only use semicolon for ChEBI IDs: |
Do you also plan a code for "There's a modification, but I don't know what it is"? I have a method that detects statistically significant variance from the unmodified nucleotide kmer models in Oxford Nanopore data, and it would be great to be able to note this in the alignment. I'd originally thought of hacking the MD string with an 'm' in those positions, which would easily allow me to make a standards-conforming file by conversion of the MD string to upper case, but real support for encoding an unknown mod would be better :-) |
It was in the paper linked above. I'm not sure which are practically likely to be useful but some form of modification ambiguity code seems sensible. |
I like: MM:Z:105+m;27-m;150+m-h;412+f;50+76397; But modifications are already part of the input sequence of mappers. Basecallers start to extend the alphabet. Will they encode an MM:Z tag in the FASTQ @ line? What are the ideas/suggestions there? |
That's a good question. FASTQ is already operating outside it's own weak specification, eg via the aux tag elements, but it seems to be a common method and I don't see why MM tags couldn't be encoded that way. Alternatively, for the common modification types FASTQ doesn't have the alphabet limitations that BAM does so it may be justified to store the data in the sequence portion instead. It'd need an aligner that understands it though. If it doesn't then an MM (or whatever) style tag permits us to use simple alignment methods while having a sidechannel for the methylation data. Ugly though! |
Btw. methylated Cs make a high proportion of the bases, e.g. in a human sample that we have it's about 25% of the Cs. |
ugh. I object to having some modifications be in the bases and others
in a tag...that seems very error-prone.
…On Wed, Feb 27, 2019 at 9:17 AM Andreas Hauser ***@***.***> wrote:
Btw. methylated Cs make a high proportion of the bases, e.g. in a human sample that we have it's about 25% of the Cs.
So a tag might be expensive.
Maybe some modifications deserve expanding the bases alphabet.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Information theory. Things that are common are represented compactly. |
There aren't any bits available in the BAM format for expanding the alphabet of bases, so for BAM the choices are encoding this in a tag or making an incompatible BAMv2 format. IMHO it'd be useful (at least for viewing!) if methylated-C (and a handful of others) could also be represented directly in the SEQ field in SAM, where there is scope for expanding the alphabet. Note though that IUPAC ambiguity characters are already used in that field, so there's more letters already taken than you might guess. |
For viewing (only) you presumably mean either IGV (or similar) or SAM
(as in not BAM or CRAM). For both of these we could use an expanded
alphabet as we are not limited to the encoding in BAM or CRAM. For
example, we could have a SAM-only representation that has well-defined
transformation into the "traditional" SEQ+tag represenation.
…On Wed, Feb 27, 2019 at 10:00 AM John Marshall ***@***.***> wrote:
There aren't any bits available in the BAM format for expanding the alphabet of bases, so for BAM the choices are encoding this in a tag or making an incompatible BAMv2 format.
IMHO it'd be useful (at least for viewing!) if methylated-C (and a handful of others) could also be represented directly in the SEQ field in SAM, where there is scope for expanding the alphabet. Note though that IUPAC ambiguity characters are already used in that field, so there's more letters already taken than you might guess.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yeah, by “viewing” I meant human beings looking at textual SAM files, and that transformation that you just described rather well is exactly what I had in mind. Seems like it would be useful for tools to be able to both read and write such SAM files (perhaps not by default!). |
For clarification, I'll use the term "fundamental seq" to mean the unmethylated original sequence. I'm all in favour of a transform request that tags "fundamental seq" plus methylation tag to produce methylated seq, but the limitations of BAM are what dictate splitting this into seq+tag. I also think it is better for most analysis tools if the sequence they read is the "fundamental seq". If we start storing other base types there then most software will interpret this as N, which is a huge problem. Having the option to transform sequence only for specific tools gives us the most flexibility. As for size, tag doesn't have to be large. Some forms are larger in memory than others, but we can do data compression efficiently on the tags still. Potentially more so than putting this data into the sequence field, which will cause havoc for reference based compressors such as CRAM. |
There are 15 letters in IUB. As to the tag, another option is to use a structured array like: XX:B:i,27551,2,10,20,-27552,1,30 "+27551" indicates methylation on the forward "strand". There are "2" modifications at offset 10 and 20, respectively. There is "1" modification of 27552 on the reverse "strand" at 30. I quoted "strand" because there are two types of strands: read strand and modification strand, which may not be the same (e.g. in case of Illumina bisulfite sequencing). I am not sure what is the best to use. This is an intricate issue. Such a B-typed array is easier for programs to parse, though is not good for eyeballing. I am also good with @jkbonfield's proposal. |
I need to get back to my evaluation of what encodings work well and what don't. [I've been rather side-tracked recently by the CRAMv4 prototyping (in Javascript Heng!), but that's for a different issue.] From a compression point all can be morphed into each other of course, but I'm talking about existing methods in both BAM and CRAM. Several worked OK though (and several not so great). Anything in ASCII is doomed to failure due to alphabet sizes. It's growing all the time and while currently we're dominated by just a few symbols, there are hundreds of modifications known about and over time we'll start to see instruments that can distinguish them, or distinguish to one of several. That means ambiguity codes, which further expands our alphabet. I think just kill the idea of embedding in the sequence and use a side channel from day one so we don't have to retrofit it in at a later stage. |
More precisely it's not "common", but "accurately predicted". Eg "e" is more common than "u" in English text, but we can predict the presence of "u" following a "q" with very high accuracy so those typically take up less space. PPM for the win. :-) It's also true whether they are embedded within the SEQ field or whether they are common elements of a string or B array, although some standard compression tools will do better on one than the other despite being equivalent from an entropy perspective. We also need to consider memory footprint. It's quite efficient to compress a long string of characters with low information content, eg the BQ array which is mostly "@" (@=64, so no +/- change); probably more so than attempting a CIGAR / MD style encoding. However the BAM record in memory is still large as it'll almost certainly be the uncompressed form. |
If something compresses well, it just meins that it is encoded And while you want efficiency you also want readability and room for all the modifications. How about UTF-8? |
Like http://modomics.genesilico.pl/modifications/ ? Please no! The fact is we need the sequence to be unadulterated so that BAM works (it cannot deal with more than is currently used) and so that many other existing tools don't interpret unknown symbols as Ns. Methylation support is a new feature, so it should be a new field without changing the meaning of the existing one. |
I haven't got time to give the full summary here, but some conclusions from experiments so far.
|
This is really great discussion on storage in the SAM/BAM/CRAM format, but I wanted to raise a point for analysis downstream from this storage point. When aggregating methylated/modified bases per-reference site there does not seem to be a standard format. The ENCODE bedMethyl format is the closest thing I have found, but this format is not reproducibly extensible to multiple modifications (e.g. 5mC, 5hmC, 5fmC) called on the same set of reads. Adding custom modified base format tags to the VCF spec (modVCF?) seems like a pretty decent option to me (though the underlying ploidy assumptions would be broken a bit). This may not be the right forum for this point, so please let me know if there is a more appropriate location to raise this point (or if this should be moved to a separate issue). P.S. Background: I am working on methods at ONT to skip FASTQ and SAM/BAM/CRAM to give higher accuracy modified base calls anchored to the reference. There is a lot of information (e.g. confidence metrics for a set of modified bases per-reference site, per-read) to be carried through from signal processing to per-reference site read aggregation that would seem to be onerous on the SAM/BAM/CRAM specs (as discussed in this thread). As such I am looking for the correct per-reference site multiple-modified base format as a direct output. |
@marcus1487 bedMethyl is easier to parse, but it is less extensible. There is also no standard way to collate data from different samples. VCF may already do the job with symbolic alleles. |
@lh3 I like that bedMethyl is easier to parse as well. And I am less worried about collating data from different samples (though it is a nice bonus). I am currently more concerned with including multiple modification types in the same file from a single sample. For example calling 6mA, 5mC and 5hmC and collating results across reads from a single sample. It seems like some version of VCF might be the best solution. |
Apologies for the slow progress of this. I'm also juggling several other things and haven't managed to put the time on this it deserves. However it is also vital to get people closer to methylation work involved so I value all these inputs greatly. Thank you. I don't see it as onerous to have to store methylation data in SAM/BAM/CRAM. Rather I see it as a way, if done correctly, of bringing uniformity and consistency with storing all data in the same place instead of suffering a proliferation of extra side-channels to hold the missing information. We've already seen several different letters being used to describe the same 5mC in fastq. |
@jkbonfield to expand on my point on the onerous storage in [S|B|CR]AM, I am thinking more about storing probabilities for a number of modifications in the same read at the same reference/read position. This could improve downstream results when aggregating over reads. For example it is conceivable to output probabilities for canonical C, 5mC, 5hmC, 5fmC and more for each read at each position. Then when aggregating over reads these probabilities could be used to produce better estimate for the total fraction of each modification at a reference position than would hard thresholding values to pick just the most likely of those modifications within each read. This is why I think storage of methylation in the [S|B|CR]AM format might be onerous. On the FASTQ storage issue, that is at least partially my fault (I arbitrarily added the Z base output for 5mC to the ONT mod base caller). I'd love to standardize that as well (especially since more mods are coming). Is there a consensus on how/with what single letter alphabet methylated/modified bases should be stored in a FASTQ? |
Are there any thoughts/pointers for
|
Has there been any development on the issues raised here? We are looking to release a tool which would output modified base calls anchored to a reference for many modifications simultaneously. As far as I can tell there is no format that supports this right now. As noted above, we are currently planning to use VCF format with some (hopefully quite logical) tweaks in order to support storage of multiple modified bases (include strand and mod specific format fields). If there are better alternatives that would be accepted by the wider field it would be great to get these integrated soon (if likely after release at this point). The other major point for which I have not seen definitive conclusions is the storage of modified bases in a FASTA/FASTQ output (though I personally think the application of this data type should be quite limited, potentially not even supporting mapping). We are currently outputting these as alternative letters (for now these letters are arbitrary and even configurable by any user training a modified base model; see here for details). If there is an attempt to create a uniform "modified bases" alphabet it would be great to integrate this into the ONT toolchain to support wider adoption. |
Unfortunately I just haven't had time to work on this issue myself recently. However I wouldn't be the right person to think about modifications in VCF anyway. FASTA/FASTQ is challenging long term as the limited alphabet is going to cause problems at some point and there are no headers associated with those file formats to act as a dictionary. Are your default letters compatible with https://dnamod.hoffmanlab.org/? If so perhaps we have an emerging consensus for FAST[AQ]. |
Ok, current thinking is tags like:
This string is
Ie the above has the 28th G modified to "o". It has the 23rd C modified to m, along with the 37th, 39th, etc. Finally is has the 234th C modified to Ch. (These may all be +1 actually, I think I count from zero..) Note we permit "N" as the unmodified base to count on which means it's the nth base rather than the nth "N" base. That should capture awkward scenarios. Note we can encode modifications on the opposite strand too. Eg:
This is saying the 189th G in the sequence has a modification "h" (which applies to C so it's inherently the other strand anyway) on the opposite strand. This permits us to record modifications on either strand, either all for one strand or all for the other (single stranded) or for genuine double stranded sensing, including mods at the same site (vanishingly rare I expect). Finally, the codes m, h, f, c, o, etc could be numerical ChEBI values. As they only occur once per code type it won't bloat the file significantly. PS. The reason we do delta vs the last base of type Edit: this doesn't mean we're wedded to specific one letter codes. We should probably adopt the ones currently in use and recommended (see above), but maybe there should be some way (header?) of adding assigning new codes via ChEBI values. However that can come later as it'll be an easy extension. (Similarly also for a quality score tag, should they be required.) |
Here's some horrid perl code to experiment with base modifications. add_meth2.pl method file.sam > new_file.sam Methods are numbers:
An example usage:
Code is here: https://gist.github.com/jkbonfield/bb600c57e710f60cbfd79407230c4a4c Note the Example content from tags:
So - pick it apart! What's wrong with it? Bound to be something :-) Note: I know it's not hugely trivial to work with, but once we have a decent API that becomes irrelevant and the perl code demonstrates it's not overly problematic. |
Nice to see some code/prototyping in this ticket. In the Dotty world, the aligner is given a pure ATCG(N) sequence string, and the SAM output is post-processed to push the remaining information in the FASTQ into the SAM tags area (as Dotty e.g. Bismark or other tagging methods, e.g. bwameth). Would this be the same procedure for the option (3), MM tags? |
Good question, but it's kind of not our problem. This group is focused on providing ways of storing the data in SAM/BAM/CRAM and VCF/BCF. We don't support FASTA/FASTQ within this GA4GH team. Maybe that's a mistake, but our current philosophy is to encourage people to use unaligned SAM (etc) instead of FASTA/FASTQ as it's so much more flexible and has file headers too. Unofficially though, there are two possibilities I guess.
Either way, it needs a compliant aligner. |
I think I understand: we currently would carry the information in the Is my understanding correct @jkbonfield ? Having re-read this ticket, I think I understand why people suggest this could be done at the VCF/BCF level (as @marcus1487 suggests), given that some of this mod-calling has similarities to the pileup-style way of calling variants from multiple aligned reads from a SAM. I guess it's a theoretical question as to which one is preferred: SAM with mod-calls (or per-read mod-info) in the MM-tags, or VCF/BCF with mod-calls from a BAM file that contains aligned reads (with per-read mod-info). |
Yes that wrapper strategy should work. As for SAM vs VCF, I don't see this as either-or, but more of an "and". With Fastq -> SAM -> VCF the same base modification data needs to be represented in each and migrated as appropriate. Although for VCF it's likely we'll be working in counts or consensus and ambiguity code level. That's someone elses problem through ;-) Edit: ie we can keep this as discussion, but ultimately we need separate PRs for SAMtags and VCF specs for the actual definitions. |
Thanks for the explanation. It does sound like there are separate PRs:
Then tools to convert from modVCF to bedMethyl or bedGraph or other formats. |
Have you (or anyone else you know) tried extending the pileup API, or prototyped any other functionality, for htslib to be able to provide access to this data in a convenient fashion? I can see that a common question would be, "what's the probability that the base at reference position i of alignment A modified to e.g. mC?". Further to aggregate those numbers over reads. |
Sorry, not yet, but we need to do it. This is getting off-topic for hts-specs being more htslib, but do you have a suggestion for what the API should look like? Pileup is pretty crufty as an interface as the structures are public and non-changeable without breaking ABI. However there's room to exploit the "pileup_cd" member perhaps although it wasn't what it was intended for. The alternative is a function that given a bam structure an a query-position (qpos) within it, returns the relative modification possibilities. It'd ideally need to cache state that between calls to avoid repeated linear scanning, and again there's no obvious way to do that. The htslib pileup interface is simply crufty. I'm open to ideas though! |
Without dragging this too far off topic, your comments are in line with my own ideas. I do think something like the pileup API is warranted such that just as we can access bases from alignments and their qualities at a reference position we can get modified bases information. If you want an example use case to orient ideas see fast5mod which first converts Guppy's Fast5 data tables into sam with the mod-base scores in array-valued tags, and then builds a per reference position data table. I should say that my initial question here was motivated by the fact that the Guppy basecaller will soon be outputting data in the form described in #418. |
An aside and just a FYI (and not really hts-specs): I've started on adding these to htslib. First thing first is a general "I have a read, show me the modifications" style interface. Once that's done I'll consider whether we can integrate it somewhere into pileup. |
Closing as this was solved in #418, but we may need a new issue / PR for tracking the inevitable next step of representing these in VCF. |
This is just an issue to track progress on the idea.
The basic proposal so far is some form of side-channel in a SAM/BAM/CRAM tag which is used in conjunction with the sequence field to characterise the base modification. The two combined together form the actual modification. This works around issues with BAM having a limited (and already defined) 16 character set and also potentially problems with more than 20 different base modifications.
The likely solutions for this will either be an MD style tag with a mixture of lengths and modification symbols, or a BQ style tag with the same length as the sequence.
Some futher experiments and discussions can be found here:
https://docs.google.com/document/d/1LNVELOMU4ZlKASPoWxuUtgk7vtOZsU-ogERa_IeMeiI/edit?usp=sharing
See also the table in Appendix A (last page) of https://www.biorxiv.org/content/early/2016/03/15/043794
The text was updated successfully, but these errors were encountered: