Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Methylation in SAM #11

Open
jkbonfield opened this issue Nov 29, 2018 · 3 comments
Open

Methylation in SAM #11

jkbonfield opened this issue Nov 29, 2018 · 3 comments

Comments

@jkbonfield
Copy link

I started investigations into how to store such data in SAM. It's non trivial due to BAM's choice of using nibbles to store sequences, thus uppercase bases + ambiguity codes are the only choices available.

samtools/hts-specs#362

The actual storage method hasn't been finalised yet, and frankly I'm in the deep end with this anyway so it's a learning experience for me too. However I have done some experiments, both actual data compression and thought based, on which methods work well.

One key thing to consider is that applications may not be methylation aware. If we are doing sequence alignment or homology computation, we don't want to treat methylated C as an N for scoring purposes, but you can bet this is what 99% of software will do. Therefore a side channel, with the "fundamental" base in the SAM SEQuence field and the modification status in an auxiliary tag, looks like the sensible way to go.

This is the time for the community to be making suggestions so we can come up with an appriopriate hts-specs update.

@tmassingham-ont
Copy link
Contributor

Hello. Thanks for the information. We had a look around to see if there was any standard we could follow before making an arbitrary choice to write out 5mC as 'Z' -- ostensibly because it was the last letter of the alphabet but I suspect a joke based on homophones from our American friends. Flappie is a research prototype and this may not the final form in which the information is presented since, as you rightly point out, at lot of downstream tooling will break. These include tools that we rely on to support our internal work (and many thanks to yourself and everyone who has developed such excellent software over the years) so we will directly experience this pain as we move direct modified basecalling from research into our products.

Representing modifications as unique letters also won't be scalable to future developments. DNA only has a few common modifications but RNA is much more diverse, there being more than a hundred entries in the RNA modification database. One way forward would be to output a sequence of canonical bases and a separate channel with modification information and associated confidence.

We welcome any advice and feedback from the community about how to represent this data -- where do you think would best place to start a discussion? For now I'll update the README to make sure users of Flappie are at least aware of the issues with current and forwards compatibility.

@jkbonfield
Copy link
Author

Somewhat tardily, there is now a proposal for storing base modifications in SAM, BAM and CRAM:

samtools/hts-specs#418

Your thoughts on this would be appreciated. Have we forgotten something important? Is it too complex? (See the discussion)

@jkbonfield
Copy link
Author

I'm not sure this is the correct place to comment, as Guppy is now the tool producing methylation and not Flappy, but I couldn't find Guppy on github.

Anyway I've now seen Guppy data and it raises interesting questions.

  1. It potentially could produce more than one modification type. Eg columns for pure C, 5mC vs 5hmC. In my SAM proposal we basically emit the most likely call. If that's straight C we don't have to record methylation chances (or not if below a certain threshold at least).

  2. The scale in Guppy is p*255, thus p=0.33 is 84. The scale used throughout the rest of SAM is phred; -10 * log10(1-p). This works, OKish even for low p values between 0.5 and 0.9 say, but it's tragic for p<0.5.

  3. Is it useful to know the 2nd most likely call at any site? There have been times I wanted this for straight consensus calling or SNP calling (call A with 0.9, but where is that 0.1 remainder distributed? It can really matter!). If so, how much?

  4. How much fidelity really matters at mid-range probabilities. Do we care about the difference between p=0.4 and p=0.6 for example, or do we only care about, say, p=0.8 and above (and correspondingly p=0.2 and below for non-events).

Obviously if we train on both 5mC and 5hmC then one of those values will have p<0.5. With the 0-255 range that's fine. Eg a 5mC being 0.66 and 5hmC = 0.33 still works fine. In the phred scale it simply doesn't work. Illumina's earlier base caller used phred+64 as they wanted to emit all 4 confidence values. This was done with log-odds scales where score=10log10(p/1-p)) instead of phreds -10log10(1-p). It works well, but still has low fidelity for the mid ranges.

Ultimately no one ended up using the Illumina data as people were still wedded to FASTQ and just cursed Illumina for producing yet another FASTQ variant! (Actually two, which was even worse.) However that doesn't mean it wasn't a sensible move.

Please comment on the hts-specs github issue (samtools/hts-specs#418), or forward this on to the correct group if it's not reported correctly here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants