-
Notifications
You must be signed in to change notification settings - Fork 15
Methylation in SAM #11
Comments
Hello. Thanks for the information. We had a look around to see if there was any standard we could follow before making an arbitrary choice to write out 5mC as 'Z' -- ostensibly because it was the last letter of the alphabet but I suspect a joke based on homophones from our American friends. Flappie is a research prototype and this may not the final form in which the information is presented since, as you rightly point out, at lot of downstream tooling will break. These include tools that we rely on to support our internal work (and many thanks to yourself and everyone who has developed such excellent software over the years) so we will directly experience this pain as we move direct modified basecalling from research into our products. Representing modifications as unique letters also won't be scalable to future developments. DNA only has a few common modifications but RNA is much more diverse, there being more than a hundred entries in the RNA modification database. One way forward would be to output a sequence of canonical bases and a separate channel with modification information and associated confidence. We welcome any advice and feedback from the community about how to represent this data -- where do you think would best place to start a discussion? For now I'll update the README to make sure users of Flappie are at least aware of the issues with current and forwards compatibility. |
Somewhat tardily, there is now a proposal for storing base modifications in SAM, BAM and CRAM: Your thoughts on this would be appreciated. Have we forgotten something important? Is it too complex? (See the discussion) |
I'm not sure this is the correct place to comment, as Guppy is now the tool producing methylation and not Flappy, but I couldn't find Guppy on github. Anyway I've now seen Guppy data and it raises interesting questions.
Obviously if we train on both 5mC and 5hmC then one of those values will have p<0.5. With the 0-255 range that's fine. Eg a 5mC being 0.66 and 5hmC = 0.33 still works fine. In the phred scale it simply doesn't work. Illumina's earlier base caller used phred+64 as they wanted to emit all 4 confidence values. This was done with log-odds scales where score=10log10(p/1-p)) instead of phreds -10log10(1-p). It works well, but still has low fidelity for the mid ranges. Ultimately no one ended up using the Illumina data as people were still wedded to FASTQ and just cursed Illumina for producing yet another FASTQ variant! (Actually two, which was even worse.) However that doesn't mean it wasn't a sensible move. Please comment on the hts-specs github issue (samtools/hts-specs#418), or forward this on to the correct group if it's not reported correctly here. |
I started investigations into how to store such data in SAM. It's non trivial due to BAM's choice of using nibbles to store sequences, thus uppercase bases + ambiguity codes are the only choices available.
samtools/hts-specs#362
The actual storage method hasn't been finalised yet, and frankly I'm in the deep end with this anyway so it's a learning experience for me too. However I have done some experiments, both actual data compression and thought based, on which methods work well.
One key thing to consider is that applications may not be methylation aware. If we are doing sequence alignment or homology computation, we don't want to treat methylated C as an N for scoring purposes, but you can bet this is what 99% of software will do. Therefore a side channel, with the "fundamental" base in the SAM SEQuence field and the modification status in an auxiliary tag, looks like the sensible way to go.
This is the time for the community to be making suggestions so we can come up with an appriopriate hts-specs update.
The text was updated successfully, but these errors were encountered: