Support for new modified base tag #1123

marcus1487 · 2022-07-22T20:37:37Z

The modified base tags (MM/ML) were updated not too long ago and this new tag format is breaking the current implementation. The update adds a . or ? to the end of the MM tag description (see section 1.7 in sam tags spec). To copy the text here the spec, the new tag is formatted as MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*. Note the additional [.?] in this as opposed to the original version. It seems that this should result in an update to this expected pattern in the code.

The text was updated successfully, but these errors were encountered:

marcus1487 · 2022-10-04T20:02:20Z

Is there any update on support for this feature. The interface in pysam to support the modified base tag is very useful, but the lack of this support makes it very hard to use going forward.

starsyi · 2023-10-24T03:05:19Z

Is there any update on support for this feature？

marcus1487 · 2023-11-09T19:15:43Z

The lack of support for this is causing invalid results for many users attempting to parse modified base tags with pysam. Specifically with the implicit . mode highly confident canonincal calls are dropped from the modified_bases output and there is no indication that these have been dropped. This is resulting in very high observed methylation rates when this is not the case in the sample. And there is no alternative for users but to parse these tags themselves. Without support for this we will have to drop pysam usage for modified base analysis. I appreciate that this is a somewhat complex issue (e.g. what probability should be returned to the user when they have been omitted/included only implicitly?), but a fix of some sort would be much better than the current state of the API.

jmarshall · 2023-11-09T20:51:56Z

The “expected pattern” you pointed to in your initial comment is in HTSlib code. This code was updated to allow the new [.?] flags in htslib 1.16, which was imported in pysam 0.20.0 released in October 2022. So since then pysam should be accepting such syntax.

The question then would be whether the pysam AlignedSegment.modified_bases functionality exposes those flags in a useful way. As someone using this functionality, you would be best placed to say what the desired functionality for this API would be. Such details would be a useful contribution.

Perhaps you or your employer are offering to sponsor the development of improved pysam basemod functionality. If that is not the case, I am having difficulty interpreting parts of your comment — notably Without support for this we will have to drop pysam usage for modified base analysis — as anything other than the sort of unkind vaguely threatening off-topic comment that causes maintainer demotivation and burnout and will eventually lead to you being banned from this repository.

marcus1487 · 2023-11-09T22:29:30Z

I apologizes if this was interpreted as a threat in any way as this was not the intent. I truly appreciate all the hard work that goes into maintaining community tools and do not wish to diminish this effort in any way. We would certainly be happy to look into contributing to the development of this feature.

To the issue at hand, I understand that the tag with these specifiers is parsed and does not error given these new format tags. The issue is the returned modified base probabilities. From the SAM specification "When this flag is not present, or it is ‘.’, these bases should be assumed to have low probability of modification." The current implementation in htslib and pysam returns no entry for these skipped positions when the specification states that these positions should be assumed to have low probability of modification. The question now becomes how to implement this assumption. I would personally be happy for these probabilities to be returned as 0 (or equivalent to the 0 bin value where applicable). It would be good to discuss issues that other users or maintainers might have with this choice. Would adding this feature require a new attribute/array of the return value indicating where probabilities were inferred in the returned array? Would a dedicated choice aside from the 0 probability value make more sense? Once we've made these design decisions we can look into how best to look into implement this feature.

cjw85 · 2023-11-10T14:23:10Z

See previous htslib PR here:
samtools/htslib#1418 (comment)

We (the nebulous conglomerate of @jts, @jkbonfield and I) left the matter at the API not being completely broken, but did not have time to come back and make the API useful in the sense of conveying more information about implied calls.

jmarshall · 2023-11-10T21:42:25Z

In that case, that may mean that establishing conventions in this area is more an htslib issue than a pysam issue.

hts_base_mod_state is an opaque type, so no-one will be inspecting state->implicit directly. It is however reported by bam_mods_query_type() and bam_mods_queryi(), which pysam could use.

e-sollier mentioned this issue Jul 18, 2023

BAM file conversion PMBio/MetH5Format#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for new modified base tag #1123

Support for new modified base tag #1123

marcus1487 commented Jul 22, 2022

marcus1487 commented Oct 4, 2022

starsyi commented Oct 24, 2023

marcus1487 commented Nov 9, 2023

jmarshall commented Nov 9, 2023 •

edited

Loading

marcus1487 commented Nov 9, 2023

cjw85 commented Nov 10, 2023

jmarshall commented Nov 10, 2023

Support for new modified base tag #1123

Support for new modified base tag #1123

Comments

marcus1487 commented Jul 22, 2022

marcus1487 commented Oct 4, 2022

starsyi commented Oct 24, 2023

marcus1487 commented Nov 9, 2023

jmarshall commented Nov 9, 2023 • edited Loading

marcus1487 commented Nov 9, 2023

cjw85 commented Nov 10, 2023

jmarshall commented Nov 10, 2023

jmarshall commented Nov 9, 2023 •

edited

Loading