Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for new modified base tag #1123

Open
marcus1487 opened this issue Jul 22, 2022 · 7 comments
Open

Support for new modified base tag #1123

marcus1487 opened this issue Jul 22, 2022 · 7 comments

Comments

@marcus1487
Copy link

The modified base tags (MM/ML) were updated not too long ago and this new tag format is breaking the current implementation. The update adds a . or ? to the end of the MM tag description (see section 1.7 in sam tags spec). To copy the text here the spec, the new tag is formatted as MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*. Note the additional [.?] in this as opposed to the original version. It seems that this should result in an update to this expected pattern in the code.

@marcus1487
Copy link
Author

Is there any update on support for this feature. The interface in pysam to support the modified base tag is very useful, but the lack of this support makes it very hard to use going forward.

@starsyi
Copy link

starsyi commented Oct 24, 2023

Is there any update on support for this feature?

@marcus1487
Copy link
Author

The lack of support for this is causing invalid results for many users attempting to parse modified base tags with pysam. Specifically with the implicit . mode highly confident canonincal calls are dropped from the modified_bases output and there is no indication that these have been dropped. This is resulting in very high observed methylation rates when this is not the case in the sample. And there is no alternative for users but to parse these tags themselves. Without support for this we will have to drop pysam usage for modified base analysis. I appreciate that this is a somewhat complex issue (e.g. what probability should be returned to the user when they have been omitted/included only implicitly?), but a fix of some sort would be much better than the current state of the API.

@jmarshall
Copy link
Member

jmarshall commented Nov 9, 2023

The “expected pattern” you pointed to in your initial comment is in HTSlib code. This code was updated to allow the new [.?] flags in htslib 1.16, which was imported in pysam 0.20.0 released in October 2022. So since then pysam should be accepting such syntax.

The question then would be whether the pysam AlignedSegment.modified_bases functionality exposes those flags in a useful way. As someone using this functionality, you would be best placed to say what the desired functionality for this API would be. Such details would be a useful contribution.

Perhaps you or your employer are offering to sponsor the development of improved pysam basemod functionality. If that is not the case, I am having difficulty interpreting parts of your comment — notably Without support for this we will have to drop pysam usage for modified base analysis — as anything other than the sort of unkind vaguely threatening off-topic comment that causes maintainer demotivation and burnout and will eventually lead to you being banned from this repository.

@marcus1487
Copy link
Author

I apologizes if this was interpreted as a threat in any way as this was not the intent. I truly appreciate all the hard work that goes into maintaining community tools and do not wish to diminish this effort in any way. We would certainly be happy to look into contributing to the development of this feature.

To the issue at hand, I understand that the tag with these specifiers is parsed and does not error given these new format tags. The issue is the returned modified base probabilities. From the SAM specification "When this flag is not present, or it is ‘.’, these bases should be assumed to have low probability of modification." The current implementation in htslib and pysam returns no entry for these skipped positions when the specification states that these positions should be assumed to have low probability of modification. The question now becomes how to implement this assumption. I would personally be happy for these probabilities to be returned as 0 (or equivalent to the 0 bin value where applicable). It would be good to discuss issues that other users or maintainers might have with this choice. Would adding this feature require a new attribute/array of the return value indicating where probabilities were inferred in the returned array? Would a dedicated choice aside from the 0 probability value make more sense? Once we've made these design decisions we can look into how best to look into implement this feature.

@cjw85
Copy link

cjw85 commented Nov 10, 2023

See previous htslib PR here:
samtools/htslib#1418 (comment)

We (the nebulous conglomerate of @jts, @jkbonfield and I) left the matter at the API not being completely broken, but did not have time to come back and make the API useful in the sense of conveying more information about implied calls.

@jmarshall
Copy link
Member

In that case, that may mean that establishing conventions in this area is more an htslib issue than a pysam issue.

hts_base_mod_state is an opaque type, so no-one will be inspecting state->implicit directly. It is however reported by bam_mods_query_type() and bam_mods_queryi(), which pysam could use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants