-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing reference position from offset implementation #40
Conversation
All automated tests passed. |
@@ -54,4 +67,61 @@ class RichADAMRecordSuite extends FunSuite { | |||
} | |||
} | |||
|
|||
test("Cigar Clipping Sequence") { | |||
// val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(100).setCigar("10H90M").build() | |||
// assert( hardClippedRead.referencePositions(0).get == 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these lines are unused, can we remove them?
Thanks for the pull request! I've put a few comments inline about the code. Also, can you please rebase down to a single commit? As for the cigar, that's a good question and is definitely somewhat ambiguous. 1S1M means that the first base is soft clipped, and then the second base is an alignment match. Clipped bases (both soft and hard) are not considered as consuming reference bases—this is not intuitive. They are considered as not consuming reference bases because they are considered as having been discarded. Therefore, realistically the reference position for a clipped base is not defined—in our case, we consider the reference position to be the position of the start of the read. I'm not sure about the exclusive end—that would be a better question for Matt. |
Frank, Thanks for the comments, I will pull out the commented lines and rebase, but I wanted to hear about the end value first and those tests would pass if end was inclusive. Clipping aside, the error still exists(ed). Right now soft clipped bases map to some reference position relative to the start, and hard clipped bases are not included. But that aside, a better example of the issue maybe: start = 1, cigar = 1M1I1M With this, refPos(0) = 1, refPos(1) = None and refPos(2) = 0 still, when refPos(2) should be 1. This fixes that. |
Note: The Scala Range object is by default exclusive. To make it inclusive, the The following table shows all the CIGAR operators and whether they consume read and/or reference bases. The SAM spec says that the "Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" which is another way of saying that those operators consume read bases. Looking at your example,
The 'M' CIGAR operator consumes reference bases whereas the 'I' operator does not. So the return should be [1, 2, 2], no? This assumes that inserts come before any matches at a position. As you mentioned, this may be a discussion for the broader team re: coordinate systems. The length of the sequence for this example should be 3 since both the operators 'M' and 'I' consume read bases. |
I believe inserts are normally conceptualized as being left-associative, i.e., coming after a base. I would expect [1, 1, 2]. |
That's fine by me as long as we're consistent. Right now, Arun's example returns
on |
Yes, there is plain old bug that makes that example return [1, None, 1] and this change would make it [1, None, 2], hence all the test. RIght now the referencePositions is only used to lookup the reference base as a mismatch (in the md tag and shortly in an upcoming PR the reference sequence), therefore insertions return None since they shouldn't be checked against the reference. But we could make a new convention that fits either of the ones you mentioned. |
One or more automated tests failed |
It should return [Some(1), Some(1), Some(2)], nyet? |
Jenkins test this please. |
One or more automated tests failed |
We could change the insertions to be the last reference base as you said, but that'd break the current BQSR implementation since that relies on insertions returning None. |
@arahuja There's no need to change this. Agree that your [1, None, 2] fix is good enough for this pull request. I'm trying to fix the pull request builder right now to test it and expect to merge it soon. Stay tuned. |
Jenkins, test this please. |
All automated tests passed. |
|
||
val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(1000).setCigar("90M10H").build() | ||
assert( hardClippedRead.referencePositions.length == 90) | ||
assert( hardClippedRead.referencePositions(0) == Some(1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are comparing a Option[Long] to an Option[Integer] with each of these. I would recommend that you do either,
foo.referencePosition(0) == Some(1000L)
or
foo.referencePositions(0).get == 1000
Aside from some minor test cleanup to compare the same types, I think this is ready to merge. I love seeing all the tests! 👍 |
All automated tests passed. |
Fixing reference position from offset implementation
Thanks, Arun! |
Two parts here that I wanted to add:
Reference position from offset computation was sometimes mapping two different offsets to the same reference position which shouldn't be possible. i.e. if start= 1 and cigar=1S1M we were getting both mapped to 0 as opposed to 0 and 1. Correct me if I'm wrong on my understanding of the cigar.
Secondly, end and unclippedEnd are exclusive (pointing to 1 past the last read base) currently, I wanted to know if this was intended? It seemed odd to me that that the final reference position in referencePositions was not the same as end. I think we had started a conversation on reference coordinate system earlier, so curious to know if end is purposely exclusive?
I had changed it and then changed back ...