Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing reference position from offset implementation #40

Merged
merged 1 commit into from
Jan 10, 2014

Conversation

arahuja
Copy link
Contributor

@arahuja arahuja commented Jan 8, 2014

Two parts here that I wanted to add:

  1. Reference position from offset computation was sometimes mapping two different offsets to the same reference position which shouldn't be possible. i.e. if start= 1 and cigar=1S1M we were getting both mapped to 0 as opposed to 0 and 1. Correct me if I'm wrong on my understanding of the cigar.

  2. Secondly, end and unclippedEnd are exclusive (pointing to 1 past the last read base) currently, I wanted to know if this was intended? It seemed odd to me that that the final reference position in referencePositions was not the same as end. I think we had started a conversation on reference coordinate system earlier, so curious to know if end is purposely exclusive?

I had changed it and then changed back ...

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/15/

@@ -54,4 +67,61 @@ class RichADAMRecordSuite extends FunSuite {
}
}

test("Cigar Clipping Sequence") {
// val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(100).setCigar("10H90M").build()
// assert( hardClippedRead.referencePositions(0).get == 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these lines are unused, can we remove them?

@fnothaft
Copy link
Member

fnothaft commented Jan 8, 2014

Thanks for the pull request! I've put a few comments inline about the code. Also, can you please rebase down to a single commit?

As for the cigar, that's a good question and is definitely somewhat ambiguous. 1S1M means that the first base is soft clipped, and then the second base is an alignment match. Clipped bases (both soft and hard) are not considered as consuming reference bases—this is not intuitive. They are considered as not consuming reference bases because they are considered as having been discarded. Therefore, realistically the reference position for a clipped base is not defined—in our case, we consider the reference position to be the position of the start of the read.

I'm not sure about the exclusive end—that would be a better question for Matt.

@arahuja
Copy link
Contributor Author

arahuja commented Jan 9, 2014

Frank,

Thanks for the comments, I will pull out the commented lines and rebase, but I wanted to hear about the end value first and those tests would pass if end was inclusive.

Clipping aside, the error still exists(ed). Right now soft clipped bases map to some reference position relative to the start, and hard clipped bases are not included. But that aside, a better example of the issue maybe: start = 1, cigar = 1M1I1M With this, refPos(0) = 1, refPos(1) = None and refPos(2) = 0 still, when refPos(2) should be 1. This fixes that.

@massie
Copy link
Member

massie commented Jan 9, 2014

Note: The Scala Range object is by default exclusive. To make it inclusive, the Range.inclusive factory method is used instead.

The following table shows all the CIGAR operators and whether they consume read and/or reference bases.

screen shot 2014-01-09 at 3 37 46 pm

The SAM spec says that the "Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" which is another way of saying that those operators consume read bases.

Looking at your example,

start = 1, cigar = 1M1I1M

The 'M' CIGAR operator consumes reference bases whereas the 'I' operator does not. So the return should be [1, 2, 2], no? This assumes that inserts come before any matches at a position. As you mentioned, this may be a discussion for the broader team re: coordinate systems.

The length of the sequence for this example should be 3 since both the operators 'M' and 'I' consume read bases.

@fnothaft
Copy link
Member

I believe inserts are normally conceptualized as being left-associative, i.e., coming after a base. I would expect [1, 1, 2].

@massie
Copy link
Member

massie commented Jan 10, 2014

That's fine by me as long as we're consistent.

Right now, Arun's example returns

List(Some(1), None, Some(1))

on master right now which is incorrect.

@arahuja
Copy link
Contributor Author

arahuja commented Jan 10, 2014

Yes, there is plain old bug that makes that example return [1, None, 1] and this change would make it [1, None, 2], hence all the test. RIght now the referencePositions is only used to lookup the reference base as a mismatch (in the md tag and shortly in an upcoming PR the reference sequence), therefore insertions return None since they shouldn't be checked against the reference. But we could make a new convention that fits either of the ones you mentioned.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/16/

@fnothaft
Copy link
Member

It should return [Some(1), Some(1), Some(2)], nyet?

@massie
Copy link
Member

massie commented Jan 10, 2014

Jenkins test this please.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/17/

@arahuja
Copy link
Contributor Author

arahuja commented Jan 10, 2014

We could change the insertions to be the last reference base as you said, but that'd break the current BQSR implementation since that relies on insertions returning None.

@massie
Copy link
Member

massie commented Jan 10, 2014

@arahuja There's no need to change this. Agree that your [1, None, 2] fix is good enough for this pull request.

I'm trying to fix the pull request builder right now to test it and expect to merge it soon. Stay tuned.

@massie
Copy link
Member

massie commented Jan 10, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/19/


val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(1000).setCigar("90M10H").build()
assert( hardClippedRead.referencePositions.length == 90)
assert( hardClippedRead.referencePositions(0) == Some(1000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are comparing a Option[Long] to an Option[Integer] with each of these. I would recommend that you do either,

foo.referencePosition(0) == Some(1000L) 

or

foo.referencePositions(0).get == 1000

@massie
Copy link
Member

massie commented Jan 10, 2014

Aside from some minor test cleanup to compare the same types, I think this is ready to merge. I love seeing all the tests! 👍

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/21/

massie added a commit that referenced this pull request Jan 10, 2014
Fixing reference position from offset implementation
@massie massie merged commit 1977ba4 into bigdatagenomics:master Jan 10, 2014
@massie
Copy link
Member

massie commented Jan 10, 2014

Thanks, Arun!

@arahuja arahuja deleted the referenceend branch February 23, 2014 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants