Fixing reference position from offset implementation #40

arahuja · 2014-01-08T23:11:06Z

Two parts here that I wanted to add:

Reference position from offset computation was sometimes mapping two different offsets to the same reference position which shouldn't be possible. i.e. if start= 1 and cigar=1S1M we were getting both mapped to 0 as opposed to 0 and 1. Correct me if I'm wrong on my understanding of the cigar.
Secondly, end and unclippedEnd are exclusive (pointing to 1 past the last read base) currently, I wanted to know if this was intended? It seemed odd to me that that the final reference position in referencePositions was not the same as end. I think we had started a conversation on reference coordinate system earlier, so curious to know if end is purposely exclusive?

I had changed it and then changed back ...

AmplabJenkins · 2014-01-08T23:18:31Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/15/

fnothaft · 2014-01-08T23:51:19Z

adam-core/src/test/scala/edu/berkeley/cs/amplab/adam/rich/RichADAMRecordSuite.scala

@@ -54,4 +67,61 @@ class RichADAMRecordSuite extends FunSuite {
    }
  }

+  test("Cigar Clipping Sequence") {
+//    val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(100).setCigar("10H90M").build()
+//    assert( hardClippedRead.referencePositions(0).get == 100)


If these lines are unused, can we remove them?

fnothaft · 2014-01-08T23:58:10Z

Thanks for the pull request! I've put a few comments inline about the code. Also, can you please rebase down to a single commit?

As for the cigar, that's a good question and is definitely somewhat ambiguous. 1S1M means that the first base is soft clipped, and then the second base is an alignment match. Clipped bases (both soft and hard) are not considered as consuming reference bases—this is not intuitive. They are considered as not consuming reference bases because they are considered as having been discarded. Therefore, realistically the reference position for a clipped base is not defined—in our case, we consider the reference position to be the position of the start of the read.

I'm not sure about the exclusive end—that would be a better question for Matt.

arahuja · 2014-01-09T00:47:06Z

Frank,

Thanks for the comments, I will pull out the commented lines and rebase, but I wanted to hear about the end value first and those tests would pass if end was inclusive.

Clipping aside, the error still exists(ed). Right now soft clipped bases map to some reference position relative to the start, and hard clipped bases are not included. But that aside, a better example of the issue maybe: start = 1, cigar = 1M1I1M With this, refPos(0) = 1, refPos(1) = None and refPos(2) = 0 still, when refPos(2) should be 1. This fixes that.

massie · 2014-01-09T23:57:26Z

Note: The Scala Range object is by default exclusive. To make it inclusive, the Range.inclusive factory method is used instead.

The following table shows all the CIGAR operators and whether they consume read and/or reference bases.

The SAM spec says that the "Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" which is another way of saying that those operators consume read bases.

Looking at your example,

start = 1, cigar = 1M1I1M

The 'M' CIGAR operator consumes reference bases whereas the 'I' operator does not. So the return should be [1, 2, 2], no? This assumes that inserts come before any matches at a position. As you mentioned, this may be a discussion for the broader team re: coordinate systems.

The length of the sequence for this example should be 3 since both the operators 'M' and 'I' consume read bases.

fnothaft · 2014-01-10T00:01:00Z

I believe inserts are normally conceptualized as being left-associative, i.e., coming after a base. I would expect [1, 1, 2].

massie · 2014-01-10T00:04:50Z

That's fine by me as long as we're consistent.

Right now, Arun's example returns

List(Some(1), None, Some(1))

on master right now which is incorrect.

arahuja · 2014-01-10T05:37:35Z

Yes, there is plain old bug that makes that example return [1, None, 1] and this change would make it [1, None, 2], hence all the test. RIght now the referencePositions is only used to lookup the reference base as a mismatch (in the md tag and shortly in an upcoming PR the reference sequence), therefore insertions return None since they shouldn't be checked against the reference. But we could make a new convention that fits either of the ones you mentioned.

AmplabJenkins · 2014-01-10T05:49:11Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/16/

fnothaft · 2014-01-10T06:10:32Z

It should return [Some(1), Some(1), Some(2)], nyet?

massie · 2014-01-10T08:46:18Z

Jenkins test this please.

AmplabJenkins · 2014-01-10T08:49:11Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/17/

arahuja · 2014-01-10T14:13:42Z

We could change the insertions to be the last reference base as you said, but that'd break the current BQSR implementation since that relies on insertions returning None.

massie · 2014-01-10T21:51:20Z

@arahuja There's no need to change this. Agree that your [1, None, 2] fix is good enough for this pull request.

I'm trying to fix the pull request builder right now to test it and expect to merge it soon. Stay tuned.

massie · 2014-01-10T22:21:20Z

Jenkins, test this please.

AmplabJenkins · 2014-01-10T22:28:16Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/19/

massie · 2014-01-10T22:52:47Z

adam-core/src/test/scala/edu/berkeley/cs/amplab/adam/rich/RichADAMRecordSuite.scala

+
+    val hardClippedRead = ADAMRecord.newBuilder().setReadMapped(true).setStart(1000).setCigar("90M10H").build()
+    assert( hardClippedRead.referencePositions.length == 90)
+    assert( hardClippedRead.referencePositions(0) == Some(1000))


You are comparing a Option[Long] to an Option[Integer] with each of these. I would recommend that you do either,

foo.referencePosition(0) == Some(1000L)

or

foo.referencePositions(0).get == 1000

massie · 2014-01-10T22:54:23Z

Aside from some minor test cleanup to compare the same types, I think this is ready to merge. I love seeing all the tests! 👍

…rror

AmplabJenkins · 2014-01-10T23:13:04Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/21/

Fixing reference position from offset implementation

massie · 2014-01-10T23:16:25Z

Thanks, Arun!

fnothaft reviewed Jan 8, 2014
View reviewed changes

massie reviewed Jan 10, 2014
View reviewed changes

Add tests for reference position and fix refpos position off-by-one-e…

df16e01

…rror

massie added a commit that referenced this pull request Jan 10, 2014

Merge pull request #40 from arahuja/referenceend

1977ba4

Fixing reference position from offset implementation

massie merged commit 1977ba4 into bigdatagenomics:master Jan 10, 2014

arahuja deleted the referenceend branch February 23, 2014 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing reference position from offset implementation #40

Fixing reference position from offset implementation #40

arahuja commented Jan 8, 2014

AmplabJenkins commented Jan 8, 2014

fnothaft Jan 8, 2014

fnothaft commented Jan 8, 2014

arahuja commented Jan 9, 2014

massie commented Jan 9, 2014

fnothaft commented Jan 10, 2014

massie commented Jan 10, 2014

arahuja commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

fnothaft commented Jan 10, 2014

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

arahuja commented Jan 10, 2014

massie commented Jan 10, 2014

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

massie Jan 10, 2014

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

massie commented Jan 10, 2014

Fixing reference position from offset implementation #40

Fixing reference position from offset implementation #40

Conversation

arahuja commented Jan 8, 2014

AmplabJenkins commented Jan 8, 2014

fnothaft Jan 8, 2014

Choose a reason for hiding this comment

fnothaft commented Jan 8, 2014

arahuja commented Jan 9, 2014

massie commented Jan 9, 2014

fnothaft commented Jan 10, 2014

massie commented Jan 10, 2014

arahuja commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

fnothaft commented Jan 10, 2014

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

arahuja commented Jan 10, 2014

massie commented Jan 10, 2014

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

massie Jan 10, 2014

Choose a reason for hiding this comment

massie commented Jan 10, 2014

AmplabJenkins commented Jan 10, 2014

massie commented Jan 10, 2014