[ADAM-646] Special case reads with '*' quality during BQSR. #647

fnothaft · 2015-04-09T17:10:26Z

Resolves #646. Allows the creation of DecadentReads with * quality scores. These reads are then not observed or corrected during BQSR.

massie · 2015-04-09T17:24:09Z

This looks good, but should we just set qualityString to null if it's * and then just do a null check?

fnothaft · 2015-04-09T17:25:31Z

Are you suggesting to do that check in the SAM/BAM<->ADAM converters?

AmplabJenkins · 2015-04-09T17:26:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/670/

Build result: FAILURE

GitHub pull request #647 of commit f6ce721 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/647/merge^{commit} # timeout=10 > git branch -a --contains d7e55c115cfc9f4de7289144d2506ea006bf3237 # timeout=10 > git rev-parse remotes/origin/pr/647/merge^{commit} # timeout=10Checking out Revision d7e55c115cfc9f4de7289144d2506ea006bf3237 (origin/pr/647/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f d7e55c115cfc9f4de7289144d2506ea006bf3237First time build. Skipping changelog.Triggering ADAM-prb ? 2.2.0,centosTriggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosADAM-prb ? 2.2.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

massie · 2015-04-09T17:29:40Z

Yes, when we convert from BAM to ADAM, set simply set the qualityString to null, if it's *. It will be more compact (we save two bytes for each read) and doesn't require any string comparisons (albeit the string isn't very long :)).

fnothaft · 2015-04-09T17:32:12Z

That's a good idea. Let me refactor that.

fnothaft · 2015-04-09T18:00:37Z

Updated with the null on conversion change.

AmplabJenkins · 2015-04-09T18:11:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/671/

Build result: FAILURE

GitHub pull request #647 of commit 608d5f7 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/647/merge^{commit} # timeout=10 > git branch -a --contains 356a6d6711a5a558e7df6c94cd0b427d82a69d58 # timeout=10 > git rev-parse remotes/origin/pr/647/merge^{commit} # timeout=10Checking out Revision 356a6d6711a5a558e7df6c94cd0b427d82a69d58 (origin/pr/647/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 356a6d6711a5a558e7df6c94cd0b427d82a69d58First time build. Skipping changelog.Triggering ADAM-prb ? 2.2.0,centosTriggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosADAM-prb ? 2.2.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2015-04-09T18:12:37Z

Jenkins, retest this please.

Looks like some issue with a JAR not being pulled down in the Hadoop 2.2 build.

massie · 2015-04-09T18:27:06Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/AlignmentRecordConverter.scala

@@ -79,7 +79,7 @@ class AlignmentRecordConverter extends Serializable {
    // set canonically necessary fields
    builder.setReadName(adamRecord.getReadName.toString)
    builder.setReadString(adamRecord.getSequence)
-    builder.setBaseQualityString(adamRecord.getQual)
+    Option(adamRecord.getQual).fold(builder.setBaseQualityString("*"))(s => builder.setBaseQualityString(s))


Simple pattern matching here would prevent double-setting the qualityString (unless I'm misreading this) and prevent allocating objects we won't use.

massie · 2015-04-09T18:30:19Z

Other than one nit, this looks good to me.

AmplabJenkins · 2015-04-09T18:41:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/672/
Test PASSed.

[ADAM-646] Special case reads with '*' quality during BQSR.

massie · 2015-04-09T20:37:51Z

Thanks, Frank!

Jaeki · 2015-04-10T02:04:34Z

Thanks, Frank!
I have tested in my cluster with small data, it works well.

fnothaft · 2015-04-10T02:05:01Z

Great! Glad to hear it @Jaeki!

Jaeki · 2015-04-11T04:23:52Z

@fnothaft Could you check the followin case?
I got the similar error when running BQSR, The input sam file is the aligned with SNAP (NA12878).

15/04/11 12:35:38 INFO DAGScheduler: Job 1 failed: aggregate at BaseQualityRecalibration.scala:84, took 3.648340 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 111, node-120): java.lang.IllegalArgumentException: Error "requirement failed" while constructing DecadentRead from Read({"contig": {"contigName": "chrY", "contigLength": 59373566, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null}, "start": 13833306, "oldPosition": null, "end": 13833411, "mapq": 23, "readName": "ERR032977_24245808", "sequence": "AAATGGAACGAAGTGGAATCGAGTGGAATGGAATCGAATGGAGTGAAATGGAATGGAATGGACGCGAAAGAATGGACTGGAACAAAATGAAATCGAACGGT", "qual": "CCCCCCCCCCCCCCCCCBCCCCCDCCCCCCCDCCCC@DCCCBCBCBBBCCCABCCCBDBCCDCCBCABBC?@@A@BABBBDBD@D<8;BB8?:@@d@B>>1", "cigar": "69M1D35=", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": true, "properPair": true, "readMapped": true, "mateMapped": true, "firstOfPair": false, "secondOfPair": true, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": false, "mateNegativeStrand": true, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": null, "origQual": null, "attributes": "PU:Z:pu\tSM:Z:sm\tNM:i:14\tPL:Z:Illumina\tRG:Z:FASTQ\tPG:Z:SNAP\tLB:Z:lb", "recordGroupName": "FASTQ", "recordGroupSequencingCenter": null, "recordGroupDescription": null, "recordGroupRunDateEpoch": null, "recordGroupFlowOrder": null, "recordGroupKeySequence": null, "recordGroupLibrary": "lb", "recordGroupPredictedMedianInsertSize": null, "recordGroupPlatform": "Illumina", "recordGroupPlatformUnit": "pu", "recordGroupSample": "sm", "mateAlignmentStart": 13869382, "mateAlignmentEnd": null, "mateContig": {"contigName": "chrY", "contigLength": 59373566, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null}})
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:40)
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:32)
at org.bdgenomics.adam.rich.DecadentRead$$anonfun$cloy$1.apply(DecadentRead.scala:50)
at org.bdgenomics.adam.rich.DecadentRead$$anonfun$cloy$1.apply(DecadentRead.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$22.apply(RDD.scala:901)
at org.apache.spark.rdd.RDD$$anonfun$22.apply(RDD.scala:901)
at org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1355)
at org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1355)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.bdgenomics.adam.rich.DecadentRead.(DecadentRead.scala:71)
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:36)
... 23 more

massie · 2015-04-11T06:34:01Z

The exact line of the error is shown in this stack trace.

at org.bdgenomics.adam.rich.DecadentRead.(DecadentRead.scala:71)

Your sequence is 102 bases long...

$ echo "AAATGGAACGAAGTGGAATCGAGTGGAATGGAATCGAATGGAGTGAAATGGAATGGAATGGACGCGAAAGAATGGACTGGAACAAAATGAAATCGAACGGT" | wc -c
102

... but the difference between your reference start and end position is 13833411 - 13833306 = 105. The cigar string is 69M1D35= which agrees with the sequence length 69+1+35 = 105.

The sequence is missing 3 bases.

Jaeki · 2015-04-13T01:19:36Z

@massie Thank you for your comment. Do you think the missing 3 bases come from the SNAP tool ? I downloaded the NA12878 reads from web and aligned with SNAP, transformed the sam file to adam and then BQSR with ADAM. That's what I did.

fnothaft force-pushed the allow-asterisk-bqsr branch from f6ce721 to 608d5f7 Compare April 9, 2015 18:00

massie reviewed Apr 9, 2015
View reviewed changes

fnothaft added 2 commits April 9, 2015 11:32

Null * quality strings on conversion.

7a3876a

[ADAM-646] Special case reads with '*' quality during BQSR.

4e30187

fnothaft force-pushed the allow-asterisk-bqsr branch from 608d5f7 to 4e30187 Compare April 9, 2015 18:32

massie added a commit that referenced this pull request Apr 9, 2015

Merge pull request #647 from fnothaft/allow-asterisk-bqsr

4c615da

[ADAM-646] Special case reads with '*' quality during BQSR.

massie merged commit 4c615da into bigdatagenomics:master Apr 9, 2015

fnothaft deleted the allow-asterisk-bqsr branch April 10, 2015 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAM-646] Special case reads with '*' quality during BQSR. #647

[ADAM-646] Special case reads with '*' quality during BQSR. #647

fnothaft commented Apr 9, 2015

massie commented Apr 9, 2015

fnothaft commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

massie commented Apr 9, 2015

fnothaft commented Apr 9, 2015

fnothaft commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

fnothaft commented Apr 9, 2015

massie Apr 9, 2015

fnothaft Apr 9, 2015

fnothaft Apr 9, 2015

massie commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

massie commented Apr 9, 2015

Jaeki commented Apr 10, 2015

fnothaft commented Apr 10, 2015

Jaeki commented Apr 11, 2015

massie commented Apr 11, 2015

Jaeki commented Apr 13, 2015

[ADAM-646] Special case reads with '*' quality during BQSR. #647

[ADAM-646] Special case reads with '*' quality during BQSR. #647

Conversation

fnothaft commented Apr 9, 2015

massie commented Apr 9, 2015

fnothaft commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

Build result: FAILURE

massie commented Apr 9, 2015

fnothaft commented Apr 9, 2015

fnothaft commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

Build result: FAILURE

fnothaft commented Apr 9, 2015

massie Apr 9, 2015

Choose a reason for hiding this comment

fnothaft Apr 9, 2015

Choose a reason for hiding this comment

fnothaft Apr 9, 2015

Choose a reason for hiding this comment

massie commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

massie commented Apr 9, 2015

Jaeki commented Apr 10, 2015

fnothaft commented Apr 10, 2015

Jaeki commented Apr 11, 2015

massie commented Apr 11, 2015

Jaeki commented Apr 13, 2015