Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-646] Special case reads with '*' quality during BQSR. #647

Merged
merged 2 commits into from
Apr 9, 2015

Conversation

fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 9, 2015

Resolves #646. Allows the creation of DecadentReads with * quality scores. These reads are then not observed or corrected during BQSR.

@massie
Copy link
Member

massie commented Apr 9, 2015

This looks good, but should we just set qualityString to null if it's * and then just do a null check?

@fnothaft
Copy link
Member Author

fnothaft commented Apr 9, 2015

Are you suggesting to do that check in the SAM/BAM<->ADAM converters?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/670/

Build result: FAILURE

GitHub pull request #647 of commit f6ce721 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/647/merge^{commit} # timeout=10 > git branch -a --contains d7e55c115cfc9f4de7289144d2506ea006bf3237 # timeout=10 > git rev-parse remotes/origin/pr/647/merge^{commit} # timeout=10Checking out Revision d7e55c115cfc9f4de7289144d2506ea006bf3237 (origin/pr/647/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f d7e55c115cfc9f4de7289144d2506ea006bf3237First time build. Skipping changelog.Triggering ADAM-prb ? 2.2.0,centosTriggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosADAM-prb ? 2.2.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@massie
Copy link
Member

massie commented Apr 9, 2015

Yes, when we convert from BAM to ADAM, set simply set the qualityString to null, if it's *. It will be more compact (we save two bytes for each read) and doesn't require any string comparisons (albeit the string isn't very long :)).

@fnothaft
Copy link
Member Author

fnothaft commented Apr 9, 2015

That's a good idea. Let me refactor that.

@fnothaft
Copy link
Member Author

fnothaft commented Apr 9, 2015

Updated with the null on conversion change.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/671/

Build result: FAILURE

GitHub pull request #647 of commit 608d5f7 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/647/merge^{commit} # timeout=10 > git branch -a --contains 356a6d6711a5a558e7df6c94cd0b427d82a69d58 # timeout=10 > git rev-parse remotes/origin/pr/647/merge^{commit} # timeout=10Checking out Revision 356a6d6711a5a558e7df6c94cd0b427d82a69d58 (origin/pr/647/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 356a6d6711a5a558e7df6c94cd0b427d82a69d58First time build. Skipping changelog.Triggering ADAM-prb ? 2.2.0,centosTriggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosADAM-prb ? 2.2.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

fnothaft commented Apr 9, 2015

Jenkins, retest this please.

Looks like some issue with a JAR not being pulled down in the Hadoop 2.2 build.

@@ -79,7 +79,7 @@ class AlignmentRecordConverter extends Serializable {
// set canonically necessary fields
builder.setReadName(adamRecord.getReadName.toString)
builder.setReadString(adamRecord.getSequence)
builder.setBaseQualityString(adamRecord.getQual)
Option(adamRecord.getQual).fold(builder.setBaseQualityString("*"))(s => builder.setBaseQualityString(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple pattern matching here would prevent double-setting the qualityString (unless I'm misreading this) and prevent allocating objects we won't use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@massie
Copy link
Member

massie commented Apr 9, 2015

Other than one nit, this looks good to me.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/672/
Test PASSed.

massie added a commit that referenced this pull request Apr 9, 2015
[ADAM-646] Special case reads with '*' quality during BQSR.
@massie massie merged commit 4c615da into bigdatagenomics:master Apr 9, 2015
@massie
Copy link
Member

massie commented Apr 9, 2015

Thanks, Frank!

@Jaeki
Copy link

Jaeki commented Apr 10, 2015

Thanks, Frank!
I have tested in my cluster with small data, it works well.

@fnothaft
Copy link
Member Author

Great! Glad to hear it @Jaeki!

@fnothaft fnothaft deleted the allow-asterisk-bqsr branch April 10, 2015 02:05
@Jaeki
Copy link

Jaeki commented Apr 11, 2015

@fnothaft Could you check the followin case?
I got the similar error when running BQSR, The input sam file is the aligned with SNAP (NA12878).

15/04/11 12:35:38 INFO DAGScheduler: Job 1 failed: aggregate at BaseQualityRecalibration.scala:84, took 3.648340 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 111, node-120): java.lang.IllegalArgumentException: Error "requirement failed" while constructing DecadentRead from Read({"contig": {"contigName": "chrY", "contigLength": 59373566, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null}, "start": 13833306, "oldPosition": null, "end": 13833411, "mapq": 23, "readName": "ERR032977_24245808", "sequence": "AAATGGAACGAAGTGGAATCGAGTGGAATGGAATCGAATGGAGTGAAATGGAATGGAATGGACGCGAAAGAATGGACTGGAACAAAATGAAATCGAACGGT", "qual": "CCCCCCCCCCCCCCCCCBCCCCCDCCCCCCCDCCCC@DCCCBCBCBBBCCCABCCCBDBCCDCCBCABBC?@@A@BABBBDBD@D<8;BB8?:@@d@B>>1", "cigar": "69M1D35=", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": true, "properPair": true, "readMapped": true, "mateMapped": true, "firstOfPair": false, "secondOfPair": true, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": false, "mateNegativeStrand": true, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": null, "origQual": null, "attributes": "PU:Z:pu\tSM:Z:sm\tNM:i:14\tPL:Z:Illumina\tRG:Z:FASTQ\tPG:Z:SNAP\tLB:Z:lb", "recordGroupName": "FASTQ", "recordGroupSequencingCenter": null, "recordGroupDescription": null, "recordGroupRunDateEpoch": null, "recordGroupFlowOrder": null, "recordGroupKeySequence": null, "recordGroupLibrary": "lb", "recordGroupPredictedMedianInsertSize": null, "recordGroupPlatform": "Illumina", "recordGroupPlatformUnit": "pu", "recordGroupSample": "sm", "mateAlignmentStart": 13869382, "mateAlignmentEnd": null, "mateContig": {"contigName": "chrY", "contigLength": 59373566, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null}})
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:40)
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:32)
at org.bdgenomics.adam.rich.DecadentRead$$anonfun$cloy$1.apply(DecadentRead.scala:50)
at org.bdgenomics.adam.rich.DecadentRead$$anonfun$cloy$1.apply(DecadentRead.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$22.apply(RDD.scala:901)
at org.apache.spark.rdd.RDD$$anonfun$22.apply(RDD.scala:901)
at org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1355)
at org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1355)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.bdgenomics.adam.rich.DecadentRead.(DecadentRead.scala:71)
at org.bdgenomics.adam.rich.DecadentRead$.apply(DecadentRead.scala:36)
... 23 more

@massie
Copy link
Member

massie commented Apr 11, 2015

The exact line of the error is shown in this stack trace.

at org.bdgenomics.adam.rich.DecadentRead.(DecadentRead.scala:71)

Your sequence is 102 bases long...

$ echo "AAATGGAACGAAGTGGAATCGAGTGGAATGGAATCGAATGGAGTGAAATGGAATGGAATGGACGCGAAAGAATGGACTGGAACAAAATGAAATCGAACGGT" | wc -c
102

... but the difference between your reference start and end position is 13833411 - 13833306 = 105. The cigar string is 69M1D35= which agrees with the sequence length 69+1+35 = 105.

The sequence is missing 3 bases.

@Jaeki
Copy link

Jaeki commented Apr 13, 2015

@massie Thank you for your comment. Do you think the missing 3 bases come from the SNAP tool ? I downloaded the NA12878 reads from web and aligned with SNAP, transformed the sam file to adam and then BQSR with ADAM. That's what I did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Requirement failed warning while running BQSR @DecadentRead
4 participants