Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export valid fastq #856

Closed
wants to merge 5 commits into from
Closed

Conversation

allenday
Copy link
Contributor

@allenday allenday commented Oct 8, 2015

The adam2fastq exported in the case of a sam with "" (unknown read quality) just exports the "". This is invalid fastq.

This PR addresses the issue by assigning quality value of "B" for each base - this is conventionally used for reads of unknown quality.

Additionally, reads loaded in from fastq are also assigned "B"-string for malformed records containing "*" for read quality.

…ecord is null or "*". This gracefully fixes these records on import
…ecord is null or "*". This gracefully fixes these records on import
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/987/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Oct 8, 2015

@allenday Jenkins is reporting a test failure but only with Hadoop 2.6.0 and Spark 1.4.1

org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctionsSuite.convert malformed FASTQ (no quality scores) => SAM => well-formed FASTQ => SAM

Error Message

Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
 at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

Stacktrace

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
    at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at scala.Option.foreach(Option.scala:236)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

@fnothaft
Copy link
Member

fnothaft commented Oct 8, 2015

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/990/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@allenday
Copy link
Contributor Author

allenday commented Oct 9, 2015

Yes, I see it. I'll investigate this coming week. Should be minor, all my
tests pass locally.

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Fri, Oct 9, 2015 at 3:45 AM, Frank Austin Nothaft <
notifications@github.com> wrote:

Jenkins, retest this please.


Reply to this email directly or view it on GitHub
#856 (comment).

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/998/
Test PASSed.

@allenday
Copy link
Contributor Author

@fnothaft test passing now

else
adamRecord.getOrigQual
else if (adamRecord.getQual == null)
"B" * seqLength
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wikipedia link above is dead, and it reads to me like B as unknown quality score is not universal across FASTQ records produced by different versions of Illumina software. Should this be a parameter presented to the user?

@fnothaft
Copy link
Member

Thanks @allenday! I've merged this as 54d5095. Let's continue the discussion on #818.

@fnothaft fnothaft closed this Nov 10, 2015
@heuermh heuermh added this to the 0.18.2 milestone Nov 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants