Test demonstrating region join failure #1206

jpdna · 2016-10-12T01:10:11Z

In trying to apply shuffleRegionJoin to a use case in gVCF I found that I was getting an OOM error
in running a InnerShuffleRegionJoin - on a tiny tiny dataset of 3 variants and 2 variants.

In this PR I attempt to recreate the same set of ReferenceRegion intervals in two RDDs to join to demonstrate the problem, here using AlignmentRecords because I modified an existing test.

Its possible that something else is wrong with this test now because I see an ArrayIndexOutofBounds exception rather than OOM, but in any case, the test below is failing for a reason I don't understand - and not gracefully. If we can make this new test work with these intervals, then at least it will be a step towards figuring out why my gVCF join with the same intervals is failing.

- Test join that was failing 10/11/2016 *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArrayIndexOutOfBoundsException: 5385867
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at scala.Option.foreach(Option.scala:236)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scal

AmplabJenkins · 2016-10-12T01:15:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1532/

Build result: FAILURE

GitHub pull request #1206 of commit f3cdb50 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1206/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c7b6acb # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1206/merge^{commit} # timeout=10Checking out Revision c7b6acb (origin/pr/1206/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c7b6acbc103a8ae8f7e8c28638cd312b6a22190aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2016-10-12T02:28:55Z

OK, cool! By any chance, is the data on the cluster? If so, perhaps let's start a thread and we can work offline to debug it there as well.

jpdna · 2016-10-12T03:12:22Z

Not on the cluster, but I provide the script for you to try to repo the gVCF join I was trying at this gist I just made:
https://gist.github.com/jpdna/a352ab9304a1885d01d3ac1c65dc77a8

which has links to the tiny input files here:
https://drive.google.com/drive/folders/0B6jh69UgixwpdDlGUkhRaW42QzA?usp=sharing

You want to start an email thread to discuss offline?

fnothaft · 2016-10-12T03:13:17Z

You want to start an email thread to discuss offline?

If the files are public then we don't need an offline thread; I'd assumed they weren't public. I'll take a look tomorrow.

jpdna · 2016-10-12T04:16:30Z

I think I found the solution.
It seems that PartitionSize needs to be more like 5000000 as indeed it is the bin size in nucleotides.

val result1 = InnerShuffleRegionJoin[VariantContext, VariantContext](x.sequences, 5000000, sc).partitionAndJoin(x_with_key, y_with_key)

now works for me and seems to be giving join result I expected.

I started with a much lower number as:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/InnerShuffleRegionJoinSuite.scala#L26
is "3".

But for working with whole chr / genome it seems like a million or more makes sense.

I'll close this PR shortly.

fnothaft · 2016-10-12T04:19:22Z

Oh, nice! Good catch. Perhaps we can beef up the documentation?

jpdna · 2016-10-12T04:59:18Z

Perhaps we can beef up the documentation?
yup, tiny PR just in for that doc, thanks!

Added test to demonstrate region join failure

f3cdb50

jpdna closed this Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test demonstrating region join failure #1206

Test demonstrating region join failure #1206

jpdna commented Oct 12, 2016 •

edited

Loading

AmplabJenkins commented Oct 12, 2016

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

Test demonstrating region join failure #1206

Test demonstrating region join failure #1206

Conversation

jpdna commented Oct 12, 2016 • edited Loading

AmplabJenkins commented Oct 12, 2016

Build result: FAILURE

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

fnothaft commented Oct 12, 2016

jpdna commented Oct 12, 2016

jpdna commented Oct 12, 2016 •

edited

Loading