Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test demonstrating region join failure #1206

Closed
wants to merge 1 commit into from

Conversation

jpdna
Copy link
Member

@jpdna jpdna commented Oct 12, 2016

In trying to apply shuffleRegionJoin to a use case in gVCF I found that I was getting an OOM error
in running a InnerShuffleRegionJoin - on a tiny tiny dataset of 3 variants and 2 variants.

In this PR I attempt to recreate the same set of ReferenceRegion intervals in two RDDs to join to demonstrate the problem, here using AlignmentRecords because I modified an existing test.

Its possible that something else is wrong with this test now because I see an ArrayIndexOutofBounds exception rather than OOM, but in any case, the test below is failing for a reason I don't understand - and not gracefully. If we can make this new test work with these intervals, then at least it will be a step towards figuring out why my gVCF join with the same intervals is failing.

- Test join that was failing 10/11/2016 *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArrayIndexOutOfBoundsException: 5385867
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
  at scala.Option.foreach(Option.scala:236)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scal

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1532/

Build result: FAILURE

GitHub pull request #1206 of commit f3cdb50 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1206/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c7b6acb # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1206/merge^{commit} # timeout=10Checking out Revision c7b6acb (origin/pr/1206/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c7b6acbc103a8ae8f7e8c28638cd312b6a22190aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

OK, cool! By any chance, is the data on the cluster? If so, perhaps let's start a thread and we can work offline to debug it there as well.

@jpdna
Copy link
Member Author

jpdna commented Oct 12, 2016

Not on the cluster, but I provide the script for you to try to repo the gVCF join I was trying at this gist I just made:
https://gist.github.com/jpdna/a352ab9304a1885d01d3ac1c65dc77a8

which has links to the tiny input files here:
https://drive.google.com/drive/folders/0B6jh69UgixwpdDlGUkhRaW42QzA?usp=sharing

You want to start an email thread to discuss offline?

@fnothaft
Copy link
Member

You want to start an email thread to discuss offline?

If the files are public then we don't need an offline thread; I'd assumed they weren't public. I'll take a look tomorrow.

@jpdna
Copy link
Member Author

jpdna commented Oct 12, 2016

I think I found the solution.
It seems that PartitionSize needs to be more like 5000000 as indeed it is the bin size in nucleotides.

val result1 = InnerShuffleRegionJoin[VariantContext, VariantContext](x.sequences, 5000000, sc).partitionAndJoin(x_with_key, y_with_key)

now works for me and seems to be giving join result I expected.

I started with a much lower number as:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/InnerShuffleRegionJoinSuite.scala#L26
is "3".

But for working with whole chr / genome it seems like a million or more makes sense.

I'll close this PR shortly.

@fnothaft
Copy link
Member

Oh, nice! Good catch. Perhaps we can beef up the documentation?

@jpdna
Copy link
Member Author

jpdna commented Oct 12, 2016

Perhaps we can beef up the documentation?
yup, tiny PR just in for that doc, thanks!

@jpdna jpdna closed this Oct 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants