BAMHeader not set when running on a cluster #676

arahuja · 2015-05-26T15:51:09Z

When running transform with .bam file as the output, I see the following error:

Exception in thread "main" java.lang.AssertionError: assertion failed: Cannot return header if not attached.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat$.getHeader(ADAMBAMOutputFormat.scala:60)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:68)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.InstrumentedOutputFormat.<init>(InstrumentedOutputFormat.scala:33)
        at org.bdgenomics.adam.rdd.read.InstrumentedADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:71)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:958)

I tried to fix this with a similar strategy used by @fnothaft for the VCF header (see https://github.com/bigdatagenomics/adam/compare/master...hammerlab:bam-save-header?expand=1), but did not seem to take. Any ideas on this?

The text was updated successfully, but these errors were encountered:

ryan-williams · 2015-06-01T15:23:20Z

IIRC @fnothaft and I could not convince ourselves that the VCF header impl works like a year ago, though he claimed it was working for him (on a cluster) and I couldn't get it to work.

ryan-williams · 2015-06-01T16:14:58Z

I just ran into this issue as well, btw.

ryan-williams · 2015-06-01T16:29:21Z

#353 has some discussion of the VCF version of this issue

ryan-williams · 2015-06-01T16:29:36Z

I'll try my hand at porting a similar fix today and see if I have any more luck

ryan-williams · 2015-06-01T16:53:58Z

Just tried with master...ryan-williams:sam-attempt, still see the issue. Will dig through logs and see if I can verify whether the executors attempted to set the header.

ryan-williams · 2015-06-01T17:23:08Z

heh, seems like the errors there were from double- and triple-setting the header on each executor (once per partition), e.g.:

java.lang.AssertionError: assertion failed: Cannot attach a new SAM header without first clearing the header.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$.addHeader(ADAMSAMOutputFormat.scala:41)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:144)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:134)

Forcibly clearing and then re-setting the header on each partition seems to do the trick, e.g.

    val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
      log.warn(s"Setting ${if (asSam) "SAM" else "BAM"} header for partition $idx")
      val header = bcastHeader.value
      synchronized {
        // perform map partition call to ensure that the VCF header is set on all
        // nodes in the cluster; see:
        // https://github.com/bigdatagenomics/adam/issues/353

        asSam match {
          case true  =>
            ADAMSAMOutputFormat.clearHeader()
            ADAMSAMOutputFormat.addHeader(header)
            log.warn(s"Set SAM header for partition $idx")
          case false =>
            ADAMBAMOutputFormat.clearHeader()
            ADAMBAMOutputFormat.addHeader(header)
            log.warn(s"Set BAM header for partition $idx")
        }
      }
      Iterator[Int]()
    }).count()

I'll open up a PR promptly.

fixes bigdatagenomics#676

fixes bigdatagenomics#676 - formatting / unnecessary `.toString` nits - one more header fix - whitespace nits

ryan-williams · 2015-06-04T03:02:07Z

Interestingly, I just saw a failure when writing a SAM file, so there might be some nondeterministic way that this remains unfixed.

Here are the YARN container logs of all my executors.

This gist has the logs of the two executors where relevant things happened: IDs 161 and 69. The only log lines about headers come from these two executors, and they were running on the same host (out of 94 in the application); probably a coincidence?

I ran rdd.coalesce(1).adamSAMSave(filename, asSam = true, asRegularFile = true) on an ADAM with #689 patched in.

Timeline:

At 22:35:00, executor 69 prints out that the header was set, twice.
2 seconds later 161 fails task 0.0 (the only task in the stage, after the .coalesce(1)) due to the header not being set.
The task is retried (attempt 0.1) on the same executor, fails again.
It is then retried (attempt 0.2) on executor 69, where the header was set, and it passes.

Something for us to keep an eye on / think about possible explanations for!

fnothaft · 2015-06-04T14:24:21Z

I'm not 100% sure (obviously) but my money is on Hadoop-BAM being borked.

ryan-williams · 2015-08-23T22:52:22Z

I thought I was seeing this again but decided the behavior is different enough that I'm going to file a new issue.

ryan-williams · 2016-05-16T04:41:03Z

Bad news: I ran into this again today.
Good news: I've figured out why it's (still!) happening!
Bad news: I don't know how to fix it.

In the no-op mapPartitions+count that we run to set the header on each executor, there is no guarantee that at least one task will be executed on each active executor. Furthermore, there is no guarantee that tasks will run on the same executors during the subsequent writing of the BAM.

Generally, tasks get placed to optimize locality, which is likely to be stable in some cases, but today I ran into trouble because the no-op-count finishes in < 1s, while Spark waits on some (>1s?) timeout before scheduling non-NODE_LOCAL tasks, which was happening in the latter (saveAsNewAPIHadoopFile) stage, meaning certain executors received tasks in the latter but not the former, resulting in quick task and then job failures.

Following are more details about my app, partitions, and executors that exposed this.

I was attempting to load a .bam that is stored on HDFS in 956 blocks spanning 91 nodes. My cluster has 94 nodes; two were offline when this BAM was stored on HDFS, and one's HDFS process is dead but it functions fine as a Spark node.

At the time of the no-op-count stage, I had 319 live (3-core) executors, 4 of which were on 2 of the nodes with no BAM blocks. The no-op-count ran its 956 tasks in < 1s, with NODE_LOCAL locality on all tasks.

Then, the saveAsNewAPIHadoopFile stage ran, scheduling 928 NODE_LOCAL tasks, close to the maximum possible of 945 (3 on each of the 315 executors with BAM blocks, minus 17 executors that only got 2 tasks). 3-4s later, it scheduled the remaining 28 (956 - 928) tasks with RACK_LOCAL locality on [some of the executors that only got 2 NODE_LOCAL tasks previously] as well as [the 4 executors on nodes that have no BAM blocks].

The latter 4 executors failed immediately due to not having the BAM header set, and then retried the failed tasks in such rapid succession that they hit my spark.task.maxFailures of 10 in under 1s and failed the job.

I haven't fully digested the implications of this or thought much about ways to fix it, but I'm going to reopen this as I think it's real and still live!

fnothaft · 2016-05-16T13:54:51Z

I think we have a 100% concrete, guaranteed fix in the single file path. Specifically, see 7d4b409. It's an ugly fix: we write the header to HDFS as a file, and then read it when creating a record reader. It should be straightforward to port over to the "normal" (write a BAM as many partitions) code path, I just haven't done it.

fnothaft · 2016-07-19T14:55:53Z

Resolved by #964.

fnothaft · 2016-08-08T16:51:51Z

This isn't yet resolved for sharded files, I'll have a PR resolving this for sharded files this AM.

Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.

ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015

fix BAM/SAM header setting when writing on cluster

e59c3e7

fixes bigdatagenomics#676

ryan-williams mentioned this issue Jun 1, 2015

fix BAM/SAM header setting when writing on cluster #691

Merged

ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015

fix BAM/SAM header setting when writing on cluster

57447f8

fixes bigdatagenomics#676 - formatting / unnecessary `.toString` nits - one more header fix - whitespace nits

fnothaft closed this as completed in #691 Jun 2, 2015

ryan-williams mentioned this issue Jun 4, 2015

Unify header-setting between BAM/SAM and VCF #711

Closed

arahuja mentioned this issue Jul 6, 2015

Writing to a BAM file with adamSAMSave consistently fails #721

Closed

ryan-williams reopened this Aug 23, 2015

ryan-williams closed this as completed Aug 23, 2015

fnothaft mentioned this issue Jan 12, 2016

BAM header is not getting set on partition 0 with headerless BAM output format #916

Closed

ryan-williams added the bug label May 16, 2016

ryan-williams reopened this May 16, 2016

ryan-williams mentioned this issue May 18, 2016

Port single file BAM and header code to VCF #1009

Closed

heuermh added this to the 0.20.0 milestone Jun 5, 2016

heuermh mentioned this issue Jun 7, 2016

Release ADAM version 0.20.0 #1048

Closed

61 tasks

fnothaft closed this as completed Jul 19, 2016

fnothaft self-assigned this Aug 8, 2016

fnothaft reopened this Aug 8, 2016

fnothaft mentioned this issue Aug 8, 2016

Resolves various single file save/header issues #1104

Merged

heuermh closed this as completed in 48caa3f Sep 7, 2016

ryan-williams mentioned this issue Oct 26, 2016

Investigate sorted join in dataset api #1223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAMHeader not set when running on a cluster #676

BAMHeader not set when running on a cluster #676

arahuja commented May 26, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 4, 2015

fnothaft commented Jun 4, 2015

ryan-williams commented Aug 23, 2015

ryan-williams commented May 16, 2016

fnothaft commented May 16, 2016

fnothaft commented Jul 19, 2016

fnothaft commented Aug 8, 2016

BAMHeader not set when running on a cluster #676

BAMHeader not set when running on a cluster #676

Comments

arahuja commented May 26, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 4, 2015

fnothaft commented Jun 4, 2015

ryan-williams commented Aug 23, 2015

ryan-williams commented May 16, 2016

fnothaft commented May 16, 2016

fnothaft commented Jul 19, 2016

fnothaft commented Aug 8, 2016