Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAMHeader not set when running on a cluster #676

Closed
arahuja opened this issue May 26, 2015 · 13 comments · Fixed by #691
Closed

BAMHeader not set when running on a cluster #676

arahuja opened this issue May 26, 2015 · 13 comments · Fixed by #691
Assignees
Labels
Milestone

Comments

@arahuja
Copy link
Contributor

arahuja commented May 26, 2015

When running transform with .bam file as the output, I see the following error:

Exception in thread "main" java.lang.AssertionError: assertion failed: Cannot return header if not attached.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat$.getHeader(ADAMBAMOutputFormat.scala:60)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:68)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.InstrumentedOutputFormat.<init>(InstrumentedOutputFormat.scala:33)
        at org.bdgenomics.adam.rdd.read.InstrumentedADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:71)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:958)

I tried to fix this with a similar strategy used by @fnothaft for the VCF header (see https://github.com/bigdatagenomics/adam/compare/master...hammerlab:bam-save-header?expand=1), but did not seem to take. Any ideas on this?

@ryan-williams
Copy link
Member

IIRC @fnothaft and I could not convince ourselves that the VCF header impl works like a year ago, though he claimed it was working for him (on a cluster) and I couldn't get it to work.

@ryan-williams
Copy link
Member

I just ran into this issue as well, btw.

@ryan-williams
Copy link
Member

#353 has some discussion of the VCF version of this issue

@ryan-williams
Copy link
Member

I'll try my hand at porting a similar fix today and see if I have any more luck

@ryan-williams
Copy link
Member

Just tried with master...ryan-williams:sam-attempt, still see the issue. Will dig through logs and see if I can verify whether the executors attempted to set the header.

@ryan-williams
Copy link
Member

heh, seems like the errors there were from double- and triple-setting the header on each executor (once per partition), e.g.:

java.lang.AssertionError: assertion failed: Cannot attach a new SAM header without first clearing the header.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$.addHeader(ADAMSAMOutputFormat.scala:41)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:144)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:134)

Forcibly clearing and then re-setting the header on each partition seems to do the trick, e.g.

    val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
      log.warn(s"Setting ${if (asSam) "SAM" else "BAM"} header for partition $idx")
      val header = bcastHeader.value
      synchronized {
        // perform map partition call to ensure that the VCF header is set on all
        // nodes in the cluster; see:
        // https://github.com/bigdatagenomics/adam/issues/353

        asSam match {
          case true  =>
            ADAMSAMOutputFormat.clearHeader()
            ADAMSAMOutputFormat.addHeader(header)
            log.warn(s"Set SAM header for partition $idx")
          case false =>
            ADAMBAMOutputFormat.clearHeader()
            ADAMBAMOutputFormat.addHeader(header)
            log.warn(s"Set BAM header for partition $idx")
        }
      }
      Iterator[Int]()
    }).count()

I'll open up a PR promptly.

ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015
ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015
  fixes bigdatagenomics#676

- formatting / unnecessary `.toString` nits

- one more header fix

- whitespace nits
@ryan-williams
Copy link
Member

Interestingly, I just saw a failure when writing a SAM file, so there might be some nondeterministic way that this remains unfixed.

Here are the YARN container logs of all my executors.

This gist has the logs of the two executors where relevant things happened: IDs 161 and 69. The only log lines about headers come from these two executors, and they were running on the same host (out of 94 in the application); probably a coincidence?

I ran rdd.coalesce(1).adamSAMSave(filename, asSam = true, asRegularFile = true) on an ADAM with #689 patched in.

Timeline:

Something for us to keep an eye on / think about possible explanations for!

@fnothaft
Copy link
Member

fnothaft commented Jun 4, 2015

I'm not 100% sure (obviously) but my money is on Hadoop-BAM being borked.

@ryan-williams
Copy link
Member

I thought I was seeing this again but decided the behavior is different enough that I'm going to file a new issue.

@ryan-williams
Copy link
Member

Bad news: I ran into this again today.
Good news: I've figured out why it's (still!) happening!
Bad news: I don't know how to fix it.

In the no-op mapPartitions+count that we run to set the header on each executor, there is no guarantee that at least one task will be executed on each active executor. Furthermore, there is no guarantee that tasks will run on the same executors during the subsequent writing of the BAM.

Generally, tasks get placed to optimize locality, which is likely to be stable in some cases, but today I ran into trouble because the no-op-count finishes in < 1s, while Spark waits on some (>1s?) timeout before scheduling non-NODE_LOCAL tasks, which was happening in the latter (saveAsNewAPIHadoopFile) stage, meaning certain executors received tasks in the latter but not the former, resulting in quick task and then job failures.

Following are more details about my app, partitions, and executors that exposed this.

I was attempting to load a .bam that is stored on HDFS in 956 blocks spanning 91 nodes. My cluster has 94 nodes; two were offline when this BAM was stored on HDFS, and one's HDFS process is dead but it functions fine as a Spark node.

At the time of the no-op-count stage, I had 319 live (3-core) executors, 4 of which were on 2 of the nodes with no BAM blocks. The no-op-count ran its 956 tasks in < 1s, with NODE_LOCAL locality on all tasks.

Then, the saveAsNewAPIHadoopFile stage ran, scheduling 928 NODE_LOCAL tasks, close to the maximum possible of 945 (3 on each of the 315 executors with BAM blocks, minus 17 executors that only got 2 tasks). 3-4s later, it scheduled the remaining 28 (956 - 928) tasks with RACK_LOCAL locality on [some of the executors that only got 2 NODE_LOCAL tasks previously] as well as [the 4 executors on nodes that have no BAM blocks].

The latter 4 executors failed immediately due to not having the BAM header set, and then retried the failed tasks in such rapid succession that they hit my spark.task.maxFailures of 10 in under 1s and failed the job.

I haven't fully digested the implications of this or thought much about ways to fix it, but I'm going to reopen this as I think it's real and still live!

@ryan-williams ryan-williams reopened this May 16, 2016
@fnothaft
Copy link
Member

I think we have a 100% concrete, guaranteed fix in the single file path. Specifically, see 7d4b409. It's an ugly fix: we write the header to HDFS as a file, and then read it when creating a record reader. It should be straightforward to port over to the "normal" (write a BAM as many partitions) code path, I just haven't done it.

@fnothaft
Copy link
Member

Resolved by #964.

@fnothaft fnothaft self-assigned this Aug 8, 2016
@fnothaft fnothaft reopened this Aug 8, 2016
@fnothaft
Copy link
Member

fnothaft commented Aug 8, 2016

This isn't yet resolved for sharded files, I'll have a PR resolving this for sharded files this AM.

fnothaft added a commit to fnothaft/adam that referenced this issue Aug 8, 2016
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.
fnothaft added a commit to fnothaft/adam that referenced this issue Aug 30, 2016
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.
fnothaft added a commit to fnothaft/adam that referenced this issue Sep 6, 2016
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.
@heuermh heuermh closed this as completed in 48caa3f Sep 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants