Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in merging bams on GCS in Spark #2287

Closed
droazen opened this issue Nov 30, 2016 · 21 comments
Closed

Possible regression in merging bams on GCS in Spark #2287

droazen opened this issue Nov 30, 2016 · 21 comments

Comments

@droazen
Copy link
Contributor

droazen commented Nov 30, 2016

@vdauwera reports that the following ApplyBQSRSpark command fails with an NIO-related error on dataproc:

time ./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://gatk-demo/TEST/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $APIKEY_ORTMP \
    -- \
    --sparkRunner GCS \
    --cluster gvda-test-bqsr \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 20g
java.nio.file.FileSystemNotFoundException: Provider "gs" not installed
    at java.nio.file.Paths.get(Paths.java:147)
    at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:30)
    at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:51)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:231)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:153)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:247)
    at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:348)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:109)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:167)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:186)
    at org.broadinstitute.hellbender.Main.instanceMain(Main.java:76)
    at org.broadinstitute.hellbender.Main.main(Main.java:92)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/11/29 23:44:27 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/11/29 23:44:27 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [155c3731-4071-4aa9-bed8-f0eb3426b805] entered state [ERROR] while waiting for [DONE].
@droazen
Copy link
Contributor Author

droazen commented Nov 30, 2016

One possible culprit is HadoopGenomics/Hadoop-BAM#111, which introduced the NIOFileUtil class to Hadoop-BAM.

@tomwhite @jean-philippe-martin Could you comment with your thoughts on this?

@jean-philippe-martin
Copy link
Contributor

Sorry, I won't be able to get to this until next week at the earliest.

@tomwhite
Copy link
Contributor

I'm not sure why the gs provider doesn't get installed (I think JP was seeing this before), but in any case there's a workaround in GATK for it: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/utils/gcs/BucketUtils.java#L380-L390

However, since the Hadoop-BAM code is doing the path lookups, it can't call that code directly (the dependency is the wrong way round). It would be best if we could fix the underlying problem of course, so that it gets picked up properly - I wonder if this can be done by fixing the service provider so it survives relocation (see http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer).

BTW I'm afraid I'm travelling this week, so I won't have time to look at it until next week either :(

@vdauwera
Copy link
Contributor

There's no rush on my end, FYI.

@droazen
Copy link
Contributor Author

droazen commented Dec 1, 2016

Next week is fine guys, no worries! :)

Does the GS NIO provider currently support writing @jean-philippe-martin? This particular method is trying to write a merged bam file to a bucket. I'm wondering if it would work even if the gs provider were present...

@jean-philippe-martin
Copy link
Contributor

@droazen yes, the GS NIO provider supports writing (though I've spent most of my focus on the reading side). You have to pick reading or writing when you open the file, you cannot do both on the same stream at the same time (I don't think that's the problem here).

What might be going on here is confusion between the NIO paths (they start with "gs://") and the GCS connector paths (they also start with "gs://"). Last time I checked the NIO paths were passed via a different input argument so we didn't confuse the two, but it's possible maybe some wires got crossed inside?

@tomwhite
Copy link
Contributor

tomwhite commented Dec 5, 2016

I checked in the spark JAR, and META-INF/services/java.nio.file.spi.FileSystemProvider contains

com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider
hdfs.jsr203.HadoopFileSystemProvider

So it looks like the service loader file is being created correctly.

I tried to reproduce on GCS by running (essentially the same as @vdauwera's command.)

time ./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://gatk-demo-tom/TEST/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $GOOGLE_APPLICATION_CREDENTIALS \
    -- \
    --sparkRunner GCS \
    --cluster cluster-tom \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 10g

But I got another error earlier on. Any ideas what this could be? (I can see the input bam with gsutil cp)

org.broadinstitute.hellbender.exceptions.UserException: A USER ERROR has occurred: Failed to read bam header from gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam
 Caused by:Error reading null at position 0
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:182)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:376)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:357)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:347)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:109)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:167)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:186)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:95)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:102)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:115)
	at org.broadinstitute.hellbender.Main.main(Main.java:157)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Error reading null at position 0
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.openStream(SeekableGCSStream.java:126)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.seek(SeekableGCSStream.java:103)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.<init>(SeekableGCSStream.java:59)
	at com.google.cloud.genomics.dataflow.readers.bam.BAMIO.openBAMFile(BAMIO.java:67)
	at com.google.cloud.genomics.dataflow.readers.bam.BAMIO.openBAM(BAMIO.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:178)
	... 20 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
Bad Request
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
	at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:4792)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.openStream(SeekableGCSStream.java:121)
	... 25 more
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [06be217b-bc58-49b4-afde-3cd5ec16f18d] entered state [ERROR] while waiting for [DONE].

@tomwhite
Copy link
Contributor

tomwhite commented Dec 5, 2016

@droazen helped me out with this. It was because I wasn't specifying the API key correctly.

@jean-philippe-martin
Copy link
Contributor

Spark hiding the GCS-NIO jar is a problem we indeed ran into before (see IOUtils.getPath). Given that this never worked, I wonder how we were outputting merged bams to GCS before?

@droazen
Copy link
Contributor Author

droazen commented Dec 5, 2016

@jean-philippe-martin Possibly we were using the GCS<->HDFS adapter previously, and something changed in the code to make us use NIO here instead? (possibly HadoopGenomics/Hadoop-BAM#111?)

@jean-philippe-martin
Copy link
Contributor

I concur, what it looks like we have here is code that switched (accidentally?) from the GCS adapter to GCS-NIO instead.

@tomwhite
Copy link
Contributor

tomwhite commented Dec 6, 2016

Yes, Hadoop-BAM uses the NIO API to do file merging, whereas in GATK we were using the Hadoop APIs (and therefore the GCS<->HDFS adapter) to do it.

It looks like there are a couple of things needed in GCS-NIO to use the NIO API for this.

  1. [NIO] Make CloudStorageFileSystem#newFileSystem more lenient in the URIs it accepts googleapis/google-cloud-java#1450 so that we don't have to special-case gs URIs to remove everything except the scheme and host when looking up the filesystem (see https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L40)
  2. NIO: Implement directory listing googleapis/google-cloud-java#813 to support path matching (https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L90)

There may be more, as I stopped there. The best way forward is probably to go back to the old code in GATK while the deficiencies in GCS-NIO are fixed and then released.

The stacktrace I got for 1 was:

java.lang.IllegalArgumentException: GCS FileSystem URIs mustn't have: port, userinfo, path, query, or fragment: gs://gatk-demo-tom/TEST/markdups.parts/_SUCCESS
	at shaded.cloud-nio.com.google.common.base.Preconditions.checkArgument(Preconditions.java:146)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:192)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:83)
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:336)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:54)
	at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:230)

And for 2:

java.lang.UnsupportedOperationException
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystem.getPathMatcher(CloudStorageFileSystem.java:219)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.getFilesMatching(NIOFileUtil.java:94)
	at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:62)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:230)

The command I ran was:

time ./gatk-launch MarkDuplicatesSpark \
    -I gs://hellbender/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
    -O gs://gatk-demo-tom/TEST/markdups \
    --METRICS_FILE /tmp/markdups-metrics \
    -apiKey $APIKEY_ORTMP \
    -- \
    --sparkRunner GCS \
    --cluster cluster-tom \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 10g

@tomwhite
Copy link
Contributor

tomwhite commented Dec 6, 2016

It would be helpful to have a failing unit test for this that runs on files in GCS.

@droazen
Copy link
Contributor Author

droazen commented Dec 6, 2016

@tomwhite Can we get a quick fix here while we wait for the NIO improvements you suggested? Perhaps we could create a separate copy of the bam merger in GATK4 that uses the old code path, and use that copy for now instead of Hadoop-BAM's?

@tomwhite
Copy link
Contributor

tomwhite commented Dec 8, 2016

@droazen That's probably the best way to do it, although I don't think we can be sure of a quick fix until we have a failing unit test for this on GCS. The turnaround time for manual testing is slow (for me at the moment). If someone could write such a test that would be very helpful.

@droazen
Copy link
Contributor Author

droazen commented Dec 8, 2016

@tomwhite If you could put together a quick PR against GATK4 with a temporary copy of the bam merger that uses the old code path, I'd be happy to test it.

@tomwhite
Copy link
Contributor

tomwhite commented Dec 9, 2016

@droazen thanks. I've created #2306 which reverts the commits that caused the problem on GCS. I've tested the above command (MarkDuplicatesSpark) on GCS and it works fine.

@jean-philippe-martin
Copy link
Contributor

jean-philippe-martin commented Dec 9, 2016 via email

@tomwhite
Copy link
Contributor

@jean-philippe-martin thanks for fixing the two GCS problems. I've filed another issue to do with walking the filesystem here: googleapis/google-cloud-java#1519. /cc @droazen @lbergelson

@droazen droazen modified the milestones: beta, 4.0 release May 30, 2017
@droazen
Copy link
Contributor Author

droazen commented May 30, 2017

@tomwhite @jean-philippe-martin Is this one done? Can we close this?

@tomwhite
Copy link
Contributor

Yes

@droazen droazen closed this as completed May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants