Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

tomwhite · 2016-12-09T12:02:12Z

This reverts commit e98fe09.

…s from Spark. (#2169)" This reverts commit a30af5a.

droazen · 2016-12-13T17:51:46Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

+        final String outputParentDir = outputFile.substring(0, outputFile.lastIndexOf('/') + 1);
+        // First, check for the _SUCCESS file.
+        final String successFile = outputFile + "/_SUCCESS";
+        final Path successPath = new Path(successFile);


Perhaps we should use a fully-qualified class name when using Hadoop's Path (org.apache.hadoop.fs.Path), to avoid potential confusion with NIO Path?

I'm just reverting the previous change here.

Ok, fair enough

lbergelson · 2016-12-13T19:05:38Z

I ran the test case from #2287 using this branch and I get

./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://hellbender/test/output/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $HELLBENDER_TEST_APIKEY \
    -- \
    --sparkRunner GCS \
    --cluster methods-test-cluster \
    --executor-cores 4 \
    --executor-memory 20g

org.broadinstitute.hellbender.exceptions.GATKException: unable to write bam: org.apache.hadoop.fs.FileAlreadyExistsException: A directory with that name exists: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:253)
	at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:349)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:112)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:96)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:103)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:116)
	at org.broadinstitute.hellbender.Main.main(Main.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

droazen · 2016-12-13T20:07:14Z

Re-opening this one so that we have a place to discuss what needs to be done to get this working.

codecov-io · 2016-12-13T20:16:28Z

Current coverage is 75.760% (diff: 89.744%)

Merging #2306 into master will increase coverage by 0.058%

@@             master      #2306   diff @@
==========================================
  Files           728        729     +1   
  Lines         38451      38622   +171   
  Methods           0          0          
  Messages          0          0          
  Branches       8027       8073    +46   
==========================================
+ Hits          29108      29260   +152   
- Misses         6840       6847     +7   
- Partials       2503       2515    +12

Diff Coverage	File Path
•••••••• 89%	...lbender/engine/spark/datasources/ReadsSparkSink.java

Powered by Codecov. Last update ffc26bb...3381f1c

lbergelson · 2016-12-13T21:21:21Z

@tomwhite Ack, sorry, meant to just comment, not close and comment.

tomwhite · 2016-12-14T12:53:54Z

@lbergelson that seems to be a separate bug, since this just reverts some commits. There's obviously a simple workaround here too.

droazen · 2016-12-14T15:52:04Z

How should we proceed? Should we try to add a new commit on this branch to fix the issue Louis ran into?

tomwhite · 2016-12-14T17:33:47Z

This was meant to be a quick fix, so I would commit this - and the overwriting issue can be looked at later.

droazen · 2016-12-14T17:51:45Z

@lbergelson do you agree?

lbergelson · 2016-12-14T22:54:14Z

I'm fine with merging this. Is the workaround for now that you have to run with a directory instead of a bam file path? It looks like it doesn't fix Geraldine's use case because of the silly overwriting issue.

lbergelson · 2016-12-14T23:03:54Z

@tomwhite I'm curious if the reason this is working for you but not for me is because you're testing with a file that fits into a single partition.

tomwhite · 2016-12-16T09:26:46Z

My mistake. I thought it was because of a file or directory from a previous run. If not, then it could be a difference in the way that GCS handles overwriting files.

lbergelson · 2016-12-16T22:12:32Z

@tomwhite Ah, I see the confusion. I made sure the file didn't exist before running. I think it's because we write the parts files into a temporary directory with the same name as the ultimate output, and then copy the final combined file over as a file with the same name. (unless I'm misrembering how it works. )
I'm guessing the issue is probably something to do with how GCS filesystem treats deleting a directory. GCS doesn't have any real concept of directories as discrete entities, so there may be something funny about deleting one, we might have to explicitly delete the files in the directory instead of trying to delete the directory itself.

tomwhite · 2016-12-19T15:31:06Z

@lbergelson So it looks like this never worked on GCS? At this point it might be best to get the code in Hadoop-BAM working with GCS (since that's what we'd prefer to use long-term), rather than patching the code being reinstated by this PR to work with GCS.

lbergelson · 2016-12-19T18:57:54Z

At some point we could write files to gs:// addresses though. I'm not sure what's going on that makes this not work when it somehow did work in the past.

tomwhite · 2017-01-05T10:45:48Z

Really we need some tests for gs:// files in ReadsSparkSinkUnitTest - e.g. a GCS version of testWritingToFileURL. This needs knowledge of how to configure the Hadoop GCS connector (outside dataproc), which I lack. Perhaps someone else knows how to do this?

lbergelson · 2017-01-09T21:52:32Z

Interesting, also seeing:

org.broadinstitute.hellbender.exceptions.GATKException: unable to write bam: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:253)
	at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:349)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:112)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:96)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:103)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:116)
	at org.broadinstitute.hellbender.Main.main(Main.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Existing mirrorFile and resourceId don't match isDirectory status! '/hadoop_gcs_connector_metadata_cache/hellbender/test/output/gatk4-spark/recalibrated.bam' (dir: 'false') vs 'gs://hellbender/test/output/gatk4-spark/recalibrated.bam/' (dir: 'true')
	at com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache.getCacheEntryInternal(FileSystemBackedDirectoryListCache.java:198)
	at com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache.putResourceId(FileSystemBackedDirectoryListCache.java:363)
	at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.createEmptyObjects(CacheSupplementedGoogleCloudStorage.java:150)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:578)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:1372)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:313)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:998)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:989)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:989)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:989)
	at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:811)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.saveAsShardedHadoopFiles(ReadsSparkSink.java:216)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:242)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:166)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:248)

for the same input.

lbergelson · 2017-01-09T22:03:16Z

possibly relate to this warning:

17/01/09 21:57:53 INFO com.google.cloud.genomics.dataflow.readers.bam.BAMIO: getReadsFromBAMFile - got input resources
17/01/09 21:57:54 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths to process : 1
17/01/09 22:01:44 WARN com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache: Got null fileList for listBaseFile '/hadoop_gcs_connector_metadata_cache/hellbender/test/output/gatk4-spark/recalibrated.bam' even though exists() was true!
17/01/09 22:01:45 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
17/01/09 22:01:45 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://hellbender/test/output/gatk4-spark/recalibrated.bam - removing from cache

lbergelson · 2017-01-09T22:10:46Z

and sometimes I get:

java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:3

droazen · 2017-01-10T20:39:49Z

It's looking like we might have to fix the issues with NIO here after all @tomwhite @jean-philippe-martin, as @lbergelson has been unable to get this working reasonably with the GCS adapter (it runs, but veeeerrryyy slowly).

droazen · 2017-01-17T18:26:48Z

Closing in favor of an NIO-based fix.

tomwhite added 2 commits December 9, 2016 11:59

Revert "Use SAMFileMerger from Hadoop-BAM."

27d1379

This reverts commit e98fe09.

Revert "Write splitting-bai files when writing (non-sharded) BAM file…

3381f1c

…s from Spark. (#2169)" This reverts commit a30af5a.

tomwhite mentioned this pull request Dec 9, 2016

Possible regression in merging bams on GCS in Spark #2287

Closed

droazen self-assigned this Dec 12, 2016

droazen reviewed Dec 13, 2016

View reviewed changes

lbergelson closed this Dec 13, 2016

droazen reopened this Dec 13, 2016

droazen changed the title ~~Revert use of NIO-based SAMFileMerger from Hadoop-BAM~~ Revert use of NIO-based SAMFileMerger from Hadoop-BAM (DO NOT MERGE) Dec 13, 2016

droazen changed the title ~~Revert use of NIO-based SAMFileMerger from Hadoop-BAM (DO NOT MERGE)~~ Revert use of NIO-based SAMFileMerger from Hadoop-BAM Dec 14, 2016

droazen requested a review from lbergelson January 10, 2017 20:19

droazen assigned lbergelson and unassigned droazen Jan 10, 2017

droazen closed this Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

tomwhite commented Dec 9, 2016

droazen Dec 13, 2016

tomwhite Dec 14, 2016

droazen Dec 14, 2016

lbergelson commented Dec 13, 2016

droazen commented Dec 13, 2016

codecov-io commented Dec 13, 2016 •

edited

Loading

lbergelson commented Dec 13, 2016

tomwhite commented Dec 14, 2016

droazen commented Dec 14, 2016

tomwhite commented Dec 14, 2016

droazen commented Dec 14, 2016

lbergelson commented Dec 14, 2016

lbergelson commented Dec 14, 2016

tomwhite commented Dec 16, 2016

lbergelson commented Dec 16, 2016

tomwhite commented Dec 19, 2016

lbergelson commented Dec 19, 2016

tomwhite commented Jan 5, 2017

lbergelson commented Jan 9, 2017

lbergelson commented Jan 9, 2017

lbergelson commented Jan 9, 2017

droazen commented Jan 10, 2017

droazen commented Jan 17, 2017

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

Conversation

tomwhite commented Dec 9, 2016

droazen Dec 13, 2016

Choose a reason for hiding this comment

tomwhite Dec 14, 2016

Choose a reason for hiding this comment

droazen Dec 14, 2016

Choose a reason for hiding this comment

lbergelson commented Dec 13, 2016

droazen commented Dec 13, 2016

codecov-io commented Dec 13, 2016 • edited Loading

Current coverage is 75.760% (diff: 89.744%)

lbergelson commented Dec 13, 2016

tomwhite commented Dec 14, 2016

droazen commented Dec 14, 2016

tomwhite commented Dec 14, 2016

droazen commented Dec 14, 2016

lbergelson commented Dec 14, 2016

lbergelson commented Dec 14, 2016

tomwhite commented Dec 16, 2016

lbergelson commented Dec 16, 2016

tomwhite commented Dec 19, 2016

lbergelson commented Dec 19, 2016

tomwhite commented Jan 5, 2017

lbergelson commented Jan 9, 2017

lbergelson commented Jan 9, 2017

lbergelson commented Jan 9, 2017

droazen commented Jan 10, 2017

droazen commented Jan 17, 2017

codecov-io commented Dec 13, 2016 •

edited

Loading