Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

Closed
wants to merge 2 commits into from

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Dec 9, 2016

See #2287

final String outputParentDir = outputFile.substring(0, outputFile.lastIndexOf('/') + 1);
// First, check for the _SUCCESS file.
final String successFile = outputFile + "/_SUCCESS";
final Path successPath = new Path(successFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should use a fully-qualified class name when using Hadoop's Path (org.apache.hadoop.fs.Path), to avoid potential confusion with NIO Path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just reverting the previous change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough

@lbergelson
Copy link
Member

I ran the test case from #2287 using this branch and I get

./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://hellbender/test/output/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $HELLBENDER_TEST_APIKEY \
    -- \
    --sparkRunner GCS \
    --cluster methods-test-cluster \
    --executor-cores 4 \
    --executor-memory 20g
org.broadinstitute.hellbender.exceptions.GATKException: unable to write bam: org.apache.hadoop.fs.FileAlreadyExistsException: A directory with that name exists: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:253)
	at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:349)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:112)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:96)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:103)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:116)
	at org.broadinstitute.hellbender.Main.main(Main.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@lbergelson lbergelson closed this Dec 13, 2016
@droazen
Copy link
Contributor

droazen commented Dec 13, 2016

Re-opening this one so that we have a place to discuss what needs to be done to get this working.

@droazen droazen reopened this Dec 13, 2016
@droazen droazen changed the title Revert use of NIO-based SAMFileMerger from Hadoop-BAM Revert use of NIO-based SAMFileMerger from Hadoop-BAM (DO NOT MERGE) Dec 13, 2016
@codecov-io
Copy link

codecov-io commented Dec 13, 2016

Current coverage is 75.760% (diff: 89.744%)

Merging #2306 into master will increase coverage by 0.058%

@@             master      #2306   diff @@
==========================================
  Files           728        729     +1   
  Lines         38451      38622   +171   
  Methods           0          0          
  Messages          0          0          
  Branches       8027       8073    +46   
==========================================
+ Hits          29108      29260   +152   
- Misses         6840       6847     +7   
- Partials       2503       2515    +12   

Sunburst

Diff Coverage File Path
•••••••• 89% ...lbender/engine/spark/datasources/ReadsSparkSink.java

Powered by Codecov. Last update ffc26bb...3381f1c

@lbergelson
Copy link
Member

@tomwhite Ack, sorry, meant to just comment, not close and comment.

@tomwhite
Copy link
Contributor Author

@lbergelson that seems to be a separate bug, since this just reverts some commits. There's obviously a simple workaround here too.

@droazen
Copy link
Contributor

droazen commented Dec 14, 2016

How should we proceed? Should we try to add a new commit on this branch to fix the issue Louis ran into?

@droazen droazen changed the title Revert use of NIO-based SAMFileMerger from Hadoop-BAM (DO NOT MERGE) Revert use of NIO-based SAMFileMerger from Hadoop-BAM Dec 14, 2016
@tomwhite
Copy link
Contributor Author

This was meant to be a quick fix, so I would commit this - and the overwriting issue can be looked at later.

@droazen
Copy link
Contributor

droazen commented Dec 14, 2016

@lbergelson do you agree?

@lbergelson
Copy link
Member

I'm fine with merging this. Is the workaround for now that you have to run with a directory instead of a bam file path? It looks like it doesn't fix Geraldine's use case because of the silly overwriting issue.

@lbergelson
Copy link
Member

@tomwhite I'm curious if the reason this is working for you but not for me is because you're testing with a file that fits into a single partition.

@tomwhite
Copy link
Contributor Author

My mistake. I thought it was because of a file or directory from a previous run. If not, then it could be a difference in the way that GCS handles overwriting files.

@lbergelson
Copy link
Member

@tomwhite Ah, I see the confusion. I made sure the file didn't exist before running. I think it's because we write the parts files into a temporary directory with the same name as the ultimate output, and then copy the final combined file over as a file with the same name. (unless I'm misrembering how it works. )
I'm guessing the issue is probably something to do with how GCS filesystem treats deleting a directory. GCS doesn't have any real concept of directories as discrete entities, so there may be something funny about deleting one, we might have to explicitly delete the files in the directory instead of trying to delete the directory itself.

@tomwhite
Copy link
Contributor Author

@lbergelson So it looks like this never worked on GCS? At this point it might be best to get the code in Hadoop-BAM working with GCS (since that's what we'd prefer to use long-term), rather than patching the code being reinstated by this PR to work with GCS.

@lbergelson
Copy link
Member

At some point we could write files to gs:// addresses though. I'm not sure what's going on that makes this not work when it somehow did work in the past.

@tomwhite
Copy link
Contributor Author

tomwhite commented Jan 5, 2017

Really we need some tests for gs:// files in ReadsSparkSinkUnitTest - e.g. a GCS version of testWritingToFileURL. This needs knowledge of how to configure the Hadoop GCS connector (outside dataproc), which I lack. Perhaps someone else knows how to do this?

@lbergelson
Copy link
Member

Interesting, also seeing:

org.broadinstitute.hellbender.exceptions.GATKException: unable to write bam: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:253)
	at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:349)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:112)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:96)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:103)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:116)
	at org.broadinstitute.hellbender.Main.main(Main.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Existing mirrorFile and resourceId don't match isDirectory status! '/hadoop_gcs_connector_metadata_cache/hellbender/test/output/gatk4-spark/recalibrated.bam' (dir: 'false') vs 'gs://hellbender/test/output/gatk4-spark/recalibrated.bam/' (dir: 'true')
	at com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache.getCacheEntryInternal(FileSystemBackedDirectoryListCache.java:198)
	at com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache.putResourceId(FileSystemBackedDirectoryListCache.java:363)
	at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.createEmptyObjects(CacheSupplementedGoogleCloudStorage.java:150)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:578)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:1372)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:313)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:998)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:989)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:989)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:989)
	at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:811)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.saveAsShardedHadoopFiles(ReadsSparkSink.java:216)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:242)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:166)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:248)

for the same input.

@lbergelson
Copy link
Member

possibly relate to this warning:

17/01/09 21:57:53 INFO com.google.cloud.genomics.dataflow.readers.bam.BAMIO: getReadsFromBAMFile - got input resources
17/01/09 21:57:54 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths to process : 1
17/01/09 22:01:44 WARN com.google.cloud.hadoop.gcsio.FileSystemBackedDirectoryListCache: Got null fileList for listBaseFile '/hadoop_gcs_connector_metadata_cache/hellbender/test/output/gatk4-spark/recalibrated.bam' even though exists() was true!
17/01/09 22:01:45 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://hellbender/test/output/gatk4-spark/recalibrated.bam
17/01/09 22:01:45 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://hellbender/test/output/gatk4-spark/recalibrated.bam - removing from cache

@lbergelson
Copy link
Member

and sometimes I get:

java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:3

@droazen droazen requested a review from lbergelson January 10, 2017 20:19
@droazen droazen assigned lbergelson and unassigned droazen Jan 10, 2017
@droazen
Copy link
Contributor

droazen commented Jan 10, 2017

It's looking like we might have to fix the issues with NIO here after all @tomwhite @jean-philippe-martin, as @lbergelson has been unable to get this working reasonably with the GCS adapter (it runs, but veeeerrryyy slowly).

@droazen
Copy link
Contributor

droazen commented Jan 17, 2017

Closing in favor of an NIO-based fix.

@droazen droazen closed this Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants