-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible regression in merging bams on GCS in Spark #2287
Comments
One possible culprit is HadoopGenomics/Hadoop-BAM#111, which introduced the @tomwhite @jean-philippe-martin Could you comment with your thoughts on this? |
Sorry, I won't be able to get to this until next week at the earliest. |
I'm not sure why the However, since the Hadoop-BAM code is doing the path lookups, it can't call that code directly (the dependency is the wrong way round). It would be best if we could fix the underlying problem of course, so that it gets picked up properly - I wonder if this can be done by fixing the service provider so it survives relocation (see http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer). BTW I'm afraid I'm travelling this week, so I won't have time to look at it until next week either :( |
There's no rush on my end, FYI. |
Next week is fine guys, no worries! :) Does the GS NIO provider currently support writing @jean-philippe-martin? This particular method is trying to write a merged bam file to a bucket. I'm wondering if it would work even if the |
@droazen yes, the GS NIO provider supports writing (though I've spent most of my focus on the reading side). You have to pick reading or writing when you open the file, you cannot do both on the same stream at the same time (I don't think that's the problem here). What might be going on here is confusion between the NIO paths (they start with "gs://") and the GCS connector paths (they also start with "gs://"). Last time I checked the NIO paths were passed via a different input argument so we didn't confuse the two, but it's possible maybe some wires got crossed inside? |
I checked in the spark JAR, and
So it looks like the service loader file is being created correctly. I tried to reproduce on GCS by running (essentially the same as @vdauwera's command.)
But I got another error earlier on. Any ideas what this could be? (I can see the input bam with
|
@droazen helped me out with this. It was because I wasn't specifying the API key correctly. |
Spark hiding the GCS-NIO jar is a problem we indeed ran into before (see IOUtils.getPath). Given that this never worked, I wonder how we were outputting merged bams to GCS before? |
@jean-philippe-martin Possibly we were using the GCS<->HDFS adapter previously, and something changed in the code to make us use NIO here instead? (possibly HadoopGenomics/Hadoop-BAM#111?) |
I concur, what it looks like we have here is code that switched (accidentally?) from the GCS adapter to GCS-NIO instead. |
Yes, Hadoop-BAM uses the NIO API to do file merging, whereas in GATK we were using the Hadoop APIs (and therefore the GCS<->HDFS adapter) to do it. It looks like there are a couple of things needed in GCS-NIO to use the NIO API for this.
There may be more, as I stopped there. The best way forward is probably to go back to the old code in GATK while the deficiencies in GCS-NIO are fixed and then released. The stacktrace I got for 1 was:
And for 2:
The command I ran was:
|
It would be helpful to have a failing unit test for this that runs on files in GCS. |
@tomwhite Can we get a quick fix here while we wait for the NIO improvements you suggested? Perhaps we could create a separate copy of the bam merger in GATK4 that uses the old code path, and use that copy for now instead of Hadoop-BAM's? |
@droazen That's probably the best way to do it, although I don't think we can be sure of a quick fix until we have a failing unit test for this on GCS. The turnaround time for manual testing is slow (for me at the moment). If someone could write such a test that would be very helpful. |
@tomwhite If you could put together a quick PR against GATK4 with a temporary copy of the bam merger that uses the old code path, I'd be happy to test it. |
I have pull requests in flight for both (1) and (2). They are 1469
<googleapis/google-cloud-java#1469> and
1470 <googleapis/google-cloud-java#1470>.
Cheers,
JP
…On Tue, Dec 6, 2016 at 3:54 AM, Tom White ***@***.***> wrote:
Yes, Hadoop-BAM uses the NIO API to do file merging, whereas in GATK we
were using the Hadoop APIs (and therefore the GCS<->HDFS adapter) to do it.
It looks like there are a couple of things needed in GCS-NIO to use the
NIO API for this.
1. googleapis/google-cloud-java#1450
<googleapis/google-cloud-java#1450>
so that we don't have to special-case gs URIs to remove everything
except the scheme and host when looking up the filesystem (see
https://github.com/HadoopGenomics/Hadoop-BAM/
blob/master/src/main/java/org/seqdoop/hadoop_bam/util/
NIOFileUtil.java#L40
<https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L40>
)
2. googleapis/google-cloud-java#813
<googleapis/google-cloud-java#813>
to support path matching (https://github.com/HadoopGenomics/Hadoop-BAM/
blob/master/src/main/java/org/seqdoop/hadoop_bam/util/
NIOFileUtil.java#L90
<https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L90>
)
There may be more, as I stopped there. The best way forward is probably to
go back to the old code in GATK while the deficiencies in GCS-NIO are fixed
and then released.
|
@jean-philippe-martin thanks for fixing the two GCS problems. I've filed another issue to do with walking the filesystem here: googleapis/google-cloud-java#1519. /cc @droazen @lbergelson |
@tomwhite @jean-philippe-martin Is this one done? Can we close this? |
Yes |
@vdauwera reports that the following
ApplyBQSRSpark
command fails with an NIO-related error on dataproc:The text was updated successfully, but these errors were encountered: