-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306
Conversation
This reverts commit e98fe09.
final String outputParentDir = outputFile.substring(0, outputFile.lastIndexOf('/') + 1); | ||
// First, check for the _SUCCESS file. | ||
final String successFile = outputFile + "/_SUCCESS"; | ||
final Path successPath = new Path(successFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should use a fully-qualified class name when using Hadoop's Path (org.apache.hadoop.fs.Path
), to avoid potential confusion with NIO Path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just reverting the previous change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough
I ran the test case from #2287 using this branch and I get
|
Re-opening this one so that we have a place to discuss what needs to be done to get this working. |
Current coverage is 75.760% (diff: 89.744%)@@ master #2306 diff @@
==========================================
Files 728 729 +1
Lines 38451 38622 +171
Methods 0 0
Messages 0 0
Branches 8027 8073 +46
==========================================
+ Hits 29108 29260 +152
- Misses 6840 6847 +7
- Partials 2503 2515 +12
|
@tomwhite Ack, sorry, meant to just comment, not close and comment. |
@lbergelson that seems to be a separate bug, since this just reverts some commits. There's obviously a simple workaround here too. |
How should we proceed? Should we try to add a new commit on this branch to fix the issue Louis ran into? |
This was meant to be a quick fix, so I would commit this - and the overwriting issue can be looked at later. |
@lbergelson do you agree? |
I'm fine with merging this. Is the workaround for now that you have to run with a directory instead of a bam file path? It looks like it doesn't fix Geraldine's use case because of the silly overwriting issue. |
@tomwhite I'm curious if the reason this is working for you but not for me is because you're testing with a file that fits into a single partition. |
My mistake. I thought it was because of a file or directory from a previous run. If not, then it could be a difference in the way that GCS handles overwriting files. |
@tomwhite Ah, I see the confusion. I made sure the file didn't exist before running. I think it's because we write the parts files into a temporary directory with the same name as the ultimate output, and then copy the final combined file over as a file with the same name. (unless I'm misrembering how it works. ) |
@lbergelson So it looks like this never worked on GCS? At this point it might be best to get the code in Hadoop-BAM working with GCS (since that's what we'd prefer to use long-term), rather than patching the code being reinstated by this PR to work with GCS. |
At some point we could write files to gs:// addresses though. I'm not sure what's going on that makes this not work when it somehow did work in the past. |
Really we need some tests for gs:// files in ReadsSparkSinkUnitTest - e.g. a GCS version of testWritingToFileURL. This needs knowledge of how to configure the Hadoop GCS connector (outside dataproc), which I lack. Perhaps someone else knows how to do this? |
Interesting, also seeing:
for the same input. |
possibly relate to this warning:
|
and sometimes I get:
|
It's looking like we might have to fix the issues with NIO here after all @tomwhite @jean-philippe-martin, as @lbergelson has been unable to get this working reasonably with the GCS adapter (it runs, but veeeerrryyy slowly). |
Closing in favor of an NIO-based fix. |
See #2287