-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed VCF in distributed mode #353
Comments
Hey @fnothaft, thanks a lot for this. I had some trouble getting a run done on my cluster with a jar built with this patch (for reasons unrelated to this issue), but I'm pretty sure I just successfully ran it and unfortunately still see the issue in the output VCF. I'll follow up with some more information shortly, and try to run again to double-check, but if you have ideas for reasons this patch might not fix it as it currently stands, let me know as well. |
@ryan-williams ah, OK. Not so good to hear! Please let me know if you've got any additional data you can share to help reproduce this. I'll be out this AM, but will spin up an EC2 cluster this afternoon and debug further. |
Here's a gist with 3 files generated from a run I did with some printlns sprinkled through ADAM and some of its deps, as well as with your patch cherry-picked in (code here, minus printlns deeper in ADAM's dep closure). The 3 files in the gist are:
Things to note:
Again, my branch with your commit and my printlns is here (not pictured: printlns I plumbed down into None of this is likely directly reproducible on your end, but let me know if you have leads for following up. It's also possible that I'm doing something incorrect on my end so I will keep trying to get to the bottom of things. |
Thanks @ryan-williams! I'll dig back into these details now. |
@ryan-williams I've just pushed a minor update that adds a broadcast to ensure that the samples data is being serialized and sent out to the nodes. I've tested this on EC2 and it seems to work, but that was the same last time, so I'm hoping to test it more rigorously this weekend. |
This is a complete hack until bigdatagenomics/adam#353 is fixed.
belated thanks for your work here, @fnothaft. @arahuja mentioned yesterday that he'd used a more recent version of ADAM and gotten proper VCFs thanks to this fix, so I went back to try to figure out what went wrong with my attempt to test it last month. I ended up at an impasse where, at the SHA of the original fix ( I now see your note about a follow-on commit that ensures that the samples data get broadcast; which SHA was that? |
@ryan-williams no sweat! Are you trying to reproduce it with the original commit, or with the current fnothaft@fd3c6d5? I had added the broadcast fix in an intermediate commit that I squashed down before merging, so it should be present in fd3c6d5, if you're picking from the latest. |
hm, that is ominous. I am running off of the real trying to repro a malformed VCF from a commit further forward than anything off the top of your head that might make the fix race-y? |
Does the issue reproduce if you take f50af87 as well? I've done my testing with the two commits put together. I wouldn't expect it to hit a race condition; there's one specific race condition that I can imagine but I believe we're protected against that already. I can put together a snippet of code that would definitely prevent that race from occurring, but I think we're already guarded. |
@ryan-williams sorry; I meant 8672534, but yes. I wouldn't expect there to be a race difference between the two, but let me take a look over the code again this PM, and I'll see if I can sort it out! |
Yea, the only thing in |
It's also possible though that it could be the addition of caching in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/ADAMVariationContext.scala in 8672534. |
ah, yea, that also looked potentially relevant. For completeness, here's a few words about my testing methodology: I have a
Here are my most recent data points from running
|
So I am seeing |
@ryan-williams sorry, this dropped off my radar as well. Glad it is good on your end though. |
This bug was reported by @ryan-williams and is cross-documented at hammerlab/guacamole#116 and on the ADAM dev-list. The long/short of it is, when writing a VCF while distributed, the setHeader function isn't called on the VCF output format, so the header doesn't get set on each node. I believe this should impact both VCF and SAM/BAM output. I believe we can fix this by doing a mapPartitions call or similar before writing the file that calls the setHeader function. This is a bit of a hack, but should resolve the problem.
The text was updated successfully, but these errors were encountered: