Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed VCF in distributed mode #353

Closed
fnothaft opened this issue Aug 12, 2014 · 18 comments
Closed

Malformed VCF in distributed mode #353

fnothaft opened this issue Aug 12, 2014 · 18 comments
Assignees
Labels

Comments

@fnothaft
Copy link
Member

This bug was reported by @ryan-williams and is cross-documented at hammerlab/guacamole#116 and on the ADAM dev-list. The long/short of it is, when writing a VCF while distributed, the setHeader function isn't called on the VCF output format, so the header doesn't get set on each node. I believe this should impact both VCF and SAM/BAM output. I believe we can fix this by doing a mapPartitions call or similar before writing the file that calls the setHeader function. This is a bit of a hack, but should resolve the problem.

@ryan-williams
Copy link
Member

Hey @fnothaft, thanks a lot for this. I had some trouble getting a run done on my cluster with a jar built with this patch (for reasons unrelated to this issue), but I'm pretty sure I just successfully ran it and unfortunately still see the issue in the output VCF.

I'll follow up with some more information shortly, and try to run again to double-check, but if you have ideas for reasons this patch might not fix it as it currently stands, let me know as well.

@fnothaft
Copy link
Member Author

@ryan-williams ah, OK. Not so good to hear! Please let me know if you've got any additional data you can share to help reproduce this. I'll be out this AM, but will spin up an EC2 cluster this afternoon and debug further.

@ryan-williams
Copy link
Member

Here's a gist with 3 files generated from a run I did with some printlns sprinkled through ADAM and some of its deps, as well as with your patch cherry-picked in (code here, minus printlns deeper in ADAM's dep closure). The 3 files in the gist are:

  1. console stdout from my spark client
  2. the resultant (still malformed) output VCF
  3. stdout/stderr from my 2 yarn executors that had any stdout

Things to note:

  • in the stdout from 1 and 3, look for VCFHeader .toString() outputs followed by a string like ", samples: ???", which indicates the samples array has been lost (is empty / not initialized) at that point. That happens in the spark client console, as well as in the second of my two containers, container_1403901413406_1104_01_000023, but not in container_1403901413406_1104_01_000156, which has ", samples: somatic". Unfortunately, the first container, which also has printlns about your mapPartitions call executing, doesn't seem to be the one outputting my VCF, and I'm not sure why the other one doesn't get your mapPartitions code.

Again, my branch with your commit and my printlns is here (not pictured: printlns I plumbed down into hadoop-bam-6.2, variant-1.107, and spark-core-1.0.0).

None of this is likely directly reproducible on your end, but let me know if you have leads for following up. It's also possible that I'm doing something incorrect on my end so I will keep trying to get to the bottom of things.

@fnothaft
Copy link
Member Author

Thanks @ryan-williams! I'll dig back into these details now.

@fnothaft
Copy link
Member Author

@ryan-williams I've just pushed a minor update that adds a broadcast to ensure that the samples data is being serialized and sent out to the nodes. I've tested this on EC2 and it seems to work, but that was the same last time, so I'm hoping to test it more rigorously this weekend.

danvk added a commit to hammerlab/igv-httpfs that referenced this issue Sep 10, 2014
@ryan-williams
Copy link
Member

belated thanks for your work here, @fnothaft.

@arahuja mentioned yesterday that he'd used a more recent version of ADAM and gotten proper VCFs thanks to this fix, so I went back to try to figure out what went wrong with my attempt to test it last month.

I ended up at an impasse where, at the SHA of the original fix (fd3c6d5), I still see the malformed-ness cropping up nondeterministically.

I now see your note about a follow-on commit that ensures that the samples data get broadcast; which SHA was that? 8672534?

@fnothaft
Copy link
Member Author

@ryan-williams no sweat! Are you trying to reproduce it with the original commit, or with the current fnothaft@fd3c6d5? I had added the broadcast fix in an intermediate commit that I squashed down before merging, so it should be present in fd3c6d5, if you're picking from the latest.

@ryan-williams
Copy link
Member

hm, that is ominous. I am running off of the real fd3c6d5 here.

trying to repro a malformed VCF from a commit further forward than fd3c6d5, though I'm not sure what that would tell us at this point given the new info you've shared here.

anything off the top of your head that might make the fix race-y?

@fnothaft
Copy link
Member Author

Does the issue reproduce if you take f50af87 as well? I've done my testing with the two commits put together. I wouldn't expect it to hit a race condition; there's one specific race condition that I can imagine but I believe we're protected against that already. I can put together a snippet of code that would definitely prevent that race from occurring, but I think we're already guarded.

@ryan-williams
Copy link
Member

to clarify, f50af87 is just fd3c6d5 with 8672534 added on, right?

I'm testing 8672534 now and so far unable to repro the bug; will keep at it. Curious about any further thoughts you have on 8672534 actually being immune to a race that sometimes affects fd3c6d5, if you have any.

@fnothaft
Copy link
Member Author

@ryan-williams sorry; I meant 8672534, but yes. I wouldn't expect there to be a race difference between the two, but let me take a look over the code again this PM, and I'll see if I can sort it out!

@ryan-williams
Copy link
Member

Yea, the only thing in 8672534 that seems related is the hadoop version upgrade... could that have done anything?

@fnothaft
Copy link
Member Author

The hadoop version "upgrade" ;) was caused by me accidentally checking fd3c6d5 in with the Hadoop version we use on EC2 and then fixing it in 8672534... It would be indeed possible that it could be related though; I'd feel like trying fd3c6d5 with the correct hadoop version would be worth trying.

@fnothaft
Copy link
Member Author

It's also possible though that it could be the addition of caching in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/ADAMVariationContext.scala in 8672534.

@ryan-williams
Copy link
Member

ah, yea, that also looked potentially relevant.

For completeness, here's a few words about my testing methodology: I have a test-fix script that:

  1. runs mvn clean package -DskipTests in ADAM
  2. injects adam-core and adam-cli JARs into my local maven repository
  3. moves over to guacamole
  4. verifies that the adam.version in guacamole matches the version of ADAM that was built
  5. runs mvn clean package -DskipTests in guacamole, which creates a guacamole JAR that includes the custom ADAM JARs from 2.
  6. runs that guacamole JAR on our cluster on some sample data and looks for the "somatic" column-header that has been necessary and sufficient to determine malformed-ness of our output VCFs.

Here are my most recent data points from running ./test-fix at different, relevant ADAM SHAs:

  • a few runs on master (8fa1199): good
  • attempt to git bisect between master and fd3c6d5 (having observed the latter failing previously):
    • 1 run at d663aab: good
    • 1 run at f50af87: good
    • 1 run at 8672534: good
  • at this point bisect has pointed me at 8672534 as the fix, but I suspect there's something more going on, as I'd expected fd3c6d5 to work, and I thought I'd seen it work previously.
  • 1 run at fd3c6d5: good!
  • 7 more runs at fd3c6d5: bad. Maybe fd3c6d5 is bad?
  • 3 runs at 8672534: good
  • 4 runs at fd3c6d5: good, bad, bad, good
  • 4 runs at 8672534: good

@ryan-williams
Copy link
Member

So I am seeing fd3c6d5 waffling, and 8672534 always passing.

@ryan-williams
Copy link
Member

anyway, this is fixed, thanks again @fnothaft. I don't have bandwidth to try to figure out what might have been race-y between fd3c6d5 and 8672534 but that's a different problem

@fnothaft
Copy link
Member Author

@ryan-williams sorry, this dropped off my radar as well. Glad it is good on your end though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants