Malformed VCF in distributed mode #353

fnothaft · 2014-08-12T21:41:31Z

This bug was reported by @ryan-williams and is cross-documented at hammerlab/guacamole#116 and on the ADAM dev-list. The long/short of it is, when writing a VCF while distributed, the setHeader function isn't called on the VCF output format, so the header doesn't get set on each node. I believe this should impact both VCF and SAM/BAM output. I believe we can fix this by doing a mapPartitions call or similar before writing the file that calls the setHeader function. This is a bit of a hack, but should resolve the problem.

ryan-williams · 2014-08-14T13:15:38Z

Hey @fnothaft, thanks a lot for this. I had some trouble getting a run done on my cluster with a jar built with this patch (for reasons unrelated to this issue), but I'm pretty sure I just successfully ran it and unfortunately still see the issue in the output VCF.

I'll follow up with some more information shortly, and try to run again to double-check, but if you have ideas for reasons this patch might not fix it as it currently stands, let me know as well.

fnothaft · 2014-08-14T13:40:23Z

@ryan-williams ah, OK. Not so good to hear! Please let me know if you've got any additional data you can share to help reproduce this. I'll be out this AM, but will spin up an EC2 cluster this afternoon and debug further.

ryan-williams · 2014-08-14T21:00:26Z

Here's a gist with 3 files generated from a run I did with some printlns sprinkled through ADAM and some of its deps, as well as with your patch cherry-picked in (code here, minus printlns deeper in ADAM's dep closure). The 3 files in the gist are:

console stdout from my spark client
the resultant (still malformed) output VCF
stdout/stderr from my 2 yarn executors that had any stdout

Things to note:

in the stdout from 1 and 3, look for VCFHeader .toString() outputs followed by a string like ", samples: ???", which indicates the samples array has been lost (is empty / not initialized) at that point. That happens in the spark client console, as well as in the second of my two containers, container_1403901413406_1104_01_000023, but not in container_1403901413406_1104_01_000156, which has ", samples: somatic". Unfortunately, the first container, which also has printlns about your mapPartitions call executing, doesn't seem to be the one outputting my VCF, and I'm not sure why the other one doesn't get your mapPartitions code.

Again, my branch with your commit and my printlns is here (not pictured: printlns I plumbed down into hadoop-bam-6.2, variant-1.107, and spark-core-1.0.0).

None of this is likely directly reproducible on your end, but let me know if you have leads for following up. It's also possible that I'm doing something incorrect on my end so I will keep trying to get to the bottom of things.

fnothaft · 2014-08-15T16:01:45Z

Thanks @ryan-williams! I'll dig back into these details now.

fnothaft · 2014-08-16T01:09:50Z

@ryan-williams I've just pushed a minor update that adds a broadcast to ensure that the samples data is being serialized and sent out to the nodes. I've tested this on EC2 and it seems to work, but that was the same last time, so I'm hoping to test it more rigorously this weekend.

This is a complete hack until bigdatagenomics/adam#353 is fixed.

ryan-williams · 2014-09-12T18:20:18Z

belated thanks for your work here, @fnothaft.

@arahuja mentioned yesterday that he'd used a more recent version of ADAM and gotten proper VCFs thanks to this fix, so I went back to try to figure out what went wrong with my attempt to test it last month.

I ended up at an impasse where, at the SHA of the original fix (fd3c6d5), I still see the malformed-ness cropping up nondeterministically.

I now see your note about a follow-on commit that ensures that the samples data get broadcast; which SHA was that? 8672534?

fnothaft · 2014-09-12T18:42:08Z

@ryan-williams no sweat! Are you trying to reproduce it with the original commit, or with the current fnothaft@fd3c6d5? I had added the broadcast fix in an intermediate commit that I squashed down before merging, so it should be present in fd3c6d5, if you're picking from the latest.

ryan-williams · 2014-09-12T18:51:19Z

hm, that is ominous. I am running off of the real fd3c6d5 here.

trying to repro a malformed VCF from a commit further forward than fd3c6d5, though I'm not sure what that would tell us at this point given the new info you've shared here.

anything off the top of your head that might make the fix race-y?

fnothaft · 2014-09-12T18:55:41Z

Does the issue reproduce if you take f50af87 as well? I've done my testing with the two commits put together. I wouldn't expect it to hit a race condition; there's one specific race condition that I can imagine but I believe we're protected against that already. I can put together a snippet of code that would definitely prevent that race from occurring, but I think we're already guarded.

ryan-williams · 2014-09-12T19:03:30Z

to clarify, f50af87 is just fd3c6d5 with 8672534 added on, right?

I'm testing 8672534 now and so far unable to repro the bug; will keep at it. Curious about any further thoughts you have on 8672534 actually being immune to a race that sometimes affects fd3c6d5, if you have any.

fnothaft · 2014-09-12T19:09:09Z

@ryan-williams sorry; I meant 8672534, but yes. I wouldn't expect there to be a race difference between the two, but let me take a look over the code again this PM, and I'll see if I can sort it out!

ryan-williams · 2014-09-12T19:11:10Z

Yea, the only thing in 8672534 that seems related is the hadoop version upgrade... could that have done anything?

fnothaft · 2014-09-12T19:14:15Z

The hadoop version "upgrade" ;) was caused by me accidentally checking fd3c6d5 in with the Hadoop version we use on EC2 and then fixing it in 8672534... It would be indeed possible that it could be related though; I'd feel like trying fd3c6d5 with the correct hadoop version would be worth trying.

fnothaft · 2014-09-12T19:15:20Z

It's also possible though that it could be the addition of caching in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/ADAMVariationContext.scala in 8672534.

ryan-williams · 2014-09-12T19:35:33Z

ah, yea, that also looked potentially relevant.

For completeness, here's a few words about my testing methodology: I have a test-fix script that:

runs mvn clean package -DskipTests in ADAM
injects adam-core and adam-cli JARs into my local maven repository
moves over to guacamole
verifies that the adam.version in guacamole matches the version of ADAM that was built
runs mvn clean package -DskipTests in guacamole, which creates a guacamole JAR that includes the custom ADAM JARs from 2.
runs that guacamole JAR on our cluster on some sample data and looks for the "somatic" column-header that has been necessary and sufficient to determine malformed-ness of our output VCFs.

Here are my most recent data points from running ./test-fix at different, relevant ADAM SHAs:

a few runs on master (8fa1199): good
attempt to git bisect between master and fd3c6d5 (having observed the latter failing previously):
- 1 run at d663aab: good
- 1 run at f50af87: good
- 1 run at 8672534: good
at this point bisect has pointed me at 8672534 as the fix, but I suspect there's something more going on, as I'd expected fd3c6d5 to work, and I thought I'd seen it work previously.
1 run at fd3c6d5: good!
7 more runs at fd3c6d5: bad. Maybe fd3c6d5 is bad?
3 runs at 8672534: good
4 runs at fd3c6d5: good, bad, bad, good
4 runs at 8672534: good

ryan-williams · 2014-09-12T19:38:11Z

So I am seeing fd3c6d5 waffling, and 8672534 always passing.

ryan-williams · 2014-09-18T21:26:57Z

anyway, this is fixed, thanks again @fnothaft. I don't have bandwidth to try to figure out what might have been race-y between fd3c6d5 and 8672534 but that's a different problem

fnothaft · 2014-09-18T21:27:47Z

@ryan-williams sorry, this dropped off my radar as well. Glad it is good on your end though.

fnothaft added the bug label Aug 12, 2014

fnothaft self-assigned this Aug 12, 2014

fnothaft mentioned this issue Aug 13, 2014

[ADAM-353] Fixing issue with SAM/BAM/VCF header attachment when running distributed #354

Merged

danvk added a commit to hammerlab/igv-httpfs that referenced this issue Sep 10, 2014

Fix our VCF files in the igv-httpfs server.

b5dfbfa

This is a complete hack until bigdatagenomics/adam#353 is fixed.

danvk mentioned this issue Sep 10, 2014

Fix our VCF files in the igv-httpfs server. hammerlab/igv-httpfs#4

Merged

arahuja mentioned this issue Sep 17, 2014

Guacamole on Demeter omits "FORMAT" column name, "somatic" column hammerlab/guacamole#116

Closed

fnothaft closed this as completed Sep 20, 2014

fnothaft mentioned this issue Oct 31, 2014

Update the contribution guidelines #447

Closed

ryan-williams mentioned this issue Nov 5, 2014

add adam view command, analogous to samtools view #451

Merged

This was referenced Jun 1, 2015

BAMHeader not set when running on a cluster #676

Closed

Unify header-setting between BAM/SAM and VCF #711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed VCF in distributed mode #353

Malformed VCF in distributed mode #353

fnothaft commented Aug 12, 2014

ryan-williams commented Aug 14, 2014

fnothaft commented Aug 14, 2014

ryan-williams commented Aug 14, 2014

fnothaft commented Aug 15, 2014

fnothaft commented Aug 16, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

ryan-williams commented Sep 18, 2014

fnothaft commented Sep 18, 2014

Malformed VCF in distributed mode #353

Malformed VCF in distributed mode #353

Comments

fnothaft commented Aug 12, 2014

ryan-williams commented Aug 14, 2014

fnothaft commented Aug 14, 2014

ryan-williams commented Aug 14, 2014

fnothaft commented Aug 15, 2014

fnothaft commented Aug 16, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

fnothaft commented Sep 12, 2014

fnothaft commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

ryan-williams commented Sep 12, 2014

ryan-williams commented Sep 18, 2014

fnothaft commented Sep 18, 2014