Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WDL for iSNVs/v-phaser #828

Merged
merged 61 commits into from
Jul 6, 2018
Merged

Add WDL for iSNVs/v-phaser #828

merged 61 commits into from
Jul 6, 2018

Conversation

tomkinsc
Copy link
Member

This PR adds WDLs for running v-phaser on one sample to output iSNVs. It probably makes sense to wait to do multi-sample via scatter-gather until we update everything to WDL 1.0 which has a Pair datatype for gathering multiple files (we're using WDL draft-2 currently).
Also, this should probably be merged via rebase due to all of the dx debugging...

…aser output

merge_to_vcf did not previously output correct alternate genotype number for some cases of multi-chr v-phaser output (line intrahost.py:820 of this commit). This corrects that and changes test cases to use the chr "-#" encoding to denote chromosome/segment number. This also changes merge_to_vcf so sample names are guessed from first column of v-phaser output if sample names are not passed in via --samples.
add dummy workflows to support newer cromwell that requires WDL files to have a workflow block; add single-sample isnv workflows
restore dummy workflows to WDL task files; this allows womtool to validate them
Note: this is WDL draft-2. Changing to WDL 1.0 will require various changes, starting with a "version 1.0" statement as the first line in each WDL. Also, input {} blocks. See: https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md
add dx default file inputs for new isnv workflows (mostly taxon_filter DBs)
@dpark01
Copy link
Member

dpark01 commented Jun 28, 2018

Okay this is a big one... I think I'll need some deconstructed explanation on each piece. So far, I see a few things in this PR:

  1. A ton of diffs that have to do with updating to the latest dxWDL + Cromwell and all the various changes that imposed.
  2. A new WDL workflow for single sample iSNVs.
  3. Some decent sized changes to intrahost.merge_to_vcf

A few questions and thoughts:

  • How interconnected are those three changes? Are they somewhat orthogonal/independent or did 2 necessitate both 1 and 3?
  • I've been watching some of the dxWDL updates from a distance and haven't yet given it a spin myself, but I was concerned about how they dropped support for implicit workflow inputs (percolated up from unlinked task inputs). But it looked like the various new releases of dxWDL have gone back and forth on supporting unlocked workflows.. not sure how related that was. I dunno, if the Cromwell/OpenWDL community more broadly has generally moved away from implicit workflow inputs then I guess we can follow, but I'm not terribly happy with that direction, but if this was only dxWDL we can inquire about it.
  • What motivated change 3? That particular method is a pretty hairy beast to dive into.. lint has complained about how overly complex it is but I've struggled to find ways to simplify and break it down because it really is managing a ton of edge cases at once (and I'm not absolutely sure that all those cases are well tested).

Anyway, breaking it down a bit will help navigate the actual diffs.

@tomkinsc
Copy link
Member Author

tomkinsc commented Jul 2, 2018

  • The three changes are fairly orthogonal, however w/r/t changes 1&2, in the course of writing the new WDL workflows I wanted to target the newer dxWDL + Cromwell (at least under the draft-2 spec). They're interdependent as included here.
  • The trend seems to be toward requiring inputs to be explicitly specified. With the combination of versions of dxWDL and Cromwell used in this PR with the draft-2 format WDL files, explicit declaration appears to extend only to required inputs. With the move to WDL 1.0 we may need to also explicitly list the optional inputs from tasks in workflows (not sure yet). It will become apparent when we start updating from WDL draft-2 to 1.0 format. I haven't come across a migration guide, and it is a bit difficult to tell when comparing the specs. In general, we will need to make fairly significant changes to support WDL 1.0...
  • Change 3 is two things:
    • a block that infers sample names from the v-phaser output files in the case where a list of sample names is not passed in. This is to reduce the number of inputs required by intrahost.merge_to_vcf(), with the WDL/DNAnexus implementation in mind.
    • a bugfix to encode+report ALT alleles in the VCF GT fields (not sure if anyone was bitten by the bug?). This should have been caught by our tests, except the sample names of the tests were not changed to the sampleName-chr# format when we changed the rest of the code to use the-chr# suffix format for dealing with multiple chromosomes.

Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, went through the diffs in more detail, let me know what you think

intrahost.py Outdated
"""samples, isnvs, and alignments must have the same number of elements
(plus an extra reference record in the alignment).
%s does not have the right number of sequences""" % fileName)
# for fileName in alignments:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented out section of intrahost.merge_to_vcf appears to have been simply a pre-screening sanity check / safety check to make sure the inputs were self consistent. Is this check no longer necessary or is it just that it doesn't function for guessed sample names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adapt it for the both cases, where sample names are provided, and when they are not.

intrahost.py Outdated
@@ -580,7 +595,10 @@ def merge_to_vcf(
alignmentFile)

for s in samplesToUse:
for row in util.file.read_tabfile(samp_to_isnv[sampleIDMatch(s)]):
isnv_filepath = samp_to_isnv[sampleIDMatch(s)]
if os.path.getsize(isnv_filepath) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the explicit check for os.path.getsize() == 0? Are there special scenarios where we pass in empty files that semantically mean something we should skip, or is this just a normal situation where vphaser detects no isnvs for that particular file (and thus produces empty output)? util.file.read_tabfile should simply produce an empty iterator, so this check seems unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this should not be necessary.

intrahost.py Outdated
isnvs_to_use.append(isnvs_file)
continue
for row in util.file.read_tabfile(isnvs_file):
guessed_sample_ID = sampleIDMatch(row[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, this code is unreachable, because anytime the isnvs_file is non-empty, the above if statement will continue and skip this for loop, but any time it's empty, the for loop won't do anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that was sloppy of me. I'll fix this and add a unit test where sample names are guessed so this has some test coverage.

@@ -800,7 +817,7 @@ def merge_to_vcf(
continue
alleleMap = dict((a, i) for i, a in enumerate(alleles))
# GT col emitted below
genos = [str(alleleMap.get(consAlleles.get(s), '.')) for s in samples]
genos = [str(alleleMap.get(consAlleles.get(s), '.')) for s in samplesToUse]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, so this was a bug before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. :(

@@ -152,7 +152,7 @@ def test_single_strand_bias_hard_filter(self):
self.assertEqual(output[0][7:], expected[7:])


@unittest.skipIf(tools.is_osx(), "vphaser2 osx binary from bioconda has issues")
#@unittest.skipIf(tools.is_osx(), "vphaser2 osx binary from bioconda has issues")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did the osx version get fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be? The test passes on OSX and I can get help text when I call it in isolation. Do you recall what the issue was?

@@ -224,8 +224,10 @@ def add_indel(self, sample, chrom, pos, acounts, libinfo=None):

def dump_isnv_tmp_file(self, sample):
fn = util.file.mkstempfname('.txt')
print("self.isnvs[sample]",sample,list(map(str,self.isnvs[sample])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we want to print all this or was this leftover from a debugging stage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover, whoops!

tomkinsc and others added 3 commits July 3, 2018 17:27
add unit test for case where sample names are guessed for merge_to_vcf; rework the sectio where they are guessed
Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's make it so!

cached_fetch_jar_from_github dnanexus dxWDL dxWDL 0.60.2
cached_fetch_jar_from_github broadinstitute cromwell womtool 32
cached_fetch_jar_from_github broadinstitute cromwell cromwell 32
cached_fetch_jar_from_github dnanexus dxWDL dxWDL 0.66.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't cause much trouble, since you're updating from dxWDL 0.60.2 -> 0.66.2 anyway, want to just try going all the way to 0.68.1 (since some time has elapsed in the course of the development of this branch)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what happens. I'll try updating Cromwell too.

@dpark01
Copy link
Member

dpark01 commented Jul 5, 2018

Actually it might be weird if you update Cromwell because then dxWDL and Cromwell will be out of sync (dxWDL 0.68.1 is based on Cromwell 32, I think).

@tomkinsc tomkinsc merged commit 567968f into master Jul 6, 2018
@tomkinsc tomkinsc deleted the ct-vphaser-wdl branch July 6, 2018 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants