genbank and other new WDL workflows #800

dpark01 · 2018-03-23T19:55:33Z

Adds WDL workflows for:

coverage_table: some basic coverage stats on a list of assemblies, similar to Snakemake reports_only
mafft: multiple alignment of N genomes, possibly multi-segment (produces one alignment per segment)
genbank: produce a Genbank submission file set with their tbl2asn tool given the proper inputs: assemblies, a reference genome and reference annotations to transfer, and required metadata for submission [fixes Create WDL workflow for Genbank submission #793]
merge_bams: a simple task for merging bam files and reheadering read group metadata using a flat file mapping table. both the merging and reheadering are optional.

…tion of genbank submissions

…ince its fast anyway)

…flow

…enbank

…tructured comment field

…utputs from tbl2asn just for debugging

tomkinsc · 2018-03-26T17:53:28Z

Downloading the annotations is left as a separate workflow to execute manually prior to this? Should fetch_annotations.wdl download the fastas as well so the accessions list can be reused for downloading both the *.fasta and *.tbl files? I wonder if passing in the accessions via file input (text file, say) would make more sense than as a dx workflow param—a file would be easier to inspect in the job metadata, and easier to reuse.

dpark01 · 2018-03-26T18:09:30Z

Yeah I wanted to leave fetch_annotations disjoint from this process, partly because you'd only really run this once, but potentially the genbank workflow many times during the course of a project, and I also wanted to leave open the option for the user to manually edit the annotations before using them.

Adding fasta fetching to that fetch_annotations task makes a lot of sense and is minimal lift, I can certainly do that.

Non-file inputs (WDL/DNAnexus) such as Array[String] are actually pretty easy to inspect in the dx job metadata.. easier than file contents I think (for example I've also added a scaffolding chosen ref String output to the scaffolding step because that's easier to parse than the fasta output sometimes). Also files might get deleted and you might lose that record.. etc. The number of inputs here in this particular case is really just the number of segments/chromosomes in the genome of this species, which is just one or a couple for most viruses.

dpark01 · 2018-03-26T18:11:47Z

Wait, regarding the fetch_annotation task spitting out fastas as well, doesn't this already do that?

tomkinsc · 2018-03-26T18:42:42Z

Sorry, didn't catch that fetch_annotation was also pulling fastas. I had assumed the tasks would be grouped at the workflow level, but it makes sense now.

tomkinsc · 2018-03-29T13:22:07Z

pipes/WDL/workflows/tasks/ncbi.wdl

+ File genbankSourceTable
+ File? coverage_table # summary.assembly.txt (from Snakemake) -- change this to accept a list of mapped bam files and we can create this table ourselves
+ String sequencingTech
+ String comment # TO DO: make this optional


What is blocking this from being optional, or having a default placeholder value?

I had a lot of difficulty with the WDL syntax of specifying an optional parameter with a double-quoted string as a value (just because it passes womtool validate doesn't mean dxWDL and Cromwell agree about whether they like the syntax)... gave up and just made it a mandatory field for now since it's almost always specified anyway in a submission. Double quoting is important because people are putting sentences with spaces and punctuation here.

Could make it an optionl text file

tomkinsc · 2018-03-29T13:26:00Z

pipes/WDL/workflows/tasks/reports.wdl

@@ -97,6 +97,31 @@ task plot_coverage {
 }


+task coverage_report {
+ Array[File]+ mapped_bams
+ Array[File] mapped_bam_idx # optional.. speeds it up if you provide it, otherwise we auto-index


Is mapped_bam_idx unused?

It is used.. the pysam code when opening the BAM file will look for a similarly named BAI file and fail on the pileup command if such a file does not exist. My wrapper code will auto-index if the index doesn't exist, but it saves time to provide these inputs if it already does exist. I don't think we can get around the hard coded assumptions in pysam and samtools about naming BAM and BAI filenames consistently.

BTW older versions of the WDL assembly pipeline will just emit mapped BAM files for aligning reads to the final assembly. Very recent versions will also emit a similarly named BAI (because why not). So if they do have the indexes available, they'll be named consistently with the BAMs.. if they don't, they can just skip it.

tomkinsc · 2018-03-29T13:32:00Z

pipes/WDL/workflows/tasks/ncbi.wdl

 }

 output {
 Array[File] sequin_files = glob("*.sqn")
- File ncbi_package = "${out_prefix}.tar.gz"
- File errorSummary = "errorsummary.val"
+ Array[File] structured_comment_files = glob("*.cmt")


Should some of these be specified as one-or-more Array[File]+? (Or does it matter for outputs?)

I played wth that at one point. maybe sequin_files makes sense. The only reason it matters for WDL outputs is if a workflow connects the output of one stage to the input of the next, and the subsequent input has a multiplicity restriction on it (the compiler will enforce that the upstream output has a compatible multiplicity).

tomkinsc · 2018-03-29T13:40:53Z

reports.py

+ # check for index and auto-create if needed
+ with pysam.AlignmentFile(bam) as af:
+ is_indexed = af.has_index()
+ if not is_indexed:


The user should know better, but should an error be emitted if the input bam is not aligned?

I think that would fail the principle of "allow empty input and pass empty output" though. What if there are genuinely no aligned reads? Currently we'll just populate the output with zeros.

tomkinsc · 2018-03-29T13:42:51Z

reports.py

+ raise Exception("input bam file {} has {} unique samples: {} (require one unique sample)".format(bam, len(samples), str(samples)))
+ sample_name = samples.pop()
+ # get and write coverage stats
+ row = genome_coverage_stats_only(bam, cov_thresholds=cov_thresholds)


How slow is indexing and pulling the coverage stats? Would it make sense to break out these functions to a process pool?

I was surprised how quick it was.. seems to be seconds per sample? https://platform.dnanexus.com/projects/FBv3fq00kyFkqP9J34zxb3gq/monitor/analysis/FBv3xx00kyFY7Kpv28JZpv5J shows under a minute for 15 LASV genomes..

dpark01 added 19 commits March 9, 2018 21:59

allow more than one way

08158cb

remove the "v." from the version number in the structured comment sec…

1eaa09b

…tion of genbank submissions

start restructure of align_and_annot workflow to be unparallelized (s…

d66d255

…ince its fast anyway)

finalize genbank submission workflow

a889ddd

add coverage_report task and workflow, update genbank submission work…

ff43071

…flow

add merge_bams wdl

9fbefae

Merge remote-tracking branch 'origin/master' into dp-genbank

53e2ba4

rename to avoid name space clash

c01526b

dxWDL doesnt like some of the syntax, so simplify it a bit

71fd973

update annot_transfer output multiplicity to match input of prepare_g…

63d7039

…enbank

trim down some multiplicity requirements

3f7b1dc

comment out broken wdl task

7876e65

attempt to reduce usage of the wdl standard library tog et this to run

39ad3b4

try another way

c40f5e6

more attempts to fix syntax

e78bd37

oops

737c830

remove some vestigal wdl code

c40f361

remove $

56a2aef

fix Integer -> Int

caa4e4a

dpark01 requested a review from tomkinsc March 23, 2018 19:55

Merge remote-tracking branch 'origin/master' into dp-genbank

b57bdf7

dpark01 and others added 6 commits March 23, 2018 16:20

forgot to sep the array

740151a

mol_type instead of molType

b02b50e

tweaks to mafft

0514da4

revert previous change and put the "v." back in the Assembly Method s…

07d77ba

…tructured comment field

remove --verbose from mafft (its a little too verbose) and add .fsa o…

083665b

…utputs from tbl2asn just for debugging

Merge branch 'master' into dp-genbank

b8855d8

Merge branch 'master' into dp-genbank

b660097

tomkinsc reviewed Mar 29, 2018

View reviewed changes

tomkinsc approved these changes Mar 29, 2018

View reviewed changes

dpark01 merged commit f61c71c into master Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genbank and other new WDL workflows #800

genbank and other new WDL workflows #800

dpark01 commented Mar 23, 2018

tomkinsc commented Mar 26, 2018

dpark01 commented Mar 26, 2018

dpark01 commented Mar 26, 2018

tomkinsc commented Mar 26, 2018

tomkinsc Mar 29, 2018

dpark01 Mar 29, 2018

notestaff Mar 29, 2018

tomkinsc Mar 29, 2018

dpark01 Mar 29, 2018

dpark01 Mar 29, 2018

tomkinsc Mar 29, 2018

dpark01 Mar 29, 2018

tomkinsc Mar 29, 2018

dpark01 Mar 29, 2018

tomkinsc Mar 29, 2018

dpark01 Mar 29, 2018

genbank and other new WDL workflows #800

genbank and other new WDL workflows #800

Conversation

dpark01 commented Mar 23, 2018

tomkinsc commented Mar 26, 2018

dpark01 commented Mar 26, 2018

dpark01 commented Mar 26, 2018

tomkinsc commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment