Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genbank and other new WDL workflows #800

Merged
merged 27 commits into from
Mar 29, 2018
Merged

genbank and other new WDL workflows #800

merged 27 commits into from
Mar 29, 2018

Conversation

dpark01
Copy link
Member

@dpark01 dpark01 commented Mar 23, 2018

Adds WDL workflows for:

  • coverage_table: some basic coverage stats on a list of assemblies, similar to Snakemake reports_only
  • mafft: multiple alignment of N genomes, possibly multi-segment (produces one alignment per segment)
  • genbank: produce a Genbank submission file set with their tbl2asn tool given the proper inputs: assemblies, a reference genome and reference annotations to transfer, and required metadata for submission [fixes Create WDL workflow for Genbank submission #793]
  • merge_bams: a simple task for merging bam files and reheadering read group metadata using a flat file mapping table. both the merging and reheadering are optional.

@dpark01 dpark01 requested a review from tomkinsc March 23, 2018 19:55
@tomkinsc
Copy link
Member

Downloading the annotations is left as a separate workflow to execute manually prior to this? Should fetch_annotations.wdl download the fastas as well so the accessions list can be reused for downloading both the *.fasta and *.tbl files? I wonder if passing in the accessions via file input (text file, say) would make more sense than as a dx workflow param—a file would be easier to inspect in the job metadata, and easier to reuse.

@dpark01
Copy link
Member Author

dpark01 commented Mar 26, 2018

Yeah I wanted to leave fetch_annotations disjoint from this process, partly because you'd only really run this once, but potentially the genbank workflow many times during the course of a project, and I also wanted to leave open the option for the user to manually edit the annotations before using them.

Adding fasta fetching to that fetch_annotations task makes a lot of sense and is minimal lift, I can certainly do that.

Non-file inputs (WDL/DNAnexus) such as Array[String] are actually pretty easy to inspect in the dx job metadata.. easier than file contents I think (for example I've also added a scaffolding chosen ref String output to the scaffolding step because that's easier to parse than the fasta output sometimes). Also files might get deleted and you might lose that record.. etc. The number of inputs here in this particular case is really just the number of segments/chromosomes in the genome of this species, which is just one or a couple for most viruses.

@dpark01
Copy link
Member Author

dpark01 commented Mar 26, 2018

Wait, regarding the fetch_annotation task spitting out fastas as well, doesn't this already do that?

@tomkinsc
Copy link
Member

Sorry, didn't catch that fetch_annotation was also pulling fastas. I had assumed the tasks would be grouped at the workflow level, but it makes sense now.

File genbankSourceTable
File? coverage_table # summary.assembly.txt (from Snakemake) -- change this to accept a list of mapped bam files and we can create this table ourselves
String sequencingTech
String comment # TO DO: make this optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is blocking this from being optional, or having a default placeholder value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a lot of difficulty with the WDL syntax of specifying an optional parameter with a double-quoted string as a value (just because it passes womtool validate doesn't mean dxWDL and Cromwell agree about whether they like the syntax)... gave up and just made it a mandatory field for now since it's almost always specified anyway in a submission. Double quoting is important because people are putting sentences with spaces and punctuation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make it an optionl text file

@@ -97,6 +97,31 @@ task plot_coverage {
}


task coverage_report {
Array[File]+ mapped_bams
Array[File] mapped_bam_idx # optional.. speeds it up if you provide it, otherwise we auto-index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mapped_bam_idx unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used.. the pysam code when opening the BAM file will look for a similarly named BAI file and fail on the pileup command if such a file does not exist. My wrapper code will auto-index if the index doesn't exist, but it saves time to provide these inputs if it already does exist. I don't think we can get around the hard coded assumptions in pysam and samtools about naming BAM and BAI filenames consistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW older versions of the WDL assembly pipeline will just emit mapped BAM files for aligning reads to the final assembly. Very recent versions will also emit a similarly named BAI (because why not). So if they do have the indexes available, they'll be named consistently with the BAMs.. if they don't, they can just skip it.

}

output {
Array[File] sequin_files = glob("*.sqn")
File ncbi_package = "${out_prefix}.tar.gz"
File errorSummary = "errorsummary.val"
Array[File] structured_comment_files = glob("*.cmt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should some of these be specified as one-or-more Array[File]+? (Or does it matter for outputs?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played wth that at one point. maybe sequin_files makes sense. The only reason it matters for WDL outputs is if a workflow connects the output of one stage to the input of the next, and the subsequent input has a multiplicity restriction on it (the compiler will enforce that the upstream output has a compatible multiplicity).

# check for index and auto-create if needed
with pysam.AlignmentFile(bam) as af:
is_indexed = af.has_index()
if not is_indexed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should know better, but should an error be emitted if the input bam is not aligned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would fail the principle of "allow empty input and pass empty output" though. What if there are genuinely no aligned reads? Currently we'll just populate the output with zeros.

raise Exception("input bam file {} has {} unique samples: {} (require one unique sample)".format(bam, len(samples), str(samples)))
sample_name = samples.pop()
# get and write coverage stats
row = genome_coverage_stats_only(bam, cov_thresholds=cov_thresholds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How slow is indexing and pulling the coverage stats? Would it make sense to break out these functions to a process pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised how quick it was.. seems to be seconds per sample? https://platform.dnanexus.com/projects/FBv3fq00kyFkqP9J34zxb3gq/monitor/analysis/FBv3xx00kyFY7Kpv28JZpv5J shows under a minute for 15 LASV genomes..

@dpark01 dpark01 merged commit f61c71c into master Mar 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create WDL workflow for Genbank submission
3 participants