-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genbank and other new WDL workflows #800
Conversation
…tion of genbank submissions
…ince its fast anyway)
…tructured comment field
…utputs from tbl2asn just for debugging
Downloading the annotations is left as a separate workflow to execute manually prior to this? Should |
Yeah I wanted to leave Adding fasta fetching to that Non-file inputs (WDL/DNAnexus) such as |
Wait, regarding the |
Sorry, didn't catch that |
File genbankSourceTable | ||
File? coverage_table # summary.assembly.txt (from Snakemake) -- change this to accept a list of mapped bam files and we can create this table ourselves | ||
String sequencingTech | ||
String comment # TO DO: make this optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is blocking this from being optional, or having a default placeholder value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a lot of difficulty with the WDL syntax of specifying an optional parameter with a double-quoted string as a value (just because it passes womtool validate doesn't mean dxWDL and Cromwell agree about whether they like the syntax)... gave up and just made it a mandatory field for now since it's almost always specified anyway in a submission. Double quoting is important because people are putting sentences with spaces and punctuation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make it an optionl text file
@@ -97,6 +97,31 @@ task plot_coverage { | |||
} | |||
|
|||
|
|||
task coverage_report { | |||
Array[File]+ mapped_bams | |||
Array[File] mapped_bam_idx # optional.. speeds it up if you provide it, otherwise we auto-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mapped_bam_idx
unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used.. the pysam code when opening the BAM file will look for a similarly named BAI file and fail on the pileup
command if such a file does not exist. My wrapper code will auto-index if the index doesn't exist, but it saves time to provide these inputs if it already does exist. I don't think we can get around the hard coded assumptions in pysam and samtools about naming BAM and BAI filenames consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW older versions of the WDL assembly pipeline will just emit mapped BAM files for aligning reads to the final assembly. Very recent versions will also emit a similarly named BAI (because why not). So if they do have the indexes available, they'll be named consistently with the BAMs.. if they don't, they can just skip it.
} | ||
|
||
output { | ||
Array[File] sequin_files = glob("*.sqn") | ||
File ncbi_package = "${out_prefix}.tar.gz" | ||
File errorSummary = "errorsummary.val" | ||
Array[File] structured_comment_files = glob("*.cmt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should some of these be specified as one-or-more Array[File]+
? (Or does it matter for outputs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played wth that at one point. maybe sequin_files
makes sense. The only reason it matters for WDL outputs is if a workflow connects the output of one stage to the input of the next, and the subsequent input has a multiplicity restriction on it (the compiler will enforce that the upstream output has a compatible multiplicity).
# check for index and auto-create if needed | ||
with pysam.AlignmentFile(bam) as af: | ||
is_indexed = af.has_index() | ||
if not is_indexed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user should know better, but should an error be emitted if the input bam is not aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would fail the principle of "allow empty input and pass empty output" though. What if there are genuinely no aligned reads? Currently we'll just populate the output with zeros.
raise Exception("input bam file {} has {} unique samples: {} (require one unique sample)".format(bam, len(samples), str(samples))) | ||
sample_name = samples.pop() | ||
# get and write coverage stats | ||
row = genome_coverage_stats_only(bam, cov_thresholds=cov_thresholds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How slow is indexing and pulling the coverage stats? Would it make sense to break out these functions to a process pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised how quick it was.. seems to be seconds per sample? https://platform.dnanexus.com/projects/FBv3fq00kyFkqP9J34zxb3gq/monitor/analysis/FBv3xx00kyFY7Kpv28JZpv5J shows under a minute for 15 LASV genomes..
Adds WDL workflows for: