-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PathSeq WDL overhaul #6536
PathSeq WDL overhaul #6536
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions. The big thing is that it would be great to update this to WDL 1.0.
scripts/pathseq/wdl/README.md
Outdated
- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. You may need to reduce this if your input reads are shorter than the default value (default 60) | ||
- ``PathSeqPipelineWorkflow.estimate_filter_metrics_with_downsampling`` -- read filter metrics will be estimated using a downsampled bam (highly recommended) (default true) | ||
- ``PathSeqPipelineWorkflow.estimate_filter_metrics_reads`` -- number of reads to downsample to for filter metrics estimation, recommended 1M for samples with ~0.1% non-host reads (default 1M) | ||
- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. Increasing may increase microbial classification specificity but may reduce sensitivity (default 31) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What made you change this from 60 to 31?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some libraries have very short reads (eg older TCGA data with ~40bp). This makes it less likely for users to get confused when the output comes up empty.
"PathSeqPipelineWorkflow.sample_name": "sample", | ||
"PathSeqPipelineWorkflow.input_bam": "gs://my-bucket/sample.bam", | ||
"PathSeqThreeStageWorkflow.sample_name": "sample", | ||
"PathSeqThreeStageWorkflow.input_bam_or_cram": "gs://my-bucket/sample.bam", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you talk to @bshifaw about what needs to happen to "feature" the PathSeq workspace? It's likely the only this is to put real (mini) data here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right, mini data would do fine for now. Here is an old json with mini data for pathseq here. Its NA12878_24RG contaminated with chicken reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bshifaw Is it SOP to have these json files? I was considering deleting it - it seems like this kind of metadata should be made available on Terra instead, ie through featured workspaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also problematic that I can't provide a docker that would support this workflow until we cut a new release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having the JSON along with the WDL is SOP. Correct, the JSON would be available in Terra (which requires a google account) but not everyone will be looking at the workflow via Terra. You may have visitors directly from Dockstore or to this repo looking for an example JSON.
I can see why not having a docker would be a problem, maybe a place older can be added for the next release or use ":latest". If both don't seem inappropriate, maybe hold off on the JSON for now but eventually add an example JSON with the WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bshifaw Okay thanks for clarifying, I will update the JSON then. Can we add an index to the chicken sample bam and put it at gs://gatk-best-practices/pathseq/contaminated-bam/NA12878_24RG_med.hg380.7chicken0.3.bam.bai
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done
@@ -108,7 +108,7 @@ | |||
* counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors | |||
* table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the | |||
* prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability | |||
* to 0. For example, the X contig in the above example can only assume 0 and 1 ploidy states.</p> | |||
* to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a rebase? I thought I just merged a PR with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm something strange happened here. I rebased but it's still showing up... maybe one of my commits changes it to X then another back to Y. The final result is correct though.
src/main/java/org/broadinstitute/hellbender/tools/spark/pathseq/PathSeqBwaSpark.java
Outdated
Show resolved
Hide resolved
* host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.</p> | ||
* | ||
* <h3>References</h3> | ||
* <ol> | ||
* <li>Walker, M. A., Pedamallu, C. S. et al. (2018). GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics. 34, 4287-4289.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
command <<< | ||
set -euo pipefail | ||
if ${defined(input_bam_index_file)}; then | ||
NUM_READS=`samtools idxstats ${input_bam_file} | awk '{s+=$3+$4} END {print s}'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although you can't play this trick with a cram. Pros and cons.
echo $NUM_READS > num_reads.txt | ||
java -Xmx2000m -jar /usr/gitc/picard.jar \ | ||
DownsampleSam \ | ||
INPUT=${input_bam_file} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To call this from GATK I think the args will go to -I
, -O
, --VALIDATON_STRINGENCY
and maybe --P
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to use GATK instead of Picard
input: | ||
input_bam_file=bam_file, | ||
input_bam_index_file=bam_index, | ||
downsampled_bam_filename="${sample_name}.downsampled.bam", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here, it would be great to rev this to WDL 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
${if defined(filter_reads_per_partition) then "--filter-reads-per-partition ${filter_reads_per_partition}" else ""} | ||
|
||
if [ ! -f "${paired_bam_output_path}" ]; then | ||
echo "File ${paired_bam_output_path} not found, creating empty BAM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you prefer to create an empty BAM rather than have a default output path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, the BAM will not be written if there are no reads. I believe this is a behavior of Spark tools in general and not something I can change in PathSeq.
Float frac_final_total = read_float("frac_final_total.txt") | ||
Float frac_qual_cpx_filtered = read_float("frac_qual_cpx_filtered.txt") | ||
Float frac_host_filtered = read_float("frac_host_filtered.txt") | ||
Float frac_dup_filtered = read_float("frac_dup_filtered.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These floats are nice for Terra data model output. Thanks.
"PathSeqPipeline.PathSeqFilter.filter_bwa_image": "gs://gatk-best-practices/pathseq/resources/pathseq_host.fa.img", | ||
"PathSeqPipeline.PathSeqScore.taxonomy_file": "gs://gatk-best-practices/pathseq/resources/meats.min2k.db", | ||
|
||
"PathSeqPipeline.gatk_docker": "broadinstitute/gatk:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pin this version please? The provenance is easier to figure out if the docker is fixed in the input json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/pathseq/wdl/README.md
Outdated
|
||
PathSeq reference files are available in the [GATK Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) (located [here](ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/beta/PathSeq)). | ||
PathSeq reference files are available in the [GATK Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) at `gs://gatk-best-practices/pathseq`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That broad link is giving me a 404. Maybe just take it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they moved it to https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor fixes, otherwise good to go!
* New pathseq wdl with downsampling; remove unneeded microbe reference fasta in alignment step Co-authored-by: ldgauthier <gauthier@broadinstitute.org>
This new PathSeq WDL redesigns the workflow for improved performance in the cloud. Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.
Other improvements include: