Output non host reads and non host contigs for AMR #205

valenzuelaomar · 2023-04-03T22:30:52Z

Outputs the non host reads and non host contigs directly from the AMR workflow. After this change gets merged in we'll be outputting the non host reads and non host contigs files twice - one by itself and the other in the outputs.zip file.

I'll ask comp bio about what we should do about this - ideally we don't want to output the same file twice.
UPDATE: our cross functional partners said that it's ok to remove the non host contigs and reads from the outputs.zip

Their suggestion was to:

Keep all the individual downloads
Change “Download All with Intermediate Files (.zip)” to “Download Intermediate Files (.zip)” and only include the intermediate files in the download.

rzlim08

Thanks!

valenzuelaomar · 2023-04-12T17:23:38Z

@rzlim08 how do you concatenate the paired end non host reads files into just one fasta file?

rzlim08 · 2023-04-12T19:37:55Z

@rzlim08 how do you concatenate the paired end non host reads files into just one fasta file?

Ah, I didn't realize we were doing this for this PR. I have a relatively simple plan, will update this afternoon

rzlim08 · 2023-04-12T21:30:07Z

workflows/amr/run.wdl

-        cp ~{sep=' ' outputFiles} .
+        # copy contigs and interleave non_host_reads
+        cp ~{contigs_in} contigs.fasta
+        seqfu ilv -1 ~{sep=" -2 " nonHostReads} | gzip > non_host_reads.fasta.gz


do I want to gzip this file?

I think the unzipped version is what the frontend expects, but maybe it's a good question for product/compbio. I think the gzipped version would be a good space saving solution

rzlim08 · 2023-04-12T22:24:59Z

@ovalenzuela19 This should be ready now

This reverts commit 9de3fc2.

…219) * Revert "output gene id in primary output file (#209)" This reverts commit 2d9ff56. * Revert "Output non host reads and non host contigs for AMR (#205)" This reverts commit 9de3fc2.

* fastp * fastp single * bowtie2 run * hisat2 run * dedup run * run subsample * run kallisto * adjust index tar filenames * polishing * polishing * count reads in each step * Create host_filter_indexing.wdl * boost fastp complexity threshold * output fastp report * build fastp from our fork with SDUST complexity filtering * use fastp --sdust_complexity_filter * bump * bump * tune * stub the remaining step descriptions * wire to tests * and auto_benchmark * fixup tests * fixup tests * fixup tests * fixup tests * fixup tests * fixup tests * add back in picard CollectInsertSizeMetrics * picard step description * host_filter_2022.wdl => host_filter.wdl * polish * restore fastqs_0 and fastqs_1 to minimize collateral changes * add minimap2 index build * picard_insert_metrics.txt * amr/run.wdl workaround * index multiple transcripts_fasta_gz * make gtf optional * allow uncompressed genome fasta * allow uncompressed genome fasta * allow uncompressed genome fasta * bump minimap2 memory * bump minimap2 memory * step descriptions -- first draft * add indexing driver & draft readme * include invocations in step descriptions * rebase amr fix * load card_json * run kallisto every time * fix amr wdl * fix short-read-mngs rebase weirdness * add final things * [modernized host filter] add ERCC and gene-level outputs to kallisto (#175) The kallisto step gains two new derivative output files: * `ERCC_counts.tsv`: Estimated read counts for the ERCC sequences only (two-column TSV: ERCC_id, est_counts) * `gene_abundance.tsv`: gene-level est_counts and tpm, computed by summing over all transcripts for each gene * (and `abundance.tsv` is renamed to `transcript_abundance.tsv`) To get the `gene_abundance.tsv` we need a new input `gtf_gz`, the Ensembl GTF file for the host species that will tell it how to map the transcript IDs in `transcript_abundance.tsv` onto gene IDs for the roll-up. The input is optional and if absent then the `gene_abundance.tsv` output is omitted too. Note: docker image update needed to install & upgrade some dependencies. * load card_json explicitly * add ~ * fix host_filter unit tests * fix host_filter unit tests * bowtie2: sort by read name for better reproducibility * update minimap2 indexing invocation * add chelonia_mydas, drosophila_melanogaster, gray_whale, pea-aphid * copy-paste {bowtie2,hisat2}_human_filter to support pipeline viz * allow kallisto nonzero exit * rename modern host filtering inputs/outputs and create a 1-1 mapping between inputs/outputs * fix lint issue * rename reads_in_count to input_read_count * auto_benchmark updates * fix test_RunCZIDDedup_safe_csv * rename kallisto output files * update mosquitos with several Culicidae * add files to wdl output for pipeline viz compatibility * convert headers in descriptions to bolded text * delete host_filter_indexing since it's subsumed in #182 * fix glob patterns in read counting * Revert "fix glob patterns in read counting" This reverts commit aeb234f. * [Bug] fix count expansion for single file short-read-mngs (#216) * fix bowtie2 counts for single file * fix extra expansions * relieve hisat2 dependency * single sample hisat2 * fix hisat2 * fix dockerfile for hisat2 --------- Co-authored-by: Omar Valenzuela <51972068+ovalenzuela19@users.noreply.github.com> * Remove AMR changes that are a WIP from modern host filtering branch (#219) * Revert "output gene id in primary output file (#209)" This reverts commit 2d9ff56. * Revert "Output non host reads and non host contigs for AMR (#205)" This reverts commit 9de3fc2. * tune hisat2 memory usage (#223) * Legacy Host Filter initial commit (#224) * legacy-host-filter-inital-commit * linting * add stage io map * remove stage io map swp file * Revert "Remove AMR changes that are a WIP from modern host filtering branch (#219)" (#226) This reverts commit 227a489. --------- Co-authored-by: Mike Lin <mlin@Mikes-MacBook-Pro.local> Co-authored-by: Omar Valenzuela <ovalenzuela@chanzuckerberg.com> Co-authored-by: Omar Valenzuela <51972068+ovalenzuela19@users.noreply.github.com> Co-authored-by: rzlim08 <37033997+rzlim08@users.noreply.github.com>

output non host reads and non contigs

d3ac2b3

valenzuelaomar requested a review from a team April 3, 2023 22:30

valenzuelaomar changed the title ~~[amr-add-outputs] Output non host reads and non host contigs for AMR~~ Output non host reads and non host contigs for AMR Apr 3, 2023

rzlim08 approved these changes Apr 4, 2023

View reviewed changes

remove non host reads and non host contigs from outputs.zip file

bed9766

add seqfu to inteleave files

f730eea

rzlim08 reviewed Apr 12, 2023

View reviewed changes

Merge branch 'main' into amr-add-outputs

e7c3d73

valenzuelaomar merged commit 9de3fc2 into main Apr 14, 2023

valenzuelaomar deleted the amr-add-outputs branch April 14, 2023 00:17

valenzuelaomar mentioned this pull request Apr 19, 2023

Remove AMR changes that are a WIP from modern host filtering branch #219

Merged

valenzuelaomar added a commit that referenced this pull request Apr 19, 2023

Revert "Output non host reads and non host contigs for AMR (#205)"

704bca6

This reverts commit 9de3fc2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output non host reads and non host contigs for AMR #205

Output non host reads and non host contigs for AMR #205

valenzuelaomar commented Apr 3, 2023 •

edited

Loading

rzlim08 left a comment

valenzuelaomar commented Apr 12, 2023

rzlim08 commented Apr 12, 2023

rzlim08 Apr 12, 2023

valenzuelaomar Apr 13, 2023

rzlim08 commented Apr 12, 2023

Output non host reads and non host contigs for AMR #205

Output non host reads and non host contigs for AMR #205

Conversation

valenzuelaomar commented Apr 3, 2023 • edited Loading

rzlim08 left a comment

Choose a reason for hiding this comment

valenzuelaomar commented Apr 12, 2023

rzlim08 commented Apr 12, 2023

rzlim08 Apr 12, 2023

Choose a reason for hiding this comment

valenzuelaomar Apr 13, 2023

Choose a reason for hiding this comment

rzlim08 commented Apr 12, 2023

valenzuelaomar commented Apr 3, 2023 •

edited

Loading