Unify host genome generation #182

morsecodist · 2023-02-01T21:28:55Z

No description provided.

mlin

Great stuff, thanks @morsecodist!

rzlim08

Looks good just a couple of nits

rzlim08 · 2023-02-24T19:18:12Z

workflows/host-genome-generation/host_genome_generation.wdl

+  runtime {
+    docker: docker_image_id
+    cpu: cpu
+    memory: "240G"


Do we want this to be either GiB or GB?

rzlim08 · 2023-02-24T19:19:35Z

workflows/host-genome-generation/host_genome_generation.wdl

+    String docker_image_id
+  }
+
+  call ensure_gz as genome_fasta {


Are we getting the results of this function anywhere?

rzlim08 · 2023-03-14T21:44:36Z

workflows/host-genome-generation/host_genome_generation.wdl

+    TMPDIR=${TMPDIR:-/tmp}
+
+    if [ "~{nucleotide_type}" == "dna" ]; then
+        >&2 minimap2 -x map-ont -d '~{genome_name}_~{nucleotide_type}.mmi' "~{fasta}"


Not sure why, but nucleotide_type is not getting filled in with dna or rna but genome_name is.

rzlim08 · 2023-03-27T23:08:56Z

workflows/host-genome-generation/host_genome_generation.wdl

+      /hisat2/hisat2_extract_splice_sites.py <(pigz -dc '~{transcripts_gtf_gz}') > "$TMPDIR/genome.ss" & pid=$!
+      /hisat2/hisat2_extract_exons.py <(pigz -dc '~{transcripts_gtf_gz}') > "$TMPDIR/genome.exon"
+      wait $pid
+      >&2 /hisat2/hisat2-build -p 16 \


@mlin @morsecodist I'm having trouble with this build when the only transcripts_gtf_gz is the ERCC.gtf. The hisat2-build command fails with Error: Encountered exception: 'Nongraph exception'

The ERCC.gtf is needed for STAR to properly run.

Can we say if the file TMPDIR/genome.ss don't bother running with --exon "$TMPDIR/genome.exon" --ss "$TMPDIR/genome.ss"?

* fix ercc gtf in index generation * pigz to cat

rzlim08 · 2023-04-24T22:21:45Z

workflows/host-genome-generation/host_genome_generation.wdl

+    set -euxo pipefail
+    TMPDIR=${TMPDIR:-/tmp}
+
+    mkdir -p "$TMPDIR"'/bt2/~{genome_name}'


There's a discrepancy here with the old host filtering where the name of the folder of the old host filtering is

"~{genome_name}.bowtie2"

This is causing inconsistency between the two and making it so that the new bowtie2 indices are not back compatible. We could create a symlink here before tarring to make it back-compatible. Otherwise we'd have to regenerate either the modern or old host filtering indices

* fastp * fastp single * bowtie2 run * hisat2 run * dedup run * run subsample * run kallisto * adjust index tar filenames * polishing * polishing * count reads in each step * Create host_filter_indexing.wdl * boost fastp complexity threshold * output fastp report * build fastp from our fork with SDUST complexity filtering * use fastp --sdust_complexity_filter * bump * bump * tune * stub the remaining step descriptions * wire to tests * and auto_benchmark * fixup tests * fixup tests * fixup tests * fixup tests * fixup tests * fixup tests * add back in picard CollectInsertSizeMetrics * picard step description * host_filter_2022.wdl => host_filter.wdl * polish * restore fastqs_0 and fastqs_1 to minimize collateral changes * add minimap2 index build * picard_insert_metrics.txt * amr/run.wdl workaround * index multiple transcripts_fasta_gz * make gtf optional * allow uncompressed genome fasta * allow uncompressed genome fasta * allow uncompressed genome fasta * bump minimap2 memory * bump minimap2 memory * step descriptions -- first draft * add indexing driver & draft readme * include invocations in step descriptions * rebase amr fix * load card_json * run kallisto every time * fix amr wdl * fix short-read-mngs rebase weirdness * add final things * [modernized host filter] add ERCC and gene-level outputs to kallisto (#175) The kallisto step gains two new derivative output files: * `ERCC_counts.tsv`: Estimated read counts for the ERCC sequences only (two-column TSV: ERCC_id, est_counts) * `gene_abundance.tsv`: gene-level est_counts and tpm, computed by summing over all transcripts for each gene * (and `abundance.tsv` is renamed to `transcript_abundance.tsv`) To get the `gene_abundance.tsv` we need a new input `gtf_gz`, the Ensembl GTF file for the host species that will tell it how to map the transcript IDs in `transcript_abundance.tsv` onto gene IDs for the roll-up. The input is optional and if absent then the `gene_abundance.tsv` output is omitted too. Note: docker image update needed to install & upgrade some dependencies. * load card_json explicitly * add ~ * fix host_filter unit tests * fix host_filter unit tests * bowtie2: sort by read name for better reproducibility * update minimap2 indexing invocation * add chelonia_mydas, drosophila_melanogaster, gray_whale, pea-aphid * copy-paste {bowtie2,hisat2}_human_filter to support pipeline viz * allow kallisto nonzero exit * rename modern host filtering inputs/outputs and create a 1-1 mapping between inputs/outputs * fix lint issue * rename reads_in_count to input_read_count * auto_benchmark updates * fix test_RunCZIDDedup_safe_csv * rename kallisto output files * update mosquitos with several Culicidae * add files to wdl output for pipeline viz compatibility * convert headers in descriptions to bolded text * delete host_filter_indexing since it's subsumed in #182 * fix glob patterns in read counting * Revert "fix glob patterns in read counting" This reverts commit aeb234f. * [Bug] fix count expansion for single file short-read-mngs (#216) * fix bowtie2 counts for single file * fix extra expansions * relieve hisat2 dependency * single sample hisat2 * fix hisat2 * fix dockerfile for hisat2 --------- Co-authored-by: Omar Valenzuela <51972068+ovalenzuela19@users.noreply.github.com> * Remove AMR changes that are a WIP from modern host filtering branch (#219) * Revert "output gene id in primary output file (#209)" This reverts commit 2d9ff56. * Revert "Output non host reads and non host contigs for AMR (#205)" This reverts commit 9de3fc2. * tune hisat2 memory usage (#223) * Legacy Host Filter initial commit (#224) * legacy-host-filter-inital-commit * linting * add stage io map * remove stage io map swp file * Revert "Remove AMR changes that are a WIP from modern host filtering branch (#219)" (#226) This reverts commit 227a489. --------- Co-authored-by: Mike Lin <mlin@Mikes-MacBook-Pro.local> Co-authored-by: Omar Valenzuela <ovalenzuela@chanzuckerberg.com> Co-authored-by: Omar Valenzuela <51972068+ovalenzuela19@users.noreply.github.com> Co-authored-by: rzlim08 <37033997+rzlim08@users.noreply.github.com>

morsecodist added 10 commits February 1, 2023 13:28

Unify host genome generation

3eaaf70

add concat unzip step + docker_image_id

86a0a38

update docker

03eade9

install docker dependencies

558ff43

gzip inputs

4556f51

fix file naming

29a721e

python path docker

9921360

star gtf flag fix

315c392

syntax

0639371

fix name outputs

54d98fe

morsecodist requested review from mlin and a team February 6, 2023 18:26

mlin approved these changes Feb 9, 2023

View reviewed changes

rzlim08 approved these changes Feb 24, 2023

View reviewed changes

rzlim08 reviewed Mar 14, 2023

View reviewed changes

rzlim08 reviewed Mar 27, 2023

View reviewed changes

mlin added a commit that referenced this pull request Mar 31, 2023

delete host_filter_indexing since it's subsumed in #182

e35cd04

mlin mentioned this pull request Mar 31, 2023

host_filter.wdl modernization #70

Merged

fix ercc gtf in host-genome generation (#202)

0d92776

* fix ercc gtf in index generation * pigz to cat

valenzuelaomar mentioned this pull request Apr 19, 2023

Remove AMR changes that are a WIP from modern host filtering branch #219

Merged

rzlim08 reviewed Apr 24, 2023

View reviewed changes

rzlim08 added 2 commits April 26, 2023 13:38

symlink bowtie2 directory (#229)

547a453

create a relative symlink (#250)

a9a57a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify host genome generation #182

Unify host genome generation #182

morsecodist commented Feb 1, 2023

mlin left a comment

rzlim08 left a comment

rzlim08 Feb 24, 2023

rzlim08 Feb 24, 2023

rzlim08 Mar 14, 2023

rzlim08 Mar 27, 2023

rzlim08 Mar 27, 2023

rzlim08 Apr 24, 2023

Unify host genome generation #182

Are you sure you want to change the base?

Unify host genome generation #182

Conversation

morsecodist commented Feb 1, 2023

mlin left a comment

Choose a reason for hiding this comment

rzlim08 left a comment

Choose a reason for hiding this comment

rzlim08 Feb 24, 2023

Choose a reason for hiding this comment

rzlim08 Feb 24, 2023

Choose a reason for hiding this comment

rzlim08 Mar 14, 2023

Choose a reason for hiding this comment

rzlim08 Mar 27, 2023

Choose a reason for hiding this comment

rzlim08 Mar 27, 2023

Choose a reason for hiding this comment

rzlim08 Apr 24, 2023

Choose a reason for hiding this comment