[MRG] remove abundance trimming from the default genome-grist workflow (

#199) * try disabling trim-low-abund * update output yet again * remove abundtrim as a step in the default workflow * adjust docs re abundtrim_reads * add abundtrim_reads to make test * fix tests, update Makefile and docs * update docs
dib-lab · Sep 26, 2022 · 8c94649 · 8c94649
1 parent 8cc02ca
commit 8c94649
Show file tree

Hide file tree

Showing 16 changed files with 114 additions and 82 deletions.
diff --git a/Makefile b/Makefile
@@ -18,19 +18,19 @@ test:
 	# try various targets to make sure they work
 	genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes \
 	    combine_genome_info retrieve_genomes estimate_distinct_kmers \
-	    count_trimmed_reads summarize_sample_info -j 8 -p
+	    count_trimmed_reads summarize_sample_info abundtrim_reads -j 8 -p
 
 ### private/local genomes test stuff
 
-test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
+test-private: outputs.private/trim/podar.trim.fq.gz \
 		databases/podar-ref.zip  databases/podar-ref.info.csv \
 		databases/podar-ref.tax.csv
 	genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p
 
 # download the (subsampled) reads for SRR606249
-outputs.private/abundtrim/podar.abundtrim.fq.gz:
-	mkdir -p outputs.private/abundtrim
-	curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz
+outputs.private/trim/podar.trim.fq.gz:
+	mkdir -p outputs.private/trim
+	curl -L https://osf.io/ckbq3/download -o outputs.private/trim/podar.trim.fq.gz
 
 # download the ref genomes
 databases/podar-ref/: 

diff --git a/doc/configuring.md b/doc/configuring.md
@@ -27,27 +27,32 @@ the config file, and genome-grist will automatically download, interleave,
 and trim them for you.
 
 If you want to run genome-grist on your own metagenomes, you need to
-provide one FASTQ file for each sample in the `abundtrim/`
-subdirectory of the output directory; for example, for the output
-directory `outputs.private` and the sample named `podar`, you would
-need to create `outputs.private/abundtrim/podar.abundtrim.fq.gz`.
-This should be an interleaved file of Illumina reads, as generated by
-(for example) `seqtk mergepe`.
+provide one FASTQ file per sample in the `trim/` subdirectory of the
+output directory; for example, for the output directory
+`outputs.private` and the sample named `podar`, you would need to
+create `outputs.private/trim/podar.trim.fq.gz`.  This should be an
+interleaved file of Illumina reads, as generated by (for example)
+`seqtk mergepe`.  The file must start with the sample name and end in
+`trim.fq.gz`.
 
 Providing correctly-named files will shortcut automatic SRA downloads for
 these files, and genome-grist will download any remaining samples.
+
 If you want to prevent automatic downloading from the SRA completely,
 you can set the parameter `prevent_sra_download: true` in the config file.
+This is a good parameter to set if you are only analyzing your own
+prepared data!
 
 ## Using Genbank genomes
 
 For Genbank genomes, all the necessary information is available already, or automatically determined by genome-grist.
 
-sourmash already provides pre-built databases containing [all GTDB genomes (R06 rs202)](https://sourmash.readthedocs.io/en/latest/databases.html) as well as [all 700,000 Genbank microbial genomes from July 2020](https://github.com/sourmash-bio/sourmash/issues/1749#issuecomment-947920226).
+sourmash already provides
+[pre-built databases containing all GTDB genomes (R07 rs207) as well as 1.3m Genbank microbial genomes from 2022](https://sourmash.readthedocs.io/en/latest/databases.html).
 
 For genomes available through Genbank (aka with Genbank accessions), genome-grist does the genome retrieval automatically, so you don't need to have them downloaded already.
 
-Taxonomy spreadsheets are available for GTDB (at the databases page) and for Genbank 700k/July 2020 (link upon request).
+Taxonomy spreadsheets are available for both GTDB and Genbank at the Databases page above.
 
 ## Preparing information on local genomes
 
@@ -130,11 +135,11 @@ If you want to enable taxonomic summarization for your local genomes, you'll nee
 
 ### Testing it all out
 
-We recommend trying this all out with a fake metagenome that's just two of your local genomes concatenated; you can set this up by making the FASTA file and then putting it in your output directory in the subdirectory `abundtrim/{sample}.abundtrim.fq.gz`, and configuring genome-grist to run `summarize_gather` on just that sample.
+We recommend trying this all out with a fake metagenome that's just two of your local genomes concatenated; you can set this up by making the FASTA file and then putting it in your output directory in the subdirectory `trim/{sample}.trim.fq.gz`, and configuring genome-grist to run `summarize_gather` on just that sample.
 
 So, for example, 
 
-* create a file `abundtrim/testme.abundtrim.fq.gz` containing a bunch of sequences (FASTA or FASTQ format, despite the filename :)
+* create a file `trim/testme.trim.fq.gz` containing a bunch of sequences (FASTA or FASTQ format, despite the filename :)
 * set `samples` in your config file `conf-test.yml` to `- testme`
 * run `genome-grist run conf-test.yml summarize_gather`
 
@@ -236,8 +241,8 @@ While you can certainly run this on the entire metagenome from Shakya et al., 20
 
 You can download this subsetted metagenome like so:
 ```shell
-mkdir -p outputs.private/abundtrim
-curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz
+mkdir -p outputs.private/trim
+curl -L https://osf.io/ckbq3/download -o outputs.private/trim/podar.trim.fq.gz
 ```
 and then confirm that the config file `conf-private.yml` has the following content:
 
@@ -295,7 +300,7 @@ samples:
 # this will be created if it doesn't exist.
 outdir: some_directory
 
-# metagenome_trim_memory: how much memory (RAM) to use when trimming reads with khmer's trim-low-abund.
+# metagenome_trim_memory: how much memory (RAM) to use when trimming reads with khmer's trim-low-abund. @CTB
 # set to 1e9 for very low diversity samples,
 # 10e9 for medium-diversity samples,
 # and 50e9 if you're foolishly working with soil :)
@@ -403,7 +408,7 @@ genome-grist is built on top of [the snakemake workflow](https://snakemake.readt
 For example,
 
 * you can put your own `{sample}_1.fastq.gz`, `{sample}_2.fastq.gz`, and `{sample}_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
-* you can put your own interleaved reads file in `abundtrim/{sample}.abundtrim.fq.gz` to run genome-grist on an unpublished or already-preprocessed set of reads;
-* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/{sample}.abundtrim.sig` if you want to have it do the database search for you;
+* you can put your own interleaved reads file in `trim/{sample}.trim.fq.gz` to run genome-grist on an unpublished or already-preprocessed set of reads;
+* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/{sample}.trim.sig.zip` if you want to have it do the database search for you;
 
 Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.
diff --git a/doc/index.md b/doc/index.md
@@ -14,11 +14,13 @@ You can use genome-grist to:
 
 * find out what genomes are in a metagenome!
 
-* map reads to those genomes!
+* estimate how much of the metagenome will map to reference genomes!
+
+* map reads to each genome and summarize the results across genomes!
 
 * summarize the taxonomic composition of a metagenome!
 
-genome-grist automates download of public data, and will automatically
+genome-grist automates the analysis of public data, and will automatically
 access metagenomes from the SRA and genomes from Genbank.
 genome-grist supports both the NCBI and GTDB taxonomies. You can also
 use your own metagenomes and genomes.
@@ -89,9 +91,9 @@ For now, Irber et al., 2022 is the primary citation for genome-grist. Any use of
 
 ### Resource requirements
 
-**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.
+**Disk space:** genome-grist makes about 3-4 copies of each SRA metagenome analyzed.
 
-**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).
+**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM.
 
 **Time:** This is largely dependent on the size of the metagenome; 100m reads takes a few hours. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that we use on our HPC:
 ```
@@ -143,7 +145,7 @@ and [matplotlib](https://matplotlib.org/).
 The default search databases used for genome-grist are based on
 sequences from [Genbank](https://www.ncbi.nlm.nih.gov/genbank/) and
 taxonomies from Genbank and [GTDB](https://gtdb.ecogenomic.org/). The
-database are provided by
+databases are provided by
 [the sourmash project](https://sourmash.readthedocs.io/en/latest/databases.html).
 
 We develop genome-grist at

diff --git a/doc/output-guide.md b/doc/output-guide.md
@@ -36,12 +36,12 @@ Below, we use "sample" and "metagenome" interchangeably.
 ### Metagenome reads
 
 * `{outdir}/raw/` - untrimmed reads, from the SRA or private sequencing.
-* `{outdir}/trim/` - adapter and quality-trimmed reads, starting from `raw/`.
-* `{outdir}/abundtrim/` - inputs into downstream steps.
+* `{outdir}/trim/` - adapter and quality-trimmed reads, starting from `raw/`; inputs into downstream steps.
+* `{outdir}/abundtrim/` - optional output of variable-coverage k-mer trimming; not used in genome-grist.
 
 ### sourmash output
 
-* `{outdir}/sigs/` - sourmash sketches calculated from abundtrim reads.
+* `{outdir}/sigs/` - sourmash sketches calculated from trimmed reads in `{outdir}/trim/`.
 * `{outdir}/gather/` - sourmash outputs; see details below.
 
 ### Genomes and mapping information

diff --git a/doc/quickstart.md b/doc/quickstart.md
@@ -14,13 +14,20 @@ conda activate grist
 python -m pip install genome-grist
 ```
 
-Note: genome-grist should run in Python 3.7 onwards; we haven't tested it extensively in Python 3.10 yet (as of Jan 2022).
+Note: genome-grist should run in Python 3.8 onwards (as of Sep 2022).
 
 ## Running genome-grist
 
-We currently recommend running genome-grist in its own directory, for several reasons; in particular, genome-grist uses snakemake and conda to install software under the working directory, and it's nice to have all the outputs be isolated.
+We currently recommend running genome-grist in its own directory, for
+several reasons; in particular, genome-grist uses snakemake and conda
+to install software under the working directory, and it's nice to have
+all of files be shared.
 
-Within the current working directory, genome-grist will create a `genbank_cache/` subdir, and any `outputs.NAME` subdirectories requested by the configuration.  We recommend always running genome-grist in this directory and naming the output directories after the different projects using genome-grist.
+Within the current working directory, genome-grist will create a
+`genbank_cache/` subdir, and any `outputs.NAME/` subdirectories
+requested by the configuration.  We recommend always running
+genome-grist in this directory and naming the output directories after
+the different projects using genome-grist.
 
 So, create a subdirectory and change into it:
 ```shell
@@ -31,11 +38,15 @@ Note, genome-grist works entirely within the current working directory and temp
 
 ### Download a small example database
 
-Download the GTDB r06 rs202 set of ~48,000 guide genomes, in a pre-prepared sourmash database format:
+Download the GTDB r06 rs202 set of ~48,000 guide genomes, in a
+pre-prepared sourmash database format:
 ```
 curl -L https://osf.io/w4bcm/download -o gtdb-rs202.genomic-reps.k31.sbt.zip
 ```
-(You can use any sourmash database that uses Genbank identifiers here; see [available databases](https://sourmash.readthedocs.io/en/latest/databases.html) for more info.)
+You can use any sourmash database with Genbank identifiers; see
+[available databases](https://sourmash.readthedocs.io/en/latest/databases.html)
+for more info. You can also use private databases; see the
+configuration docs for more info.
 
 ### Make a configuration file
 
@@ -44,7 +55,6 @@ Put the following in a config file named `conf-tutorial.yml`:
 samples:
 - SRR5950647
 outdir: outputs.tutorial/
-metagenome_trim_memory: 1e9
 
 sourmash_databases:
 - gtdb-rs202.genomic-reps.k31.sbt.zip
@@ -61,7 +71,7 @@ genome-grist run conf-tutorial.yml summarize_gather summarize_mapping
 This will perform the following steps:
 
 * download the [SRR5950647 metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5950647) from the Sequence Read Archive (target `download_reads`).
-* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
+* preprocess it to remove adapters and low-quality reads (target `trim_reads`).
 * build a sourmash signature from the preprocess reads. (target `smash_reads`).
 * perform a `sourmash gather` against the specified database (target `gather_reads`).
 * download the matching genomes from GenBank into `genbank_cache/` (target `download_matching_genomes`).
@@ -72,14 +82,17 @@ You can put one or more targets on the command line as above with `summarize_gat
 
 ## Output files
 
-The key output files under the outputs directory are:
+Some key output files under the outputs directory are:
 
-* `gather/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
-* `gather/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
+* `gather/{sample}.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
+* `gather/{sample}.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
+* `gather/genomes/` - all of the genomes found across all of the samples.
 * `gather/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
+* `mapping/{sample}.summary.csv` - summary information about mapped reads
 * `reports/report-{sample}.html` - a summary report.
-* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
-* `sigs/{sample}.abundtrim.sig` - sourmash signature for the preprocessed reads.
+* `trim/{sample}.trim.fq.gz` - trimmed and preprocessed reads.
+* `sigs/{sample}.trim.sig.zip` - sourmash signature for the preprocessed reads.
 
 Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.
 
+Please see [the guide to genome-grist output files](output-guide.md) for more information!
diff --git a/genome_grist/__main__.py b/genome_grist/__main__.py
@@ -139,6 +139,7 @@ def run(configfile, snakemake_args, no_use_conda, verbose, outdir, help):
  * gather_reads - run 'sourmash gather' on metagenomes against Genbank
  * download_genbank_genomes - download all matching Genbank genomes
  * map_reads - map all metagenome reads to Genbank genomes
+ * abundtrim_reads - do variable-coverage trimming on data sets; see docs
  * make_sgc_conf - make a spacegraphcats config file
 
 Please see https://github.com/dib-lab/genome-grist for quickstart docs.