Skip to content

Commit

Permalink
[MRG] remove abundance trimming from the default genome-grist workflow (
Browse files Browse the repository at this point in the history
#199)

* try disabling trim-low-abund

* update output yet again

* remove abundtrim as a step in the default workflow

* adjust docs re abundtrim_reads

* add abundtrim_reads to make test

* fix tests, update Makefile and docs

* update docs
  • Loading branch information
ctb authored Sep 26, 2022
1 parent 8cc02ca commit 8c94649
Show file tree
Hide file tree
Showing 16 changed files with 114 additions and 82 deletions.
10 changes: 5 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,19 @@ test:
# try various targets to make sure they work
genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes \
combine_genome_info retrieve_genomes estimate_distinct_kmers \
count_trimmed_reads summarize_sample_info -j 8 -p
count_trimmed_reads summarize_sample_info abundtrim_reads -j 8 -p

### private/local genomes test stuff

test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
test-private: outputs.private/trim/podar.trim.fq.gz \
databases/podar-ref.zip databases/podar-ref.info.csv \
databases/podar-ref.tax.csv
genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p

# download the (subsampled) reads for SRR606249
outputs.private/abundtrim/podar.abundtrim.fq.gz:
mkdir -p outputs.private/abundtrim
curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz
outputs.private/trim/podar.trim.fq.gz:
mkdir -p outputs.private/trim
curl -L https://osf.io/ckbq3/download -o outputs.private/trim/podar.trim.fq.gz

# download the ref genomes
databases/podar-ref/:
Expand Down
35 changes: 20 additions & 15 deletions doc/configuring.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,27 +27,32 @@ the config file, and genome-grist will automatically download, interleave,
and trim them for you.

If you want to run genome-grist on your own metagenomes, you need to
provide one FASTQ file for each sample in the `abundtrim/`
subdirectory of the output directory; for example, for the output
directory `outputs.private` and the sample named `podar`, you would
need to create `outputs.private/abundtrim/podar.abundtrim.fq.gz`.
This should be an interleaved file of Illumina reads, as generated by
(for example) `seqtk mergepe`.
provide one FASTQ file per sample in the `trim/` subdirectory of the
output directory; for example, for the output directory
`outputs.private` and the sample named `podar`, you would need to
create `outputs.private/trim/podar.trim.fq.gz`. This should be an
interleaved file of Illumina reads, as generated by (for example)
`seqtk mergepe`. The file must start with the sample name and end in
`trim.fq.gz`.

Providing correctly-named files will shortcut automatic SRA downloads for
these files, and genome-grist will download any remaining samples.

If you want to prevent automatic downloading from the SRA completely,
you can set the parameter `prevent_sra_download: true` in the config file.
This is a good parameter to set if you are only analyzing your own
prepared data!

## Using Genbank genomes

For Genbank genomes, all the necessary information is available already, or automatically determined by genome-grist.

sourmash already provides pre-built databases containing [all GTDB genomes (R06 rs202)](https://sourmash.readthedocs.io/en/latest/databases.html) as well as [all 700,000 Genbank microbial genomes from July 2020](https://github.com/sourmash-bio/sourmash/issues/1749#issuecomment-947920226).
sourmash already provides
[pre-built databases containing all GTDB genomes (R07 rs207) as well as 1.3m Genbank microbial genomes from 2022](https://sourmash.readthedocs.io/en/latest/databases.html).

For genomes available through Genbank (aka with Genbank accessions), genome-grist does the genome retrieval automatically, so you don't need to have them downloaded already.

Taxonomy spreadsheets are available for GTDB (at the databases page) and for Genbank 700k/July 2020 (link upon request).
Taxonomy spreadsheets are available for both GTDB and Genbank at the Databases page above.

## Preparing information on local genomes

Expand Down Expand Up @@ -130,11 +135,11 @@ If you want to enable taxonomic summarization for your local genomes, you'll nee

### Testing it all out

We recommend trying this all out with a fake metagenome that's just two of your local genomes concatenated; you can set this up by making the FASTA file and then putting it in your output directory in the subdirectory `abundtrim/{sample}.abundtrim.fq.gz`, and configuring genome-grist to run `summarize_gather` on just that sample.
We recommend trying this all out with a fake metagenome that's just two of your local genomes concatenated; you can set this up by making the FASTA file and then putting it in your output directory in the subdirectory `trim/{sample}.trim.fq.gz`, and configuring genome-grist to run `summarize_gather` on just that sample.

So, for example,

* create a file `abundtrim/testme.abundtrim.fq.gz` containing a bunch of sequences (FASTA or FASTQ format, despite the filename :)
* create a file `trim/testme.trim.fq.gz` containing a bunch of sequences (FASTA or FASTQ format, despite the filename :)
* set `samples` in your config file `conf-test.yml` to `- testme`
* run `genome-grist run conf-test.yml summarize_gather`

Expand Down Expand Up @@ -236,8 +241,8 @@ While you can certainly run this on the entire metagenome from Shakya et al., 20

You can download this subsetted metagenome like so:
```shell
mkdir -p outputs.private/abundtrim
curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz
mkdir -p outputs.private/trim
curl -L https://osf.io/ckbq3/download -o outputs.private/trim/podar.trim.fq.gz
```
and then confirm that the config file `conf-private.yml` has the following content:

Expand Down Expand Up @@ -295,7 +300,7 @@ samples:
# this will be created if it doesn't exist.
outdir: some_directory

# metagenome_trim_memory: how much memory (RAM) to use when trimming reads with khmer's trim-low-abund.
# metagenome_trim_memory: how much memory (RAM) to use when trimming reads with khmer's trim-low-abund. @CTB
# set to 1e9 for very low diversity samples,
# 10e9 for medium-diversity samples,
# and 50e9 if you're foolishly working with soil :)
Expand Down Expand Up @@ -403,7 +408,7 @@ genome-grist is built on top of [the snakemake workflow](https://snakemake.readt
For example,
* you can put your own `{sample}_1.fastq.gz`, `{sample}_2.fastq.gz`, and `{sample}_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
* you can put your own interleaved reads file in `abundtrim/{sample}.abundtrim.fq.gz` to run genome-grist on an unpublished or already-preprocessed set of reads;
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/{sample}.abundtrim.sig` if you want to have it do the database search for you;
* you can put your own interleaved reads file in `trim/{sample}.trim.fq.gz` to run genome-grist on an unpublished or already-preprocessed set of reads;
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/{sample}.trim.sig.zip` if you want to have it do the database search for you;

Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.
12 changes: 7 additions & 5 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,13 @@ You can use genome-grist to:

* find out what genomes are in a metagenome!

* map reads to those genomes!
* estimate how much of the metagenome will map to reference genomes!

* map reads to each genome and summarize the results across genomes!

* summarize the taxonomic composition of a metagenome!

genome-grist automates download of public data, and will automatically
genome-grist automates the analysis of public data, and will automatically
access metagenomes from the SRA and genomes from Genbank.
genome-grist supports both the NCBI and GTDB taxonomies. You can also
use your own metagenomes and genomes.
Expand Down Expand Up @@ -89,9 +91,9 @@ For now, Irber et al., 2022 is the primary citation for genome-grist. Any use of
### Resource requirements

**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.
**Disk space:** genome-grist makes about 3-4 copies of each SRA metagenome analyzed.

**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).
**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM.

**Time:** This is largely dependent on the size of the metagenome; 100m reads takes a few hours. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that we use on our HPC:
```
Expand Down Expand Up @@ -143,7 +145,7 @@ and [matplotlib](https://matplotlib.org/).
The default search databases used for genome-grist are based on
sequences from [Genbank](https://www.ncbi.nlm.nih.gov/genbank/) and
taxonomies from Genbank and [GTDB](https://gtdb.ecogenomic.org/). The
database are provided by
databases are provided by
[the sourmash project](https://sourmash.readthedocs.io/en/latest/databases.html).

We develop genome-grist at
Expand Down
6 changes: 3 additions & 3 deletions doc/output-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@ Below, we use "sample" and "metagenome" interchangeably.
### Metagenome reads

* `{outdir}/raw/` - untrimmed reads, from the SRA or private sequencing.
* `{outdir}/trim/` - adapter and quality-trimmed reads, starting from `raw/`.
* `{outdir}/abundtrim/` - inputs into downstream steps.
* `{outdir}/trim/` - adapter and quality-trimmed reads, starting from `raw/`; inputs into downstream steps.
* `{outdir}/abundtrim/` - optional output of variable-coverage k-mer trimming; not used in genome-grist.

### sourmash output

* `{outdir}/sigs/` - sourmash sketches calculated from abundtrim reads.
* `{outdir}/sigs/` - sourmash sketches calculated from trimmed reads in `{outdir}/trim/`.
* `{outdir}/gather/` - sourmash outputs; see details below.

### Genomes and mapping information
Expand Down
37 changes: 25 additions & 12 deletions doc/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,20 @@ conda activate grist
python -m pip install genome-grist
```

Note: genome-grist should run in Python 3.7 onwards; we haven't tested it extensively in Python 3.10 yet (as of Jan 2022).
Note: genome-grist should run in Python 3.8 onwards (as of Sep 2022).

## Running genome-grist

We currently recommend running genome-grist in its own directory, for several reasons; in particular, genome-grist uses snakemake and conda to install software under the working directory, and it's nice to have all the outputs be isolated.
We currently recommend running genome-grist in its own directory, for
several reasons; in particular, genome-grist uses snakemake and conda
to install software under the working directory, and it's nice to have
all of files be shared.

Within the current working directory, genome-grist will create a `genbank_cache/` subdir, and any `outputs.NAME` subdirectories requested by the configuration. We recommend always running genome-grist in this directory and naming the output directories after the different projects using genome-grist.
Within the current working directory, genome-grist will create a
`genbank_cache/` subdir, and any `outputs.NAME/` subdirectories
requested by the configuration. We recommend always running
genome-grist in this directory and naming the output directories after
the different projects using genome-grist.

So, create a subdirectory and change into it:
```shell
Expand All @@ -31,11 +38,15 @@ Note, genome-grist works entirely within the current working directory and temp

### Download a small example database

Download the GTDB r06 rs202 set of ~48,000 guide genomes, in a pre-prepared sourmash database format:
Download the GTDB r06 rs202 set of ~48,000 guide genomes, in a
pre-prepared sourmash database format:
```
curl -L https://osf.io/w4bcm/download -o gtdb-rs202.genomic-reps.k31.sbt.zip
```
(You can use any sourmash database that uses Genbank identifiers here; see [available databases](https://sourmash.readthedocs.io/en/latest/databases.html) for more info.)
You can use any sourmash database with Genbank identifiers; see
[available databases](https://sourmash.readthedocs.io/en/latest/databases.html)
for more info. You can also use private databases; see the
configuration docs for more info.

### Make a configuration file

Expand All @@ -44,7 +55,6 @@ Put the following in a config file named `conf-tutorial.yml`:
samples:
- SRR5950647
outdir: outputs.tutorial/
metagenome_trim_memory: 1e9

sourmash_databases:
- gtdb-rs202.genomic-reps.k31.sbt.zip
Expand All @@ -61,7 +71,7 @@ genome-grist run conf-tutorial.yml summarize_gather summarize_mapping
This will perform the following steps:

* download the [SRR5950647 metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5950647) from the Sequence Read Archive (target `download_reads`).
* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
* preprocess it to remove adapters and low-quality reads (target `trim_reads`).
* build a sourmash signature from the preprocess reads. (target `smash_reads`).
* perform a `sourmash gather` against the specified database (target `gather_reads`).
* download the matching genomes from GenBank into `genbank_cache/` (target `download_matching_genomes`).
Expand All @@ -72,14 +82,17 @@ You can put one or more targets on the command line as above with `summarize_gat

## Output files

The key output files under the outputs directory are:
Some key output files under the outputs directory are:

* `gather/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `gather/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `gather/{sample}.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `gather/{sample}.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `gather/genomes/` - all of the genomes found across all of the samples.
* `gather/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
* `mapping/{sample}.summary.csv` - summary information about mapped reads
* `reports/report-{sample}.html` - a summary report.
* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
* `sigs/{sample}.abundtrim.sig` - sourmash signature for the preprocessed reads.
* `trim/{sample}.trim.fq.gz` - trimmed and preprocessed reads.
* `sigs/{sample}.trim.sig.zip` - sourmash signature for the preprocessed reads.

Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.

Please see [the guide to genome-grist output files](output-guide.md) for more information!
1 change: 1 addition & 0 deletions genome_grist/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ def run(configfile, snakemake_args, no_use_conda, verbose, outdir, help):
* gather_reads - run 'sourmash gather' on metagenomes against Genbank
* download_genbank_genomes - download all matching Genbank genomes
* map_reads - map all metagenome reads to Genbank genomes
* abundtrim_reads - do variable-coverage trimming on data sets; see docs
* make_sgc_conf - make a spacegraphcats config file
Please see https://github.com/dib-lab/genome-grist for quickstart docs.
Expand Down
Loading

0 comments on commit 8c94649

Please sign in to comment.