diff --git a/autometa/validation/benchmark.py b/autometa/validation/benchmark.py index 41037f499..8908aa3f4 100644 --- a/autometa/validation/benchmark.py +++ b/autometa/validation/benchmark.py @@ -21,9 +21,9 @@ along with Autometa. If not, see . COPYRIGHT -Autometa clustering evaluation benchmarking. +Autometa taxon-profiling, clustering and binning-classification evaluation benchmarking. -Script to benchmark Autometa clustering results using clustering evaluation metrics. +Script to benchmark Autometa taxon-profiling, clustering and binning-classification results using clustering and classification evaluation metrics. """ diff --git a/docs/source/benchmarking.rst b/docs/source/benchmarking.rst index c280b0819..f58eab3e6 100644 --- a/docs/source/benchmarking.rst +++ b/docs/source/benchmarking.rst @@ -1,113 +1,210 @@ -************ +============ Benchmarking -************ +============ + +.. headers/sections formatting + See: https://docs.typo3.org/m/typo3/docs-how-to-document/main/en-us/WritingReST/HeadlinesAndSection.html .. note:: - The most recent benchmarking results are hosted on our `KwanLab/metaBenchmarks `_ Github repository - and provide a range of analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective - modules so that the community may easily assess how their novel (``taxon-profiling``, ``clustering``, ``binning``, ``refinement``) algorithms - perform compared to current state-of-the-art methods. Tools were selected for benchmarking based on their relevance - to environmental, single-assembly, reference-free binning pipelines. + The most recent Autometa benchmarking results covering multiple modules and input parameters are hosted on our + `KwanLab/metaBenchmarks `_ Github repository and provide a range of + analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective + modules so that the community may easily assess how Autometa's novel (``taxon-profiling``, ``clustering``, + ``binning``, ``refinement``) algorithms perform compared to current state-of-the-art methods. Tools were selected for + benchmarking based on their relevance to environmental, single-assembly, reference-free binning pipelines. Benchmarking with the ``autometa-benchmark`` module =================================================== +Autometa includes the ``autometa-benchmark`` entrypoint, a script to benchmark Autometa taxon-profiling, clustering +and binning-classification prediction results using clustering and classification evaluation metrics. To select the +appropriate benchmarking method, supply the ``--benchmark`` parameter with the respective choice. The three benchmarking +methods are detailed below. + +.. note:: + If you'd like to follow along with the benchmarking commands, you may download the test datasets + using: + + .. code:: bash + + autometa-download-dataset \ + --community-type simulated \ + --community-sizes 78Mbp \ + --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \ + --dir-path $HOME/Autometa/autometa/datasets/simulated + + This will download three files: + + - ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]`` + - ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]`` + - ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]`` + + +Taxon-profiling +--------------- + Example benchmarking with simulated communities ------------------------------------------------ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + # Set community size (see above for selection/download of other community types) + community_size=78Mbp + + # Inputs + ## NOTE: predictions and reference were downloaded using autometa-download-dataset + predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/taxonomy.tsv.gz" # required columns -> contig, taxid + reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz" + ncbi=$HOME/Autometa/autometa/databases/ncbi + + # Outputs + output_wide="${community_size}.taxon_profiling_benchmarks.wide.tsv.gz" # file path + output_long="${community_size}.taxon_profiling_benchmarks.long.tsv.gz" # file path + reports="${community_size}_taxon_profiling_reports" # directory path + + autometa-benchmark \ + --benchmark classification \ + --predictions $predictions \ + --reference $reference \ + --ncbi $ncbi \ + --output-wide $output_wide \ + --output-long $output_long \ + --output-classification-reports $reports -Download all of the simulated communities and their reference assignments -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. note:: + Using ``--benchmark=classification`` requires the path to a directory containing files (nodes.dmp, names.dmp, merged.dmp) + from NCBI's taxdump tarball. This should be supplied using the ``--ncbi`` parameter. + +Clustering +---------- + +Example benchmarking with simulated communities +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash - community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp) - autometa-download-dataset \ - --community-type simulated \ - --community-sizes ${community_sizes[@]} \ - --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \ - --dir-path simulated + # Set community size (see above for selection/download of other community types) + community_size=78Mbp + + # Inputs + ## NOTE: predictions and reference were downloaded using autometa-download-dataset + predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster + reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz" + + # Outputs + output_wide="${community_size}.clustering_benchmarks.wide.tsv.gz" + output_long="${community_size}.clustering_benchmarks.long.tsv.gz" + + autometa-benchmark \ + --benchmark clustering \ + --predictions $predictions \ + --reference $reference \ + --output-wide $output_wide \ + --output-long $output_long + -Benchmark all of the simulated communities and their reference assignments -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Binning +------- + +Example benchmarking with simulated communities +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash - for community_size in ${community_sizes[@]};do - autometa-benchmark \ - --benchmark clustering \ - --predictions simulated/${community_size}/binning.tsv.gz \ - --reference simulated/${community_size}/reference_assignments.tsv.gz \ - --output-wide ${community_size}.clustering_benchmarks.wide.tsv.gz \ - --output-long ${community_size}.clustering_benchmarks.long.tsv.gz - done + # Set community size (see above for selection/download of other community types) + community_size=78Mbp + + # Inputs + ## NOTE: predictions and reference were downloaded using autometa-download-dataset + predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster + reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz" + + # Outputs + output_wide="${community_size}.binning_benchmarks.wide.tsv.gz" + output_long="${community_size}.binning_benchmarks.long.tsv.gz" + + autometa-benchmark \ + --benchmark binning-classification \ + --predictions $predictions \ + --reference $reference \ + --output-wide $output_wide \ + --output-long $output_long -Aggregate across simulated communities (when dataset index is unique) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. code:: python +Autometa Test Datasets +====================== - import pandas as pd - import glob - df = pd.concat([ - pd.read_csv(fp, sep="\t", index_col="dataset") - for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz") - ]) - df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True) +Descriptions +------------ -Aggregate across simulated communities (when dataset index is `not` unique) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Simulated Communities +~~~~~~~~~~~~~~~~~~~~~ -.. code:: python +.. csv-table:: Autometa Simulated Communities + :file: simulated_community.csv + :header-rows: 1 - import pandas as pd - import os - import glob - dfs = [] - for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"): - df = pd.read_csv(fp, sep="\t", index_col="dataset") - df.index = df.index.map(lambda fpath: os.path.basename(fpath)) - dfs.append(df) - df = pd.concat(dfs) - df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True) +You can download all the Simulated communities using this `link `__. +Individual communities can be downloaded using the links in the above table. + +For more information on simulated communities, +check the `README.md `__ +located in the ``simulated_communities`` directory. + +Synthetic Communities +~~~~~~~~~~~~~~~~~~~~~ + +51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``. + +The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates. +The synthetic community's DNA was extracted for sequencing, assembly and binning. + +You can download the MIX51 community using this `link `__. -Downloading Test Datasets -========================= +Download +-------- -Using the built-in ``autometa`` module --------------------------------------- +Using ``autometa-download-dataset`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Autometa is packaged with a built-in module that allows any user to download any of the available test datasets. -To use these utilities simply run the command ``autometa-download-dataset``. +To use retrieve these datasets one simply needs to run the ``autometa-download-dataset`` command. -For example, to download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa -v2.0 binning/taxonomy predictions: +For example, to download the reference assignments for a simulated community as well as the most recent Autometa +binning and taxon-profiling predictions for this community, provide the following parameters: .. code:: bash - # Note: community is the test dataset that was used for clustering or classification. e.g. - # choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp - community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp) - + # choices for simulated: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp autometa-download-dataset \ - --community-type simulated \ - --community-sizes ${community_sizes[@]} \ - --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \ - --dir-path simulated + --community-type simulated \ + --community-sizes 78Mbp \ + --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \ + --dir-path simulated + +This will download ``reference_assignments.tsv.gz``, ``binning.tsv.gz``, ``taxonomy.tsv.gz`` to the ``simulated/78Mbp`` directory. -Using ``gdrive`` via command line ---------------------------------- +- ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]`` +- ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]`` +- ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]`` -You can download the individual assemblies of different datasests with the help of ``gdown`` using command line. -If you have installed ``autometa`` using ``conda`` then ``gdown`` should already be installed. -If not, you can install it using ``conda install -c conda-forge gdown`` or ``pip install gdown``. +Using ``gdrive`` +~~~~~~~~~~~~~~~~ + +You can download the individual assemblies of different datasests with the help of ``gdown`` using command line +(This is what ``autometa-download-dataset`` is using behind the scenes). If you have installed ``autometa`` using +``conda`` then ``gdown`` should already be installed. If not, you can install it using +``conda install -c conda-forge gdown`` or ``pip install gdown``. Example for the 78Mbp simulated community -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +""""""""""""""""""""""""""""""""""""""""" 1. Navigate to the 78Mbp community dataset using the `link `_ mentioned above. -2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option. This will have a ``copy link`` button that you should use. The link for the metagenome assembly (ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing`` +2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option. + This will have a ``copy link`` button that you should use. The link for the metagenome assembly + (ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing`` 3. The file ID is within the ``/`` forward slashes between ``file/d/`` and ``/``, e.g: .. code:: bash @@ -133,25 +230,68 @@ Example for the 78Mbp simulated community addressing this specific issue which we are keeping a close eye on and will update this documentation when it is merged. -Autometa Test Datasets -====================== +Advanced +======== -Simulated Communities ---------------------- +Data Handling +------------- -.. csv-table:: Autometa Simulated Communities - :file: simulated_community.csv - :header-rows: 1 +Aggregating benchmarking results +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can download all the Simulated communities using this `link `__. -Individual communities can be downloaded using the links in the above table. +When dataset index is unique +"""""""""""""""""""""""""""" -For more information on simulated communities, -check the `README.md `__ -located in the ``simulated_communities`` directory. +.. code:: python -Generating New Simulated Communities -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + import pandas as pd + import glob + df = pd.concat([ + pd.read_csv(fp, sep="\t", index_col="dataset") + for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz") + ]) + df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True) + +When dataset index is `not` unique +"""""""""""""""""""""""""""""""""" + +.. code:: python + + import pandas as pd + import os + import glob + dfs = [] + for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"): + df = pd.read_csv(fp, sep="\t", index_col="dataset") + df.index = df.index.map(lambda fpath: os.path.basename(fpath)) + dfs.append(df) + df = pd.concat(dfs) + df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True) + + +Downloading multiple test datasets at once +------------------------------------------ + +To download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa +v2.0 binning/taxonomy predictions all at once, you can provide the multiple arguments to ``--community-sizes``. + +e.g. ``--community-sizes 78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp`` + +An example of this is shown in the bash script below: + +.. code:: bash + + # choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp + community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp) + + autometa-download-dataset \ + --community-type simulated \ + --community-sizes ${community_sizes[@]} \ + --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \ + --dir-path simulated + +Generating new simulated communities +------------------------------------ Communities were simulated using `ART `__, a sequencing read simulator, with a collection of 3000 bacteria randomly retrieved. @@ -173,14 +313,4 @@ e.g. ``-l 1250`` would translate to 1250Mbp as the sum of total lengths for all # -s : the standard deviation of DNA/RNA fragment size for paired-end simulations. # -l : the length of reads to be simulated $ coverage = ((250 * reads) / (length * 1000000)) - $ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path - -Synthetic Communities ---------------------- - -51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``. - -The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates. -The synthetic community's DNA was extracted for sequencing, assembly and binning. - -You can download the MIX51 community using this `link `__. + $ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path \ No newline at end of file